Upload
juniper-freeman
View
227
Download
0
Embed Size (px)
Citation preview
Chun-Yuan Lin
Brief of GPU&CUDA
112/04/21
Introduce to MyselfName: Chun-Yuan Lin (林俊淵 ) (1977)Education: Ph.D., Dept. CS, FCU Univ.Experience: Post Dr. Fellow, Institute of Molecular and Cellular Biology
Post Dr. Fellow, Dept. CS
NTHU Univ. Research: Parallel and Distributed Processing, Parallel and Distributed
Programming Language, Algorithm Analysis, Information
Retrieve, Genomics, Proteomics, and Bioinformatics.Contact: [email protected]
http://sslab.cs.nthu.edu.tw/~cylin/
112/04/21
My family
112/04/21
Please let me know who are you
112/04/21
Introduction to Share memory programming
112/04/21
Types of Parallel Computers
Two principal types:
Shared memory multiprocessor
Distributed memory multicomputer
Another type:Distributed Shared memory multiprocessor
112/04/21
Shared Memory Multiprocessor SystemNatural way to extend single processor model - have multiple processors connected to multiple memory modules, such that each processor can access any memory module :
Processors
Interconnectionnetwork
Memory moduleOneaddressspace
112/04/21
Simplistic view of a small shared memory multiprocessor
Examples:Dual PentiumsQuad Pentiums
Processors Shared memory
Bus
112/04/21
Programming Shared Memory Multiprocessors
Threads - programmer decomposes program into individual parallel sequences, (threads), each being able to access variables declared outside threads.
Example Pthreads
Sequential programming language with preprocessor compiler directives to declare shared variables and specify parallelism.
Example OpenMP - industry standard - needs OpenMP compiler
112/04/21
Flynn (1966) created a classification for computers based upon instruction streams and data streams:
Single instruction stream-single data stream (SISD) computer
Single processor computer - single stream of instructions generated from program. Instructions operate upon a single stream of data items.
Flynn’s Classifications
112/04/21
Multiple Instruction Stream-Multiple Data Stream (MIMD) Computer
General-purpose multiprocessor system - each processor has a separate program and one instruction stream is generated from each program for each processor. Each instruction operates upon different data.
Both the shared memory and the message-passing multiprocessors so far described are in the MIMD classification.
112/04/21
Single Instruction Stream-Multiple Data Stream (SIMD) Computer
A specially designed computer - a single instruction stream from a single program, but multiple data streams exist. Instructions from program broadcast to more than one processor. Each processor executes same instruction in synchronism, but using different data.
Developed because a number of important applications that mostly operate upon arrays of data.
112/04/21
Multiple Program Multiple Data (MPMD) Structure
Within the MIMD classification, each processor will have its own program to execute:
Program
Processor
Data
Program
Processor
Data
InstructionsInstructions
112/04/21
Single Program Multiple Data (SPMD) Structure
Single source program written and each processor executes its personal copy of this program, although independently and not in synchronism.
Source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer.
112/04/21
The MIMD category includes a wide class of computers. For this reason, in 1988, E. E. Johnson proposed a further classification of such machines based on their memory structure (global or distributed) and the mechanism used for communication/synchronization (shared variables or message passing).
112/04/21
Programming with Shared Memory
we outline the methods of programming systems that have shared memory, including the use of processes, threads, parallel programming languages, and sequential languages with compiler directives and library routines. standard UNIX process"fork-join" modelIEEE thread standard PthreadsOpenMP, a widely accepted industry standard for parallel
programming on a shared memory multiprocessor.
112/04/21
Shared memory multiprocessor system
Any memory location can be accessible by any of the processors. (dual- and quad-Pentium systems, cost-effective)
A single address space exists, meaning that each memory location is given a unique address within a single range of addresses.
Generally, shared memory programming more convenient although it does require access to shared data to be controlled by the programmer (using critical sections etc.)
112/04/21
Threads
The process created with UNIX fork is a "heavyweight" process; it is a completely separate program with its own variables, stack, and personal memory allocation. heavyweight processes are expensive in time and memory space.
A much more efficient mechanism is one in which independent concurrent sequences are defined within a process, so-cal1ed threads.
The threads al1 share the same memory space and global variables of the process and are much less expensive in time and memory space than the processes themselves.
112/04/21
Each thread needs its own stack and also stores information regarding registers but shares the code and other parts.
Creation of a thread can take three orders of magnitude less time than process creation.
In addition, a thread will immediately have access to shared global variables.
Equally important, threads can be synchronized much more efficiently than processes.
Multithreading also helps alleviate the long latency of message-passing112/04/21
Differences between a process and threads
112/04/21
Creating Shared Data
The key aspect of shared memory programming is that shared memory provides the possibility of creating variables and data structures that can be accessed directly by every processor.
If UNIX heavyweight processes are to share data. additional shared memory system calls are necessary. Typically, each process has its own virtual address space within the virtual memory management system.
It is not necessary to create shared data items explicitly when using threads. Variables declared at the top of the main program (main thread) are global and are available to all threads.
112/04/21
Accessing Shared Data
Accessing shared data needs careful control. (process, thread) (write is a problem)
Consider two processes each of which is to add one to a shared data item, x. Necessary for the contents of the location x to be read, x + 1 computed, and the result written back to the location:
X = 10, answer is 12, but may get 11 112/04/21
Critical Section
A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time. (when a process reach a critical section…)
This mechanism is known as mutual exclusion.
This concept also appears in an operating systems.
112/04/21
Deadlock
Can occur with two processes when one requires a resource held by the other, and this process requires a resource held by the first process.
112/04/21
Barriers
process/thread synchronization is often needed in shared memory programs.
Pthreads do not have a native barrier, so barriers have to be hand-coded using a condition variable and mutex.
A global counter variable is incremented each time a thread reaches the barrier, and all the threads are released when the counter has reached a defined number of threads. The threads are released by the last thread reaching the barrier using broadcast signal
112/04/21
Language Constructs for Parallelism
Shared Data
Shared memory variables might be declared as shared with, say,
shared int x; shared int *p;
112/04/21
par Construct
For specifying concurrent statements:
par {S1;S2;..Sn;
}
The keyword par indicates that statements in the body are to be executed concurrently. This is instruction-level parallelism.
112/04/21
Multiple concurrent processes or threads could be specified by listing the routines that are to be executed concurrently.
par {proc1;proc2;..procn;
}
112/04/21
forall Construct
To start multiple similar processes together:
forall (i = 0; i < n; i++) {S1;S2;..Sm;
}
which generates n processes each consisting of the statements forming the body of the for loop, S1, S2, …, Sm. Each process uses a different value of i.
112/04/21
Example
forall (i = 0; i < 5; i++)a[i] = 0;
clears a[0], a[1], a[2], a[3], and a[4] to zero concurrently.
112/04/21
OpenMP
An accepted standard developed in the late 1990s by a group of industry specialists.
Consists of a small set of compiler directives, augmented with a small set of library routines and environment variables using the base language Fortran and C/C++.
The compiler directives can specify such things as the par and forall operations described previously.
Several OpenMP compilers available.
112/04/21
For C/C++, the OpenMP directives are contained in #pragmastatements. The OpenMP #pragma statements have the format:
#pragma omp directive_name ...
where omp is an OpenMP keyword.
May be additional parameters (clauses) after the directive name for different options.
Some directives require code to specified in a structured block (a statement or statements) that follows the directive and then the directive and structured block form a “construct”.
112/04/21
Parallel Directive
#pragma omp parallelstructured_block
creates multiple threads, each one executing the specifiedstructured_block, either a single statement or a compoundstatement created with { ...} with a single entry point and a single exit point.
There is an implicit barrier at the end of the construct.The directive corresponds to forall construct.
112/04/21
112/04/21
Number of threads in a team
Established by either:
1. num_threads clause after the parallel directive, or2. omp_set_num_threads() library routine being previously called, or3. the environment variable OMP_NUM_THREADS is defined in the order given or is system dependent if none of the above.
Number of threads available can also be altered automatically to achieve best use of system resources by a “dynamic adjustment” mechanism.
112/04/21
Work-Sharing
Three constructs in this classification:
sectionsforsingle
In all cases, there is an implicit barrier at the end of the construct unless a nowait clause is included.
Note that these constructs do not start a new team of threads. That done by an enclosing parallel construct.
112/04/21
Shared Memory Programming
Performance Issues
112/04/21
Shared Data in Systems with Caches
All modern computer systems have cache memory, high-speed memory closely attached to each processor for holding recently referenced data and code.
Cache coherence protocols
Update policy - copies of data in all caches are updated at the time one copy is altered.
Invalidate policy - when one copy of data is altered, the same data in any other cache is invalidated (by resetting a valid bit in the cache). These copies are only updated when the associated processor makes reference for it.
112/04/21
False Sharing
Different parts of block required by different processors but notsame bytes. If one processor writes to one part of the block, copies of the complete block in other caches must be updated or invalidated though the actual data is not shared.
112/04/21
Critical Sections Serializing Code
High performance programs should have as few as possible critical sections as their use can serialize the code.
Suppose, all processes happen to come to their critical section together.
They will execute their critical sections one after the other.
In that situation, the execution time becomes almost that of a single processor.
112/04/21
What is GPU?
Graphics Processing Units
112/04/21
The ChallengeRender infinitely complex scenesAnd extremely high resolutionIn 1/60th of one second
Luxo Jr. 1985 took 2-3 hours per frame to render on a Cray-1 supercomputer
Today we can easily render that in 1/30th of one second
Over 300,000x fasterStill not even close to where we need to
be… but look how far we’ve come!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
PC/DirectX Shader Model Timeline
Quake 3
Giants Halo Far Cry UE3Half-Life
1998 1999 2000 2001 2002 2003 2004
DirectX 6Multitexturing
Riva TNT
DirectX 8SM 1.x
GeForce 3 Cg
DirectX 9SM 2.0
GeForceFX
DirectX 9.0cSM 3.0
GeForce 6DirectX 5Riva 128
DirectX 7T&L TextureStageState
GeForce 256
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
A quiet revolution and potential build-up Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Until last year, programmed through graphics
API
GPU in every PC and workstation – massive volume and potential impact
Why Massively Parallel Processor
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
112/04/21
16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW,
4GB/S BW to CPU
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
GeForce 8800
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
G80 Characteristics367 GFLOPS peak performance (25-50 times of
current high-end microprocessors)265 GFLOPS sustained for apps such as VMDMassively parallel, 128 cores, 90WMassively threaded, sustains 1000s of threads per
app30-100 times speedup over high-end
microprocessors on scientific and media applications: medical imaging, molecular dynamics
“I think they're right on the money, but the huge performance differential (currently 3 GPUs ~= 300 SGI Altix Itanium2s) will invite close scrutiny so I have to be careful what I say publically until I triple check those numbers.”
-John Stone, VMD group, Physics UIUC© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
ObjectiveTo understand the major factors that dictate
performance when using GPU as an compute accelerator for the CPUThe feeds and speeds of the traditional CPU worldThe feeds and speeds when employing a GPUTo form a solid knowledge base for performance
programming in modern GPU’sKnowing yesterday, today, and tomorrow
The PC world is becoming flatterOutsourcing of computation is becoming easier…
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
Future Apps Reflect a Concurrent World
Exciting applications in future mass computing market have been traditionally considered “supercomputing applications”Molecular dynamics simulation, Video and audio coding
and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products
These “Super-apps” represent and model physical, concurrent world
Various granularities of parallelism exist, but…programming model must not hinder parallel
implementationdata delivery needs careful management
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
Stretching from Both Ends for the Meat
New GPU’s cover massively parallel parts of applications better than CPU
Attempts to grow current CPU architectures “out” or domain-specific architectures “in” lack successUsing a strong combination on apps a compelling idea CUDA
Traditional applications
Current architecture coverage
New applications
Domain-specificarchitecture coverage
Obstacles© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
Bandwidth – Gravity of Modern Computer SystemsThe Bandwidth between key components
ultimately dictates system performanceEspecially true for massively parallel systems
processing massive amount of data Tricks like buffering, reordering, caching can
temporarily defy the rules in some casesUltimately, the performance goes falls back to
what the “speeds and feeds” dictate
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
Classic PC architectureNorthbridge connects
3 components that must be communicate at high speedCPU, DRAM, videoVideo also needs to have
1st-class access to DRAMPrevious NVIDIA cards
are connected to AGP, up to 2 GB/s transfers
Southbridge serves as a concentrator for slower I/O devices
CPU
Core Logic Chipset© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
PCI Bus SpecificationConnected to the southBridge
Originally 33 MHz, 32-bit wide, 132 MB/second peak transfer rate
More recently 66 MHz, 64-bit, 512 MB/second peakUpstream bandwidth remain slow for device (256MB/s peak)Shared bus with arbitration
Winner of arbitration becomes bus master and can connect to CPU or DRAM through the southbridge and northbridge
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
An Example of Physical Reality Behind CUDA CPU
(host)
GPU w/ local DRAM
(device)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
Northbridge handles “primary” PCIe to video/GPU and DRAM.PCIe x16 bandwidth at 8 GB/s (4 GB each direction)
112/04/21
Graphic Processor Unit
112/04/21
Parallel Computing on a GPU
NVIDIA GPU Computing Architecture Via a separate HW interface In laptops, desktops, workstations, servers G80 to G200
8-series GPUs deliver 50 to 200 GFLOPSon compiled parallel C applications
Programmable in C with CUDA tools Multithreaded SPMD model uses
application data parallelism and thread parallelism
Tesla C870
Tesla S870
Tesla D870
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL1, University of Illinois, Urbana-Champaign
Tesla C10601T GFLOPS
Tesla S1070
TESLA S1070NVIDIA® Tesla™ S1070 : 4 teraflop 1U system。
112/04/21
What is GPGPU ?General Purpose computation using GPU in
applications other than 3D graphicsGPU accelerates critical path of application
Data parallel algorithms leverage GPU attributesLarge data arrays, streaming throughputFine-grain SIMD parallelismLow-latency floating point (FP) computation
Applications – see //GPGPU.orgGame effects (FX) physics, image processingPhysical modeling, computational engineering,
matrix algebra, convolution, correlation, sorting© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
DirectX 5 / OpenGL 1.0 and BeforeHardwired pipeline
Inputs are DIFFUSE, FOG, TEXTUREOperations are SELECT, MUL, ADD, BLENDBlended with FOG
RESULT = (1.0-FOG)*COLOR + FOG*FOGCOLOR
Example HardwareRIVA 128, Voodoo 1, Reality Engine, Infinite Reality
No “ops”, “stages”, programs, or recirculation
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
The 3D Graphics Pipeline
Application
Scene Management
Geometry
Rasterization
Pixel Processing
ROP/FBI/Display
FrameBuffer
Memory
Host
GPU
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
The GeForce Graphics Pipeline
Host
Vertex ControlVertex Cache
VS/T&L
Triangle Setup
Raster
Shader
ROP
FBI
TextureCache Frame
BufferMemory
Matt20
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
Traditional Graphics Pipeline
Unified Shader Pipline (G80~)
112/04/21
Feeding the GPUGPU accepts a sequence of commands and
dataVertex positions, colors, and other shader
parametersTexture map imagesCommands like “draw triangles with the
following vertices until you get a command to stop drawing triangles”.
Application pushes data using Direct3D or OpenGL
GPU can pull commands and data from system memory or from its local memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
CUDA“Compute Unified Device Architecture”General purpose programming model
GPU = dedicated super-threaded, massively data parallel co-processor
Targeted software stackCompute oriented drivers, language, and tools
Driver for loading computation programs into GPUStandalone Driver - Optimized for computation Interface designed for compute - graphics free APIGuaranteed maximum download & readback
speedsExplicit GPU memory management
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
TPC TPC TPC TPC TPC TPC
TEX
SM
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1
Texture Processor Cluster Streaming Multiprocessor
SM
Shared Memory
Streaming Processor Array
…
GeForce-8 Series HW Overview
112/04/21
CUDA Programming Model:A Highly Multithreaded Coprocessor The GPU is viewed as a compute device
that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel
Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads
Differences between GPU and CPU threads GPU threads are extremely lightweight
Very little creation overhead GPU needs 1000s of threads for full efficiency
Multi-core CPU needs only a few© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
Thread Batching: Grids and Blocks A kernel is executed as
a grid of thread blocks All threads share data
memory space A thread block is a
batch of threads that can cooperate with each other by: Synchronizing their
execution Efficiently sharing data
through a low latency shared memory
Two threads from two different blocks cannot cooperate
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NDVIA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
CUDA Device Memory Space Overview Each thread can:
R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant
memory Read only per-grid texture
memory
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemor
y
Thread (0, 0)
Registers
LocalMemor
y
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemor
y
Thread (0, 0)
Registers
LocalMemor
y
Thread (1, 0)
Registers
Host
• The host can R/W global, constant, and texture memories
112/04/21© David Kirk/NVIDIA and Wen-mei W.
Hwu, 2007ECE 498AL1, University of Illinois, Urbana-
Champaign
Global, Constant, and Texture Memories(Long Latency Accesses)
Global memoryMain means of
communicating R/W Data between host and device
Contents visible to all threads
Texture and Constant MemoriesConstants initialized
by host Contents visible to all
threads
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemor
y
Thread (0, 0)
Registers
LocalMemor
y
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemor
y
Thread (0, 0)
Registers
LocalMemor
y
Thread (1, 0)
Registers
Host
Courtesy: NDVIA112/04/21
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
SIMTSIMT (single instruction multi threads)
Serial Code (CPU)
. . .
. . .
Serial Code (CPU)
Parallel Kernel (GPU)
Parallel Kernel (GPU)
112/04/21
What is Behind such an Evolution?
The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data
processing rather than data caching and flow control
DRAM
Cache
ALUControl
ALU
ALU
ALU
DRAM
CPU GPU
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL1, University of Illinois, Urbana-Champaign
112/04/21
CPU v.s GPU
112/04/21
ResourceCUDA ZONE:
http://www.nvidia.com.tw/object/cuda_home_tw.html#
CUDA Course: http://www.nvidia.com.tw/object/cuda_university_courses_tw.html
112/04/21