Upload
elias
View
158
Download
0
Embed Size (px)
DESCRIPTION
CUDA 超大规模并行程序设计. 赵开勇 [email protected] http://www.comp.hkbu.edu.hk/~kyzhao http://blog.csdn.net/openhero 香港浸会大学计算机系 浪潮 GPU 高性能开发顾问. 提纲. 从 GPGPU 到 CUDA 并行程序 组织 并行 执行模型 CUDA 基础 存储器 CUDA 程序设计工具 新一代 Fermi GPU. Graphic Processing Unit (GPU). 用于个人计算机、工作站和游戏机的专用图像显示设备 显示卡 - PowerPoint PPT Presentation
Citation preview
CUDA 超大规模并行程序设计赵开勇
[email protected]://www.comp.hkbu.edu.hk/~kyzhao
http://blog.csdn.net/openhero香港浸会大学计算机系
浪潮 GPU高性能开发顾问
22
提纲 从 GPGPU 到 CUDA并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代 Fermi GPU
33
Graphic Processing Unit (GPU)
用于个人计算机、工作站和游戏机的专用图像显示设备显示卡
• nVidia 和 ATI (now AMD)是主要制造商> Intel准备通过 Larrabee进入这一市场
主板集成• Intel
44
3维图像流水线一帧典型图像
1M triangles3M vertices25M fragments
VertexProcessor
FragmentProcessor
Rasterizer
Framebuffer
Texture
CPU GPU
30 frames/s30M triangles/s90M vertices/s750M fragments/s
55
传统 GPU架构Graphics program
Vertex processors
Fragment processors
Pixel operations
Output image
66
GPU的强大运算能力
0
20
40
60
80
100
120
2003 2004 2005 2006 2007
Me
mo
ry b
an
dw
idth
(G
B/s
)
GPU
CPUG80 Ultra
G80
G71
NV40
NV30 Hapertown
W oodcrestPrescott EENorthwood
0
20
40
60
80
100
120
2003 2004 2005 2006 2007
Me
mo
ry b
an
dw
idth
(G
B/s
)
GPU
CPUG80 Ultra
G80
G71
NV40
NV30 Hapertown
W oodcrestPrescott EENorthwood
•数据级并行 : 计算一致性
•专用存储器通道•有效隐藏存储器延时
77
General Purpose Computing on GPU (GPGPU)
88
GPGPU
核心思想用图形语言描述通用计算问题
把数据映射到 vertex或者fragment处理器
但是硬件资源使用不充分存储器访问方式严重受限难以调试和查错高度图形处理和编程技巧
99
G80 GPU
L2
FB
SP SP
L1
TF
Th
rea
d P
roc
es
so
r
Vtx Thread Issue
Setup / Rstr / ZCull
Geom Thread Issue Pixel Thread Issue
Input Assembler
Host
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
Streaming Multiprocessor (SM)
Streaming Processor (SP)
1010
CUDA: Compute Unified Device Architecture
CUDA: 集成 CPU + GPU C应用程序 通用并行计算模型
单指令、多数据执行模式 (SIMD)• 所有线程执行同一段代码 (1000s threads on the fly)• 大量并行计算资源处理不同数据
隐藏存储器延时• 提升计算/通信比例• 合并相邻地址的内存访问 • 快速线程切换 1 cycle@GPU vs. ~1000 cycles@CPU
1111
Evolution of CUDA-Enabled GPUs Compute 1.0: basic CUDA compatibility
G80 Compute 1.1: asynchronous memory copies
and atomic global operations G84, G86, G92, G94, G96, and G98
Compute 1.2: dramatically improved memory coalescing rules, double the register count, intra-warp voting primitives, atomic shared memory operations GT21X
Compute 1.3: double precision GT200
1212
CUDA成功案例
1313
提纲从 GPGPU到 CUDA并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代 Fermi GPU
1414
并行性的维度 1维
y = a + b //y, a, b vectors
2维P = M N //P, M, N matrices
3维CT or MRI imaging
a[0] a[1] … a[n]
b[0] b[1] … b[n]
y[0] y[1] … y[n]
+ + +
= = =
=
1515
并行线程组织结构 Thread: 并行的基本单位 Thread block: 互相合作的线程组
Cooperative Thread Array (CTA)允许彼此同步通过快速共享内存交换数据 以 1维、 2维或 3维组织最多包含 512个线程
Grid: 一组 thread block 以 1维或 2维组织共享全局内存
Kernel: 在 GPU上执行的核心程序 One kernel one grid
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
1616
Parallel Program Organization in CUDA
Thread
Thread block
Grid
SP
Software Hardware
SM
SM
GPU…
TPC
SM SM SM
TPC
SM SM SM
TPC
SM SM SM
1717
并行线程执行 调用 kernel function 需要指定执行配置
Threads 和 blocks具有 IDs threadIdx: 1D, 2D, or 3DblockIdx: 1D, or 2D由此决定相应处理数据
__global__ void kernel(...);dim3 DimGrid(3, 2); // 6 thread blocks dim3 DimBlock(16, 16); // 256 threads per block kernel<<< DimGrid, DimBlock>>> (...);
1818
实例 1: Element-Wise Addition//CPU program
//sum of two vectors a and b
void add_cpu(float *a, float *b, int N)
{
for (int idx = 0; idx<N; idx++)
a[idx] += b[idx];
}
void main()
{
.....
fun_add(a, b, N);
}
//CUDA program
//sum of two vectors a and b
__global__ void add_gpu(float *a, float *b, int N)
{
Int idx =blockIdx.x* blockDim.x+ threadIdx.x;
if (idx < N)
a[idx] += b[idx];
}
void main()
{
…..
dim3 dimBlock (256);
dim3 dimGrid( ceil( N / 256 );
fun_add<<<dimGrid, dimBlock>>>(a, b, N);
}
1919
提纲从 GPGPU到 CUDA并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代 Fermi GPU
2020
CUDA Processing Flow
2121
并行线程执行 SM内以 (warp 即 32 threads)为单位并行执行Warp内线程执行同一条指令Half-warp是存储操作的基本单位
Warp
Block 0
Block 1
Block 2
2222
控制流 (Control Flow)
同一 warp内的分支语句可能执行不同的指令路径不同指令路径的线程只能顺序执行
• 每次执行 warp中一条可能的路径• N条指令路径→ 1/N throughput
只需要考虑同一 warp即可,不同warp的不同的指令路径不具相关性
G80上使用指令预测技术加速指令执行
2323
控制流 (Control Flow)常见情况 : 分支条件是 thread ID的函数时 , 容易导致分支( divergence)Example with divergence:
• If (threadIdx.x > 2) { } > 在 thread block产生两条不同指令路径
• Branch granularity < warp size• threads 0 and 1 与 1st warp中其它指令的指令路径不同
Example without divergence:• If (threadIdx.x / WARP_SIZE > 2) { }
> 也在 thread block产生两条不同指令路径• Branch granularity is a whole multiple of warp size• 同一 warp的所有线程具备相同指令路径
2424
线程同步 void __syncthreads();
Barrier synchronization同步 thread block之内的所有线程避免访问共享内存时发生 RAW/WAR/WAW 冒
险 (hazard)
__shared__ float scratch[256];
scratch[threadID] = begin[threadID];
__syncthreads();
int left = scratch[threadID -1];
在此等待,直至所有线程到达才开始执行下面的代码
2525
Dead-Lock with __syncthreads Dead-lock if
Some threads have val larger than threshold
And others not
__global__ void compute(...)
{
// do some computation for val
if( val > threshold )return;
__syncthreads(); // work with val & store it return;
}
2626
提纲从 GPGPU到 CUDA并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代 Fermi GPU
2727
CUDA扩展语言结构 Declspecs
global, device, shared, local, constant
Keywords threadIdx, blockIdx threadDim, blockDim
Intrinsics __syncthreads
Runtime API Memory, symbol, execution
management
Function launch
__device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ...
region[threadIdx] = image[i];
__syncthreads() ... image[j] = result;}
// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per blockfoo<<<100, 10>>> (parameters);
2828
存储器空间 R/W per-thread registers
1-cycle latency R/W per-thread local memory
Slow – register spilling to global memory R/W per-block shared memory
1-cycle latency “__shared__” But bank conflicts may drag down
R/W per-grid global memory ~500-cycle latency “__device__” But coalescing accessing could hide
latency Read only per-grid constant and texture
memories ~500-cycle latency, but cached
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
2929
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
GPU Global Memory分配 cudaMalloc()
分配显存中的 global memoryglobal memory两个参数
• 对象数组指针和数组尺寸 cudaFree()
释放显存中的 global memoryglobal memory• 对象数组指针
int blk_sz = 64;float* Md;int size = blk_sz * blk_sz * sizeof(float);
cudaMalloc((void**)&Md, size);…cudaFree(Md);
3030
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
Host – Device数据交换 cudaMemcpy()
Memory data transferRequires four parameters
• Pointer to destination• Pointer to source• Number of bytes copied• Type of transfer
> Host to Host, Host to Device, Device to Host, Device to Device
cudaMemcpy(Md, M.elements, size, cudaMemcpyHostToDevice);
cudaMemcpy(M.elements, Md, size, cudaMemcpyDeviceToHost);
3131
CUDA函数定义
__global__ 定义 kernel函数必须返回 void
__device__ 函数不能用 &运算符取地址 , 不支持递归调用 , 不支持静态变量 (static
variable), 不支持可变长度参数函数调用
Executed on the:
Only callable from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host
3232
CUDA数学函数pow, sqrt, cbrt, hypot, exp, exp2, expm1, log,
log2, log10, log1p, sin, cos, tan, asin, acos, atan, atan2, sinh, cosh, tanh, asinh, acosh, atanh, ceil, floor, trunc, round, etc.只支持标量运算许多函数有一个快速、较不精确的对应版本
• 以” __”为前缀,如 __sin()•编译开关 -use_fast_math强制生成该版本的目标码
3333
实例 2: 矩阵相乘 矩阵数据类型 – 不属于 CUDA! 单精度浮点数 width height个元素 矩阵元素在 elements中
1-D数组存放矩阵数据Row-major storage
typedef struct { int width; int height; float* elements;} Matrix;
A
B
C
WM
.wid
th =
N.h
eigh
tIM
.hei
ght
M.width N.width
3434
实例 2: 矩阵相乘 C = A B of size WIDTH x WIDTH 一个线程处理一个矩阵元素
简化 : 假定 WIDTH x WIDTH < 512• 只需要一个 thread block
线程载入 A的一行和 B的一列A 和 B的一对相应元素作一次乘法和一次加法 A
B
C
WID
THW
IDTH
WIDTH WIDTH
3535
CUDA Implementation – Host Side// Matrix multiplication on the devicevoid Mul(const Matrix A, const Matrix B, Matrix C){ int size = A.width A.width sizeof(float); // Load M and N to the device float *Ad, *Bd, *Cd; cudaMalloc((void**)&Ad, size); //matrix stored in linear order cudaMemcpy(Ad, A.elements, size, cudaMemcpyHostToDevice); cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B.elements, size, cudaMemcpyHostToDevice);
// Allocate C on the device cudaMalloc((void**)&Cd, size);
3636
CUDA Implementation – Host Side // Launch the device computation threads! dim3 dimGrid(1); dim3 dimBlock(M.width, M.width); Muld<<<dimGrid, dimBlock>>>(Ad, Bd, Cd, M.width);
// Read P from the device copyFromDeviceMatrix(C.elements, Cd); cudaMemCopy(C, Cd, N * size, cudaMemcpyDeviceToHost);
// Free device matrices cudaFree(Ad); cudaFree(Bd); cudaFree(Cd);}
3737
CUDA Implementation – Kernel// Matrix multiplication kernel – thread specification__global__ void Muld (float* Ad, float* Bd, float* Cd, int width){ // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; // cvalue is used to store the element of the matrix // that is computed by the thread float cvalue = 0;
3838
CUDA Implementation – Kernel
A
B
C
WID
THW
IDTH
WIDTH WIDTH
ty
tx
for (int k = 0; k < width; ++k) { float ae = Ad[ty * width + k]; float be = Bd [tx + k * width]; cvalue += ae * be; } // Write the matrix to device memory; // each thread writes one element Cd[ty * width + tx] = cvalue;}
3939
提纲从 GPGPU到 CUDA并行程序组织并行执行模型 CUDA基础存储器
Shared memoryGlobal memory
CUDA程序设计工具新一代 Fermi GPU
4040
共享存储器 (Shared Memory)
设置于 streaming multiprocessor内部 由一个线程块内部全部线程共享
完全由软件控制访问一个地址只需要 1个时钟周期
4141
共享存储器结构 G80的共享存储器组织为 16 banks
Addressed in 4 bytes Bank ID = 4-byte address % 16相邻 4-byte地址映射相邻 banks每一 bank的带宽为 4 bytes per
clock cycle 对同一 bank的同时访问导致 bank
conflict 只能顺序处理 仅限于同一线程块内的线程
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 000, 16, 32, …
01, 17, 33, …02, 18, 34, …03, 19, 35, …
15, 31, 47, …
…
4242
Bank Addressing实例 No Bank Conflicts
Linear addressing stride == 1 (s=1)
No Bank Conflicts Random 1:1 Permutation
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
__shared__ float shared[256];float foo = shared[threadIdx.x];
4343
Bank Addressing实例 2-way bank conflicts
Linear addressing stride == 2 (s=2)
8-way bank conflicts Linear addressing
stride == 8 (s=8)
Thread 11Thread 10Thread 9Thread 8
Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 11Thread 10Thread 9Thread 8
Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 9Bank 8
Bank 15
Bank 7
Bank 2Bank 1Bank 0x8
x8
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 9Bank 8
Bank 15
Bank 7
Bank 2Bank 1Bank 0
Bank 9Bank 8
Bank 15
Bank 7
Bank 2Bank 1Bank 0x8
x8
__shared__ float shared[256];float foo = shared[2 * threadIdx.x];
__shared__ float shared[256];float foo = shared[8 * threadIdx.x];
4444
提纲从 GPGPU到 CUDA并行程序组织并行执行模型 CUDA基础存储器
Shared memoryGlobal memory
CUDA程序设计工具新一代 Fermi GPU
4545
全局内存 (Global Memory)
全局内存在 G80/G200上没有缓存
Constant memory 和 texture memory有少量缓存
存取延时400-600 clock cycles
非常容易成为性能瓶颈 优化是提高性能的关键 !
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
4646
Coalesced Global Memory Accesses
4747
Non-Coalesced Global Memory Accesses
4848
Non-Coalesced Global Memory Accesses
4949
Coalescing on 1.2 and Higher Devices
Global memory access by threads in a half-warp can be coalesced When the words accessed by all threads lie in the same
segment of size equal to: • 32 bytes if all threads access 8-bit words• 64 bytes if all threads access 16-bit words • 128 bytes if all threads access 32-bit or 64-bit words
Any pattern of addresses requested by the half-warp• Including patterns where multiple threads access the same
address
5050
Example of New Coalescing Rules
Address 0
Thread 0
Address 4
Address …
Address 116
Address 120
Address 124
Address 128
Address …
Address 172
Address 176
Address 180
Address 184
Address 188
Address 252
Thread 1
Thread 2
Thread 3
Thread …
Thread 14
Thread 15
…
Segment 0 (128B)
Segment 1 (128B)
Reduced to 32B
Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data.
5151
提纲从 GPGPU到 CUDA并行程序组织并行执行模型 CUDA基础存储器
Shared memoryGlobal memory
CUDA程序设计工具新一代 Fermi GPU
5252
下载 CUDA 软件 http://www.nvidia.cn/object/cuda_get_cn.html CUDA driver
硬件驱动 CUDA toolkit
工具包 CUDA SDK
程序范例及动态链接库 CUDA Visual Profiler
程序剖析工具
CPU(Host)
CUDA Libraries (CUFFT& CUBLAS)
CUDA Runtime Libraries
CUDA Driver
Application
GPU(Device)
5353
CUDA程序的编译 (compile)
CUDA源文件被 nvcc处理nvcc is a compiler driver
nvcc 输出:PTX (Parallel Thread eXecution)
• Virtual ISA for multiple GPU hardware
• Just-In-Time compilation by CUDA runtime
GPU binary• Device-specific binary object
Standard C code• With explicit parallelism
C/C++ CUDA Application
NVCC
PTX Code
C/C++ CPU Code
Generic
CUDA Runtime
Specialized
Other GPUsG80 GT200
CUDA Binary
5454
DEBUG
make dbg=1CPU代码以 debug模式编译可以用 debugger (e.g. gdb, visual studio)运行
• 但不能检查 GPU代码的中间结果 make emu=1
在 CPU上以 emulation方式顺序运行可以使用 printf() 打印中间结果
• 基本顺序执行• 但不能再现线程间的竞争 (race) 现象• 浮点运算结果可能有微小的差别
5555
检查资源使用 使用 -cubin flag编译开关 检查 .cubin 文件的” code” 部分
architecture {sm_10}abiversion {0}modname {cubin}code {
name = BlackScholesGPUlmem = 0smem = 68reg = 20bar = 0bincode {
0xa0004205 0x04200780 0x40024c09 0x00200780…
per thread local memory
per thread block shared memory
per thread registers
5656
CUDA Debugger: cuda-gdb Released with CUDA 2.2
A ported version of GNU Debugger, gdb Red Hat Enterprise Linux 5.x 32-bit and 64-bit
Compiling with debug support nvcc –g –G foo.cu –o foo
Single-step individual warps (“next” or “step”) Advances all threads in the same warp
Display device memory in the device kernel Data that resides in various GPU memory regions such as shared, local,
and global memory Switch to any CUDA block/thread
thread <<<(BX,BY),(TX,TY,TZ)>>> Breaking into running applications
Ctrl+C to break into hanging programs
5757
“Nexus” GPU/CPU Development Suite
Major componentsNexus Debugger
• Source code debugger for GPU source code• CUDA, DirectCompute, HLSL, …
Nexus Analyzer• System-wide event viewer for both GPU & CPU events
Nexus Graphics Inspector• For frame based, deep inspection of textures and geometry
Full integration with Visual StudioWindows 7/VistaAvailable on Oct. 29, 2009
5858
提纲从 GPGPU到 CUDA并行程序组织并行执行模型 CUDA基础存储器
Shared memoryGlobal memory
CUDA程序设计工具新一代 Fermi GPU
5959
3 Major Generations of CUDA GPUs
GPU G80 GT200 GT300 (Fermi)
CUDA cores 128 240 512
Process (nm) 90 45 40
Transistors 681 Million 1.4 Billion 3.0 Billion
Double precision floating point capability
None 30 FMA ops/clock 256 FMA ops/clock
Single precision floating point capability 128 MAD ops/clock 240 MAD ops/clock 512 MAD ops/clock
Warp scheduler 1 1 2
Special function units / SM 2 2 4
CUDA cores / SM 8 8 32
Shared memory / SM 16KB 16KB Configurable 48KB or 16KB
L1 cache / SM None None Configurable 16KB or 48KB
L2 cache / SM None None 786KB
Concurrent kernels 1 1 Up to 16
Load/store memory space 32-bit 32-bit 64-bit
6060
Fermi GPU 架构
CUDA core (SP)LD/ST unit
Special functional unitThread scheduler
GDDR5 DRAM Interface
786KB L2 cache
6161
Third Generation Streaming Multiprocessor
32 CUDA cores (SPs) per SM, 4x over G200
8x peak double precision floating point performance over G200
Dual warp scheduler that schedule and dispatch two warps of 32 threads
Memory16x128KB = 2048KB register file16x64KB=768KB L1
cache/shared memory• Configurable partitioning
768KB L2 cache
6262
Dual Warp Scheduler
Instruction Dispatch Unit
Warp Scheduler
Warp 8 instruction 11 Warp 9 instruction 11
Warp 2 instruction 42 Warp 3 instruction 33
Warp 14 instruction 95 Warp 15 instruction 95
Instruction Dispatch Unit
Warp Scheduler
Warp 8 instruction 12 Warp 9 instruction 12
Warp 14 instruction 96 Warp 3 instruction 34
Warp 2 instruction 43 Warp 15 instruction 96
time
6363
Second Generation Parallel Thread Execution ISA 64-bit memory address space - with ECC
Unified address space with full C++ support
Optimized execution of OpenCL and DirectCompute Full IEEE 754-2008 single and double precision floating
point numbers
6464
NVIDIA GigaThreadTM Engine Multi-kernel execution
10x faster application context switchingConcurrent kernel executionOut-of order thread block execution
Two streaming transfer enginesWork in a pipelined, overlapped mannerEach could saturate the PCIe interface
6565
Performance
6666
参考书目1. NVIDIA, CUDA Programming Guide, NVidia, 2008,
2009.
2. 张舒、褚艳利、 赵开勇、张钰勃 , GPU高性能运算之 CUDA, 2009.
3. http://www.hpctech.com/.
4. T. Mattson, et al., Patterns for Parallel Programming, Addison Wesley, 2005.
6767
参考文献 Special issues on GPU computing
IEEE, Proceedings of IEEE, Vol. 96, No. 5, May, 2008.ACM, Queue, Vol. 6, No. 2, March/April, 2008.Elsevier, Journal of Parallel and Distributed Computing,
Vol. 68, No. 10, October, 2008.