CUDA 超大规模并行程序设计

CUDA 超大规模并行程序设计赵开勇

[email protected]://www.comp.hkbu.edu.hk/~kyzhao

http://blog.csdn.net/openhero香港浸会大学计算机系

浪潮 GPU高性能开发顾问

mailto:[email protected]

http://www.comp.hkbu.edu.hk/~kyzhao

http://blog.csdn.net/openhero

22

提纲从 GPGPU 到 CUDA并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代 Fermi GPU

33

Graphic Processing Unit (GPU)

用于个人计算机、工作站和游戏机的专用图像显示设备显示卡

• nVidia 和 ATI (now AMD)是主要制造商> Intel准备通过 Larrabee进入这一市场

主板集成• Intel

44

3维图像流水线一帧典型图像

1M triangles3M vertices25M fragments

VertexProcessor

FragmentProcessor

Rasterizer

Framebuffer

Texture

CPU GPU

30 frames/s30M triangles/s90M vertices/s750M fragments/s

55

传统 GPU架构Graphics program

Vertex processors

Fragment processors

Pixel operations

Output image

66

GPU的强大运算能力

0

20

40

60

80

100

120

2003 2004 2005 2006 2007

Me

mo

ry b

an

dw

idth

(G

B/s

)

GPU

CPUG80 Ultra

G80

G71

NV40

NV30 Hapertown

W oodcrestPrescott EENorthwood

0

20

40

60

80

100

120

2003 2004 2005 2006 2007

Me

mo

ry b

an

dw

idth

(G

B/s

)

GPU

CPUG80 Ultra

G80

G71

NV40

NV30 Hapertown

W oodcrestPrescott EENorthwood

•数据级并行 : 计算一致性

•专用存储器通道•有效隐藏存储器延时

77

General Purpose Computing on GPU (GPGPU)

88

GPGPU

核心思想用图形语言描述通用计算问题

把数据映射到 vertex或者fragment处理器

但是硬件资源使用不充分存储器访问方式严重受限难以调试和查错高度图形处理和编程技巧

99

G80 GPU

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Streaming Multiprocessor (SM)

Streaming Processor (SP)

1010

CUDA: Compute Unified Device Architecture

CUDA: 集成 CPU + GPU C应用程序通用并行计算模型

单指令、多数据执行模式 (SIMD)• 所有线程执行同一段代码 (1000s threads on the fly)• 大量并行计算资源处理不同数据

隐藏存储器延时• 提升计算／通信比例• 合并相邻地址的内存访问 • 快速线程切换 1 cycle@GPU vs. ~1000 cycles@CPU

1111

Evolution of CUDA-Enabled GPUs Compute 1.0: basic CUDA compatibility

G80 Compute 1.1: asynchronous memory copies

and atomic global operations G84, G86, G92, G94, G96, and G98

Compute 1.2: dramatically improved memory coalescing rules, double the register count, intra-warp voting primitives, atomic shared memory operations GT21X

Compute 1.3: double precision GT200

1212

CUDA成功案例

1313

提纲从 GPGPU到 CUDA并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代 Fermi GPU

1414

并行性的维度 1维

y = a + b //y, a, b vectors

2维P = M N //P, M, N matrices

3维CT or MRI imaging

a[0] a[1] … a[n]

b[0] b[1] … b[n]

y[0] y[1] … y[n]

+ + +

= = =

=

1515

并行线程组织结构 Thread: 并行的基本单位 Thread block: 互相合作的线程组

Cooperative Thread Array (CTA)允许彼此同步通过快速共享内存交换数据以 1维、 2维或 3维组织最多包含 512个线程

Grid: 一组 thread block 以 1维或 2维组织共享全局内存

Kernel: 在 GPU上执行的核心程序 One kernel one grid

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

1616

Parallel Program Organization in CUDA

Thread

Thread block

Grid

SP

Software Hardware

SM

SM

GPU…

TPC

SM SM SM

TPC

SM SM SM

TPC

SM SM SM

1717

并行线程执行调用 kernel function 需要指定执行配置

Threads 和 blocks具有 IDs threadIdx: 1D, 2D, or 3DblockIdx: 1D, or 2D由此决定相应处理数据

__global__ void kernel(...);dim3 DimGrid(3, 2); // 6 thread blocks dim3 DimBlock(16, 16); // 256 threads per block kernel<<< DimGrid, DimBlock>>> (...);

1818

实例 1: Element-Wise Addition//CPU program

//sum of two vectors a and b

void add_cpu(float *a, float *b, int N)

{

for (int idx = 0; idx<N; idx++)

a[idx] += b[idx];

}

void main()

{

.....

fun_add(a, b, N);

}

//CUDA program

//sum of two vectors a and b

__global__ void add_gpu(float *a, float *b, int N)

{

Int idx =blockIdx.x* blockDim.x+ threadIdx.x;

if (idx < N)

a[idx] += b[idx];

}

void main()

{

…..

dim3 dimBlock (256);

dim3 dimGrid( ceil( N / 256 );

fun_add<<<dimGrid, dimBlock>>>(a, b, N);

}

1919


2020

CUDA Processing Flow

2121

并行线程执行 SM内以 (warp 即 32 threads)为单位并行执行Warp内线程执行同一条指令Half-warp是存储操作的基本单位

Warp

Block 0

Block 1

Block 2

2222

控制流 (Control Flow)

同一 warp内的分支语句可能执行不同的指令路径不同指令路径的线程只能顺序执行

• 每次执行 warp中一条可能的路径• N条指令路径→ 1/N throughput

只需要考虑同一 warp即可，不同warp的不同的指令路径不具相关性

G80上使用指令预测技术加速指令执行

2323

控制流 (Control Flow)常见情况 : 分支条件是 thread ID的函数时 , 容易导致分支（ divergence）Example with divergence:

• If (threadIdx.x > 2) { } > 在 thread block产生两条不同指令路径

• Branch granularity < warp size• threads 0 and 1 与 1st warp中其它指令的指令路径不同

Example without divergence:• If (threadIdx.x / WARP_SIZE > 2) { }

> 也在 thread block产生两条不同指令路径• Branch granularity is a whole multiple of warp size• 同一 warp的所有线程具备相同指令路径

2424

线程同步 void __syncthreads();

Barrier synchronization同步 thread block之内的所有线程避免访问共享内存时发生 RAW/WAR/WAW 冒

险 (hazard)

__shared__ float scratch[256];

scratch[threadID] = begin[threadID];

__syncthreads();

int left = scratch[threadID -1];

在此等待，直至所有线程到达才开始执行下面的代码

2525

Dead-Lock with __syncthreads Dead-lock if

Some threads have val larger than threshold

And others not

__global__ void compute(...)

{

// do some computation for val

if( val > threshold )return;

__syncthreads(); // work with val & store it return;

}

2626


2727

CUDA扩展语言结构 Declspecs

global, device, shared, local, constant

Keywords threadIdx, blockIdx threadDim, blockDim

Intrinsics __syncthreads

Runtime API Memory, symbol, execution

management

Function launch

__device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ...

region[threadIdx] = image[i];

__syncthreads() ... image[j] = result;}

// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per blockfoo<<<100, 10>>> (parameters);

2828

存储器空间 R/W per-thread registers

1-cycle latency R/W per-thread local memory

Slow – register spilling to global memory R/W per-block shared memory

1-cycle latency “__shared__” But bank conflicts may drag down

R/W per-grid global memory ~500-cycle latency “__device__” But coalescing accessing could hide

latency Read only per-grid constant and texture

memories ~500-cycle latency, but cached

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

2929

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

GPU Global Memory分配 cudaMalloc()

分配显存中的 global memoryglobal memory两个参数

• 对象数组指针和数组尺寸 cudaFree()

释放显存中的 global memoryglobal memory• 对象数组指针

int blk_sz = 64;float* Md;int size = blk_sz * blk_sz * sizeof(float);

cudaMalloc((void**)&Md, size);…cudaFree(Md);

3030

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Host – Device数据交换 cudaMemcpy()

Memory data transferRequires four parameters

• Pointer to destination• Pointer to source• Number of bytes copied• Type of transfer

> Host to Host, Host to Device, Device to Host, Device to Device

cudaMemcpy(Md, M.elements, size, cudaMemcpyHostToDevice);

cudaMemcpy(M.elements, Md, size, cudaMemcpyDeviceToHost);

3131

CUDA函数定义

__global__ 定义 kernel函数必须返回 void

__device__ 函数不能用 &运算符取地址 , 不支持递归调用 , 不支持静态变量 (static

variable), 不支持可变长度参数函数调用

Executed on the:

Only callable from the:

__device__ float DeviceFunc() device device

__global__ void KernelFunc() device host

__host__ float HostFunc() host host

3232

CUDA数学函数pow, sqrt, cbrt, hypot, exp, exp2, expm1, log,

log2, log10, log1p, sin, cos, tan, asin, acos, atan, atan2, sinh, cosh, tanh, asinh, acosh, atanh, ceil, floor, trunc, round, etc.只支持标量运算许多函数有一个快速、较不精确的对应版本

• 以” __”为前缀，如 __sin()•编译开关 -use_fast_math强制生成该版本的目标码

3333

实例 2: 矩阵相乘矩阵数据类型 – 不属于 CUDA! 单精度浮点数 width height个元素矩阵元素在 elements中

1-D数组存放矩阵数据Row-major storage

typedef struct { int width; int height; float* elements;} Matrix;

A

B

C

WM

.wid

th =

N.h

eigh

tIM

.hei

ght

M.width N.width

3434

实例 2: 矩阵相乘 C = A B of size WIDTH x WIDTH 一个线程处理一个矩阵元素

简化 : 假定 WIDTH x WIDTH < 512• 只需要一个 thread block

线程载入 A的一行和 B的一列A 和 B的一对相应元素作一次乘法和一次加法 A

B

C

WID

THW

IDTH

WIDTH WIDTH

3535

CUDA Implementation – Host Side// Matrix multiplication on the devicevoid Mul(const Matrix A, const Matrix B, Matrix C){ int size = A.width A.width sizeof(float); // Load M and N to the device float *Ad, *Bd, *Cd; cudaMalloc((void**)&Ad, size); //matrix stored in linear order cudaMemcpy(Ad, A.elements, size, cudaMemcpyHostToDevice); cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B.elements, size, cudaMemcpyHostToDevice);

// Allocate C on the device cudaMalloc((void**)&Cd, size);

3636

CUDA Implementation – Host Side // Launch the device computation threads! dim3 dimGrid(1); dim3 dimBlock(M.width, M.width); Muld<<<dimGrid, dimBlock>>>(Ad, Bd, Cd, M.width);

// Read P from the device copyFromDeviceMatrix(C.elements, Cd); cudaMemCopy(C, Cd, N * size, cudaMemcpyDeviceToHost);

// Free device matrices cudaFree(Ad); cudaFree(Bd); cudaFree(Cd);}

3737

CUDA Implementation – Kernel// Matrix multiplication kernel – thread specification__global__ void Muld (float* Ad, float* Bd, float* Cd, int width){ // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; // cvalue is used to store the element of the matrix // that is computed by the thread float cvalue = 0;

3838

CUDA Implementation – Kernel

A

B

C

WID

THW

IDTH

WIDTH WIDTH

ty

tx

for (int k = 0; k < width; ++k) { float ae = Ad[ty * width + k]; float be = Bd [tx + k * width]; cvalue += ae * be; } // Write the matrix to device memory; // each thread writes one element Cd[ty * width + tx] = cvalue;}

3939

提纲从 GPGPU到 CUDA并行程序组织并行执行模型 CUDA基础存储器

Shared memoryGlobal memory

CUDA程序设计工具新一代 Fermi GPU

4040

共享存储器 (Shared Memory)

设置于 streaming multiprocessor内部由一个线程块内部全部线程共享

完全由软件控制访问一个地址只需要 1个时钟周期

4141

共享存储器结构 G80的共享存储器组织为 16 banks

Addressed in 4 bytes Bank ID = 4-byte address % 16相邻 4-byte地址映射相邻 banks每一 bank的带宽为 4 bytes per

clock cycle 对同一 bank的同时访问导致 bank

conflict 只能顺序处理仅限于同一线程块内的线程

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 000, 16, 32, …

01, 17, 33, …02, 18, 34, …03, 19, 35, …

15, 31, 47, …

…

4242

Bank Addressing实例 No Bank Conflicts

Linear addressing stride == 1 (s=1)

No Bank Conflicts Random 1:1 Permutation

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15


Bank 15


Thread 15


Thread 15


Bank 15


Thread 15


Bank 15


Bank 15


Thread 15


Thread 15


__shared__ float shared[256];float foo = shared[threadIdx.x];

4343

Bank Addressing实例 2-way bank conflicts

Linear addressing stride == 2 (s=2)

8-way bank conflicts Linear addressing

stride == 8 (s=8)

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15


Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15


Bank 15


Thread 15


Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0x8

x8

Thread 15


Thread 15


Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0

Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0x8

x8

__shared__ float shared[256];float foo = shared[2 * threadIdx.x];

__shared__ float shared[256];float foo = shared[8 * threadIdx.x];

4444




4545

全局内存 (Global Memory)

全局内存在 G80/G200上没有缓存

Constant memory 和 texture memory有少量缓存

存取延时400-600 clock cycles

非常容易成为性能瓶颈优化是提高性能的关键 !

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

4646

Coalesced Global Memory Accesses

4747

Non-Coalesced Global Memory Accesses

4848

Non-Coalesced Global Memory Accesses

4949

Coalescing on 1.2 and Higher Devices

Global memory access by threads in a half-warp can be coalesced When the words accessed by all threads lie in the same

segment of size equal to: • 32 bytes if all threads access 8-bit words• 64 bytes if all threads access 16-bit words • 128 bytes if all threads access 32-bit or 64-bit words

Any pattern of addresses requested by the half-warp• Including patterns where multiple threads access the same

address

5050

Example of New Coalescing Rules

Address 0

Thread 0

Address 4

Address …

Address 116

Address 120

Address 124

Address 128

Address …

Address 172

Address 176

Address 180

Address 184

Address 188

Address 252

Thread 1

Thread 2

Thread 3

Thread …

Thread 14

Thread 15

…

Segment 0 (128B)

Segment 1 (128B)

Reduced to 32B

Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data.

5151




5252

下载 CUDA 软件 http://www.nvidia.cn/object/cuda_get_cn.html CUDA driver

硬件驱动 CUDA toolkit

工具包 CUDA SDK

程序范例及动态链接库 CUDA Visual Profiler

程序剖析工具

CPU(Host)

CUDA Libraries (CUFFT& CUBLAS)

CUDA Runtime Libraries

CUDA Driver

Application

GPU(Device)

5353

CUDA程序的编译 (compile)

CUDA源文件被 nvcc处理nvcc is a compiler driver

nvcc 输出：PTX (Parallel Thread eXecution)

• Virtual ISA for multiple GPU hardware

• Just-In-Time compilation by CUDA runtime

GPU binary• Device-specific binary object

Standard C code• With explicit parallelism

C/C++ CUDA Application

NVCC

PTX Code

C/C++ CPU Code

Generic

CUDA Runtime

Specialized

Other GPUsG80 GT200

CUDA Binary

5454

DEBUG

make dbg=1CPU代码以 debug模式编译可以用 debugger (e.g. gdb, visual studio)运行

• 但不能检查 GPU代码的中间结果 make emu=1

在 CPU上以 emulation方式顺序运行可以使用 printf() 打印中间结果

• 基本顺序执行• 但不能再现线程间的竞争 (race) 现象• 浮点运算结果可能有微小的差别

5555

检查资源使用使用 -cubin flag编译开关检查 .cubin 文件的” code” 部分

architecture {sm_10}abiversion {0}modname {cubin}code {

name = BlackScholesGPUlmem = 0smem = 68reg = 20bar = 0bincode {

0xa0004205 0x04200780 0x40024c09 0x00200780…

per thread local memory

per thread block shared memory

per thread registers

5656

CUDA Debugger: cuda-gdb Released with CUDA 2.2

A ported version of GNU Debugger, gdb Red Hat Enterprise Linux 5.x 32-bit and 64-bit

Compiling with debug support nvcc –g –G foo.cu –o foo

Single-step individual warps (“next” or “step”) Advances all threads in the same warp

Display device memory in the device kernel Data that resides in various GPU memory regions such as shared, local,

and global memory Switch to any CUDA block/thread

thread <<<(BX,BY),(TX,TY,TZ)>>> Breaking into running applications

Ctrl+C to break into hanging programs

5757

“Nexus” GPU/CPU Development Suite

Major componentsNexus Debugger

• Source code debugger for GPU source code• CUDA, DirectCompute, HLSL, …

Nexus Analyzer• System-wide event viewer for both GPU & CPU events

Nexus Graphics Inspector• For frame based, deep inspection of textures and geometry

Full integration with Visual StudioWindows 7/VistaAvailable on Oct. 29, 2009

5858




5959

3 Major Generations of CUDA GPUs

GPU G80 GT200 GT300 (Fermi)

CUDA cores 128 240 512

Process (nm) 90 45 40

Transistors 681 Million 1.4 Billion 3.0 Billion

Double precision floating point capability

None 30 FMA ops/clock 256 FMA ops/clock

Single precision floating point capability 128 MAD ops/clock 240 MAD ops/clock 512 MAD ops/clock

Warp scheduler 1 1 2

Special function units / SM 2 2 4

CUDA cores / SM 8 8 32

Shared memory / SM 16KB 16KB Configurable 48KB or 16KB

L1 cache / SM None None Configurable 16KB or 48KB

L2 cache / SM None None 786KB

Concurrent kernels 1 1 Up to 16

Load/store memory space 32-bit 32-bit 64-bit

6060

Fermi GPU 架构

CUDA core (SP)LD/ST unit

Special functional unitThread scheduler

GDDR5 DRAM Interface

786KB L2 cache

6161

Third Generation Streaming Multiprocessor

32 CUDA cores (SPs) per SM, 4x over G200

8x peak double precision floating point performance over G200

Dual warp scheduler that schedule and dispatch two warps of 32 threads

Memory16x128KB = 2048KB register file16x64KB=768KB L1

cache/shared memory• Configurable partitioning

768KB L2 cache

6262

Dual Warp Scheduler

Instruction Dispatch Unit

Warp Scheduler

Warp 8 instruction 11 Warp 9 instruction 11



Instruction Dispatch Unit

Warp Scheduler




time

6363

Second Generation Parallel Thread Execution ISA 64-bit memory address space - with ECC

Unified address space with full C++ support

Optimized execution of OpenCL and DirectCompute Full IEEE 754-2008 single and double precision floating

point numbers

6464

NVIDIA GigaThreadTM Engine Multi-kernel execution

10x faster application context switchingConcurrent kernel executionOut-of order thread block execution

Two streaming transfer enginesWork in a pipelined, overlapped mannerEach could saturate the PCIe interface

6565

Performance

6666

参考书目1. NVIDIA, CUDA Programming Guide, NVidia, 2008,

2009.

2. 张舒、褚艳利、赵开勇、张钰勃 , GPU高性能运算之 CUDA, 2009.

3. http://www.hpctech.com/.

4. T. Mattson, et al., Patterns for Parallel Programming, Addison Wesley, 2005.

6767

参考文献 Special issues on GPU computing

IEEE, Proceedings of IEEE, Vol. 96, No. 5, May, 2008.ACM, Queue, Vol. 6, No. 2, March/April, 2008.Elsevier, Journal of Parallel and Distributed Computing,

Vol. 68, No. 10, October, 2008.

Documents

CUDA 超大规模并行程序设计