67
CUDA 超超超超超超超超超超 超超超 [email protected] http://www.comp.hkbu.edu.hk/ ~kyzhao http://blog.csdn.net/openher o 超超超超超超超超超超 超超 GPU 超超超超超超超

CUDA 超大规模并行程序设计

  • Upload
    elias

  • View
    158

  • Download
    0

Embed Size (px)

DESCRIPTION

CUDA 超大规模并行程序设计. 赵开勇 [email protected] http://www.comp.hkbu.edu.hk/~kyzhao http://blog.csdn.net/openhero 香港浸会大学计算机系 浪潮 GPU 高性能开发顾问. 提纲. 从 GPGPU 到 CUDA 并行程序 组织 并行 执行模型 CUDA 基础 存储器 CUDA 程序设计工具 新一代 Fermi GPU. Graphic Processing Unit (GPU). 用于个人计算机、工作站和游戏机的专用图像显示设备 显示卡 - PowerPoint PPT Presentation

Citation preview

Page 1: CUDA  超大规模并行程序设计

CUDA 超大规模并行程序设计赵开勇

[email protected]://www.comp.hkbu.edu.hk/~kyzhao

http://blog.csdn.net/openhero香港浸会大学计算机系

浪潮 GPU高性能开发顾问

Page 2: CUDA  超大规模并行程序设计

22

提纲 从 GPGPU 到 CUDA并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代 Fermi GPU

Page 3: CUDA  超大规模并行程序设计

33

Graphic Processing Unit (GPU)

用于个人计算机、工作站和游戏机的专用图像显示设备显示卡

• nVidia 和 ATI (now AMD)是主要制造商> Intel准备通过 Larrabee进入这一市场

主板集成• Intel

Page 4: CUDA  超大规模并行程序设计

44

3维图像流水线一帧典型图像

1M triangles3M vertices25M fragments

VertexProcessor

FragmentProcessor

Rasterizer

Framebuffer

Texture

CPU GPU

30 frames/s30M triangles/s90M vertices/s750M fragments/s

Page 5: CUDA  超大规模并行程序设计

55

传统 GPU架构Graphics program

Vertex processors

Fragment processors

Pixel operations

Output image

Page 6: CUDA  超大规模并行程序设计

66

GPU的强大运算能力

0

20

40

60

80

100

120

2003 2004 2005 2006 2007

Me

mo

ry b

an

dw

idth

(G

B/s

)

GPU

CPUG80 Ultra

G80

G71

NV40

NV30 Hapertown

W oodcrestPrescott EENorthwood

0

20

40

60

80

100

120

2003 2004 2005 2006 2007

Me

mo

ry b

an

dw

idth

(G

B/s

)

GPU

CPUG80 Ultra

G80

G71

NV40

NV30 Hapertown

W oodcrestPrescott EENorthwood

•数据级并行 : 计算一致性

•专用存储器通道•有效隐藏存储器延时

Page 7: CUDA  超大规模并行程序设计

77

General Purpose Computing on GPU (GPGPU)

Page 8: CUDA  超大规模并行程序设计

88

GPGPU

核心思想用图形语言描述通用计算问题

把数据映射到 vertex或者fragment处理器

但是硬件资源使用不充分存储器访问方式严重受限难以调试和查错高度图形处理和编程技巧

Page 9: CUDA  超大规模并行程序设计

99

G80 GPU

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Streaming Multiprocessor (SM)

Streaming Processor (SP)

Page 10: CUDA  超大规模并行程序设计

1010

CUDA: Compute Unified Device Architecture

CUDA: 集成 CPU + GPU C应用程序 通用并行计算模型

单指令、多数据执行模式 (SIMD)• 所有线程执行同一段代码 (1000s threads on the fly)• 大量并行计算资源处理不同数据

隐藏存储器延时• 提升计算/通信比例• 合并相邻地址的内存访问 • 快速线程切换 1 cycle@GPU vs. ~1000 cycles@CPU

Page 11: CUDA  超大规模并行程序设计

1111

Evolution of CUDA-Enabled GPUs Compute 1.0: basic CUDA compatibility

G80 Compute 1.1: asynchronous memory copies

and atomic global operations G84, G86, G92, G94, G96, and G98

Compute 1.2: dramatically improved memory coalescing rules, double the register count, intra-warp voting primitives, atomic shared memory operations GT21X

Compute 1.3: double precision GT200

Page 12: CUDA  超大规模并行程序设计

1212

CUDA成功案例

Page 13: CUDA  超大规模并行程序设计

1313

提纲从 GPGPU到 CUDA并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代 Fermi GPU

Page 14: CUDA  超大规模并行程序设计

1414

并行性的维度 1维

y = a + b //y, a, b vectors

2维P = M N //P, M, N matrices

3维CT or MRI imaging

a[0] a[1] … a[n]

b[0] b[1] … b[n]

y[0] y[1] … y[n]

+ + +

= = =

=

Page 15: CUDA  超大规模并行程序设计

1515

并行线程组织结构 Thread: 并行的基本单位 Thread block: 互相合作的线程组

Cooperative Thread Array (CTA)允许彼此同步通过快速共享内存交换数据 以 1维、 2维或 3维组织最多包含 512个线程

Grid: 一组 thread block 以 1维或 2维组织共享全局内存

Kernel: 在 GPU上执行的核心程序 One kernel one grid

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Page 16: CUDA  超大规模并行程序设计

1616

Parallel Program Organization in CUDA

Thread

Thread block

Grid

SP

Software Hardware

SM

SM

GPU…

TPC

SM SM SM

TPC

SM SM SM

TPC

SM SM SM

Page 17: CUDA  超大规模并行程序设计

1717

并行线程执行 调用 kernel function 需要指定执行配置

Threads 和 blocks具有 IDs threadIdx: 1D, 2D, or 3DblockIdx: 1D, or 2D由此决定相应处理数据

__global__ void kernel(...);dim3 DimGrid(3, 2); // 6 thread blocks dim3 DimBlock(16, 16); // 256 threads per block kernel<<< DimGrid, DimBlock>>> (...);

Page 18: CUDA  超大规模并行程序设计

1818

实例 1: Element-Wise Addition//CPU program

//sum of two vectors a and b

void add_cpu(float *a, float *b, int N)

{

for (int idx = 0; idx<N; idx++)

a[idx] += b[idx];

}

void main()

{

.....

fun_add(a, b, N);

}

//CUDA program

//sum of two vectors a and b

__global__ void add_gpu(float *a, float *b, int N)

{

Int idx =blockIdx.x* blockDim.x+ threadIdx.x;

if (idx < N)

a[idx] += b[idx];

}

void main()

{

…..

dim3 dimBlock (256);

dim3 dimGrid( ceil( N / 256 );

fun_add<<<dimGrid, dimBlock>>>(a, b, N);

}

Page 19: CUDA  超大规模并行程序设计

1919

提纲从 GPGPU到 CUDA并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代 Fermi GPU

Page 20: CUDA  超大规模并行程序设计

2020

CUDA Processing Flow

Page 21: CUDA  超大规模并行程序设计

2121

并行线程执行 SM内以 (warp 即 32 threads)为单位并行执行Warp内线程执行同一条指令Half-warp是存储操作的基本单位

Warp

Block 0

Block 1

Block 2

Page 22: CUDA  超大规模并行程序设计

2222

控制流 (Control Flow)

同一 warp内的分支语句可能执行不同的指令路径不同指令路径的线程只能顺序执行

• 每次执行 warp中一条可能的路径• N条指令路径→ 1/N throughput

只需要考虑同一 warp即可,不同warp的不同的指令路径不具相关性

G80上使用指令预测技术加速指令执行

Page 23: CUDA  超大规模并行程序设计

2323

控制流 (Control Flow)常见情况 : 分支条件是 thread ID的函数时 , 容易导致分支( divergence)Example with divergence:

• If (threadIdx.x > 2) { } > 在 thread block产生两条不同指令路径

• Branch granularity < warp size• threads 0 and 1 与 1st warp中其它指令的指令路径不同

Example without divergence:• If (threadIdx.x / WARP_SIZE > 2) { }

> 也在 thread block产生两条不同指令路径• Branch granularity is a whole multiple of warp size• 同一 warp的所有线程具备相同指令路径

Page 24: CUDA  超大规模并行程序设计

2424

线程同步 void __syncthreads();

Barrier synchronization同步 thread block之内的所有线程避免访问共享内存时发生 RAW/WAR/WAW 冒

险 (hazard)

__shared__ float scratch[256];

scratch[threadID] = begin[threadID];

__syncthreads();

int left = scratch[threadID -1];

在此等待,直至所有线程到达才开始执行下面的代码

Page 25: CUDA  超大规模并行程序设计

2525

Dead-Lock with __syncthreads Dead-lock if

Some threads have val larger than threshold

And others not

__global__ void compute(...)

{

// do some computation for val

if( val > threshold )return;

__syncthreads(); // work with val & store it return;

}

Page 26: CUDA  超大规模并行程序设计

2626

提纲从 GPGPU到 CUDA并行程序组织并行执行模型 CUDA基础存储器 CUDA程序设计工具新一代 Fermi GPU

Page 27: CUDA  超大规模并行程序设计

2727

CUDA扩展语言结构 Declspecs

global, device, shared, local, constant

Keywords threadIdx, blockIdx threadDim, blockDim

Intrinsics __syncthreads

Runtime API Memory, symbol, execution

management

Function launch

__device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ...

region[threadIdx] = image[i];

__syncthreads() ... image[j] = result;}

// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per blockfoo<<<100, 10>>> (parameters);

Page 28: CUDA  超大规模并行程序设计

2828

存储器空间 R/W per-thread registers

1-cycle latency R/W per-thread local memory

Slow – register spilling to global memory R/W per-block shared memory

1-cycle latency “__shared__” But bank conflicts may drag down

R/W per-grid global memory ~500-cycle latency “__device__” But coalescing accessing could hide

latency Read only per-grid constant and texture

memories ~500-cycle latency, but cached

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Page 29: CUDA  超大规模并行程序设计

2929

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

GPU Global Memory分配 cudaMalloc()

分配显存中的 global memoryglobal memory两个参数

• 对象数组指针和数组尺寸 cudaFree()

释放显存中的 global memoryglobal memory• 对象数组指针

int blk_sz = 64;float* Md;int size = blk_sz * blk_sz * sizeof(float);

cudaMalloc((void**)&Md, size);…cudaFree(Md);

Page 30: CUDA  超大规模并行程序设计

3030

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Host – Device数据交换 cudaMemcpy()

Memory data transferRequires four parameters

• Pointer to destination• Pointer to source• Number of bytes copied• Type of transfer

> Host to Host, Host to Device, Device to Host, Device to Device

cudaMemcpy(Md, M.elements, size, cudaMemcpyHostToDevice);

cudaMemcpy(M.elements, Md, size, cudaMemcpyDeviceToHost);

Page 31: CUDA  超大规模并行程序设计

3131

CUDA函数定义

__global__ 定义 kernel函数必须返回 void

__device__ 函数不能用 &运算符取地址 , 不支持递归调用 , 不支持静态变量 (static

variable), 不支持可变长度参数函数调用

Executed on the:

Only callable from the:

__device__ float DeviceFunc() device device

__global__ void KernelFunc() device host

__host__ float HostFunc() host host

Page 32: CUDA  超大规模并行程序设计

3232

CUDA数学函数pow, sqrt, cbrt, hypot, exp, exp2, expm1, log,

log2, log10, log1p, sin, cos, tan, asin, acos, atan, atan2, sinh, cosh, tanh, asinh, acosh, atanh, ceil, floor, trunc, round, etc.只支持标量运算许多函数有一个快速、较不精确的对应版本

• 以” __”为前缀,如 __sin()•编译开关 -use_fast_math强制生成该版本的目标码

Page 33: CUDA  超大规模并行程序设计

3333

实例 2: 矩阵相乘 矩阵数据类型 – 不属于 CUDA! 单精度浮点数 width height个元素 矩阵元素在 elements中

1-D数组存放矩阵数据Row-major storage

typedef struct { int width; int height; float* elements;} Matrix;

A

B

C

WM

.wid

th =

N.h

eigh

tIM

.hei

ght

M.width N.width

Page 34: CUDA  超大规模并行程序设计

3434

实例 2: 矩阵相乘 C = A B of size WIDTH x WIDTH 一个线程处理一个矩阵元素

简化 : 假定 WIDTH x WIDTH < 512• 只需要一个 thread block

线程载入 A的一行和 B的一列A 和 B的一对相应元素作一次乘法和一次加法 A

B

C

WID

THW

IDTH

WIDTH WIDTH

Page 35: CUDA  超大规模并行程序设计

3535

CUDA Implementation – Host Side// Matrix multiplication on the devicevoid Mul(const Matrix A, const Matrix B, Matrix C){ int size = A.width A.width sizeof(float); // Load M and N to the device float *Ad, *Bd, *Cd; cudaMalloc((void**)&Ad, size); //matrix stored in linear order cudaMemcpy(Ad, A.elements, size, cudaMemcpyHostToDevice); cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B.elements, size, cudaMemcpyHostToDevice);

// Allocate C on the device cudaMalloc((void**)&Cd, size);

Page 36: CUDA  超大规模并行程序设计

3636

CUDA Implementation – Host Side // Launch the device computation threads! dim3 dimGrid(1); dim3 dimBlock(M.width, M.width); Muld<<<dimGrid, dimBlock>>>(Ad, Bd, Cd, M.width);

// Read P from the device copyFromDeviceMatrix(C.elements, Cd); cudaMemCopy(C, Cd, N * size, cudaMemcpyDeviceToHost);

// Free device matrices cudaFree(Ad); cudaFree(Bd); cudaFree(Cd);}

Page 37: CUDA  超大规模并行程序设计

3737

CUDA Implementation – Kernel// Matrix multiplication kernel – thread specification__global__ void Muld (float* Ad, float* Bd, float* Cd, int width){ // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; // cvalue is used to store the element of the matrix // that is computed by the thread float cvalue = 0;

Page 38: CUDA  超大规模并行程序设计

3838

CUDA Implementation – Kernel

A

B

C

WID

THW

IDTH

WIDTH WIDTH

ty

tx

for (int k = 0; k < width; ++k) { float ae = Ad[ty * width + k]; float be = Bd [tx + k * width]; cvalue += ae * be; } // Write the matrix to device memory; // each thread writes one element Cd[ty * width + tx] = cvalue;}

Page 39: CUDA  超大规模并行程序设计

3939

提纲从 GPGPU到 CUDA并行程序组织并行执行模型 CUDA基础存储器

Shared memoryGlobal memory

CUDA程序设计工具新一代 Fermi GPU

Page 40: CUDA  超大规模并行程序设计

4040

共享存储器 (Shared Memory)

设置于 streaming multiprocessor内部 由一个线程块内部全部线程共享

完全由软件控制访问一个地址只需要 1个时钟周期

Page 41: CUDA  超大规模并行程序设计

4141

共享存储器结构 G80的共享存储器组织为 16 banks

Addressed in 4 bytes Bank ID = 4-byte address % 16相邻 4-byte地址映射相邻 banks每一 bank的带宽为 4 bytes per

clock cycle 对同一 bank的同时访问导致 bank

conflict 只能顺序处理 仅限于同一线程块内的线程

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 000, 16, 32, …

01, 17, 33, …02, 18, 34, …03, 19, 35, …

15, 31, 47, …

Page 42: CUDA  超大规模并行程序设计

4242

Bank Addressing实例 No Bank Conflicts

Linear addressing stride == 1 (s=1)

No Bank Conflicts Random 1:1 Permutation

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

__shared__ float shared[256];float foo = shared[threadIdx.x];

Page 43: CUDA  超大规模并行程序设计

4343

Bank Addressing实例 2-way bank conflicts

Linear addressing stride == 2 (s=2)

8-way bank conflicts Linear addressing

stride == 8 (s=8)

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0x8

x8

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0

Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0x8

x8

__shared__ float shared[256];float foo = shared[2 * threadIdx.x];

__shared__ float shared[256];float foo = shared[8 * threadIdx.x];

Page 44: CUDA  超大规模并行程序设计

4444

提纲从 GPGPU到 CUDA并行程序组织并行执行模型 CUDA基础存储器

Shared memoryGlobal memory

CUDA程序设计工具新一代 Fermi GPU

Page 45: CUDA  超大规模并行程序设计

4545

全局内存 (Global Memory)

全局内存在 G80/G200上没有缓存

Constant memory 和 texture memory有少量缓存

存取延时400-600 clock cycles

非常容易成为性能瓶颈 优化是提高性能的关键 !

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Page 46: CUDA  超大规模并行程序设计

4646

Coalesced Global Memory Accesses

Page 47: CUDA  超大规模并行程序设计

4747

Non-Coalesced Global Memory Accesses

Page 48: CUDA  超大规模并行程序设计

4848

Non-Coalesced Global Memory Accesses

Page 49: CUDA  超大规模并行程序设计

4949

Coalescing on 1.2 and Higher Devices

Global memory access by threads in a half-warp can be coalesced When the words accessed by all threads lie in the same

segment of size equal to: • 32 bytes if all threads access 8-bit words• 64 bytes if all threads access 16-bit words • 128 bytes if all threads access 32-bit or 64-bit words

Any pattern of addresses requested by the half-warp• Including patterns where multiple threads access the same

address

Page 50: CUDA  超大规模并行程序设计

5050

Example of New Coalescing Rules

Address 0

Thread 0

Address 4

Address …

Address 116

Address 120

Address 124

Address 128

Address …

Address 172

Address 176

Address 180

Address 184

Address 188

Address 252

Thread 1

Thread 2

Thread 3

Thread …

Thread 14

Thread 15

Segment 0 (128B)

Segment 1 (128B)

Reduced to 32B

Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data.

Page 51: CUDA  超大规模并行程序设计

5151

提纲从 GPGPU到 CUDA并行程序组织并行执行模型 CUDA基础存储器

Shared memoryGlobal memory

CUDA程序设计工具新一代 Fermi GPU

Page 52: CUDA  超大规模并行程序设计

5252

下载 CUDA 软件 http://www.nvidia.cn/object/cuda_get_cn.html CUDA driver

硬件驱动 CUDA toolkit

工具包 CUDA SDK

程序范例及动态链接库 CUDA Visual Profiler

程序剖析工具

CPU(Host)

CUDA Libraries (CUFFT& CUBLAS)

CUDA Runtime Libraries

CUDA Driver

Application

GPU(Device)

Page 53: CUDA  超大规模并行程序设计

5353

CUDA程序的编译 (compile)

CUDA源文件被 nvcc处理nvcc is a compiler driver

nvcc 输出:PTX (Parallel Thread eXecution)

• Virtual ISA for multiple GPU hardware

• Just-In-Time compilation by CUDA runtime

GPU binary• Device-specific binary object

Standard C code• With explicit parallelism

C/C++ CUDA Application

NVCC

PTX Code

C/C++ CPU Code

Generic

CUDA Runtime

Specialized

Other GPUsG80 GT200

CUDA Binary

Page 54: CUDA  超大规模并行程序设计

5454

DEBUG

make dbg=1CPU代码以 debug模式编译可以用 debugger (e.g. gdb, visual studio)运行

• 但不能检查 GPU代码的中间结果 make emu=1

在 CPU上以 emulation方式顺序运行可以使用 printf() 打印中间结果

• 基本顺序执行• 但不能再现线程间的竞争 (race) 现象• 浮点运算结果可能有微小的差别

Page 55: CUDA  超大规模并行程序设计

5555

检查资源使用 使用 -cubin flag编译开关 检查 .cubin 文件的” code” 部分

architecture {sm_10}abiversion {0}modname {cubin}code {

name = BlackScholesGPUlmem = 0smem = 68reg = 20bar = 0bincode {

0xa0004205 0x04200780 0x40024c09 0x00200780…

per thread local memory

per thread block shared memory

per thread registers

Page 56: CUDA  超大规模并行程序设计

5656

CUDA Debugger: cuda-gdb Released with CUDA 2.2

A ported version of GNU Debugger, gdb Red Hat Enterprise Linux 5.x 32-bit and 64-bit

Compiling with debug support nvcc –g –G foo.cu –o foo

Single-step individual warps (“next” or “step”) Advances all threads in the same warp

Display device memory in the device kernel Data that resides in various GPU memory regions such as shared, local,

and global memory Switch to any CUDA block/thread

thread <<<(BX,BY),(TX,TY,TZ)>>> Breaking into running applications

Ctrl+C to break into hanging programs

Page 57: CUDA  超大规模并行程序设计

5757

“Nexus” GPU/CPU Development Suite

Major componentsNexus Debugger

• Source code debugger for GPU source code• CUDA, DirectCompute, HLSL, …

Nexus Analyzer• System-wide event viewer for both GPU & CPU events

Nexus Graphics Inspector• For frame based, deep inspection of textures and geometry

Full integration with Visual StudioWindows 7/VistaAvailable on Oct. 29, 2009

Page 58: CUDA  超大规模并行程序设计

5858

提纲从 GPGPU到 CUDA并行程序组织并行执行模型 CUDA基础存储器

Shared memoryGlobal memory

CUDA程序设计工具新一代 Fermi GPU

Page 59: CUDA  超大规模并行程序设计

5959

3 Major Generations of CUDA GPUs

GPU G80 GT200 GT300 (Fermi)

CUDA cores 128 240 512

Process (nm) 90 45 40

Transistors 681 Million 1.4 Billion 3.0 Billion

Double precision floating point capability

None 30 FMA ops/clock 256 FMA ops/clock

Single precision floating point capability 128 MAD ops/clock 240 MAD ops/clock 512 MAD ops/clock

Warp scheduler 1 1 2

Special function units / SM 2 2 4

CUDA cores / SM 8 8 32

Shared memory / SM 16KB 16KB Configurable 48KB or 16KB

L1 cache / SM None None Configurable 16KB or 48KB

L2 cache / SM None None 786KB

Concurrent kernels 1 1 Up to 16

Load/store memory space 32-bit 32-bit 64-bit

Page 60: CUDA  超大规模并行程序设计

6060

Fermi GPU 架构

CUDA core (SP)LD/ST unit

Special functional unitThread scheduler

GDDR5 DRAM Interface

786KB L2 cache

Page 61: CUDA  超大规模并行程序设计

6161

Third Generation Streaming Multiprocessor

32 CUDA cores (SPs) per SM, 4x over G200

8x peak double precision floating point performance over G200

Dual warp scheduler that schedule and dispatch two warps of 32 threads

Memory16x128KB = 2048KB register file16x64KB=768KB L1

cache/shared memory• Configurable partitioning

768KB L2 cache

Page 62: CUDA  超大规模并行程序设计

6262

Dual Warp Scheduler

Instruction Dispatch Unit

Warp Scheduler

Warp 8 instruction 11 Warp 9 instruction 11

Warp 2 instruction 42 Warp 3 instruction 33

Warp 14 instruction 95 Warp 15 instruction 95

Instruction Dispatch Unit

Warp Scheduler

Warp 8 instruction 12 Warp 9 instruction 12

Warp 14 instruction 96 Warp 3 instruction 34

Warp 2 instruction 43 Warp 15 instruction 96

time

Page 63: CUDA  超大规模并行程序设计

6363

Second Generation Parallel Thread Execution ISA 64-bit memory address space - with ECC

Unified address space with full C++ support

Optimized execution of OpenCL and DirectCompute Full IEEE 754-2008 single and double precision floating

point numbers

Page 64: CUDA  超大规模并行程序设计

6464

NVIDIA GigaThreadTM Engine Multi-kernel execution

10x faster application context switchingConcurrent kernel executionOut-of order thread block execution

Two streaming transfer enginesWork in a pipelined, overlapped mannerEach could saturate the PCIe interface

Page 65: CUDA  超大规模并行程序设计

6565

Performance

Page 66: CUDA  超大规模并行程序设计

6666

参考书目1. NVIDIA, CUDA Programming Guide, NVidia, 2008,

2009.

2. 张舒、褚艳利、 赵开勇、张钰勃 , GPU高性能运算之 CUDA, 2009.

3. http://www.hpctech.com/.

4. T. Mattson, et al., Patterns for Parallel Programming, Addison Wesley, 2005.

Page 67: CUDA  超大规模并行程序设计

6767

参考文献 Special issues on GPU computing

IEEE, Proceedings of IEEE, Vol. 96, No. 5, May, 2008.ACM, Queue, Vol. 6, No. 2, March/April, 2008.Elsevier, Journal of Parallel and Distributed Computing,

Vol. 68, No. 10, October, 2008.