パフォーマンス解析ツールカーネルや関数（API ）の名称 gputime GPU上で処理に要した時間（ s単位） cputime CPUで処理（=カーネル起動）に要した時間

パフォーマンス解析ツール

長岡技術科学大学電気電子情報工学専攻出川智啓

今回の内容

2015/05/20GPGPU実践プログラミング302

プロファイリング

Compute Profiler (Nsight)

Warpと並列実行の制約

Occupancy Calculator

ベクトル和のプロファイリング

プログラムの挙動


経験的なプログラム高速化手法

スレッド数は128か256 メモリはアドレス順にアクセス

等

プログラム内部で挙動がどのように変化するか

パフォーマンスカウンタ機能による確認




パフォーマンスカウンタの数値を出力

プログラム実行時の挙動の記録

ボトルネックの特定

プロファイラ

コマンドラインで実行するプロファイラ

Compute Profiler nvprof(CUDA 5以降)

GUIから操作するプロファイラ

Compute Visual Profiler Nsight

Compute Profiler


CUDAの実行環境に組み込まれたプロファイラ

環境変数を設定するだけで有効化

使い方

1. 環境変数の設定 $ export CUDA_PROFILE=1

2. プログラムの実行 $./a.out

3. プロファイル結果(標準はcuda_profile_0.log)の確認

Compute Profilerの環境変数


CUDA_PROFILE 1でプロファイル実行，0でプロファイル停止

CUDA_PROFILE_CONFIG プロファイル項目を記述したファイルの名前

CUDA_PROFILE_LOG 出力ファイル名（標準はcuda_profile_0.log）

CUDA_PROFILE_CSV 1で出力ファイルをCSV（カンマ区切り）形式に変更

差分法のプロファイル


$ export CUDA_PROFILE=1

$ nvcc ‐arch=sm_20 differentiate.cu

$ ./a.out cuda_profile_0.logが出力される

$cat cuda_profile_0.log

プロファイル結果の確認


標準で出力される値

method カーネルや関数（API）の名称

gputime GPU上で処理に要した時間（s単位）

cputime CPUで処理（=カーネル起動）に要した時間

実際の実行時間=cputime+gputime occupancy 同時実行されているWarp数と実行可能な大

Warp数の比（一般的には1に近いほどよい）

$cat cuda_profile_0.log# CUDA_PROFILE_LOG_VERSION 2.0# CUDA_DEVICE 0 Tesla M2050# TIMESTAMPFACTOR fffff614a81cd038method,gputime,cputime,occupancymethod=[ memcpyHtoD ] gputime=[ 3285.216 ] cputime=[ 3657.000 ]method=[ _Z13differentiatePdS_ ] gputime=[ 197.408 ] cputime=[ 18.000 ] occupancy=[ 1.000 ]method=[ memcpyDtoH ] gputime=[ 3598.752 ] cputime=[ 4505.000 ]

GPUの並列処理の実際


GPUのスレッドを管理

Streaming Multiprocessor内のCUDA Core数 Tesla世代 8 Fermi世代 32 Kepler世代 192 Maxwell世代 128

Warp（ウォープ）という処理の単位の導入

32スレッドをまとめて1 Warp 1 Warpごとに処理を実行

256スレッドの同時実行は不可能



処理はWarp単位で実行

32スレッドごとに処理を実行

1ブロックあたりのスレッド数が256=Warp 8個

ある命令を発行すると，Warp単位で同じ命令を実行

Warp内でCUDA CoreはSIMD的に動作

複数のまとまったスレッドが同じ演算を同時に実行

SIMT(Single Instruction Multiple Threads)

厳密なSIMDではない


パイプライン実行

あるWarpがグローバルメモリへアクセス，データ待ち

その他の実行可能なWarpが計算を実行

グローバルメモリへのアクセスのレイテンシを隠蔽


計算メモリアクセス計算Warp0

Warp1

Warp2

Warp3

実行開始処理時間

計算メモリアクセス計算

計算メモリアクセス計算

計算メモリアクセス

・・・

・・・



Warpの同時実行

同時に実行されているWarp(Active Warp)が多いほどレイテンシ隠蔽が容易

Active Warpを多くすることで高速化

Warpの同時実行数はブロック内で使用している資源（レジスタ，共有メモリ）によって変化

占有率(Occupancy) 同時実行されているWarp数と実行可能な大Warp数の比

一般的には1に近い方が高速

並列実行に対する制約


GPUの構成（資源の量，スケジューラ）に起因するハードウェア的・ソフトウェア的制約

Tesla世代 Fermi世代

Warpサイズ 32スレッド

1ブロックあたりのスレッド数 512 1024

1SMあたりの大スレッド数 1024 1536

1SMあたりのWarp数 32 48

1SMあたりのブロック数 8 8

1SMあたりの共有メモリサイズ 16384 byte 49152 byte

1SMあたりの32bitレジスタ数 16384本 32768本

https://www.softek.co.jp/SPG/Pgi/TIPS/public/accel/gpu‐accel2.html



1ブロックあたりのスレッド数 256 1ブロックあたりのWarp数

256thread/block / 32thread/Warp = 8 Warp/block

1SMが処理するブロック数（1SMあたり大48Warpを処理）

48 Warp/SM / 8 Warp/block = 6 block/SM（<8） 1SMあたり6ブロックを並列処理すればよい（大8ブロック）

同時実行されるWarp数は

8 Warp/block × 6 block/SM = 48 Warp/SM. 1SMあたり大48Warp同時実行できるので，占有率は

48 Warp/SM /48 Warp/SM = 1(=100%)

太字：利用するGPUによって決定下線：ユーザの設定により決まる値斜体：処理内容やコンパイラによって決定



1ブロックあたりのスレッド数 64 1ブロックあたりのWarp数

64thread/block / 32thread/Warp = 2 Warp/block

1SMが処理するブロック数（1SMあたり大48Warpを処理）

48 Warp/SM / 2 Warp/block = 24 block/SM（>8） 1SMあたり8ブロックを並列処理（24ブロックの処理は不可能）

同時実行されるWarp数は

2 Warp/block × 8 block/SM = 16 Warp/SM. 1SMあたり大48Warp同時実行できるので，占有率は

16 Warp/SM /48 Warp/SM = 0.333(=33%)




並列実行されるブロックの数

1ブロックあたりの共有メモリやレジスタの使用量で制限

1ブロックあたりのスレッド数 Nt 1ブロックが利用する共有メモリS[byte]

49152/S [block/SM]

1スレッドが利用するレジスタR[本]32768/(Nt×R) [block/SM]

並列実行されるブロック数

min{8, 48/(Nt/32), 49152/S, 32768/(Nt×R)}


Occupancyの計算


CUDA Occupancy Calculator NVIDIAが提供* Occupancy等の値を実行前に推定

必要な情報

GPUの世代

共有メモリのサイズ設定

64kBのうち何kBを共有メモリとするか

1ブロックあたりのスレッド数

1スレッドあたりのレジスタ使用数

1ブロックあたりの共有メモリ使用量

*http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

CUDA Occupancy Calculator デモ

GPGPU実践プログラミング318 2015/05/20

共有メモリやレジスタ数使用量の確認


コンパイルオプションを付加

‐Xptxas –v もしくは ‐‐ptxas‐options=‐v 実行可能バイトコードの情報を出力

$ nvcc ‐arch=sm_20 differentiate.cu ‐Xptxas ‐vptxas info : Compiling entry function '_Z13differentiatePdS_' for 'sm_20'ptxas info : Function properties for _Z13differentiatePdS_

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loadsptxas info : Used 19 registers, 48 bytes cmem[0], 12 bytes cmem[16]

使用レジスタ数コンスタントメモリ使用量（サイズ）（コンパイラの適化により利用）

共有メモリやレジスタ数使用量の確認


共有メモリを利用したカーネル

コンスタントメモリを利用したカーネル

$ nvcc ‐arch=sm_20 differentiate_shared.cu ‐Xptxas ‐vptxas info : Compiling entry function '_Z13differentiatePdS_' for 'sm_20'ptxas info : Function properties for _Z13differentiatePdS_

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loadsptxas info : Used 18 registers, 2064+0 bytes smem, 48 bytes cmem[0], 4 bytes cmem[16]

共有メモリ使用量（サイズ）

$ nvcc ‐arch=sm_20 differentiate_constant.cu ‐Xptxas ‐vptxas info : Compiling entry function '_Z13differentiatePdS_' for 'sm_20'ptxas info : Function properties for _Z13differentiatePdS_

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loadsptxas info : Used 13 registers, 2064+0 bytes smem, 48 bytes cmem[0], 8 bytes cmem[2]

CUDA Occupancy Calculator デモ


レジスタ数の制約*


Compute Capalicity2.0世代のGPUのレジスタ利用

1 Warpあたり64本単位でレジスタを確保

レジスタ数が問題となってOccupancy100%の達成が困難

Occupancy 100%を達成するためには

ハードウェア的制約

1SMあたりのレジスタ数 32768本以下

1SMあたりのWarp数 48(=48×32スレッド)以下

1 Warpあたり64本単位でレジスタを確保

*http://www.gdep.jp/column/view/33

レジスタ数の制約*


Occupancy 100%を達成するためには

1スレッドあたりの利用可能レジスタ数

32768本/(48Warp/SM×32thread/Warp)=21.3=21本/thread 1スレッドあたり21本以下

1 Warpが確保するレジスタ数

21本/thread×32thread/Warp=672→704(64の倍数)に切り上げ

1SMあたりの使用レジスタ数

704本/Warp×48Warp/SM=33792本/SM (>32768)

1スレッドあたりの使用レジスタ数が20本以下でないと100%を達成できない

*http://www.gdep.jp/column/view/33

詳細なプロファイリング


gputime, cputime, occupancy以外のプロファイル

設定ファイルに測定する項目を記述

設定ファイルを環境変数CUDA_PROFILE_CONFIGで指定

prof.confがよく使われる

$ export CUDA_PROFILE_CONFIG=prof.conf 設定後，CUDA_PROFILE=1にしてプログラムを実行

$ export CUDA_PROFILE=1

詳細なプロファイリング


prof.confの内容

Profiler User's Guide http://docs.nvidia.com/cuda/profiler‐users‐guide/index.html

4.3. Command Line Profiler Configuration中のTable 2 9. Metrics Reference中のTable 3

nvprofを利用することも可能（CUDA 5以降）

$ nvprof ‐‐query‐eventsの結果を参照

カーネル起動に関する情報


名称内容

gpustarttimestamp カーネルもしくはメモリ転送開始時のタイムスタンプ(ns，64bit16進数)

gpuendtimestamp カーネルもしくはメモリ転送完了時のタイムスタンプ(ns，64bit16進数)

timestamp カーネル起動もしくはメモリ転送開始時のタイムスタンプ(s，単精度実数)

gridsize カーネル起動時のブロック数(x,y)

gridsize3d カーネル起動時のブロック数(x,y,z)

threadblocksize カーネル起動時のスレッド数(x,y,z)

dynsmemperblock 動的に割り当てられた共有メモリのサイズ

stasmemperblock 静的に割り当てられた共有メモリのサイズ

regperthread 1スレッドあたりのレジスタ数

memtransferdir メモリ転送の方向(0. host‐>device, 1. device‐>host)

memtransfersze メモリ転送サイズ

memtransferhostmemtype ホストメモリの種類（ページング可能かページロックか）

streamid カーネルを起動したストリームのID

cacheconfigrequested キャッシュ設定の要求共有メモリ：L1キャッシュ0. 設定無し 1. 48kB:16kB2. 16kB:48kB 3. 32kB:32kBcacheconfigexecuted カーネル起動時のキャッシュ設定

カーネル実行時のイベント


全てのGPUで取得できる情報

名称内容

local_load ローカルメモリからの読込回数（1SM内のWarpあたり）

local store ローカルメモリからの書込回数（1SM内のWarpあたり）

gld_request グローバルメモリからの読込回数（1SM内のWarpあたり）

gst_request グローバルメモリへの書込回数（1SM内のWarpあたり）

divergent_branch divergent branchの発生回数

branch プログラム中の分岐命令の数

sm_cta_launched 1SM内で実行されたブロックの数



Compute Capability 1.xのGPUで取得できる情報名称内容

gld_incoherent 非コアレスアクセスによるグローバルメモリからの読込回数

gld_coherent コアレスアクセスによるグローバルメモリからの読込回数

gld_32b 32バイト単位で行われたグローバルメモリからの読込回数



gst_incoherent 非コアレスアクセスによるグローバルメモリへの書込回数

gst_coherent コアレスアクセスによるグローバルメモリへの書込回数

gst_32b 32バイト単位で行われたグローバルメモリへの書込回数



instructions 実行された命令数

warp_serialize 共有メモリ，コンスタントメモリでコンフリクトした回数

cta_launched 起動されたブロックの数




instructions_issued（バージョンによってはinst_issued）

命令発行数（命令のreplay*を含む）

instructions_issued1_0 命令群0の1命令発行に要するサイクル数




instructions_executed（バージョンによってはinst_executed）

命令発行数（replayを含まない）

warps_launched 起動されたWarpの数（1SMあたり）

threads_launched 起動されたスレッドの数（1SMあたり）

active_cycles 1SM内で低でも1 Warpが実行されたサイクル数

active_warps 1SM内でアクティブになったWarpの数の合計サイクルごとにアクティブなWarpの数（0~48の間で）を加算

shared_load 共有メモリからの読込回数（1SM内のWarpあたり）

shared_store 共有メモリへの書込回数（1SM内のWarpあたり）

*同一Warp内のスレッド

が同じ命令を逐次的に実行

バンクコンフリクト等が原因で並列実行できない場合に発生http://on‐demand.gputechconf.com/gtc‐express/2011/presentations/Inst_limited_kernels_Oct2011.pdf




l1_global_load_hit L1キャッシュにヒットしたグローバルメモリからの読込回数

l1_global_load_miss L1キャッシュでキャッシュミスしたグローバルメモリからの読込回数

l1_local_load_hit L1キャッシュにヒットしたローカルメモリからの読込回数

l1_local_load_miss L1キャッシュでキャッシュミスしたローカルメモリからの読込回数

l1_local_store_hit L1キャッシュにヒットしたローカルメモリへの書込回数

l1_local_store_miss L1キャッシュでキャッシュミスしたローカルメモリへの書込回数

l1_shared_bank_conflict 共有メモリ内でバンクコンフリクトした回数

uncached_global_load_transaction

キャッシュされないグローバルメモリからの読込回数32,64,128ビットアクセスに対して1,2,4で加算

global_store_transaction グローバルメモリへの書込回数32,64,128ビットアクセスに対して1,2,4で加算



合計で70個程度(※動作は未確認)

•elapsed_cycles_sm•global_store_transaction•l1_global_load_hit•l1_global_load_miss•l1_local_load_hit•l1_local_load_miss•l1_local_store_hit•l1_local_store_miss•l1_shared_bank_conflict•sm_cta_launched•tex0_cache_sector_misses•tex0_cache_sector_queries•uncached_global_load_transaction•fb_subp0_read_sectors•fb_subp0_write_sectors•fb_subp1_read_sectors•fb_subp1_write_sectors•l2_subp0_read_hit_sectors•l2_subp0_read_sector_misses•l2_subp0_read_sector_queries•l2_subp0_read_sysmem_sector_queries•l2_subp0_read_tex_hit_sectors•l2_subp0_read_tex_sector_queries•l2_subp0_total_read_sector_queries•l2_subp0_total_write_sector_queries

•l2_subp0_write_sector_misses•l2_subp0_write_sector_queries l2•l2_subp0_write_sysmem_sector_queries•l2_subp1_read_hit_sectors•l2_subp1_read_sector_misses•l2_subp1_read_sector_queries•l2_subp1_read_sysmem_sector_queries•l2_subp1_read_tex_hit_sectors•l2_subp1_read_tex_sector_queries•l2_subp1_total_read_sector_queries•l2_subp1_total_write_sector_queries•l2_subp1_write_sector_misses l2 subp1•l2_subp1_write_sector_queries l2 subp1•l2_subp1_write_sysmem_sector_queries•gld_inst_128bit•gld_inst_16bit•gld_inst_32bit•gld_inst_64bit•gld_inst_8bit•gst_inst_128bit•gst_inst_16bit•gst_inst_32bit•gst_inst_64bit•gst_inst_8bit•active_cycles

•active_warps•atom_count•branch•divergent_branch•gld_request•gred_count•gst_request•inst_executed•inst_issued•local_load•local_store•prof_trigger_00•prof_trigger_01•prof_trigger_02•prof_trigger_03•prof_trigger_04•prof_trigger_05•prof_trigger_06•prof_trigger_07•shared_load•shared_store•thread_inst_executed_0•thread_inst_executed_1•threads_launched•warps_launched

http://www.ics.uci.edu/~roblim1/docs/cupticounters.txt

プログラム挙動の確認


スレッド数の変化によるメモリアクセスの変化

グローバルメモリからの読込(prof_gld) gld_request l1_global_load_hit l1_global_load_miss

グローバルメモリへの書込(prof_gst) gst_request global_store_transaction

命令

inst_issued (prof_inst_issued) inst_executed (prof_inst_executed)

#define N (512000)#define Nbytes (N*sizeof(float))#define NT (256)#define NB (N/NT) //=2000

__global__ void init(float *a,float *b, float *c){

int i = blockIdx.x*blockDim.x+ threadIdx.x;

a[i] = 1.0;b[i] = 2.0;c[i] = 0.0;

}__global__ void add(float *a,

float *b, float *c){int i = blockIdx.x*blockDim.x

+ threadIdx.x;c[i] = a[i] + b[i];

}

int main(void){float *a,*b,*c;

cudaMalloc((void **)&a, Nbytes);cudaMalloc((void **)&b, Nbytes);cudaMalloc((void **)&c, Nbytes);

init<<< NB, NT>>>(a,b,c);add<<< NB, NT>>>(a,b,c);

return 0;}

ベクトル和（コアレスアクセス版）


vectoradd_coalesce.cu

#define N (512000)#define Nbytes (N*sizeof(float))#define NT (250)#define NB (N/NT) //=2048



a[i] = 1.0;b[i] = 2.0;c[i] = 0.0;




}




return 0;}

ベクトル和（非コアレスアクセス版）


vectoradd_noncoalesce.cu

export CUDA_PROFILE=1nvcc ‐arch=sm_20 vectoradd_coalesce.cu#export CUDA_PROFILE_CONFIG=prof_gldexport CUDA_PROFILE_LOG=vectoradd_c_profile_gld.log./a.out#export CUDA_PROFILE_CONFIG=prof_gstexport CUDA_PROFILE_LOG=vectoradd_c_profile_gst.log./a.out#export CUDA_PROFILE_CONFIG=prof_inst_issuedexport CUDA_PROFILE_LOG=vectoradd_c_profile_inst_issued.log./a.out#export CUDA_PROFILE_CONFIG=prof_inst_executedexport CUDA_PROFILE_LOG=vectoradd_c_profile_inst_executed.log./a.out##

#nvcc ‐arch=sm_20 vectoradd_noncoalesce.cu#export CUDA_PROFILE_CONFIG=prof_gldexport CUDA_PROFILE_LOG=vectoradd_n_profile_gld.log./a.out#export CUDA_PROFILE_CONFIG=prof_gstexport CUDA_PROFILE_LOG=vectoradd_n_profile_gst.log./a.out#export CUDA_PROFILE_CONFIG=prof_inst_issuedexport CUDA_PROFILE_LOG=vectoradd_n_profile_inst_issued.log./a.out#export CUDA_PROFILE_CONFIG=prof_inst_executedexport CUDA_PROFILE_LOG=vectoradd_n_profile_inst_executed.log./a.out#export CUDA_PROFILE=0

コンパイルと実行（シェルスクリプト）


vectoradd.sh

実行結果（ベクトル和のみ抜粋, Occupancyはどちらも1）


l1_global_load_missの値が増加

globa_store_transactionの値が大幅に増加

inst_issuedの値が大幅に増加

内容コアレスアクセス非コアレスアクセス

gputime+cputime（4回の平均）

56.808+103=159.808

76.616+122.5=199.116

gld_request 2336 2368

l1_global_load_hit 0 1941

l1_global_load_miss 2336 2577

gst_request 1136 1168

global_store_transaction 1136 2252

inst_issued 24113 31399

inst_executed 19720 19992

メモリアクセスの評価


transactions/request 1 Warpがアクセス要求に対してキャッシュラインをいくつ使うか

理想値

32 thread/Warp × variable size[byte/thread]/ 128 byte/line

4バイト変数では1，8バイト変数では2

読込

(l1_global_load_hit+l1_global_load_miss)/gld_request

書込

global_store_transaction/gst_requet

transactions/request


理想値

float型（4バイト）の読込 1 float型（4バイト）の書込 1

非コアレスアクセスではメモリ転送の効率が50%に低下

読込書込

コアレスアクセス (2336+0)/2336=1 1136/1136=1

非コアレスアクセス (1941+2577)/2368≈2 2252/1168≈2

プログラム挙動の確認


命令の実行に要したサイクルの評価

実行した命令数

inst_executed (prof_inst_executed)

Warpが起動したサイクル数

active_cycles (prof_cycle)

#define N (8*1024) //64kBに収める#define Nbytes (N*sizeof(float))#define NT (256)#define NB (N/NT)



a[i] = 1.0;b[i] = 2.0;c[i] = 0.0;




}




return 0;}

GPUプログラム（グローバルメモリ利用）

2015/05/20GPGPU実践プログラミング

vectoradd.cu

340

#define N (8*1024) #define Nbytes (N*sizeof(float))#define NT (256)#define NB (N/NT)__constant__ float a[N],b[N];

__global__ void init(float *c){int i = blockIdx.x*blockDim.x

+ threadIdx.x;c[i] = 0.0f;

}__global__ void add(float *c){


c[i] = a[i] + b[i];}int main(void){

float *c;float *host_a,*host_b;

int i;

host_a=(float *)malloc(Nbytes);host_b=(float *)malloc(Nbytes);cudaMalloc((void **)&c,Nbytes);for(i=0;i<N;i++){

host_a[i] = 1.0f;host_b[i] = 2.0f;

}cudaMemcpyToSymbol

(a,host_a,Nbytes);cudaMemcpyToSymbol

(b,host_b,Nbytes);init<<< NB, NT>>>(c);add<<< NB, NT>>>(c);

return 0;}

コンスタントメモリ（単純な置き換え）


vectoradd_constant.cu

341

#define N (8*1024) #define Nbytes (N*sizeof(float))#define NT (256)#define NB (N/NT)__constant__ float a, b;

__global__ void init(float *c){int i = blockIdx.x*blockDim.x

+ threadIdx.x;c[i] = 0.0f;

}__global__ void add(float *c){


c[i] = a + b;}int main(void){

float *c;float host_a,host_b;

host_a=1.0f;host_b=2.0f;cudaMalloc((void **)&c,Nbytes);

//host_a,host_bが配列ではないので//アドレスを取り出すために&を付けるcudaMemcpyToSymbol

(a,&host_a,sizeof(float));cudaMemcpyToSymbol

(b,&host_b,sizeof(float));init<<< NB, NT>>>(c);add<<< NB, NT>>>(c);

return 0;}

コンスタントメモリ（同一アドレス参照）


vectoradd_broadcast.cu

342

export CUDA_PROFILE=1#nvcc ‐arch=sm_20 vectoradd.cu#export CUDA_PROFILE_CONFIG=prof_inst_executedexport CUDA_PROFILE_LOG=vectoradd_profile_inst_executed.log./a.out#export CUDA_PROFILE_CONFIG=prof_cycleexport CUDA_PROFILE_LOG=vectoradd_profile_cycle.log./a.out##nvcc ‐arch=sm_20 vectoradd_constant.cu#export CUDA_PROFILE_CONFIG=prof_inst_executedexport CUDA_PROFILE_LOG=vectoradd_constant_profile_inst_executed.log

./a.out#export CUDA_PROFILE_CONFIG=prof_cycleexport CUDA_PROFILE_LOG=vectoradd_constant_profile_cycle.log./a.out

##nvcc ‐arch=sm_20 vectoradd_broadcast.cu#export CUDA_PROFILE_CONFIG=prof_inst_executedexport CUDA_PROFILE_LOG=vectoradd_broadcast_profile_inst_executed.log

./a.out#export CUDA_PROFILE_CONFIG=prof_cycleexport CUDA_PROFILE_LOG=vectoradd_broadcast_profile_cycle.log./a.out#export CUDA_PROFILE=0

コンパイルと実行（シェルスクリプト）


vectoradd_cycle.sh

実行結果（ベクトル和のみ抜粋, Occupancyはどちらも1）


inst_issuedの値が減少

active_cyclesの値が大幅に増加

同じ処理の実行に余分なサイクルが必要

1サイクル（クロック）あたりの命令実行数(≠演算回数) グローバルメモリ 408/ 658=0.62 コンスタントメモリ 336/2984=0.11 ブロードキャスト 288/ 599=0.48

内容グローバルメモリコンスタントメモリブロードキャスト

gputime+cputime（2回の平均）

2.96+49=51.96

7.216+53.5=60.716

2.96+49.5=52.46

inst_executed 408 336 288

active_cycles 658 2984 599

簡易的なプロファイリング(CUDA 5以降)


実行と同時にプロファイル結果を出力

$ nvprof ./a.out

$ nvprof ‐‐query‐events プロファイルできるイベント一覧を表示

$ nvprof ‐events inst_executed ./a.out プロファイルするイベントを指定して実行

Nsight デバッグ・プロファイルツール

https://developer.nvidia.com/nsight‐eclipse‐edition https://developer.nvidia.com/nvidia‐nsight‐visual‐studio‐edition

Visual StudioやEclipseと統合し，開発→デバッグ→プロファイリングが可能


Available Events


The profiler supports logging of following counters during kernel execution only on GPUs with Compute Capability 2.0 or higher

Name description

instructions_issued Number of instructions issued including replays

instructions_issued1_0 Number of cycles that issue one instruction for instruction group 0

instructions_issued2_0 Number of cycles that issue two instructions for instruction group 0

instructions_issued1_1 Number of cycles that issue one instruction for instruction group 1

instructions_issued2_1 Number of cycles that issue two instructions for instruction group 1

instructions_executed Number of instructions executed, do not include replays

warps_launched Number of warps launched in a SM

threads_launched Number of threads launched in a SM

active_cycles Count of cycles in which at least one warp is active in a multiprocessor

active_warpsAccumulated count of no. of warps which are active per cycle in a multiprocessor. Each cycle increments it by the number of warps active in that cycle (in range 0‐48)

shared_load Number of executed shared load instructions per warp in a SM

shared_store Number of executed shared store instructions per warp in a SM

http://www.cs.cmu.edu/afs/cs/academic/class/15668‐s11/www/cuda‐doc/Compute_Profiler.txt

Available Events



Name description

l1_global_load_hit Number of global load hits in L1 cache

l1_global_load_miss Number of global load misses in L1 cache

l1_local_load_hit Number of local load hits in L1 cache

l1_local_load_miss Number of local load misses in L1 cache

l1_local_store_hit Number of local store hits in L1 cache

l1_local_store_miss Number of local store misses in L1 cache

l1_shared_bank_conflict Count of no. of bank conflicts in shared memory


Number of uncached global load transactions. This increments by 1, 2 or 4 for 32, 64 and 128 bit accesses respectively


Number of uncached global load transactions. This increments by 1, 2, or 4 for 32, 64 and 128 bit accesses respectively

global_store_transaction Number of global store transactions. This increments by 1, 2, or 4 for 32, 64 and 128 bit accesses respectively

Available Events



Name description

l2_subp0_write_sector_misses Accumulated write sector misses from L2 cache for slice 0 for all the L2 cache units

l2_subp1_write_sector_misses Accumulated write sectors misses from L2 cache for slice 1 for all the L2 cache units

l2_subp0_read_sector_misses Accumulated read sectors misses from L2 cache for slice 0 for all the L2 cache units

l2_subp1_read_sector_misses Accumulated read sectors misses from L2 cache for slice 1 for all the L2 cache units

l2_subp0_write_sector_queries Accumulated write sector queries from L1 to L2 cache for slice 0 of all the L2 cache units

l2_subp1_write_sector_queries Accumulated write sector queries from L1 to L2 cache for slice 1 of all the L2 cache units

l2_subp0_read_sector_queries Accumulated read sector queries from L1 to L2 cache for slice 0 of all the L2 cache units

l2_subp1_read_sector_queries Accumulated read sector queries from L1 to L2 cache for slice 1 of all the L2 cache units

tex0_cache_sector_queries Number of texture cache sector queries for texture unit 0

tex0_cache_sector_misses Number of texture cache sector misses for texture unit 0

tex1_cache_sector_queries Number of texture cache sector queries for texture unit 1

tex1_cache_sector_misses Number of texture cache sector misses for texture unit 1

Available Events



Name description

fb_subp0_read_sectors Number of read requests sent to sub‐partition 0 of all the DRAM units

fb_subp1_read_sectors Number of read requests sent to sub‐partition 1 of all the DRAM units

fb_subp0_write_sectors Number of write requests sent to sub‐partition 0 of all the DRAM units

fb_subp1_write_sectors Number of read requests sent to sub‐partition 1 of all the DRAM units

fb0_subp0_read_sectors Number of read requests sent to sub‐partition 0 of DRAM unit 0


fb0_subp0_write_sectors Number of write requests sent to sub‐partition 0 of DRAM unit 0






Documents

パフォーマンス解析ツール カーネルや関数（API ）の名称 gputime GPU上で処理に要した時間（ s単位） cputime CPUで処理（=カーネル起動）に要した時間

パフォーマンス解析ツールカーネルや関数（API ）の名称 gputime GPU上で処理に要した時間（ s単位） cputime CPUで処理（=カーネル起動）に要した時間