マルチコアのプログラミング技法 -- OpenCLとWebCL

マルチコアのプログラミング技法 -- OpenCLとWebCL

maruyama097 丸山不二夫

Scalable algorithms and libraries can be the best legacy we can leave behind from this era

-- Sanjay J. Patel

Agenda

o  マルチコア、メニコアの時代へ o  メニコア・チップの2方向への進化と異種混合環

境 -- Intel Xeon Phi と NVIDIA Tesla o  メニコアとパラレル・プログラミング o  Xeon Phiのプログラミング OpenMP o  NVIDIAのプログラミング　CUDA o  OpenCL o  WebCL

マルチコア、メニコアの時代へ

拡大を続けるICTの世界　2001-2013

スマートフォン

携帯

インターネット

”ICT Facts and Figures 2013”

固定電話

2005年から2010年までの世界のトランジスター数の変化

IDF2011 Keynote より　2011/9/13

2010年から2015年までの世界のトランジスター数の予想

IDF2011 Keynote より　2011/9/13

チップ上のトランジスター数の増大は、やむことなく進んでいる

ムーアの法則

トランジスター数の増大をチップのパワーにどう生かすか？

o  トランジスター数の増大は、自動的にチップのパワーを増大させる訳ではない。そこには、いくつかの選択肢がある。

l  コアの処理能力を高める l  パイプライン処理の強化 l  vector演算等新しい命令の追加 l  ....

l  キャッシュを拡大する l  コアの数を増やす l  ....

チップのクロックは、頭打ちの状態

チップのクロックの問題

o  チップの性能をあげる、最もストレートな方法は、クロックの周波数を上げることである。しかし、そこには、いくつかの大きな問題がある。

o  消費電力の増大/発熱の問題 o  高い周波数の為には、高い電圧が必要になるが、

リーク電流も増大し、性能が低下する o  光のスピードでしか情報は伝わらないので、原理

的には、チップの大きさが限界を与える。

コアの増大は、2005年あたりから顕著に

メニコア・チップの2方向への進化と異種混合環境

Intel Xeon Phi と NVIDIA Tesla

2013 June Top 500

http://www.top500.org/list/2013/06/

Top 500 のパフォーマンスの伸び

Rank

Site System Cores Rmax Rpeak

1 National University of Defense Technology China

Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P NUDT

3120000 33862.7 54902.4

2 DOE/SC/Oak Ridge National Laboratory United States

Titan - Cray XK7 , Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA K20x Cray Inc.

560640 17590.0 27112.5

Titan Cray XK7 Compute Node

21

Y

X

Z

HT3 HT3

PCIe Gen2

XK7 Compute Node Characteristics AMD Series 6200

(Interlagos)

NVIDIA Kepler

Host Memory 32GB

1600 MT/s DDR3

NVIDIA Tesla X2090 Memory

6GB GDDR5 capacity

Gemini High Speed Interconnect

Keplers in final installation http://en.wikipedia.org/wiki/Titan_(supercomputer)


6 Texas Advanced Computing Center/Univ. of Texas United States

Stampede - PowerEdge C8220, Xeon E5-2680 8C 2.700GHz, Infiniband FDR, Intel Xeon Phi SE10P Dell

462462 5168.1 8520.1

10 National Supercomputing Center in Tianjin China

Tianhe-1A - NUDT YH MPP, Xeon X5670 6C 2.93 GHz, NVIDIA 2050 NUDT

186368 2566.0 4701.0

16 National Supercomputing Centre in Shenzhen (NSCS) China

Nebulae - Dawning TC3600 Blade System, Xeon X5650 6C 2.66GHz, Infiniband QDR, NVIDIA 2050 Dawning

120640 1271.0 2984.3


21 GSIC Center, Tokyo Institute of Technology Japan

TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU, Linux/Windows NEC/HP

73278 1192.0 2287.6

Intel Xeon Phi

Xeon Phiのアーキテクチャー

o  60個のコアは、高速の双方向リングネットワークで結合されている。

o  コアは、1GHzあるいはそれ以上のクロックで稼働する。

o  キャッシュは、すべてのコプロッセサーで、コヒーレントに保たれる

o  それぞれのコアは、ローカルに512KBのL2キャッシュを持つ。L2キャッシュは、相互に高速にアクセス出来る。（チップ上のL2キャッシュの総量は、25MB以上になる）

http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-coprocessor-block-diagram.html

From “Intel Xeon Phi Coprocessor Architecture and Tools”

64-bit x86に追加された命令

o  狭いMMXの代わりに、SIMDの能力を持った広い512bit長のベクター演算

o  インテルのSSE, AVX 演算 o  乗除、平方根、べき等の高速演算サポート o  メモリーの高速な利用を可能にする、Scatter/

gatherとストリーミング・ストアのサポート

From “Intel Xeon Phi Coprocessor High Performance Programming”

Transforming and Tuning

OffloadモデルとNativeモデル

o  Offloadモデル：　プログラムはプロセッサーの上で走り、プログラムの一部が、コプロセッサー上にoffloadされ実行される、プロセッサー中心のモデル。

o  Nativeモデル：　プログラムは、プロセッサーとコプロセッサー上で走り、様々の方法で、相互に通信を行うモデル。

NVIDIA Tesla

Kepler Tesla K20

KEY FEATURES

o  GPU n  Number of processor cores: 2496 n  Processor core clock: 706 MHz

o  Board� n  PCI Express Gen2 ×16 system interface

o  Memory n  Memory clock: 2.6 GHz n  Memory bandwidth: 208 GB/sec� n  Interface: 320-bit n  Total board memory: 5 GB n  20 pieces of 64M × 16 GDDR5, SDRAM

GK110 Kepler アーキテクチャ

o  フル実装の Kepler GK110 には、SMX ユニットが 15 個と 64 ビットのメモリー・コントローラが 6 個用意される

o  Kepler GK110 の SMX ユニットには、単精度 CUDA コアが 192 個搭載されていて、各コアには完全パイプライン化された浮動小数点演算ユニットと整数演算ユニットが設けられている

o  各SMXには、倍精度ユニット64個、特殊関数ユニット(SFU)32個、ロード／ストア・ユニット（LD/ST）32個が用意されている。

L2 キャッシュ

PCI Express 3.0 Interface

Mem

ory Controller

Mem

ory Controller

SMX x 15

SMX x 15, Memory Controller x 6

Instruction Cache Warp Scheduler Dispatch Unit

InterConnect Shared Memory/L1 ReadOnly Cache Tex

Core

SFU x 32 Load/Store x 32 DP Unit x 64 Core x 192

各SMXの構成

2021年には、CPUと GPUの性能は、100 倍にまで拡大するという予想

メニコア・チップの2方向への進化と異種混合環境

CPUとGPUのアーキテクチャーの違い

CPUとGPUは、基本的には、異なるデザイン思想に基づいてている。

CPU: 遅延を意識した設計

o  大きなキャッシュ n  メモリーアクセスの長い遅

延をキャッシュで短かな遅延に変える

o  高度な制御 n  分岐遅延を軽減する為の

分岐予測・投機的実行 n  データ遅延を軽減する為の

データ先読み o  強力な演算機能

n  演算の遅延を軽減する

Cache

ALU Control

ALU

ALU

ALU

DRAM

CPU

GPU: スループットを意識した設計

o  小さなキャッシュ n  メモリーのスループットを高める

o  単純な制御 n  分岐予測なし n  データの先読みなし

o  エネルギー効率のいい演算機能

o  遅延に打ち勝つために大量スレッドを必要とする

DRAM

GPU

CPUとGPUを両方使うメリット

o  遅延が問題となる、シーケンシャルな実行の部分ではCPUを使う

o  シーケンシャルなコードでは、CPUはGPUの１０倍以上早い

o  スループットが重要となるパラレルな実行の部分では、GPUを使う

o  パラレルなコードでは、GPUはCPUの１０倍以上早い

異種混合環境の登場

異種混合環境としてのデバイスのハードウェアの進化

動画のフレームレートで見るモバイル・デバイスのパワー

解像度フレームレートタイプ

1990~ 176 x 144 15 fps

2010 1920 x 1080 30 fps フルHD

2013 3840 x 2160 60 fps “4K”

2015 7680 x 4320 120 fps “8K”

Apple iPhone 5s A7

http://www.appbank.net/2013/09/20/iphone-news/671017.php

Samsung Exynos 5 Octa (Exynos 5420) o  CPU: ARM Cortex-A15 x4コア + Cortex-

A7 x4コアのbig.LITTLEオクタコア構成 o  GPU: Mali-T628 (８コア)

o  新モデル：　８コアが同時に動く、「Heterogeneous Multi-Processing (HMP) 」機能　2013年9月発表　

Qualcomm Snapdragon 800

o  Quad core Krait 400 CPU at up to 2.3GHz per core, 28nm HPm

o  Adreno 330 GPU o  USB 3.0対応

Intel Atom Merrifield

o  CPU: Tangier (Silvermort x 2) 22nm

http://pc.watch.impress.co.jp/docs/column/ubiq/20130507_598283.html

NVIDIA Tegra 4

o  CPUコア: 4 + 1 o  CPUアーキテクチャ: ARM Cortex-A15 o  最大クロック速度: 1.9GHz o  カスタムGPUコア: 72

Tegra 4のGPU

Tegra 4の後継機 Logan

o  「Loganは(Tegraファミリで)初めて最新のGPUを搭載する。CUDAを使うことができる最初のモバイルプロセッサとなる。Loganは、Kepler GPUを搭載し、CUDA 5にフル対応し、OpenGL 4.3もサポートする。Loganは来年(2014年)頭には、容易に製造に入ることができるだろう」

メニコアとパラレル・プログラミング

メニコアのパワーを引き出す為には、パラレル・プログラミングが必須となる。いくつかのアプローチと実装が存在している。

並列と分散

p  並列処理と分散処理の違いを、ネットワークをまたいで処理を行うか否かという視点で考えることが出来る。ただ、計算単位間の結合の疎密は、基本的には、次のような単位間の通信に要する時間で規定されている。

l  L1: 3 cycles l  L2: 14 cycles l  RAM: 250 cycles l  DISK: 41,000,000 cycles l  NETWORK: 240,000,000 cycles

並列と分散

p  この必要な通信時間の差は、確かに量的なものではあるが、L1, L2, RAMとDISKとNETWORKの間に、質的な差はあると考えるに十分なものである。

l  L1: 3 cycles l  L2: 14 cycles l  RAM: 250 cycles l  DISK: 41,000,000 cycles l  NETWORK: 240,000,000 cycles

大規模分散システムとスーパーコンピューター

o  Google、Facebook等の大規模分散システムと、天河-2、Titan等の大規模並列クラスターとの違いを述べるのは、意外と簡単ではない。

o  もちろん、違いを見つけるのは容易でもある。ただ、GoogleとFacebookのシステムの違いは大きいし、天河-2とTitanのアーキテクチャの違いも大きい。

o  同じように、これらの共通性を求めることも、比較的簡単に出来る。

将来の変化

o  こうしたことは、おそらく、現在のシステムが、進化のprimitiveな段階にあることを示しているように思う。

o  進化の方向は、明らかに、DISKとNETWORKの高速化にある。メモリーとディスクが一体化し、ネットワークが内部バスより高速になれば、システムのアーキテクチャーは、変わる。

　　「Gilderの予想」、「量子光通信」、.... 新しい素子技術、....

タスク・パラレルとデータ・パラレル

o  パラレル・プログラミングの基本的なモデルは、相互に依存しない独立したタスクを並行して実行するタスク・パラレルと、個々の要素の独立した処理を許すデータに対するデータ・パラレルの二つである。両者の組み合わせも可能である。

o  タスク・パラレルは、比較的荒い粒度でのパラレル処理で、マルチコアのCPUに向いており、データ・パラレルは、細粒度のマッシブ・パラレル処理で、メニコアのGPUに向いている。

データ・パラレルと関数型言語

o  パラレル・コンピューティングにおける関数型言語のメリットは、データ・パラレル・モデルと結びついている。

o  そこでは、要素要素をループをまわして一つづつ処理する形ではなく、データ構造全体に作用する副作用のない関数（個別の処理を担う関数を引数に取る）を利用することで、データ・パラレルを実現する。

o  ただ、残念ながら、大規模なシステムでは、こうしたアプローチはまだ利用されていないように思う。

パラレル・プログラミング言語とそのモデル

主要なコンピュータ・ベンダーによってサポートされ、標準的なプログラミング・インターフェースになったものでも多くのパラレル・プログラミング言語とモデルがある。ここでは、MPI、OpenMP、OpenACC、CUDA、OpenCLについて、その概要を述べる。

“Programming Massive Parallel Processor”

Parallel Programming Languages and Models

o  最も広く使われてきたのは、スケーラブルなクラスター・コンピューティングの為のMPI（Message Passing Interface）と、共有メモリーのマルチプロセッサーの為のOpenMPである。

o  両者とも、主要なコンピュータ・ベンダーによってサポートさえ、標準的なプログラミング・インターフェースになった。

MPI(Message Passing Interface) o  MPIは、スケーラブルなクラスター・コンピューティ

ングの為に設計された。 o  クラスター上のコンピューター・ノードが、メモリー

を共有しない場合のモデルである。すべてのデータの共有と相互作用は、明示的なメッセージ・パッシングを通じて行われなければならない。

o  MPIは、ハイ・パフォーマンス・コンピューティング（HPC）で成功を収めてきた。MPIで書かれたアプリケーションは、10万ノード以上のクラスター・コンピューティング・システムで、成功裏に稼働している。

OpenMP

o  OpenMPは、共有メモリーを持つマルチプロセッサーの実行の為にデザインされたものである。

o  OpenMP の実装は、コンパイラとランタイムから構成される。プログラマは、ループについてOpenMPのコンパイラに対して、ディレクティブ（コマンド）とプラグマ（ヒント）を指定する。

o  これらのディレクティブとプラグマを用いて、OpenMPのコンパイラはパラレルなコードを生成する。ランタイム・システムは、パラレルなスレッドとリソースを管理することで、パラレル・コードの実行をサポートする。

OpenACC

o  OpenACCは、異機種混在のコンピュータシステムのプログラミングの為に提案されたものである。

o  OpenACCの主要な利点は、コンパイルの自動化と、抽象化によってプログラマからパラレル・コンピューティングの為の多くの細かな決まり事から解放する、ランタイムのサポートを提供することである。

o  ただ、OpenACCで効率的なプログラムを行う為には、やはり、パラレル・プログラミングに必要な、すべての細部を理解していることが必要になる。

CUDA

o  CUDAは、GPUを活用したパラレル・プログラミングの為に提案されたものである。

o  CUDAは、プログラマにパラレル・プログラミングの細部の明示的なコントロールを与えるので、それは、OpenMPやOpenACCを第一のプログラム・インターフェースとして使いたいと思っているひとにとっても、優れた学習用の練習台になる。

OpenCL

o  2009年に、Apple, Intel, AMD, NVIDIAを含む、産業界の主要なプレーヤ達は、Open Compute Language(OpenCL)と呼ばれる標準化されたプログラミング・モデルを、共同で開発した。

o  CUDAと同様に、OpenCLのプログラミング・モデルは、プログラマが、マッシブ・パラレル・プロセッサーで、パラレル実行とデータは配送を管理することを可能にする、言語拡張とランタイムAPIを定義している。

CUDAとMPIの結合

o  今日では、多くのHPCクラスターは、均一ではないノードを利用している。

o  一方、CUDAは、それぞれのノードについては効率的なインターフェースであるのだが、大部分のアプリケーションの開発者は、クラスター・レベルでは、MPIを利用する必要がある。

o  それ故、HPCのパラレル・プログラマーには、MPIとCUDAをどのように結合するかを理解することが重要になる。

MPIの問題

o  アプリケーションをMPIに移植するのに必要な作業の量は、コンピューティング・ノード間の共有メモリーが欠けていることによって、極めて大きなものになる。

o  プログラマは、入力データと出力データをクラスター・ノードに分割する、ドメインの分割を実行する必要がある。このドメインの分割に基づいて、プログラマは、また、ノード間のデータの交換を管理する、メッセージの送信と受信を行う関数を呼び出す必要がある。

o  CUDAは、それに対して、こうした困難に対して、CPUの内部でパラレル実行の為の共有メモリーを提供する。

CUDAの問題

o  CUDAでは、CPUとGPU間のコミュニケーションについては、以前は、非常に限られた共有メモリーの能力しか提供していなかった。

o  プログラマは、”一方向”のメッセージ・パッシングに似たような方法で、CPUとGPU間のデータ転送を管理する必要があった。

o  均一ではないコンピューティング・システムにおける、グローバル・アドレス空間と自動データ転送の新しいサポートは、GMACとCUDA 4.0では、今では、利用可能である。

データ転送の最適化とGMAC

o  GMAC, CUDA, OpenCLの下では、プログラマーは、CPUとGPUの間で共有されるデータ構造を、Cの変数として定義出来る。

o  GMACのランタイムは、必要に応じてだが、プログラマーの為に、データ転送操作の最適化を自動的に行い、データの整合性を維持する。こうしたサポートは、計算とI/O動作が重なりあっているようなデータ転送を含む、CUDAとOpenCLのプログラミングの複雑さを大幅に軽減する。

OpenCLの問題

o  OpenCLは、OpenCLで開発されたアプリケーションが、OpenCLの言語拡張とAPIをサポートするすべてのプロセッサー上で修正なしに正しく走るような標準化されたプログラミング・モデルである。

o  しかしながら、新しいプロセッサーで、高いパフォーマンスを達成しようとすれば、すすんでアプリケーションを修正する必要はあるだろう。

Parallel Programming Languages and Models

o  OpenCLとCUDAの両方をよく知っているひとは、OpenCLのコンセプトと諸特徴と、CUDAのそれらとのあいだには、顕著な類似点があることを知るだろう。すなわち、CUDAのプログラマは、OpenCLのプログラミングを、最小の努力で学習出来る。より重要なことは、仮想的には、CUDAを使って学ぶことの出来るテクニックのすべては、OpenCLのプログラミングに応用出来るだろう。

Xeon Phiのプログラミング OpenMP

“Intel Xeon Phi Co-processor High Performance Programming”

omp_set_num_threads(2); kmp_set_defaults("KMP_AFFINITY=compact"); #pragma omp parallel #pragma omp master numthreads = omp_get_num_threads(); printf("Initializing\r\n"); #pragma omp parallel for for(i=0; i<FLOPS_ARRAY_SIZE; i++) { fa[i] = (float)i + 0.1; fb[i] = (float)i + 0.2; } printf("Starting Compute on %d threads\r\n",numthreads); tstart = dtime();

// scale the calculation across threads requested // need to set environment variables // OMP_NUM_THREADS and KMP_AFFINITY #pragma omp parallel for private(j,k) for (i=0; i<numthreads; i++) { // each thread will work it's own array section // calc offset into the right section int offset = i*LOOP_COUNT; // loop many times to get lots of calculations for(j=0; j<MAXFLOPS_ITERS; j++) { // scale 1st array and add in the 2nd array for(k=0; k<LOOP_COUNT; k++) { fa[k+offset] = a * fa[k+offset] + fb[k+offset]; } } }

% export OMP_NUM_THREADS=122 % export KMP_AFFINITY=scatter

#pragma offload target (mic) #pragma omp parallel #pragma omp master numthreads = omp_get_num_threads(); printf("Initializing\r\n"); #pragma omp parallel for for(i=0; i<FLOPS_ARRAY_SIZE; i++) { fa[i] = (float)i + 0.1; fb[i] = (float)i + 0.2; } printf("Starting Compute on %d threads\r\n",numthreads); tstart = dtime();

#pragma offload target (mic) #pragma omp parallel for private(j,k) for (i=0; i<numthreads; i++) { .......

% export OMP_NUM_THREADS=2 % ./hellomem Initializing Starting BW Test on 2 threads Gbytes = 1024.000, Secs = 104.381, GBytes per sec = 9.810 % export OMP_NUM_THREADS=10 % ./hellomem Initializing Starting BW Test on 10 threads Gbytes = 1024.000, Secs = 21.991, GBytes per sec 46.565 % export OMP_NUM_THREADS=20 % ./hellomem Initializing Starting BW Test on 20 threads Gbytes = 1024.000, Secs = 13.198, GBytes per sec = 77.585 .... .... % export OMP_NUM_THREADS=61 % ./hellomem Initializing Starting BW Test on 61 threads Gbytes = 1024.000, Secs = 7.386, GBytes per sec = 138.637

Coreの数を増やす

Teslaのプログラミング　CUDA

GPU コンピューティングの歴史

1.  グラフィック・パイプラインの進化 2.  GPGPU：中間段階 3.  GPUコンピューティング

グラフィック・パイプラインの進化

o  機能が固定したグラフィック・パイプラインの時代 n  1980年代の初期から1990年代の終わり。 n  DirectXの最初の7世代

o  プログラム可能な、リアルタイム・グラフィックスの進化 n  2001年 GeForce 3 / 2002年　ATI Radeon 9700 n  2005年 Xbox 360

o  グラフィックとコンピューティング・プロセッサの統合 n  2006年　GeForce 8800

GPGPU（General Purpose GPU）：中間段階

o  DirectX 9対応のGPUの時代、研究者の一部は、GPUの演算能力の伸びに注目して、それをHPCに利用することを考え始めた。

o  長崎大工学部濱田剛助教（当時）は、3800万円の予算で、NVIDIA GeForceを380基使って、158TFlopの性能を達成し、2009年ゴードン・ベル賞を受賞した。

o  これは、2002年の地球シミュレーター（600億円）の41TFlop、2009年の地球シミュレーター2（158億円で改修）の122TFlopを上回るものであった。

「高性能の計算機は重要だ」としながらも、巨費を投じた従来の開発方針について「素直にいいとは言えない。方向性が逆」と述べ、低価格化が可能との見方を示した。

GPGPU（General Purpose GPU）：中間段階

o  ただ、DirectX 9 GPUは、グラフィックAPI専用にデザインされていたので、GPUの計算能力にアクセスする為には、問題をグラフィック演算ー例えば、pixel shader演算ーに置き換える作業が必要だった。

o  入力データは、テキスチャー・イメージとしてGPUに送られ、出力は、ピクセルの集合として返ってくる。

GPUコンピューティング

o  DirectX 10の世代のGPUは、高機能の浮動小数点演算をサポートするようになった。

o  Tesla GPUのアーキテクチャーの設計では、NVIDIAは、GPUをグラフィック専用のチップとしてではなく、データ・パラレルの処理の役割を担う、プログラマーがプログラム可能なプロセッサーにすることの重要性に気づき、その道に踏み出した。

o  Teslaでは、かつてのshaderプロセッサーは、命令用のメモリーとキャッシュ、命令の制御ロジックを備えた、完全にプログラム可能なプロセッサーになった。

Teslaのアークテクチャー

o  こうした変更に必要なハードウェアの追加は、命令キャッシュや命令制御ロジックの回路を共有することで軽減された。

o  NVIDIAは、Cのプログラムが要求する、メモリーへのランダム・アクセスを可能とする、メモリーLoad/Store命令を追加した。

o  Teslaは、階層化されたパラレル・スレッドのプログラミング・モデル、バリア同期、パラレル処理のディスパッチと管理の機能を導入した。

CUDA

o  NVIDIAは、プログラマーが容易にデータ・パラレル型の新しいコンピューティング・モデルを利用出来るように、CUDAコンパイラーと、そのライブラリー、ランタイムを開発した。

o  プログラマーは、GPUのパワーを使ってパラレル計算を実行するのに、GPGPUのようにグラフィックAPIを使う必要は、もはや無くなった。

CPU/GPUのマルチコア化と疎粒度/細粒度のパラレル処理

o  マルチコア化というのは、ムーアの法則によるチップ上に実装可能なトランジスター数の増大を、一つのコアのパフォーマンスをあげる為に使うのではなく、同じパフォーマンスのコアの数を増やす為に利用しようということ。それは、CPUでもGPUでも同じ。

o  CPU処理のマルチコアによるパラレル化は粒度が荒く、コア数の増大によって処理を書き直さなければいけないことが起こる。一方、GPUのマルチコアによるパラレル化は、粒度が細かく、コアの数に応じてスケールする。

CUDAのプログラミング

タスク・パラレルとデータ・パラレル

o  大規模なアプリケーションでは、通常は、多くの独立したタスクがある。それ故、多くのタスクのパラレル実行が可能である。

o  一般的には、データ・パラレルがパラレル・プログラムのスケーラビリティの源になる。

o  大規模なデータに対して、大量のパラレル・プロセッサーが利用可能な、非常に多くのデータのパラレル処理を見つけることが出来る。こうして、ハードウェアの世代につれ、アプリケーションのパフォーマンスが向上することが可能になる。

データ・パラレルの例ベクトルの加算

ベクトルの加算は、要素ごとに、独立に計算可能である。

CUDAのプログラムの構造

o  CUDAのプログラムの構造は、一つのホスト（CPU）と一つ以上のデバイス（GPU）が、コンピュータ上で共存していることを反映している。

o  CUDAのソースファイルには、ホストとデバイスのコードの両方が含まれる。

o  普通のCプログラムは、ホスト側のコードだけを含んだCUDAプログラムである。

o  デバイスの為のコードは、データや関数の宣言にCUDAのキーワードを加えることで明確に区別される。こうしたCUDAキーワードを含むコードは、普通のCコンパイラーでは、コンパイル出来ない。

CUDAプログラムのkernel o  デバイス（GPU）側の、データ・パラレル関数とそ

れに関連したデータ構造のラベルとして、kernel というCUDAキーワードが利用される。

o  NVCC(NVIDIAのCUDA用Cコンパイラ)は、CUDAのキーワードを利用して、プログラムをホスト側とデバイス側に分離する。

o  デバイス・コードは、NVCCのランタイムによって、コンパイルされ、GPU上で実行される。

CUDAプログラムのコンパイル

CUDAプログラムの実行とgrid

o  CUDAプログラムの実行は、host(CPU)のプログラムの実行から始まる。

o  kernel関数が呼び出されると、沢山のスレッドがデバイス上で走る。kernelの呼び出しで、まとめて生成される全てのスレッドは、gridと呼ばれる。

o  kernelのスレッドの実行が全て終わると、それに対応したgridは終了する。

o  こうして、制御は、次のkernelが呼ぶ出されるまで、host側に移る。

CUDAプログラムの実行の例

gridとblockとthread

o  ホスト側のコードで、kernelが起動されると、CUDAのランタイムによって、gridのスレッドが生成される。

o  gridは、二段階の階層で組織されている。それぞれのgridは、スレッドの配列であるblockからなる。grid内のblockは、ユニークなblockIdxを持つ。

o  一つのgridのblockは、全て同じサイズで、1024個までのスレッドを含むことが出来る。blockのスレッド数は、kernelの起動時にhost側で指定する。blockのスレッド数は、blockDimで参照出来る。

o  block中のそれぞれのスレッドは、ユニークなthreadIdxを持つ。

GridとBlockとThread

Grid 0

grid中の全てのスレッドは、同じkernel コードを実行する。

CUDAのkernel関数の定義

ベクトルの加算 – 通常のCコード // Compute vector sum C = A+B

void vecAdd(float* A, float* B, float* C, int n) {

for (i = 0, i < n, i++)

C[i] = A[i] + B[i];

}

// Compute vector sum C = A+B

__global__ void vecAddKernel( float* A_d, float* B_d, float* C_d, int n)

{

int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; }

ベクトルの加算 – CUDAのkernel Cコード

forループは、どこへいったのか？

o  このkernelコードには、forループは含まれていない。このコードは、grid内の全てのスレッドで独立に実行される。

o  n回のループは、n個のスレッドとして実行される。



{



Grid 0


ベクトルの加算

ベクトルの加算は、要素ごとに、独立に計算可能である。

__global__ キーワード

o  __global__ は、宣言された関数が、CUDAのkernel関数であることを表す。

o  この関数は、host側で呼び出され、デバイス上で実行される。



{ int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i];

}

CUDAの関数宣言のキーワード

int i = threadIdx.x + blockDim.x * blockIdx.x;

o  この式で計算される数字 i は、何を表しているのであろうか？　

o  threadIdxは、ブロック内のスレッドの番号、blockIdxは、グリッド内のブロックの番号である。



{



Grid 0



o  この式で計算される数字 i は、何を表しているのであろうか？　

o  先の図をみれば、このi は、グリッド内のスレッドの番号をユニークに表していることが分かる。



{



o  この式にでてくる、*Idx.x の .xは何を表しているのだろうか？　実は、 *Idx.xだけではなく、*Idx.yも*Idx.zも、CUDAには存在する。

o  それは、*Idxのx成分であることを表している // Compute vector sum C = A+B


{



o  この式にでてくる、*Idx.x の .xは何を表しているのだろうか？　実は、 *Idx.xだけではなく、*Idx.yも*Idx.zも、CUDAには存在する。

o  それは、*Idxのx成分であることを表している // Compute vector sum C = A+B


{


if(i<n) C_d[i] = A_d[i] + B_d[i]; o  この式にでてくる、条件式 if (i<n) は何を表しているの

だろうか？　実は、ブロックの大きさは32の倍数と決められていて、32の倍数のスレッドが自動的に生成される。

o  この条件は、あまったスレッドに仕事をさせない為のものである。これで正確にn個のスレッドが仕事をする。



{


CUDAのkernel関数の呼び出し

ベクトルの加算の呼び出し – 通常のCコード int main() { // Memory allocation for A_h, B_h, and C_h

// I/O to read A_h and B_h, N elements 省略

vecAdd(A_h, B_h, C_h, N); }

int main() { // Memory allocation for A_h, B_h, and C_h


　　　　vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, 　　 C_d, n); }

ベクトルの加算の呼び出し – CUDAのhost Cコード

int main() { // Memory allocation for A_h, B_h, and C_h


　　　　vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, 　　 C_d, n); }

vecAddKernel <<<ceil(n/256),256>>>(...)

o  この<<<と>>>の記号は何だろうか？　CUDAでは、この引数がgridの設定に使われている。

o  第一の引数に、grid内のblockの数を、第二の引数に、block内のthreadの数を指定する。

<<< , >>>の引数について

o  先には、簡単に、第一の引数に、grid内のblockの数、第二の引数に、block内のthreadの数を指定する。　としたが、正確に言うと正しくない。

o  一般的には、gridは、blockの3次元配列で、同様に、blockは、threadの3次元配列である。

o  それ故、一般的には、次が正しい。第一の引数に、grid内のblockの3次元配列を、第二の引数に、block内のthreadの3次元配列を指定する。

dim3 タイプ

o  dim3 は、Cの構造体で、符号なし整数のx, y, z の三つのフィールドを持つ。このdim3 を使って、kernel関数へのパラメータ渡しは、次のように行われる。3D以下のgrid, blockについては、使わない次元に１を入れて宣言する。

int vecAdd(float* A, float* B, float* C, int n) { .... dim3 DimGrid(n/256, 1, 1); if (n%256) DimGrid.x++; dim3 DimBlock(256, 1, 1); vecAddKernel<<<DimGrid,DimBlock>>>(A_d, B_d, C_d, n); }

2次元grid, 3次元blockの例

dim3 dimGrid(2,2,1); dim3 dimBlock(4,2,2); kernelFunction <<< dimGrid . dimBlock >>> ( ... )

CUDAプログラム　サンプル

#include <cuda.h> void vecAdd(float* A, float* B, float* C, int n)‏ { int size = n* sizeof(float); float* A_d, B_d, C_d; … 1. // A,Bの為のメモリーをデバイスに割り当 // A,Bをデバイスのメモリーにコピー 2. // Kernel コードを走らせ、デバイスに // 実際のベクトルの和の計算をさせる 3. // デバイスのメモリーから、Cをコピーする // Free device vectors }

CPU

Host Memory!GPU

Part 2

Device Memory!

Part 1!

Part 3!

ベクトルの加算 – CUDAのCコード

132

void vecAdd(float* A, float* B, float* C, int n)!{! int size = n * sizeof(float); ! float* A_d, B_d, C_d;!!

1.  // A,Bをデバイスのメモリーにコピー! cudaMalloc((void **) &A_d, size);! cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice);! cudaMalloc((void **) &B_d, size);! cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);!!

// Cの為のメモリーをデバイスに割り当 cudaMalloc((void **) &C_d, size);!!2. // Kernel invocation code – あとで見る! …!

3. // Cをデバイスからホストに転送する! cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost);! // デバイス上のA, B, Cのメモリーを解放する! cudaFree(A_d); cudaFree(B_d); cudaFree (C_d);!}!

CUDAのメモリーの概略 o  デバイスは次のことが可能である n  スレッド毎のregisterの読み/書き n  grid毎のglobal memoryの読み/

書き

o  ホストは次のことが可能である n  grid毎のglobal memory との間

の、双方向のデータ転送

同じアプリによって呼ばれる kernel間では、Global, constant, texture メモリーは保存されている。

(Device) Grid!

Global Memory!

Block (0, 0)!

Thread (0, 0)!

Registers!

Thread (1, 0)!

Registers!

Block (1, 0)!

Thread (0, 0)!

Registers!

Thread (1, 0)!

Registers!

Host!

133

Grid!

Global Memory!

Block (0, 0)!

Thread (0, 0)!

Registers!

Thread (1, 0)!

Registers!

Block (1, 0)!

Thread (0, 0)!

Registers!

Thread (1, 0)!

Registers!

Host!

CUDAデバイスのメモリー管理API

o  cudaMalloc() n  デバイスのglobal memory

にオブジェクトを割り当てる n  二つの引数

o  割り当てられるオブジェクトへのポインタのアドレス

o  割り当てられるオブジェクトへのサイズ（バイト数）

o  cudaFree() n  デバイスのglobal memory

からオブジェクトを解放する n  引数は、解放されるオブジェク

トへのポインタ

Host!

ホスト-デバイス間データ転送API

o  cudaMemcpy() n  メモリーデータ転送 n  ４つの引数

o  コピー先へのポインタ o  コピー元へのポインタ o  コピーされるバイト数 o  転送のタイプ・方向

n  デバイスへの転送は、非同期で行われる

(Device) Grid!

Global　Memory!

Block (0, 0)!

Thread (0, 0)!

Registers!

Thread (1, 0)!

Registers!

Block (1, 0)!

Thread (0, 0)!

Registers!

Thread (1, 0)!

Registers!

// Compute vector sum C = A+B // Each thread performs one pair-wise addition


{ int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; }

ベクトルの加算　CUDA kernel コード

int vecAdd(float* A, float* B, float* C, int n)

{ // allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each dim3 DimGrid(n/256, 1, 1); if (n%256) DimGrid.x++; dim3 DimBlock(256, 1, 1); vecAddKernel<<<DimGrid,DimBlock>>>(A_d, B_d,

C_d, n); }

ベクトルの加算　CUDA host コード

CUDAとOpenCL

OpenCLの背景

o  OpenCL は、C言語上の、標準化された、クロス・プラットフォームのパラレル・コンピューティングのAPIである。それは、異種混合のコンピューティング・デバイスからなるシステムで、ポータブルなパラレル・アプリケーションの開発が可能となるようにデザインされている。

o  OpenCLの開発は、急速に成長している様々のパラレル・コンピューティングのプラットフォームに標準的で高パフォーマンスのアプリケーションの開発プラットフォームのニーズによって動機付けられている。

OpenCLの背景

o  OpenCLは、特に、異種混合のパラレル・コンピューティング・システムの以前のプログラミング・モデルの、アプリケーションの移植性の限界に、大きな関心を向けている。

o  OpenCLの開発は、Appleによって開始され、OpenGL標準を管理しているKhronosグループによって管理されていた。

o  一方で、それは、異種混合のパラレル・コンピューティングでの単一コードベースのサポート、データ・パラレル処理、複雑なメモリー階層といった領域で、強くCUDAの影響を受けていた。

o  他方で、OpenCLは、マルチ・プラットフォームとマルチ・ベンダーのポータビリティのサポートによって、より複雑なプラットフォームとデバイスのモデルを持っている。

o  OpenCLは、X86だけでなく、AMD/ATI、 NVIDIA GPUでの実装も存在している。原理的には、DSPやFPGAといった別のタイプのデバイスでの実装に拡張することも出来る。OpenCLは、異なるベンダーのデバイス間のコードのポータビリティをサポートするものだが、こうしたポータビリティは、無料ではない。

o  OpenCLのプログラムは、ハードウェアのより大きな多様性を取り扱うように準備する必要があり、もっと複雑なものになっていくだろう。

o  同時に、OpenCLの特徴の多くは、オプショナルなもので、全てのデバイスでサポートされる必要はないかもしれない。ポータブルなOpenCLのコードは、こうしたオプショナルな特徴を使うことを避ける必要がある。

o  しかし、こうしたオプショナルな特徴のあるものは、それをサポートしているデバイスにとっては、重要なパフォーマンスの向上をアプリにもたらすことがある。結果として、ポータブルなOpenCLのコードは、どんなデバイスの上でもパフォーマンスの可能性を達成出来ないことになる。

Data Parallelism Model o  OpenCLは、データ・パラレル実行モデルを利用

している。この点では、CUDAと直接に対応している。

o  OpenCLのプログラムは、二つの部分からなる。一つは、OpenCLのデバイス上で実行されるkernelの部分と、もう一つは、kernelの実行を管理するhostプログラムの部分である。

OpenCLとCUDAの対応

OpenCL

o  Kernel o  Host Program o  NDRange

(Index Space) o  Work Item o  Work Group

CUDA

o  Kernel o  Host Program o  Grid

o  Thread o  Block

OpenCLの最も基本的なアイデアと４つのモデル

OpenCLの最も基本的なアイデアを記述する為に、次の４つの階層的なモデルを用いる。 p  Platform Model p  Memory Model p  Execution Model p  Programming Model

Platform Model

Memory Model

Execution Model

Platform Model

Platform Model

Platform Model o  このモデルは、一つ以上のOpenCLデバイスと

接続したhostからなる。 o  OpenCLデバイスは、一つ以上のcompute

unit(CU)に分割され、それらはさらに、一つ以上の procesing element(PE)に分割される。

o  デバイス上の計算は、processing unit上で行われる。

Work Item

Memory Model

Memory Model

OpenCL Address Space

o __private (CUDA local)

o __local (CUDA shared)

o __constant (CUDA constant)

o __global (CUDA global)

Memory Model

Execution Model

KernelとHost program

o  OpenCLの実行モデルは、OpenCLデバイス上で実行されるkernelと、ホスト上で実行されるホスト・プログラムという、二つの異なった実行単位で定義される。

o  kernelは、ある計算に関連した”work”が行われる場所である。この”work”は、グループ（work-group)内で実行されるwork-itemを通じて行われる。

Context kernelの実行は、hostによって管理されたcontext上で行われる。contextは、kernelの実行環境を定義する。contextの中には、次のリソースが含まれる 1.   Devices: OpenCLのデバイスの集まりは、hostによっ

て利用される。 2.   Kernels: OpenCLのデバイスの上で走る、OpenCLの

関数 3.   Program Objects: kernelで実装されるプログラム

のソース、あるいは実行可能なもの 4.   Memory Objects: hostとOpenCLデバイスに見える

メモリー・オブジェクト

Command Queue

o  hostとデバイスは、command-queueを通じて相互作用する。一つのcommand-queueは、一つのデバイスに関連付けられている。command-queueに置かれた命令は、次の三つのタイプからなる。

1.  Kernel-enqueue コマンド 2.  Memory コマンド 3.  Syncronization コマンド

Commandの State

commandは、Event Object を通じて、Stateを知らせる。

Work item, Work group, NDRange

o  kernel関数が起動されると、そのコードはwork item で実行される。これは、CUDAのthreadに対応する。

o  work itemは、work groupを形成する。それは、CUDAのthread Blockに対応する。

o  OpenCLのwork itemは、グローバルなディメンション・インデックス・レンジ NDRangeで指定される。インデックス・スペースは、work item と、どのようにデータが work itemにマップされるかを定義する。

OpenCLとCUDAの対応

OpenCL

o  Kernel o  Host Program o  NDRange

(Index Space) o  Work Item o  Work Group

CUDA

o  Kernel o  Host Program o  Grid

o  Thread o  Block

OpenCLとCUDAのKernel関数

__kernel void addVec (__global const float *a, __global const float *b, __global float *result) { int gid = get_global_id(0); result[gid] = a[gid] + b[gid]; }

__global__ void vecAddKernel( float* A_d, float* B_d, float* C_d, int n) { 　　int i = threadIdx.x + blockDim.x * blockIdx.x; 　　C_d[i] = A_d[i] + B_d[i]; }

OpenCL

CUDA

Overview of the OpenCL parallel execution model.

OpenCL Execution Model

Work Item

Work Group

Work Item

Work Group

OpenCL 意味 CUDAでの対応

get_globa1_id(0) x次元のwork-itemのGlobal index

blockIdx.x * blockDim.x + threadIdx.x

get_local_id(0) work group内でのwork-itemのindex(x次元)

threadld.x

get_global_size(0) NDRangeの大きさ（x次元）

gridDim.x * blockDim.x

get_local_size(0) work-groupの大きさ（x次元）

blockDim.x

OpenCLとCUDAのインデックス

Work Item Identifier

Programming Model

OpenCLプログラムの流れ

o  OpenCLのプログラムは、基本的には、次のような流れになる

1.  OpenCLのプラットフォームを選択して、contextを生成する

2.  デバイスを選んで、command-queueを生成する 3.  program objectを生成する 4.  kernel objectを生成し、それに与える引数の

memory objectを生成する 5.  kernelを実行し、その結果を読み出す 6.  エラーをチェックする

Contextが、OpenCLデバイスの実行環境を与える

OpenCLのコンポーネント

OpenCL Program Sample

初期化

cl_int err; cl_context context; cl_device_id devices; cl_command_queue cmd_queue; err = clGetDeviceIDs(CL_DEVICE_TYPE_GPU, 1, &devices, NULL); context = clCreateContext(0, 1, &devices, NULL, NULL, &err); cmd_queue = clCreateCommandQueue(context, devices, 0, NULL);

デバイスを指定して、計算が実行されるcontextを作る

Discovering Devices clGetDeviceIDs

Device Property clGetDeviceInfo

メモリーの割り当て

cl_mem ax_mem = clCreateBuffer(context, CL_MEM_READ_ONLY, atom_buffer_size, NULL, NULL); err = clEnqueueWriteBuffer(cmd_queue, ax_mem, CL_TRUE, 0, 　　　　　　　　atom_buffer_size, (void*)ax, 0,NULL,NULL); clFinish(cmd_queue);

デバイスで利用されるリソースを割り当てる

Memory Buffer clCreateBuffer

Memory Buffer clEnqueueWriteBuffer

kernel/プログラムの生成

cl_program program[1]; cl_kernel kernel[1]; program[0] = clCreateProgramWithSource(context, 　　　　1,　(const char**)&program_source, NULL, &err); err = clBuildProgram(program[0], 0, NULL, NULL, 　　　　　NULL, NULL); kernel[0] = clCreateKernel(program[0], "mdh", &err);

プログラムを読み込んで、カーネルを生成する。

Building Programs clBuildProgram

実行

size_t global_work_size[2], local_work_size[2]; global_work_size[0] = nx; global_work_size[1] = ny; local_work_size[0] = nx/2; local_work_size[1] = ny/2; err = clSetKernelArg(kernel[0], 0, sizeof(cl_mem), &ax_mem); err = clEnqueueNDRangeKernel(cmd_queue, kernel[0], 2, NULL,&global_work_size, &local_work_size, 0, NULL, NULL);

カーネルに値を渡し、実行する。

終了処理

err = clEnqueueReadBuffer(cmd_queue, val_mem, CL_TRUE, 0, grid_buffer_size, val, 0, NULL, NULL); clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(cmd_queue); clReleaseContext(context);

計算結果をホストに返し、リソースを解放する。

Execution / Read

OpenCL Design and Programming Guide for the Intel Xeon Phi Coprocessor

http://software.intel.com/en-us/articles/opencl-design-and-programming-guide-for-the-intel-xeon-phi-coprocessor

WebCL

http://www2012.wwwconference.org/proceedings/nocompanion/ DevTrack_Slides/008_WebCL_for_hardware_accelerated_web_applications.pdf

WebCLの目的

o  Webアプリで、GPU/マルチコアの高パフォーマンスのパラレル処理を可能にする。 n  異種混合のマルチコア・デバイスへのポータブルで効

率的なアクセス n  プラットフォームに依存しない、標準準拠のソリュー

ション n  JavaScript環境で、OpenCLの能力を統合する n  マルチコアのリソースを備えたモバイル・プラットフォー

ムで、高い計算要求に応える、インタラクティブなWebアプリの幅を広げる

WebCLのデザイン・ゴール

o  異種混合の処理要素上で、次のようなデザイン・フィロソフィーの下で、一般的な目的のパラレル・プログラミングを可能にすること。 n  デスクトップとモバイルをまたぐ、単一で均一な標準 n  一般的な目的のパラレル・プログラミングの為の、オー

プンでロイヤリティ・フリーの標準 n  公開された仕様ドラフト、メーリングリスト、フォーラム

等で、オープンさを推進する

WebCLのアプローチ

o  OpenCL標準と密に連携する n  開発者の親近感を保ち、受容を容易にする n  開発者が、OpenCLについての知識をWeb環境に移

すことを可能にする n  OpenCLとWebCLの進化とともに、両者が同期するこ

とを容易にする o  OpenCL上のインターフェースであることを志向

する o  セキュリティにフォーカスしたデザイン o  デザインに基づく移植性

Nokia’s WebCL Prototype

o  Nokia open sourced their prototype in May 2011 (LGPL).

o  Web-based interactive photo editor utilizing GPU for image processing, through WebGL & Nokia’s OpenCL bindings for JavaScript.

o  YouTube Demo: http://www.youtube.com/watch?v=9BF7zzUM1kY

o  Add-on for Firefox 4 on Win/Linux(Firefox 5 coming soon)

o  Visit http://webcl.nokiaresearch.com for binaries, source code, demos and tutorials.

Samsung WebCL Prototype

o  Samsung open sourced their prototype WebCL implementation for WebKit in July 2011 (BSD license).

o  Allows JavaScript to run computations on GPU. o  Demos on YouTube: http://www.youtube.com/

user/SamsungSISA Demos use WebGL for 3D rendering.

o  Code available at http://code.google.com/p/webcl/

o  For comparison, same computations were also done in pure JavaScript. - WebCL gave performance increases of up to 100x.

WebCL Working Draft

2013/10/23 https://cvs.khronos.org/svn/repos/registry/trunk/public/webcl/spec/latest/index.html

interface WebCL { // Functions sequence<WebCLPlatform> getPlatforms(); WebCLContext? createContext( optional WebCLContextProperties properties); sequence<DOMString>? getSupportedExtensions(); object? enableExtension(DOMString extensionName); void waitForEvents( sequence<WebCLEvent> eventWaitList, optional WebCLCallback whenFinished); void releaseAll();

dictionary WebCLContextProperties { sequence<WebCLDevice>? devices = null; 　// Default: let the implementation decide WebCLPlatform? platform = null; 　// Default: let the implementation decide CLenum deviceType = 0x1; 　 // 0x1 == WebCL.DEVICE_TYPE_DEFAULT };

interface WebCLPlatform { any getInfo(CLenum name); sequence<WebCLDevice> getDevices( optional CLenum deviceType); sequence<DOMString>? getSupportedExtensions(); object? enableExtension(DOMString extensionName); }; interface WebCLDevice { any getInfo(CLenum name); sequence<DOMString>? getSupportedExtensions(); object? enableExtension(DOMString extensionName); };

interface WebCLContext { WebCLBuffer createBuffer( CLenum memFlags, CLuint sizeInBytes, optional ArrayBufferView hostPtr); WebCLCommandQueue createCommandQueue( optional WebCLDevice? device, optional CLenum properties); WebCLImage createImage(CLenum memFlags, WebCLImageDescriptor descriptor, optional ArrayBufferView hostPtr); WebCLProgram createProgram(DOMString source); WebCLSampler createSampler( CLboolean normalizedCoords, CLenum addressingMode, CLenum filterMode); WebCLUserEvent createUserEvent(); any getInfo(CLenum name); ...

interface WebCLCommandQueue { ////////////////////////////////////////////////////////// // // Copying: Buffer <-> Buffer, Image <-> Image, // Buffer <-> Image // void enqueueCopyBuffer( WebCLBuffer srcBuffer, WebCLBuffer dstBuffer, CLuint srcOffset, CLuint dstOffset, CLuint numBytes, optional sequence<WebCLEvent>? eventWaitList, optional WebCLEvent? event);

// interface WebCLCommandQueue ///////////////////////////////////////////////////////// // // Reading: Buffer -> Host, Image -> Host // void enqueueReadBuffer( WebCLBuffer buffer, CLboolean blockingRead, CLuint bufferOffset, CLuint numBytes, ArrayBufferView hostPtr, otional sequence<WebCLEvent>? eventWaitList, optional WebCLEvent? event);

// interface WebCLCommandQueue //////////////////////////////////////////////////// // // Writing: Host -> Buffer, Host -> Image // void enqueueWriteBuffer( WebCLBuffer buffer, CLboolean blockingWrite, CLuint bufferOffset, CLuint numBytes, ArrayBufferView hostPtr, optional sequence<WebCLEvent>? eventWaitList, optional WebCLEvent? event);

// interface WebCLCommandQueue /////////////////////////////////////////////////////////// // // Executing kernels // void enqueueNDRangeKernel( WebCLKernel kernel, CLuint workDim, sequence<CLuint>? globalWorkOffset, sequence<CLuint> globalWorkSize, sequence<CLuint>? localWorkSize, optional sequence<WebCLEvent>? eventWaitList, optional WebCLEvent? event);

// interface WebCLCommandQueue ///////////////////////////////////////////////////////// // // Synchronization // void enqueueMarker(WebCLEvent event); void enqueueBarrier(); void enqueueWaitForEvents ( sequence<WebCLEvent> eventWaitList); void finish(); void flush();

// interface WebCLCommandQueue //////////////////////////////////////////////////////////// // // Querying command queue information // any getInfo(CLenum name); void release(); };

interface WebCLMemoryObject { any getInfo(CLenum name); void release(); };

interface WebCLBuffer : WebCLMemoryObject { WebCLBuffer createSubBuffer( CLenum memFlags, CLuint origin, CLuint sizeInBytes); }; interface WebCLImage : WebCLMemoryObject { WebCLImageDescriptor getInfo(); }; interface WebCLSampler { any getInfo(CLenum name); void release(); };

interface WebCLProgram { any getInfo(CLenum name); any getBuildInfo( WebCLDevice device, CLenum name); void build(optional WebCLDevice>? devices, optional DOMString? options, optional WebCLCallback whenFinished); WebCLKernel createKernel(DOMString kernelName); sequence<WebCLKernel> createKernelsInProgram(); void release(); };

interface WebCLKernel { any getInfo(CLenum name); any getWorkGroupInfo(WebCLDevice device, CLenum name); void setArg(CLuint index, WebCLMemoryObject value); void setArg(CLuint index, WebCLSampler value); void setArg(CLuint index, ArrayBufferView value); void release(); };

[Constructor] interface WebCLEvent { readonly attribute CLenum status; readonly attribute WebCLMemoryObject buffer; any getInfo(CLenum name); any getProfilingInfo(CLenum name); void setCallback(CLenum commandExecCallbackType, WebCLCallback notify); void release(); };

WebCL Hardware-Accelerated Web Application

http://download.tizen.org/misc/media/conference2012/tuesday/ballroom-b/2012-05-08-1415-1455-webcl_for_hardware-accelerated_web_applications.pdf

WebCL: Initialization

<script> var platforms = WebCL.getPlatforms(); var devices = platforms[0]. getDevices(WebCL.DEVICE_TYPE_GPU); var context = WebCL.createContext( { WebCLDevice: devices[0] } ); var queue = context.createCommandQueue(); </script>

WebCL: Create Kernel

<script id="squareProgram" type="x-kernel"> __kernel square( __global float* input, __global float* output, const unsigned int count) { int i = get_global_id(0); if(i < count) output[i] = input[i] * input[i]; } </script>

WebCL: Create Kernel

<script> var programSource = getProgramSource("squareProgram"); // JavaScript function using DOM APIs var program = context.createProgram(programSource); program.build(); var kernel = program.createKernel("square"); </script>

WebCL: Run Kernel 1 <script> … var inputBuf = context.createBuffer(WebCL.MEM_READ_ONLY, Float32Array.BYTES_PER_ELEMENT * count); var outputBuf = context.createBuffer(WebCL.MEM_WRITE_ONLY, Float32Array.BYTES_PER_ELEMENT * count); var data = new Float32Array(count); // populate data … queue.enqueueWriteBuffer(inputBuf, data, true); // last arg indicates API is blocking

WebCL: Run Kernel 2

kernel.setKernelArg(0, inputBuf); kernel.setKernelArg(1, outputBuf); kernel.setKernelArg(2, count, WebCL.KERNEL_ARG_INT); var workGroupSize = kernel.getWorkGroupInfo(devices[0], WebCL.KERNEL_WORK_GROUP_SIZE); queue.enqueueNDRangeKernel(kernel, [count], [workGroupSize]);

WebCL: Run Kernel 3

queue.finish(); // this API blocks queue.enqueueReadBuffer(outputBuf, data, true); // last arg indicates API is blocking </script>

WebCL: Image Object Creation o  From Uint8Array()

<script> var bpp = 4; // bytes per pixel var pixels = new Uint8Array(width * height * bpp); var pitch = width * bpp; var clImage = context.createImage(WebCL.MEM_READ_ONLY, { channelOrder:WebCL.RGBA, channelType:WebCL.UNORM_INT8, size:[width, height], pitch:pitch } ); </script>

WebCL: Image Object Creation o  From <img> or <canvas> or <video>

<script> var canvas = document.getElementById("aCanvas"); var clImage = context.createImage(WebCL.MEM_READ_ONLY, canvas); // format, size from element </script>

WebCL: Vertex Buffer Initialization

<script> var points = new Float32Array(NPOINTS * 3); var glVertexBuffer = gl.createBuffer(); gl.bindBuffer(gl.ARRAY_BUFFER, glVertexBuffer); gl.bufferData(gl.ARRAY_BUFFER, points, gl.DYNAMIC_DRAW); var clVertexBuffer = context.createFromGLBuffer( WebCL.MEM_READ_WRITE, glVertexBuffer); kernel.setKernelArg(0, NPOINTS, WebCL.KERNEL_ARG_INT); kernel.setKernelArg(1, clVertexBuffer); </script>

WebGL

WebCL

WebCL: Vertex Buffer Update and Draw

<script> function DrawLoop() { queue.enqueueAcquireGLObjects([clVertexBuffer]); queue.enqueueNDRangeKernel(kernel, [NPOINTS], [workGroupSize]); queue.enqueueReleaseGLObjects([clVertexBuffer]); gl.bindBuffer(gl.ARRAY_BUFFER, glVertexBuffer); gl.clear(gl.COLOR_BUFFER_BIT); gl.drawArrays(gl.POINTS, 0, NPOINTS); gl.flush(); } </script>

WebCL

WebGL

Texture Initialization

<script> var glTexture = gl.createTexture(); gl.bindTexture(gl.TEXTURE_2D, glTexture); gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA, gl.RGBA, gl.UNSIGNED_BYTE, image); var clTexture = context.createFromGLTexture2D( WebCL.MEM_READ_WRITE, gl.TEXTURE_2D, glTexture); kernel.setKernelArg(0, NWIDTH, WebCL.KERNEL_ARG_INT); kernel.setKernelArg(1, NHEIGHT, WebCL.KERNEL_ARG_INT); kernel.setKernelArg(2, clTexture);

Texture Update and Draw

<script> function DrawLoop() { queue.enqueueAcquireGLObjects([clTexture]); queue.enqueueNDRangeKernel(kernel, [NWIDTH, NHEIGHT], [tileWidth, tileHeight]); queue.enqueueReleaseGLObjects([clTexture]); gl.clear(gl.COLOR_BUFFER_BIT); gl.activeTexture(gl.TEXTURE0); gl.bindTexture(gl.TEXTURE_2D, glTexture); gl.flush(); } </script>

WebCL: Initialization (draft)


WebCL: Initialization (draft)


参考資料

OpenCL Design and Programming Guide for the Intel Xeon Phi Coprocessor

http://software.intel.com/en-us/articles/opencl-design-and-programming-guide-for-the-intel-xeon-phi-coprocessor

Why is this paper needed?

o  While OpenCL is a portable programming model, the performance portability is not guaranteed. Traditional GPUs and the Intel Xeon Phi coprocessor have different HW designs. Their differences are such that they benefit from different application optimizations. For example, traditional GPUs rely on the existence of fast shared local memory, which the programmer needs to program explicitly. Intel Xeon Phi coprocessor includes fully coherent cache hierarchy, similar to regular CPU caches, which automatically speed up memory accesses.

o  Another example: while some traditional GPUs are based on HW scheduling of many tiny threads, Intel Xeon Phi coprocessors rely on the device OS to schedule medium size threads. These and other differences suggest that applications usually benefit from tuning to the HW they’re intended to run on.

Will I need to have different OpenCL optimizations for different devices?

o  Not necessarily. Will you add a small #ifdef in your code to run 50% faster on Intel Xeon Phi coprocessor? Will you duplicate a 1000-line file for that? Would you do it for only 10% speedup? Or, maybe you would prefer adding the optimization unconditionally and pay 10% slowdown on other devices for 50% improvement on Intel Xeon Phi coprocessor? It is totally your decision. In some cases, you will need to make the tradeoff between cross device performance and maintainability of your OpenCL application.

o  We really encourage developers to explore the performance potential of the Intel Xeon Phi coprocessor, using the guidelines available in this document and then decide based on the performance numbers. This document doesn’t intend to answer all the questions, but instead give you some tools to answer them yourself.

o  An Intel Xeon Phi coprocessor contains many cores, each with a 512-bit vector arithmetic unit, capable of executing SIMD vector instructions. An L1 cache is included in each core (32 KB data + 32 KB instructions). An L2 cache is associated with each core (512 KB combined Data and Instr, L1 D cache is inclusive). A high-speed interconnect allows data transfer between the L2 caches and the memory subsystem. Each core can execute up to four HW threads simultaneously.

o  This simultaneous multi-threading helps hide instruction and memory latencies. OpenCL hides most of these details from the programmer.

Key Intel Xeon Phi Coprocessor Performance Aspects

o  Multi-threading parallelism o  Intel Xeon Phi coprocessor HW includes many

cores depending on the SKU (I assume 60 in this paper). Each core is capable of running up to four HW threads. In most cases, populating the 240 threads with tasks is essential to maximize performance. The exact number of HW threads can be queried with the clGetDeviceInfor(NUM_COMPUTE_UNITS); interface.

o  In Core Vectorization o  The vector size in the Intel Xeon Phi

coprocessor is 512 bit wide SIMD. Typically, this vector represents 8 double precision floating point numbers, or 16 single precision floating point numbers. Each Intel Xeon Phi coprocessor core can issue a single vector computation instruction per cycle.

o  PCI Express* (PCIe) Bus Interface o  The Intel Xeon Phi coprocessor resides on the

PCIe bus. Transferring data over the PCIe bus has the highest latency and the lowest bandwidth. As you would do in any other PCIe device, you should reduce this traffic to a minimum.

o  Memory subsystem o  The Intel Xeon Phi coprocessor includes three

levels of memory (GDDR, L2 cache, and L1 cache). The following table includes important cache information:

L1 (Data + Instructions)

Shared L2

Total Size

32 KB + 32 KB 512 KB

Miss Latency 15-30 cycles 500-1000 cycles

o  Since the Intel Xeon Phi coprocessor is an in-order machine, the latency of memory accesses has significant impact on software performance. Luckily, the programmer can reduce these latencies. Prefetches are one of the tools that can help hide memory latencies. We will discuss it in more detail later.

Data Access Pattern

o  Accessing memory consecutively is the fastest way to access memory on the Intel Xeon Phi coprocessor. It improves cache efficiency, reduces the number of TLB (Translation Lookaside Buffer) misses, and allows the HW prefetcher to kick in.

Mapping the OpenCL constructs to Intel Xeon Phi coprocessor o  Understanding how the key OpenCL constructs

are implemented on the Intel Xeon Phi coprocessor will help you better design your application to take advantage of the coprocessor’s HW. It will also help you avoid the coprocessor’s performance pitfalls.

o  Conceptually, at initialization time, the OpenCL driver creates 240 SW threads and pins them to the HW threads (for a 60-core configuration). Then, following a clEnqueueNDRange() call, the driver schedules the work groups (WG) of the current NDRange on the 240 threads. A WG is the smallest task being scheduled on the threads. So calling clEnqueueNDRange() with less than 240 WGs, leaves the coprocessor underutilized.

o  The OpenCL compiler creates an optimized routine that executes a WG. This routine is built from up to three nested loops, as shown in the following pseudo code:

1: __Kernel ABC(…) 2: For (int i = 0; i < get_local_size(2); i++) 3: For (int j = 0; j < get_local_size(1); j++) 4: For (int k = 0; k < get_local_size(0); k++) 5: Kernel_Body;

o  Note that the innermost loop is used for dimension zero of the NDRange. This directly impacts the access pattern of your performance critical code. It also impacts the implicit vectorization efficiency.

o  The OpenCL compiler implicitly vectorizes the WG routine based on dimension zero loop, i.e., the dimension zero loop is unrolled by the vector size. So the WG code with vectorization looks like:

1: __Kernel ABC(…) 2: For (int i = 0; i < get_local_size(2); i++) 3: For (int j = 0; j < get_local_size(1); j++) 4: For (int k = 0; k < get_local_size(0); k += VECTOR_SIZE) 5: Vector_Kernel_Body;

o  The vector size of Intel Xeon Phi coprocessor is 16, regardless of the data types used in the kernel. However, in the future, we may increase the vectorization size to allow more instruction level parallelism.

Exposing algorithm parallelism

o  While the OpenCL specification provides various ways to express parallelism and concurrency, some of them will not map well to Intel Xeon Phi coprocessor. We will show you how the key OpenCL constructs are mapped to the coprocessor, so you can design your application to exploit its parallelism.

Multi-threading

o  To get good utilization of the 240 HW threads, it’s best to have more than 1000 WGs per NDRange. Having 180‒240 WGs per NDRange will provide basic threads utilization; however, the execution may suffer from poor load-balancing and high invocation overhead.

o  Recommendation: Have at least 1000 WGs per NDRange to optimally utilize the Intel Xeon Phi coprocessor HW threads. Applications with NDRange of 100 WGs or less will suffer from serious under-utilization of threads.

o  Single WG execution duration also impacts the threading efficiency. Lightweight WGs are also not recommended, as these may suffer from relatively high overheads.

Vectorization

o  OpenCL on Intel Xeon Phi coprocessor includes an implicit vectorization module. The OpenCL compiler automatically vectorizes the implicit WG loop over the work items in dimension zero (see example above). The vectorization width is currently 16, regardless of the data type used in the kernel. In future implementations, we may vectorize even 32 elements. As OpenCL work items are guaranteed to be independent, the OpenCL vectorizer needs no feasibility analysis to apply vectorization.

o  However, the vectorized kernel is only used if the local size of dimension zero is greater than or equal to 16. Otherwise, the OpenCL runtime runs scalar kernel for each of the work items. If the WG size at dimension zero is not divisible by 16, then the end of the WG needs to be executed by scalar code. This isn’t an issue for large WGs, e.g., 1024 items at dimension zero, but is for WGs of size 31 on dimension zero.

o  Recommendation 1: Don’t manually vectorize kernels, as the OpenCL compiler is going to scalarize your code to prepare it for implicit vectorization.

o  Recommendation 2: Avoid using a WG size that is not divisible by 32 (16 will work for now).

Work-Item-ID nonuniform control flow

o  In this section, we explain the difference between uniform and nonuniform control flow, in the context of implicit vectorization. It is important to understand because uniform control flow may have small negative impacts on performance. But nonuniform control flow creates significant performance overhead within the innermost NDRange dimension. The uniformity with respect to the vectorized loop (dimension zero) matters.

Uniform branch example:

o  A branch is uniform if it is statically guaranteed that all work items within a WG execute the same side of the branch.

1: //isSimple is a kernel argument 2: Int LID = get_local_id(0); 3: If (isSimple == 0) 4: Res = buff[LID];

1: Int LID = get_local_id(0); 2: If (LID == 0) 3: Res = -1;

Nonuniform branch example:

Another uniform branch example:

1: Int LID = get_local_id(1); 2: //Uniform as the IF is based on dimension one, while vectorization on dimension on. 3: If (LID == 0) 4: Res = -1;

o  While vectorizing, the compiler has to linearize (flatten) any code dominated by nonuniform control flow via predication. The first and major cost of predication is the execution of both sides of the branch. Additional penalties result from the masked execution.

o  Recommendation: Avoid branches, especially those that are nonuniform on dimension zero.

// Assuming the following original kernel code: 1: Int gid = get_global_id(0); 2: If(gid % 32 == 0) 3: Res = HandleEdgeCase(); 4: Else 5: Res = HandleCommonCase(); 6: End // After vectorization (and predication), // the code looks like: 1: int16 gid = get16_global_id(0); 2: uint mask; 3: Mask = compare16int((gid % broadcast16(32)), 0) 4: res_if = HandleEdgeCase(); 5: res_else = HandleCommonCase(); 6: Res = (res_if & mask) | (res_else & not(mask)); // Note that both the IF and the ELSE are executed // for all of the work items.

Data Alignment o  For various reasons, memory access that is

vector-size-aligned is faster than unaligned memory access. In the Intel Xeon Phi coprocessor, OpenCL buffers are guaranteed to start on a vector-size-aligned address. However, this only guarantees that the first WG starts at an aligned address. To guarantee that all WGs start at a properly aligned location, the WG size (local size) needs to be divisible by 16, or even by 32 if you want to take advantage of potential product improvements.

o  Calling EnqueueNDRange with local size NULL, lets the OpenCL driver choose the best WG size for you. The driver should be smart enough to choose a WG size matching the alignment requirements. However, the programmer needs to make sure that the global size is divisible by VECTOR_SIZE and the quotient is big enough to allow the runtime efficient split to WGs. “Big enough” is 1,000,000 in cases of a small kernel and 1000 in the case of a huge kernel including a 1000 iteration loop in the kernel. Also NDRange offsetting can break the alignment.

o  Recommendation 1: Don’t use NDrange offset. If you have to use an offset, then make it a multiple of 32, or at least a multiple of 16.

o  Recommendation 2: Use local size that is a multiple of 32, or at least of 16.

Design your algorithm to benefit from the Intel Xeon Phi coprocessor memory subsystem

o  Since Intel Xeon Phi coprocessor is an in-order machine, it is very sensitive to memory latencies. Memory-related optimizations, at the application level, can lead to 2X-4X performance speedup.

Intra WG data reuse

o  Designing your application to maximize the amount of data reuse from the caches is the first memory optimization to apply. However, only certain algorithms need to reuse data. For example, adding two matrices involves no opportunity to reuse any data. But multiplying two matrices (GEMM) involves significant data reuse. Therefore, it is an obvious candidate for blocking/tiling optimization. Please see more details in the Intel SDK for OpenCL Applications XE – Optimization Guide.

o  To benefit from data reuse, you need to take into account the WG implicit loop(s), as described earlier in this document. The programmer’s control over these loops is through the local size definition. The programmer can add additional loop(s) (explicit) in the kernel.

Cross WG data reuse

o  Cross-group data reuse is a greater challenge. Currently, OpenCL on Intel Xeon Phi coprocessor doesn’t allow enough control over the WGs scheduling. Therefore, cross WG data reuse is almost impossible. We will keep this section as a placeholder for future development.

Data access pattern

o  Consecutive data access usually allows the best memory system performance. When one considers consecutive memory access, understanding the structure of the WG implicit loops is crucial. The innermost implicit loop is the loop over dimension zero. If your kernel introduces no additional (explicit) loop, then you should try having most of your memory accesses consecutive with that implicit dimension zero loop in mind. For example:

o  The following code accesses the 2D buffers consecutively in memory (recommended):

1: __kernel ABC(…){ 2: int ID1 = get_global_id(1); 3: int ID0 = get_global_id(0); 4: res[ID1][ID0] = param1 * buffer[ID1][ID0]; 5: }

1: __kernel ABC(…){ 2: int ID1 = get_global_id(1); 3: int ID0 = get_global_id(0); 4: res[ID0][ID1] = param1 * buffer[ID0][ID1]; 5: }

The following code accesses the 2D buffers consecutively in memory (recommended):

The following code doesn’t access the 2D buffers consecutively in memory (not recommended):

o  The second code example scans the 2D buffers “column major.” With vectorization, it results in double faults, namely: 1) The input vector data need to be gathered along the column from 16 consecutive rows. The result is stored via scatter instructions to 16 different rows. Both operations perform slowly. 2) Memory access is not consecutive, iteration to iteration. Both of these increase the pressure on the TLB and prevent prefetching.

Simple one dimension example (recommended): Consecutive access:

1: Int id = get_global_id(0); 2: A[id]= B[id];

Non-Consecutive access (not recommended):

1: Int id = get_global_id(0); 2: A[id*4] = B[id*4]

Recommendation: Use ID(0) to index memory consecutively within the row. With explicit 2D buffer: buffer[ID1][ID0]. With 2D indexing into 1D buffer: buffer[STRIDE * ID1 + ID0]

o  If your kernel includes an explicit loop, then you should remember that the implicit vectorization is still based on the ID(0) implicit loop. So accessing buffers through the OpenCL IDs should follow the recommendation above (buffer[ID1][ID0]). This will keep vector access consecutive and efficient. Accessing buffers through the inner loop index (idx), will be consecutive within the inner loop (buffer[ID1][idx]) and will be uniform to the vectorized loop, which is excellent! However, mixing ID0 and idx should be avoided. For example, buffer[ID0][idx] is strided to the vectorized loop, therefore will result in gather/scatter.

Data layout o  Pure SOA (Structure-of-Arrays) data layout

results in simple and efficient vector loads and stores. However, spatial locality is lower, the pressure on the TLB is higher, and the number of pages used simultaneously can be higher.

o  With AOS (Array-of-Structures) data layout, the generated vectorized kernel needs to load and store data via gather and scatter instructions, which are less efficient than simple vector load and store. However, for random access pattern, AOS layout is often more efficient than SOA because of better spatial locality.

o  Please remember that random access of SOA data layout creates gather and scatter instructions too.

o  The third option is AOSOA—an array of structures of small arrays. The size of the small arrays should be 32 for the Intel Xeon Phi coprocessor. This would allow vectorization of up to 32 elements vector.

1: struct Point32 { float x[32], y[32], z[32]; }; 2: __kernel void ABC(__global Point32* ptrData)

o  AOSOA allows efficient vectorization using simple vector loads, while not overloading the TLB, nor spreading the accesses across many pages. The problem of AOSOA is the readability of the code. Most people don’t naturally think in AOSOA terms.

Data prefetching

o  With the Intel Xeon Phi coprocessor being an in-order machine, data prefetching is an essential way to bring data closer to the cores, in parallel with other computations. Loads and stores are executed serially, with parallelism. For example, any two load instructions are executed entirely serially. The prefetch instruction is exceptional. It is executed in parallel to other instructions, including to other prefetch instructions. Therefore, prefetch instruction that hasn’t finished on time, can still improve the performance as this memory request executed in parallel to other instruction.

o  A cache miss means a thread stall plus a few cycles penalty to reissue the instruction. The Intel Xeon Phi coprocessor includes a simple automatic HW prefetcher to the L2. It takes some time for the HW prefetcher to kick in, and it needs to restart on every 4 KB virtual page boundary.

o  Automatic SW prefetches to the L1 and L2 are inserted by the OpenCL compiler for data accessed in future iterations, whenever it figures out (through analysis) that such can be inserted and provide benefit. The beta release includes partial support for automatic SW prefetching.

o  Manual prefetching can be inserted by the programmer into the OpenCL kernel, via the prefetch built-in. Currently, manual prefetches are inserted exactly at the location and to the address that the programmer requested, but these are limited to L2 prefetches. In the future, the OpenCL compiler may add both L2 and L1 prefetches for the PREFETCH built-in. It may also improve the location and stride indicated by the programmer. Manual prefetches should be inserted at least 500 cycles before the data is going to be actually used. Usually only the main input and output buffers need to be prefetched.

Local memory and Barriers

o  While traditional GPUs include Shared Local Memory (SLM), which requires manual management, Intel Xeon Phi coprocessor includes a two-level cache system (automatic), similar to most modern CPUs. Therefore, using the OpenCL SLM provides no benefit on the Intel Xeon Phi coprocessor. Furthermore, local memory in the coprocessor is allocated on the regular GDDR memory and is supported by the cache system like any other memory. Therefore, it introduces additional overhead in terms of redundant data copy and management.

o  Recommendation: Avoid using Shared Local Memory on the Intel Xeon Phi coprocessor.

o  The Intel Xeon Phi coprocessor includes no special HW support for barriers. Therefore, barriers are emulated by OpenCL on the coprocessor. We recommend avoiding the use of barriers. Also, splitting the kernel into two separate kernels will be slower than a barrier, so we don’t recommend taking this path either.

o  As of the beta release, the combination of barrier and WG size nondivisible by 16, results in execution of a scalar kernel. Please avoid this combination. Currently, we don’t see a justification to optimize it within the OpenCL compiler.

Summary

o  While designing your OpenCL application for Intel Xeon Phi coprocessor, you should pay careful attention to the following aspects:

1.  Include enough work groups within each NDRange—a minimum of 1000 is recommended.

2.  Avoid lightweight work groups. Don’t hesitate using the maximum local size allowed (currently 1024). Keep the WG size a multiple of 32.

3.  Avoid ID(0) dependent control flow. This allows efficient implicit vectorization.

5.  Prefer consecutive data access. 6.  Data layout preferences: AOS for sparse

random access; pure SOA or AOSOA(32) otherwise.

7.  Exploit data reuse through the caches within the WG—tiling/blocking.

8.  If auto-prefetching didn’t kick in, use the PREFETCH built-in to bring the global data to the cache 500‒1000 cycles before use.

9.  Don’t use local memory. Avoid using barriers.

End of World