24
2 nd Floor 45 York Place Edinburgh EH1 3HP United Kingdom 英国爱丁堡 York Place 45号 2层 EH1 3HP Visit us at 登陆我们的网站 www.codeplay.com OpenCL Colin Riddell GPU Compiler Developer Codeplay Codeplay GPU编译器开发者

OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

2nd Floor 45 York Place Edinburgh EH1 3HP United Kingdom 英国爱丁堡 York Place 45号 2层 EH1 3HP

Visit us at 登陆我们的网站 www.codeplay.com

OpenCL

Colin Riddell GPU Compiler Developer

Codeplay

Codeplay GPU编译器开发者

Page 2: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

In this talk 今天的讨论

• Codeplay

– Our Market

– 我们的市场

• Overview of OpenCL

• OpenCL概览

• Codeplay + OpenCL

– Our technology

– 我们的技术

• Future of OpenCL

• OpenCL的未来

Page 3: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

3

Codeplay:GPU Compiler Experts Codeplay: CPU编译器专家

• Developing whole GPU compilers

• 开发完整的GPU编译器

• Working on customer‘s GPU compilers(LLVM and proprietary tech)

• 在客户的GPU编译器上工作(LLVM和专有技术)

• Optimization,bug-fixing, new languages,testing

• 优化、修理错误、新语言、测试

• Infrastructure for testing and tracking perf

• 测试和跟踪性能的基础架构

Been producing GPU compilers since 2002,from work by founders starting 1999 自2002年起生产GPU编译器,源自创始人自1999年的工作

• Collaborative R&D 合作研发

• Academic collaborations学术合作

• Creating test-suites创立测试-套间

• Helping define new standards

• 帮助定义新的标准

• GPGPU as well as graphics

• GPGPU,还有图形

• Debuggers,IDE integration

• 调试和IDE集成

• Graphics and games s/w

• 图形和游戏软件 15 expert development staff,based in Edinburgh,Scotland,UK

and expert in Uppsala, Sweden 在英国苏格兰爱丁堡有15名专业开发员工

以及在瑞典Uppsala的专家

Page 4: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose
Page 5: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

Codeplay’s Technologies Codeplay的技术

• Testplay – GPU compiler development and monitoring tool

• Testplay – GPU编译器开发和监控工具

– Allows us to test compilers very quickly让我们能够很快的测试编译器

– Monitors progress towards perf & quality targets

– 监控性能和质量达标的进程

• Offload – A range of technologies to enable complex C++ software on GPUs

• Offload – 一系列在GPU上启用复杂的C++软件的科技

– Compiler: source-to-source translator or Clang/LLVM patch

– 编译器:源到源编译器或Clang/LLVM补丁

– Libraries, being standardized with Khronos and others

– 库,按照Khronos及其他标准标准化

• VectorC – A retargetable optimizing compiler for GPU type processors

• Vector C – 适合GPU类型处理器的可重定目标的优化编译器

Page 6: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

Our Market我们的市场

Testplay compiler testing system

•Small specialized teams work with

large hardware companies

•小型的专业团队与大型硬件公司合作

•Help with optimizing their

compiler tech’

•帮助优化他们的编译器技术

•Major focus on conformance and

then performance.

•主要关注一致性和之后的性能表现

•Use in house testing tools such

as Testplay

•采用自行开发的测试工具,如Testplay

Page 7: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

• Codeplay

• Overview of OpenCL

• OpenCL概览

• Codeplay + OpenCL

– Our technology

–我们的技术

– Our market

–我们的市场

• Future of OpenCL

• OpenCL的未来

Page 8: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

Overview of OpenCL OpenCL概览

CPUs Multiple cores

driving performance increases

多芯驱动性能增长

GPUs Increasingly general purpose

data-parallel computing 数据平行计算,目的越来越广泛

Graphics APIs and Shading Languages 图形API和着色语言

Multi-processor

programming – e.g. OpenMP 多处理器编程 - 例如OpenMP

Emerging Intersection 新兴交叉点

Heterogeneous Computing 异构计算

Diagram thanks to Neil Trevett; KhronosGroup

Page 9: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

OpenCL – Heterogeneous OpenCL – 异构

Host 主机

Compute Device 计算设备

Compute Unit 计算单位

Processing Element 处理单元

Page 10: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

10

The BIG Idea behind OpenCL OpenCL背后的宏大想法

• The BIG Idea behind OpenCL/Open CL背后的宏大想法

-OpenCL execution model... OpenCL执行模式

-Define N-dimensional computation domain 定义N-维计算域

• Execute a kernel at each point in computation domain 在计算域内的每个点执行内核

• -C Derivative to write kernels-based on ISO C99 – C Derivative写内核 – 给予ISO C99

• -APIs to discover devices in a system and distribute work to them

• API发现系统内的设备,将任务分配给他们

• Targeting many types of device 面向多种设备

• -GPUs,CPUs,DSPs,embedded systems,mobile phones..Even FPGAs

• GPU、CPU、DSP、嵌入系统、移动电话。。。甚至FPGA

Page 11: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

Memory Model and domain 内存模式和域

Memory Model

内存模式 Domain 域 • Private Memory is per work item (for a

single kernel execution only)

• 每个工作项目有私有内存(仅为单一的内核执行)

• Local Memory is within a work group (between all “instances” of this kernel)

• 本地内存在工作组里(在这个内核的所有“执行个体”之间)

• Global/Constant Memory is visible to all work groups (between all kernels)

• 所有工作组(所有内核之间)都可以看到全局/常量内存

• Inputs from host need to be explicitly passed in

• 需要明确传递进主机的输入

Diagram thanks to KhronosGroup

Page 12: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

Memory Model and domain 内存模式和域

Memory Model

内存模式

Memory space keywords

内存空间关键词 kernel void dp_mul(global const float *a,

global const float *b,

global float *c) {

int id = get_global_id(0);

__private float3 priv = float3(0.0,1.0,2.0);

__local int loc = 3;

c[id] = a[id] * b[id]; } // execute over “n” work-items

Diagram thanks to KhronosGroup

Page 13: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

On the host side 主机方面

• Get the devices 获取设备 - clGetDeviceIDs()

• Create a context 创建Context - clCreateContext()

• Create command queue 创建指令排列 - clCreateCommandQueue()

• Allocate GPU in/out memory for buffers 分配缓冲的GPU出入内存 - clCreateBuffer()

• Create program, build, obtain handle – 创建程序、建立、获得处理clCreateProgramWithSource(),clBuildProgram(),clCreateKernel()

• Associate GPU memory with Kernel args 将GPU内存和内核args关联 – clSetKernelArg()

• Launch 发布 – clEnqueueNDRangeKernel()

• Copy result from GPU out to CPU memory将结果从GPU输出拷贝到CPU内存 – clEnqueueReadBuffer()

Page 14: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

OpenCL 1.2 • OpenCL 1.2 is latest publically available spec

• OpenCL 1.2是最新公开的规范

• Custom devices and built-in kernels supported支持订制设备和内置内核

• Image improvments: 图像改善

• clEnqueueFillBuffer and clEnqueueFillImage can fill with color or pattern – 可以用颜色或图样填充

• clCreateImage() to create an image object 创建图像对象

• New 1D image, 1D image from a buffer obj, 1D image array.

• 新的一维图像,来自缓冲对象的一维图像、一维图像列阵

• Memory APIs: 内存API

• clEnqueueMigrateMemObjects() – gives better control on the location of memory objects 更好地控制内存对象的位置

• Separate compilation and linking of kernel programs

• 编译和内核程序连接是分开的

• Device partitioning allowing partition based on a number of partitioning schemes设备分区允许根据几种分区计划进行分区

Page 15: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

Codeplay Our market 我们的市场

Overview of OpenCL OpenCL概览 Codeplay + OpenCL

Our technology 我们的技术

Future of OpenCL OpenCL的未来

Page 16: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

Performance: GPUs (& FPGAs) are fast! 性能:GPU (和FPGA)很快! OpenCL lets you run on a wide variety of parallel devices with high performance OpenCL让你可以运行各种不同的并行设备,性能很高

GPU/CPU/FPGA/Cell BE, etc... Open standard, widely supported, lots going on 开放标准、广泛支持、很多

A higher-level language may have domain-specific knowledge about software to enable parallelization 高级语言可能对软件有具体到域的指示,从而启用并行

Can make GPU acceleration easy ? 能让GPU加速变得简单?

OpenCL as a backend for compiler tools Open CL作为编译器工具的后端

Page 17: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

What are the problems? 问题在哪儿?

A GPU (or FGPA) is not a 100% general-purpose programmable device GPU(或FGPA)不是100%的一般目的可编程工具

May change in the future, but for now: 未来可能会改变,但现在: Data-parallel, with limited task-parallel support 数据并行,任务并行支持有限 No recursion, no globals, no function pointers 没有递归、没有全局、没有函数指针 Only access data in buffers只能在缓冲获取数据 Understand memory spaces理解内存空间

Shipping source for unknown platforms知平台的运输源 Need to be able to handle run-time compilation 需要能够处理运行时编译

Page 18: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

#include "OffloadCL.h" #include "SeismicSimulation.h" using namespace Offload; void OffloadCLUpdateStressPerf() { array_view<float,2> mS(clS), mT(clT), mV(clV); array_view<const float,2> mM(clM); parallel_for_each(grid<2>(extent<2>(UniverseHeight-1, UniverseWidth-1)), [=](index<2> pt) { int x = pt.get_x(); int y = pt.get_y(); mS[pt] = mS[pt] + mM[pt]*(mV[index<2> (y,x+1)]-mV[pt]); mT[pt] = mT[pt] + mM[pt]*(mV[index<2> (y+1,x)]-mV[pt]); } ); }

OffloadCL – C++ on the GPU OffloadCL – GPU上的C++

parallel_for_each used with lambda function 和Lambda函数使用parallel_for_each (逐个并行) Lambda function runs on GPU as OpenCL when compiled with OffloadCL 用OffloadCL编译的Lambda函数在GPU上作为OpenCL运行

Page 19: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

template<class Functor> void parallel_for_each (grid<2> range, Functor functor) { #ifdef __offloadcplusplus int w = range.get_x(), h = range.get_y(); offloadparallel<2>(h, w, functor) { index<2> currentPoint = index<2>::GetCurrentPoint(); functor (currentPoint); } #else //standard C++ compilers int range_x = range.get_x(), range_y = range.get_y(); for (int y=0; y<range_y; y++) for (int x=0; x<range_x; x++) { index<2> currentPoint(x, y); functor (currentPoint); } #endif }

OffloadCL – C++ on the GPU OffloadCL – GPU上的C++

offloadparallel runs on GPU as OpenCL kernel/在CPU上作为OpenCL内核运行 Standard C++ in pre-processor else block runs on host 标准C++在预处理器中,否则会阻碍主机运行

Page 20: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

OffloadCL – Seisimic demo OffloadCL –Seisimic演示

Adapted from Intel’s Threaded Building Blocks 改编自英特尔的Threaded Building Blocks Can switch between OffloadCL, C++ AMP and CPU version可以在OffloadCL、C++AMP和CPU版本间转换

Page 21: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

Codeplay Our market 我们的市场

Overview of OpenCL OpenCL概览 Codeplay + OpenCL

Our technology 我们的技术

Future of OpenCL OpenCL的未来

Page 22: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

Native C++ support? 本机Native C++支持? Task parallelization in OpenCL 2.0? 在OpenCL 2.0中的任务并行? Changes in hardware brining shared memory spaces? 硬件的改变带来共享的内存空间?

SoCs

Future of OpenCL OpenCL的未来

Page 23: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

2nd Floor 45 York Place Edinburgh EH1 3HP United Kingdom 英国爱丁堡 York Place 45号 2层 EH1 3HP

Visit us at登陆我们的网站 www.codeplay.com

Codeplay Our market 我们的市场

Overview of OpenCL OpenCL概览 Codeplay + OpenCL

Our technology 我们的技术

Future of OpenCL OpenCL的未来

Page 24: OpenCL - Khronos Group · 2014-04-08 · Overview of OpenCL OpenCL概览 CPUs Multiple cores driving performance increases 多芯驱动性能增长 GPUs Increasingly general purpose

24

2nd Floor 45 York Place Edinburgh EH1 3HP United Kingdom 英国爱丁堡 York Place 45号 2层 EH1 3HP

Visit us at 登陆我们的网站 www.codeplay.com

Codeplay Our market 我们的市场

Overview of OpenCL OpenCL概览 Codeplay + OpenCL

Our technology 我们的技术

Future of OpenCL OpenCL的未来 Questions? 有没有问题?