62
Introduction to OpenCL OpenCL Workshop | December 1, 2010 | Brisbane, Australia Tomasz Bednarz, CESRE How to select OpenCL devices, initialise a compute context, allocate device memory, compile and run kernels, output results

Introduction to OpenCL, 2010

Embed Size (px)

DESCRIPTION

Introduction to OpenCL, presentation from OpenCL workshop at OzViz 2010 held in Brisbane, Australia.

Citation preview

Page 1: Introduction to OpenCL, 2010

Introduction to OpenCL

OpenCL Workshop | December 1, 2010 | Brisbane, Australia!Tomasz Bednarz, CESRE!

How to select OpenCL devices, initialise a compute context, allocate device memory, compile and run kernels, output results

Page 2: Introduction to OpenCL, 2010

Welcome to Open Computing Language (OpenCLTM)

•  N-Body Simulation Demo"•  Khronos Group and OpenCL standard"•  OpenCL Anatomy"

•  Platform Model"•  Execution Model"•  Memory Model"

•  Short Introduction to OpenCL Programming "•  OpenCL C language"•  Supported data types"•  Synchronisation primitives"

•  Additional information and resources."

OpenCL is a trademark of Apple, Inc.

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 3: Introduction to OpenCL, 2010

N-Body Simulation: demo

Page 4: Introduction to OpenCL, 2010

N-Body Simulation

•  Applications"•  Molecular dynamics"•  Astronomical and astrophysical simulations"•  Fluid dynamics simulation"•  Radiosity (Radiometric transfer)"

•  N2 interactions to compute per time-step"•  For the brute force all-pairs approach

discussed here"•  Highly Parallel"•  High Arithmetic intensity"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Two of these galaxies attract each other.

Lars Nyland, Mark Harris, Jan Prins “Fast N-Body Simulation with CUDA”. In Hubert Nguyen, editor, GPU Gems 3, chapter 31, pages 677-695, Addison Wesley 2007.

Page 5: Introduction to OpenCL, 2010

N-Body Simulation (http://developer.nvidia.com/gpugems3)

•  N-Body simulation models the motion of particles subject to a force due to the particle-particle interactions between all particles in the system"

•  Typical example: simulation of stars in a galaxy subject to the gravitational force"

•  Given N bodies with an initial position xj and velocity vj for 1≤i≤N, the force fij on body i caused by its gravitational attraction to body j is given by the following:"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

fij =Gmimj

rij2 !

rijrij

Fi = fij1! j!Ni" j

# =Gmimjrijrij

31! j!Ni" j

#

ai =Fimi

ij

rij = x j ! xi

where mi and mj are the masses of bodies i and j."•  The acceleration is computed as:"

Page 6: Introduction to OpenCL, 2010

N-Body Simulation

•  As bodies approach each other, the force between them grows without bound, therefore softening factor e2>0 may be added"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Fi !Gmimjrij

rij2+ e2( )

321" j"N

#

•  The softening factor limits the magnitude of the force between the bodies, which is desirable for numerical integration of the system state"

•  Acceleration:"

ai =Fimi

!G "mjrij

rij2+ e2( )

321# j#N

$

Page 7: Introduction to OpenCL, 2010

N-Body Simulation: parallel concept

•  Particles i, j interact with each other"•  OpenCL can be used to compute acceleration on all bodies in parallel "•  N/p work groups of p work items process p bodies at a time"•  Every work item loads all other body positions from off-chip memory"

•  N2 loads … bandwidth bound = poor performance "

•  Optimization (using tiles) to be presented in the afternoon session"

Particle j

Part

icle

i

single interaction between i and j

Outer Loop (i)

Inner Loop (j)

Page 8: Introduction to OpenCL, 2010

ai =Fimi

!G "mjrij

rij2+ e2( )

321# j#N

$

N-Body Simulation: body-body force calculation

Fi !Gmimjrij

rij2+ e2( )

321" j"N

#

http://developer.download.nvidia.com/compute/opencl/sdk/website/samples.html#oclNbody http://developer.apple.com/library/mac/#samplecode/OpenCL_NBody_Simulation_Example/Introduction/Intro.html

Page 9: Introduction to OpenCL, 2010

N-Body Simulation: demo

Page 10: Introduction to OpenCL, 2010

The Khronos Group

Page 11: Introduction to OpenCL, 2010

http://www.khronos.org/opencl/

Page 12: Introduction to OpenCL, 2010

http://www.khronos.org/opencl/

Page 13: Introduction to OpenCL, 2010

What is OpenCL?

Courtesy of

Heterogeneous Computing

Multi-processor programming, threading libraries - e.g. OpenMP

Graphics APIs and Shading Languages, Vendor Compute APIs

CPUs Multiple cores driving performance increases

GPUs Increasingly general purpose data-parallel

computing

Emerging Intersection

OpenCL - Open Computing Language: open, royalty-free standard for programming heterogeneous parallel computing at the intersection of GPU and multi-core CPU capabilities.

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

http://www.khronos.org/opencl/

Page 14: Introduction to OpenCL, 2010

What is OpenCL?

Cross-platform desktop 3D

Embedded 3D

Streaming Media and Image Processing

Surface and synch abstraction

Heterogeneous Parallel Programing

3D for Web

Desktop 3D Ecosystem

Parallel computing and visualisation

Streamlined APIs for mobile and embedded graphics, media and compute acceleration

Roadmap convergence OpenGL 4.0 and OpenGL ES 2.0 are both streamlined, programmable pipelines. GL and ES working groups are working on convergence. WebGL is a positive pressure for portable 3D content for all platforms.

Desktop Visual Computing OpenGL and OpenCL have direct interoperability. OpenCL objects can be Created from OpenGL Textures, Buffer Objects and Renderbuffers.

Mobile Visual Computing Compute, graphics and AV APIs interoperate through EGL.

Hundreds of men years invested by industry experts in coordinated ecosystem!

OpenCL – the center of a visual computing ecosystem with parallel computations, 3D, video, audio, and image processing on desktop, embedded and mobile systems!

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010. Based on http://www.khronos.org/opencl/

Page 15: Introduction to OpenCL, 2010

OpenCL Timeline

June 2008

December 2008

May 2009

2nd Half 2009

June 2010

OpenCL working group!is proposed by Apple. !

Draft spec is contributed!to Khronos.!

Khronos releases OpenCL 1.0 conformance

tests to ensure high-quality implementations.!

OpenCL 1.1 spec is released and first

implementation ship.!

Multiple conformant implementations ship across diverse OS and

platforms.!

Khronos releases publicly OpenCL 1.1 as

royalty-free specification.!

•  OpenCL 1.0 was released six months after the proposal was created"•  OpenCL ships first on Appleʼs Mac OS X Snow Leopard"•  18 month cadence between OpenCL 1.0 and OpenCL 1.1"

•  Backward compatible to protect software investment"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010. Based on http://www.khronos.org/opencl/

Page 16: Introduction to OpenCL, 2010

OCL Quick Reference Cards http://www.khronos.org/files/opencl-quick-reference-card.pdf

Page 17: Introduction to OpenCL, 2010

Design goals of OpenCL

•  Enable all compute resources in system"•  CPUs, GPUs, and other processors enabled as peers"•  Data- and task- parallel compute model"

•  Efficient parallel programming model"•  ANSI C99 based kernel language"

•  Low-level abstraction"•  Abstracts the specifics of the underlying hardware"•  High-performance, but device independent "

•  Define precision requirements for all floating-point computations"•  Consistent results on all platforms and devices"

•  Interoperability with Graphics APIs"•  Dedicated support for OpenGL, OpenGL ES and DirectX"

•  Drive future hardware requirements"•  Applicable to both consumer and HPC applications"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 18: Introduction to OpenCL, 2010

OpenCL Platform Model

Page 19: Introduction to OpenCL, 2010

It’s heterogeneous world

•  Platform model encapsulates compute resources"

•  A modern platform includes:"•  One or more CPUs"•  One or more GPUs"•  Optional accelerators (e.g. DSPs)"•  Other?"

Using OpenCL Programmers write a single portable program that uses ALL resources !

in the heterogeneous platform!

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010. Based on http://www.khronos.org/opencl/

Page 20: Introduction to OpenCL, 2010

OpenCL Platform Model

COMPUTE UNIT

COMPUTE UNIT

COMPUTE UNIT

COMPUTE UNIT

COMPUTE DEVICE

PROCESSING ELEMENT

COMPUTE UNIT

COMPUTE UNIT

COMPUTE UNIT

COMPUTE DEVICE

..... ….

HOST!

•  One Host connected to one or more Compute Devices"•  Compute device can be a CPU, GPU or other processor"

•  Each Compute Device is composed of one or more Compute Units"•  Compute Unit can may be a core, multi-processor, etc."

•  Each Compute Unit is further divided into one or more Processing Elements "•  Processing Elements execute code as SIMD or SPMD!

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 21: Introduction to OpenCL, 2010

Anatomy of OpenCL Application

COMPUTE UNIT

COMPUTE UNIT

COMPUTE UNIT

COMPUTE UNIT

COMPUTE DEVICE

COMPUTE UNIT

COMPUTE UNIT

COMPUTE UNIT

COMPUTE DEVICE

..... …. HOST!

OpenCL Application

COMPUTE DEVICES

Host Code - Written in C/C++ - Executes on the host

Device Code - Written in OpenCL C - Executes on the device

•  Host code sends commands to the Devices:"•  To transfer data between host memory and device memories!•  To execute device code!

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 22: Introduction to OpenCL, 2010

Anatomy of OpenCL Application

•  Serial code executes in a Host (CPU) thread"•  Parallel code executes in many Device (GPU) threads across multiple processing elements"

OCL Application

Serial code

Parallel code

Serial code

Parallel code

Host = CPU

Host = CPU

Device = GPU …

Device = GPU …

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 23: Introduction to OpenCL, 2010

OpenCL Execution Model

Page 24: Introduction to OpenCL, 2010

OpenCL Execution Model

•  OpenCL application runs on a Host which submits work to the Compute Devices!• Work item: the basic unit of work on an OpenCL device"• Kernel: the code for a work item, which is basically C

function"• Program: Collection of kernels and other functions

(analogous to a dynamic library). Managed by host."• Context: The environment within which work-items

execute, which includes devices and their memories and command queues (contains all resources for computation)"

• Command queue: A queue used by the Host application to submit work to a Device (kernel execution instances)"•  Work is queued in-order, one queue per device"•  Work can be executed in-order or out of order"•  Events are used for synchronisation"

MEMORY!

GPU! CPU!

GPU &

CPU Queues

CONTEXT

COMMANDS

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 25: Introduction to OpenCL, 2010

OpenCL Execution Model

•  Portable execution model that allows a kernel to execute at each point in a problem domain (N-dimensional computational domain) à decomposition of a task into work-items!

void !addVector(const float *A, ! const float *B, ! float *C, ! int N) !{ ! int index; !! for (index=0; index<N, index++) ! C[index] = A[index]+B[index]; !} !

__kernel void !addVector(__global const float *A, ! __global const float *B, ! __global float *C, ! int N) !{ ! int index = get_global_id(0); !! if (index < N) ! C[index] = A[index]+B[index]; !} !!

Traditional loop as a function in C OpenCL C kernel

Work item: the basic unit of work on an OpenCL device Kernel: the code for a work item, which is basically C function

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 26: Introduction to OpenCL, 2010

Kernel Execution on Platform Model

Work-Item

Work-Group

Kernel execution instance

Compute element

Compute unit

Compute device

•  Each work-item is executed by a compute element!

•  Each work-group is executed on a compute unit"

•  Several concurrent work-groups can reside on one compute unit depending on work-groupʼs memory requirements and compute unitʼs memory resources"

•  Each kernel is executed on a compute device!

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 27: Introduction to OpenCL, 2010

Benefits of Work-Groups

•  Automatic scalability across devices with different numbers of compute units"• Work-groups can execute in any order, concurrently or sequentially"

•  Efficient cooperation between work-items of same work-group"•  Fast shared memory and synchronization"

•  Independence between work-groups gives scalability:"•  A kernel scales across any number of compute units"

Work-group 0!

Work-group 1!

Work-group 2!

Work-group 3!

Work-group 4!Work-group 5!

Work-group 6!

Work-group 7!

Kernel Launch

Work-group 0!

Work-group 2!

Work-group 4!

Work-group 6!

Unit 0 Work-group 1!

Work-group 3!

Work-group 5!

Work-group 7!

Unit 1 Device with 2 compute units

Work-group 0!

Work-group 4!

Unit 0 Work-group 1!

Work-group 5!

Unit 1 Device with 4 compute units

Work-group 2!

Work-group 6!

Unit 2 Work-group 3!

Work-group 7!

Unit 3

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 28: Introduction to OpenCL, 2010

Work-group synchronisation

•  Always define the best N-dimensional index space (NDRange) for your algorithms (currently 1D, 2D and 3D index spaces are supported)"• Kernels are executed across a global domain of

work-items!• Work-items are single points of execution and

are grouped into local work-groups!

1024

1024

Synchronisation between work-items"possible only within workgroups:"barriers and memory fences!

Cannot synchronise outside "of work-groups"

•  Global Dimensions: 1024x1024 (whole problem space)"•  Local Dimensions: 32x32 (work-group)"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 29: Introduction to OpenCL, 2010

Work-items and work-groups

•  A kernel is a function executed in each point of a problem domain (for each work-item)"

•  Number of work items = 4096 (16 work-groups, 256 work-items each):"

NDRANGE DEVICE get_group_id(0) = 2

0 1 2 15

get_global_size(0) = 4096

get_local_size(0) = 256

3 … 4

get_global_id(0) = 1792

WORK GROUP WORK ITEM get_local_id(0) = 255

0 1 … 255

get_num_groups (0) = 16

__kernel void !addVector(__global const float *A, ! __global const float *B, ! __global float *C, ! int N) !{ ! int index = get_global_id(0); !! if (index < N) ! C[index] = A[index]+B[index]; !} !

Page 30: Introduction to OpenCL, 2010

Work-items and work-groups in 2D

get_group_id(0),get_group_id(1)

get_local_id(0),get_local_id(1)

0,0 1,0 2,0

0,1

0,2

1,1

15,0

0,15

WORK GROUP WORK ITEMS

2,2

4,1

get_local_size(1)

get_local_size(0)

NDRANGE DEVICE

0,0 1,0 2,0

0,1

0,2

1,1

7,0

7,7 0,7

3,4

get_global_size(0)

get_

glob

al_s

ize(

1)

get_global_id(0),get_global_id(1)

.

•  Number of work items to execute 128 x 128 = 16384:" (A kernel is executed in each point of a problem domain)

Page 31: Introduction to OpenCL, 2010

OpenCL Memory Model

Page 32: Introduction to OpenCL, 2010

OpenCL Memory Model

•  Address spaces"•  Private: read/write access for work-item only"•  Local: read/write access for entire work-group"•  Global/Constant: visible to all work-groups"•  Host: accessible by the CPU"

•  Synchronisation"•  All Synchronisation for all memory accesses

must be done explicitly"

Compute Unit 1

Private Memory!

Private Memory!

Work Item1 PE!

Work ItemJ PE!

Compute Unit N

Private Memory!

Private Memory!

Work Item1 PE!

Work ItemJ PE!

Local Memory! Local Memory!

Global/Constant Memory!

Compute Device

Host

Host Memory!Memory management is Explicit!

You must move data from host à global à local … and back"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 33: Introduction to OpenCL, 2010

OpenCL Programming

•  How to define the platform"•  How to execute code on the platform"•  How to move data around in memory"•  How to write (and build) programs"

Page 34: Introduction to OpenCL, 2010

Host application

Page 35: Introduction to OpenCL, 2010

OpenCL Language and API Highlights

•  Platform Layer API (called from host)"•  Abstraction layer for diverse computational resources"•  Query, select and initialise compute devices"•  Create compute contexts and work-queues"

•  Runtime API (called from host)"•  Launch compute kernels"•  Set kernel execution configuration"•  Manage scheduling, compute, and memory resources"

•  OpenCL language"•  To write C-based compute kernels for execution on a compute device"•  Includes rich set of build-in functions"•  Can be compiled JIT/Online or offline"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 36: Introduction to OpenCL, 2010

OpenCL Language Highlights

•  Function qualifiers"•  __kernel qualifier declares a function as a kernel"

•  Address space qualifiers"•  __global, __local, __constant, __private"

•  Work-item functions"•  get_work_dim(), get_global_id(), get_local_id(), get_group_id(), get_local_size()"

•  Image functions"•  Images must be accessed through built-in functions"•  Read/writes performed through sampler objects from host or defined in source"

•  Synchronisation functions"•  Barriers – all work-items within a work-group must execute the barrier function

before any work-item in the work-group can continue"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

__kernel void !addVector(__global const float *A, ! __global const float *B, ! __global float *C, ! int N) !{ ! int index = get_global_id(0); !! if (index < N) ! C[index] = A[index]+B[index]; !} !

Page 37: Introduction to OpenCL, 2010

OpenCL Framework: Overview •  Platform layer: platform query and context creation"•  Compiler for OpenCL C"•  Runtime: memory management and command execution within a context"

CONTEXT!

GPU! CPU!

PROGRAMS! KERNELS! MEMORY OBJECTS! COMMAND QUEUES!

BUFFERS! IMAGES!IN

ORDER!QUEUE!

OUT OF ORDER QUEUE!

COMPUTE DEVICE

addVector!

arg[0] value!

arg[1] value!

arg[2] value!

!!

__kernel void ! addVector( ! __global float *A, ! __global const float *B, ! __global float *C) ! { ! int i = get_global_id(0); ! C[i] = A[i]+B[i]; ! } !

CPU binary!

GPU binary!

!!

COMPILE CODE! CREATE ARGS AND DATA! SEND TO EXECUTION!

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 38: Introduction to OpenCL, 2010

OpenCL Framework: Objects Types

•  cl_platform_id "– identifier for a specific platform"•  cl_device_id "– identifier for a specific compute device "•  cl_context "– handle for a compute context"•  cl_command_queue "– handle for a command queue (for a compute device)"•  cl_mem "– handle for a memory resource (managed by context)"•  cl_program "– handle for a program resource (library of kernels)"•  cl_kernel "– handle for a compute kernel "

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

•  All object types are opaque handles"•  Enables cross-platform compatibility for complex data types"

•  All objects are reference counted and garbage collected"•  When reference count reaches zero, object is deallocated"

Page 39: Introduction to OpenCL, 2010

OpenCL Framework: Platform Layer

•  To query platform information:"•  clGetPlatformIDs() à obtain the list of platforms available"•  clGetPlatformInfo() à platform profile, version, name, vendor, extensions"

•  To query Devices: "•  clGetDeviceIDs() à obtain the list of devices available on platform"•  clGetDeviceInfo() à type, capabilities, vendor, name, etc."

•  Create an OpenCL context for one or more devices"

Context!cl_context!

One or more devices!cl_device_id!

Memory and device code shared by these devices!cl_mem !cl_program!

Command queues to send commands to these devices!cl_command_queue!

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 40: Introduction to OpenCL, 2010

Context creation: platform IDs

•  Get all platform IDs:!"

// get number of OpenCL platforms available"cl_int err;"cl_uint num_platforms;"std::vector<cl_platform_id> platformIDs;"

err = clGetPlatformIDs(NULL, NULL, &num_platforms); if (err != CL_SUCCESS) { … }

platformIDs.resize(num_platforms);

// get all OpenCL platform IDs err = clGetPlatformIDs(num_platforms, &platformIDs[0], NULL);

cl_int clGetPlatformIDs(!cl_uint num_entries,"cl_platform_id *platforms,"cl_uint *num_platforms)"

If NULL, the arguments are ignored

•  SIMPLE EXAMPLE get the platform ID:!"

// get first OpenCL platform ID available"cl_platform_id platform;"err = clGetPlatformIDs(1, &platform, NULL);"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 41: Introduction to OpenCL, 2010

Context creation: device IDs

•  Get all platform IDs:""

cl_uint nDevices;"cl_device_type deviceType;"vector<cl_device_id> deviceIDs;""

if (platformIDs.size() == 0) {" // get number of device IDs for default platform" err = clGetDeviceIDs(NULL, deviceType, 0, NULL, &nDevices); "} else {" // get number of device IDs for selected platform" err = clGetDeviceIDs(platformIDs[selectedPlatform], deviceType, 0, NULL, &nDevices); "}"deviceIDs.resize(nDevices);"if (platformIDs.size() == 0) {" // get default device IDs of default platform" err = clGetDeviceIDs(NULL, deviceType, nDevices, &deviceIDs[0], NULL); "} else {" // get device IDs of selected platform" err = clGetDeviceIDs(platformIDs[selectedPlatform], deviceType, nDevices, &deviceIDs[0], NULL); "}"

•  SIMPLE: get first GPU associated with the platform:""

cl_device_id device;"err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);"

cl_int clGetDeviceIDs(!cl_platform_id platform,"cl_device_type device_type,"cl_uint num_entries,"cl_device_id *devices,"cl_uint *num_devices)"

DEVICE TYPE:!CL_DEVICE_TYPE_CPU"CL_DEVICE_TYPE_GPU"CL_DEVICE_TYPE_ACCELERATOR"CL_DEVICE_TYPE_DEFAULT"CL_DEVICE_TYPE_ALL"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 42: Introduction to OpenCL, 2010

Context creation

•  Create OpenCL context for few devices:!"

cl_int err;"cl_context context;

context = clCreateContext(NULL, deviceIDs.size(), &deviceIDs[0], NULL, NULL, &err); if (err != CL_SUCCESS) { … }

•  SIMPLE EXAMPLE: create context object!"

cl_context context;"context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);"

cl_context clCreateContext(!const cl_context_properties *properties,"cl_uint num_devices,"const cl_device_id *devices, "void CL_CALLBACK *pfn_notify,"void *user_data,"cl_int *errcode_ret)"

cl_contet_properties_enum:!CL_CONTEXT_PLATFORM"CL_CONTEXT_D3D10_DEVICE_KHR"CL_GL_CONTEXT_KHR"CL_EGL_DISPLAY_KHR"..."…"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 43: Introduction to OpenCL, 2010

Error Handling and Resource Deallocation

•  Error handling:"•  All host functions return an error code"•  Context error callback"

•  The callback function may be called asynchronously by OpenCL and it is the applicationʼs responsibility to ensure that the callback function is thread-safe"

•  Resource deallocation"•  Reference counting API: clRetain*(), clRelease*()"

•  clRetainContext();"•  clReleaseContext();"•  clRetainMemObject();"•  clReleaseMemObject();"•  clRetainKernel();"•  clReleaseKernel();"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 44: Introduction to OpenCL, 2010

OpenCL C

•  Derived from ISO C99!•  Features added to the language:!

•  Work-items and work-groups"•  Vector types"•  Synchronisation"•  Address space qualifiers"

•  Also includes a large set of built-in functions:!•  Image manipulation"•  Work-item manipulation"•  Math functions"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 45: Introduction to OpenCL, 2010

OpenCL C

Language Restrictions:!•  No functions defined in C99 standard headers"•  No recursion supported"•  Pointers to function are not permitted"•  Pointers to pointers allowed within a kernel, but not as an argument"•  No variable length arrays and structures"•  Bit fields are not supported"•  Writes to a pointer to a type less than 32 bits are not supported*"•  Double types are not supported, but reserved"•  3D Image writes are not supported""

"*Some restrictions are addressed through extensions"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 46: Introduction to OpenCL, 2010

OpenCL C Optional Extensions

•  Extensions are optional features exposed through OpenCL"•  The OpenCL working group has already approved many extensions to the

OpenCL specification:"•  Double precision floating-point types"•  Built-in functions to support doubles"•  Atomic functions*"•  Byte-addressable stores (write to pointers to types < 32 bits)*"•  3D Image writes"•  Built-in functions to support half types"

* New core features in OpenCL 1.1

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 47: Introduction to OpenCL, 2010

OpenCL C: Data Types

•  Scalar data types"•  char, uchar, short, ushort, int, uint, long, ulong, float"•  bool, intptr_t, ptrdiff_t, size_t, uintptr_t, void, half (storage)"

•  Image types"•  Image2d_t, image3d_t, sampler_t, event_t"

•  Vector data types"•  Vector lengths 2, 3*, 4, 8, 16 (char2, ushort4, int8, float16, double2^, …)"•  Endian safe"•  Aligned at vector length"•  Vector operations"•  Built-in function "* New core features in OpenCL 1.1 ^ Double is optional type in OpenCL

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 48: Introduction to OpenCL, 2010

OpenCL C: Synchronisation Primitives

•  Built-in functions to order memory operations and synchronise execution:"•  mem_fence(CLK_LOCAL_MEM_FENCE and/or CLK_GLOBAL_MEM_FENCE)"

•  Waits until all reads/writes to local and/or global memory made by calling work-item prior to mem_fence() are visible to all threads in the work-group"

•  barrier(CLK_LOCAL_MEM_FENCE and/or CLK_GLOBAL_MEM_FENCE)"•  Waits until all work-items in the work-group have reached this point and calls mem_fence

(CLK_LOCAL_MEM_FENCE and/or CLK_GLOBAL_MEM_FENCE)"

•  Used to coordinate accesses to local or global memory shared among work-items "

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 49: Introduction to OpenCL, 2010

OpenCL Runtime

•  Command queues creation and management"•  Device memory allocation and management"•  Device code compilation and execution"•  Event creation and management

(synchronisation and profiling)"

Page 50: Introduction to OpenCL, 2010

Kernel Compilation

•  We use cl_program object that encapsulates some source code and its last successful build (it may contain several kernel functions): "•  clCreateProgramWithSource() à creates a program object for a context, and loads

the source code specified by the strings array into the program object"•  clCreateProgramWithBinary() à create program objects and loads the binary there"•  clBuildProgram() à compiles and links a program executable from program source

or binary"

•  Weʼll use also cl_kernel object which encapsulates the values of the kernelʼs arguments used when the kernel is executed: "•  clCreateKernel() à creates a kernel object from successfully compiled program "•  clSetKernelArg() à sets the argument value for a specific argument of a kernel"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 51: Introduction to OpenCL, 2010

Kernel Compilation •  Write a kernel:"

"

const char* src = ”__kernel void vectorMul(__global const float *a,\n” " ” __global const float *b,\n” " ” __global float *c,\n” " ” int numElements)\n”" ”{\n”"

" ” int i = get_global_id(0);\n”"" ” if (i < numElements)\n”"

” c[i] = a[i]*b[i];\n”" ”}\n”;"

•  Create program:"

cl_program program = " clCreateProgramWithSource(context, 1, &src, NULL, NULL); "

•  Build program and create kernel:"

clBuildProgram(program, 0, NULL, NULL, NULL, NULL); "cl_kernel kernel = clCreateKernel(program, ”vectorMul”, NULL);"

•  Set kernel arguments:"

clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&devSrcA); "clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&devSrcB); clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&devDst); clSetKernelArg(kernel, 3, sizeof(cl_int), (void*)&numElements); ""

cl_program clCreateProgramWithSource(!cl_context context,"cl_uint count,"const char **strings,"const size_t *lengths,"cl_int *errcode_ret)"

cl_kernel clCreateKernel(!cl_program program,"const char *kernel_name,"cl_int *errcode_ret)"

cl_int clBuildProgram(!cl_program program,"cl_uint num_devices,"const cl_device_id *device_list,"const char *options;"void CL_CALLBACK *pfn_notify,"void *user_data)"

-cl-opt-disable, !-cl-mad-enable!…"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 52: Introduction to OpenCL, 2010

Memory Objects

•  Memory objects (cl_mem) are categorized into two types:"•  Buffer objects"•  Image objects!

•  Memory objects can be copied to host memory, from host memory, or to other memory objects"

•  Kernels take memory objects as input, and output to one or more memory objects"

•  Regions of a memory object can be accessed by host by mapping them into the host address space"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 53: Introduction to OpenCL, 2010

Memory Objects: Buffer Object

•  A buffer object stored a one-dimensional collection of elements (1D array)"•  Elements of a buffer object can be:"

•  Scalar data type (such as an int, float)"•  Vector data type"•  User-defined structure"

•  Elements in a buffer are stored in sequential fashion and can be accessed using pointer by a kernel executing on a device"

•  Data is stored in the same format as it is accessed by the kernel"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 54: Introduction to OpenCL, 2010

Memory Objects: Image Object

•  Image object stores a two- or three-dimensional texture, frame-buffer or image"

•  Can be created from existing OpenGL texture or render-buffer"•  The elements of an image object are selected from a list of predefined image

formats"•  Image elements are always a 4-component vector (each component can be a

float or signed/unsigned integer) in a kernel"•  Accessed within device via built-in functions (storage format not exposed to

application)"•  Sampler objects are used to configure how built-in functions sample images

(addressing modes, filtering modes)"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 55: Introduction to OpenCL, 2010

Command Queue

•  Memory, program and kernel objects à created using a context"•  Operations on objects performed using a command-queue"•  The command-queue used to schedule commands for execution on a device"

•  En-queuing functions: clEnqueue*()"•  Multiple queues can execute on the same device"

•  Modes of execution:"•  In-order: Each command in the queue executes only when the proceeding

command has completed (including memory writes) "•  Out-of-order: No guaranteed order of completion for commands"•  CL_QUEUE_PROFILING ENABLE: enable or disable profiling commands in the

command-queue"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 56: Introduction to OpenCL, 2010

Command Queue

•  Create command queue for a specific device"cl_command_queue queue = clCreateCommandQueue(context, device, 0, NULL); "

cl_command_queue clCreateCommandQueue(!cl_context context,"cl_device_id device,"cl_command_queue_properties properties,"cl_int *errcode_ret)"

•  Properties"•  CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE determines if command-queue are

executed in-order or out-of-order. If set, the commands are executed out-of-order."•  CL_QUEUE_PROFILING_ENABLE enables or disables profiling of commands in the

command-queue. If set, the profiling of commands is enabled. "

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 57: Introduction to OpenCL, 2010

Data Transfer between Host and Device

•  Create buffers on host and device"

size_t size = 100000*sizeof(int);"int *host_buffer = (int*)malloc(size); "cl_mem devSrcA = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, NULL); "cl_mem devSrcB = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, NULL); …"

•  Write to buffer objects from host memory"

clEnqueueWriteBuffer(queue, devSrcA, " CL_FALSE, 0, size, host_buffer, 0, NULL, NULL); "…"

•  Read from buffer object to host memory"

clEnqueueReadBuffer(queue, devDst, " CL_TRUE, 0, size, host_buffer, 0, NULL, NULL); "…"

cl_mem clCreateBuffer(!cl_context context,"cl_mem_flags flags,"size_t size,"void *host_ptr,"cl_int *errcode_ret)"

CL_MEM_READ_WRITE,!CL_MEM_WRITE_ONLY,!CL_MEM_READ_ONLY,!…"

cl_int clEnqueueWriteBuffer(!cl_command_queue queue,"cl_mem buffer,"cl_bool blocking_write,"size_t offset,"size_t size,"const void *ptr,"cl_uint num_events_in_wait_list,!const cl_event *event_wait_list,"cl_event *event)"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 58: Introduction to OpenCL, 2010

Kernel Invocation over NDRange

•  Host code invokes a kernel over an index space NDRange (1D, 2D or 3D)!•  Work-group dimensionality matches work-item dimensionality"

•  Set number of work-items in a work-group"size_t localWorkSize = 256;"int numWorkGroups = (N+localWorkSize-1)/localWorkSize; // round up"size_t globalWorkSize = numWorkGroups * localWorkSize; // must be divisible by localWorkSize

•  Enqueue kernel"clEnqueueNDRangeKernel(" queue, kernel 1, NULL, &globalWorkSize, &localWorkSize, 0, NULL, NULL); "

cl_int clEnqueueNDRangeKernel(!cl_command_queue queue,"cl_kernel kernel,"Cl_uint work_dim,"cont size_t *global_work_offset,"cont size_t *global_work_size,"cont size_t *local_work_offset,"cl_uint num_events_in_wait_list,!const cl_event *event_wait_list,"cl_event *event)"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 59: Introduction to OpenCL, 2010

Command Synchronisation

•  Queue barrier command: clEnqueueBarrier()"•  Commands after the barrier start executing only after all commands before the

barrier have completed"•  Events: a cl_event object can be associated with each command"

•  Commands return evens and obey event waitlist"•  clEnqueue*(…, num_events_in_waitlist, *event_waitlist, *event);"

•  Any commands (or clWaitForEvents()) can wait on events before executing"•  Event object can be queried to track execution status of associated command and

get profiling information"•  Some clEnqueue*() calls can be optionally blocking"

•  clEnqueueReadBuffer(…, CL_TRUE, …);"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 60: Introduction to OpenCL, 2010

Synchronisation: Queues & Events

•  You must explicitly synchronise between queues"•  Multiple devices each have their own queue (possibly multiple queues per device)"•  Use events to synchronise kernel executions between queues"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 61: Introduction to OpenCL, 2010

OpenCL Resources

•  OpenCL at Khronos"•  http://www.khronos.org/opencl (spec, registry, man, forums, reference card)"

•  NVIDIA OpenCL website, forum"•  http://www.nvidia.com/object/cuda_opencl_new.html"•  http://developer.nvidia.com/object/opencl.html (drivers, profiler, code samples)"

•  AMD Developer Central"•  http://developer.amd.com/gpu/atistreamsdk/pages/default.aspx"

•  Intel OpenCL SDK"•  http://software.intel.com/en-us/articles/intel-opencl-sdk/"

•  IBM OpenCL Development Kid for Linux on Power"•  http://www.alphaworks.ibm.com/tech/opencl"

•  OpenCL Studio"•  http://www.opencldev.com (develop, visualize, prototype UIs)"

CSIRO. Introduction to OpenCL. OpenCL Workshop at the OzViz 2010, Brisbane, December 2010.

Page 62: Introduction to OpenCL, 2010

Contact us Phone: 1300 363 400 or +61 3 9545 2176

Email: [email protected] Web: www.csiro.au Thank you …

Earth Science and Resource Engineering Tomasz P Bednarz 3D Visualisation Engineer Mining Technology Team Mobile: +61 429 153 274 Email: tomasz.bednarz(_at_)csiro.au Web: www.tomaszbednarz.com

Acknowledgments Mark Harris, Derek Gerstmann, Mike Houston, Justin Hensley, Jason Young, Dominik Behr, Con Caris, John Taylor, Khronos Group, AMD, NVIDIA and all others for sharing publicly their GPGPU knowledge (this presentation is based on)