OpenCL - The Open Standard for Heterogeneous Parallel Programming

Folie 120090714-1 TechTalk OpenCL SC-VK Basermann

OpenCL – The Open Standard for Heterogeneous Parallel Programming

Dr.-Ing. Achim BasermannDeutsches Zentrum für Luft- und Raumfahrt e.V.

Abteilung Verteilte Systeme und Komponentensoftware, Köln-Porz

Folie 220090714-1 TechTalk OpenCL SC-VK Basermannt

Survey

MotivationTrends in parallel processingRole of OpenCL

Khronos´ OpenCLPlatform modelExecution modelMemory modelProgramming model

Conclusions

Material used: OpenCL survey and specification from http://www.khronos.org/opencl/


Motivation: Trends in Parallel Processing, HW

Trend #1: Multicore processor chipsMaintain (or even reduce) frequency while replicating cores

Trend #2: Accelerators (GPGPUs, e.g.)Previously, processors would “catch” up with accelerator function in thenext generation

Accelerator design expense not amortized wellNew accelerator designs more likely to maintain performance advantage

And will maintain an enormous power advantage for target workloads

Trend #2b: Heterogeneous multicore in general (IBM Cell, e.g.)Mixes of powerful cores, smaller cores, and accelerators potentially offerthe most efficient nodesThe challenge is harnessing them efficiently


Motivation: Programming Issues

Many cores per node, and accelerators/heterogeneity

Future performance gains will come via massive parallelism (not clock speed)An unwelcome situation for HPC apps!

Need new programming models to exploit

At the system/cluster level:Message-passing to connect node-level languages, orGlobal addressing to make communication implicit?


Motivation: Classic Programming Models


Motivation: OpenCL

New open standard that specifically addresses parallel compute accelerators

Extension to C

Provides data parallel and task parallel models

Facilitates natural transition from the growing number of CUDA (ComputeUnified Device Architecture, NVIDIA) programs

Porting of Cell applications to a standard model

Play wells with MPI

Can interoperate with Fortran and OpenMP


Motivation: Roles of OpenCL


Motivation: Before OpenCL


Motivation: The promise of OpenCL


: OpenCL

CPUsMultiple cores driving

performance increases

GPUsIncreasingly general purpose

data-parallel computingImproving numerical precision

Graphics APIs and Shading Languages

Multi-processor programming –

e.g. OpenMP

EmergingIntersection

OpenCLHeterogenous

Computing

OpenCL – Open Computing LanguageOpen, royalty-free standard for portable, parallel programming of heterogeneous

parallel computing CPUs, GPUs, and other processors

OpenCL – Open Computing LanguageOpen, royalty-free standard for portable, parallel programming of heterogeneous

parallel computing CPUs, GPUs, and other processors


OpenCL Working GroupDiverse industry participation

Processor vendors, system OEMs, middleware vendors, application developersMany industry-leading experts involved in OpenCL’s design

A healthy diversity of industry perspectivesApple initially proposed and is very active in the working group

Serving as specification editor

Here are some of the other companies in the OpenCL working group


OpenCL Timeline

Six months from proposal to released specificationDue to a strong initial proposal and a shared commercial incentive to work quickly

Apple’s Mac OS X Snow Leopard will include OpenCLImproving speed and responsiveness for a wide spectrum of applications

Multiple OpenCL implementations expected in the next 12 monthsOn diverse platforms

Apple works with AMD, Intel,

NVIDIA and others on draft

proposal

Apple proposes OpenCL working

group and contributes draft specification to

Khronos

OpenCL working group develops draft

into cross-vendor

specification

Working Group sends

completed draft to Khronos Board for

Ratification

Khronos publicly releases

OpenCL as royalty-free

specification

Khronos to release

conformance tests to ensure

high-quality implementations

Jun08 Oct08Dec08

May09


OpenCL: Part of the Khronos API Ecosystem Silicon Community

Software Community

OpenCLHeterogeneous

Parallel Computing

Embedded 3D

Cross platform desktop 3D

3D Asset Interchange Format

Enhanced Audio

Vector 2D

Surface and synch abstraction

Streaming Media andImage Processing

Mobile OS Abstraction

Integrated Mixed-media Stack

Desktop 3D Ecosystem Parallel computing and

visualization in scientific and consumer applications

Umbrella specifications define coherent acceleration stacks for mobile application portability

Streamlined APIs for mobile and embedded graphics, media and

compute acceleration

OpenCL is at the center of an emerging visual computing ecosystem that includes 3D graphics, video and image processing on desktop, embedded and mobile systems


OpenCL: Platform Model

One Host + one or more Compute DevicesEach Compute Device is composed of one or more Compute Units

Each Compute Unit is further divided into one or more Processing Elements


OpenCL: Execution ModelOpenCL Program:

KernelsBasic unit of executable code — similar to C functions, CUDA kernels, etc.Data-parallel or task-parallel

Host ProgramCollection of compute kernels and internal functionsAnalogous to a dynamic library

Kernel ExecutionThe host program invokes a kernel over an index space called an NDRange

NDRange, “N-Dimensional Range”, can be a 1D, 2D, or 3D space

A single kernel instance at a point in the index space is called a work-itemWork-items have unique global IDs from the index spaceCUDA: thread IDs

Work-items are further grouped into work-groupsWork-groups have a unique work-group IDWork-items have a unique local ID within a work-groupCUDA: Block IDs


Total number of work-items = Gx * Gy

Size of each work-group = Sx * Sy

Global ID can be computed from work-group ID and local ID

OpenCL: Execution Model,example 2D NDRange


OpenCL: Execution ModelContexts are used to contain and manage the state of the “world”

Kernels are executed in contexts defined and manipulated by the hostDevicesKernels - OpenCL functionsProgram objects - kernel source and executableMemory objects

Command-queue - coordinates execution of kernelsKernel execution commandsMemory commands: Transfer or map memory object dataSynchronization commands: Constrain the order of commands

Execution order of commandsLaunched in-order Executed in-order or out-of-orderEvents are used to implement appropriate synchronization of execution instances


OpenCL: Memory ModelShared memory

Relaxed consistency (similar to CUDA)

Multiple distinct address spaces(which can be collapsed)

Global / Constant Memory Data Cache

Global Memory

Compute Device Memory

Compute DeviceGlobal memory

- Qualifier: __global; Ex.: __global float4 *p;- Global memory in CUDA

Constant memory- Qualifier: __constant- Constant memory in CUDA

Local Memory Local Memory

Local memory (to work group)- Qualifier: __local- Shared memory in CUDA

PE PE PE PE

Private memory (to a work item)- Qualifier: __private; Ex.: __private char *px;- local memory in CUDA


A relaxed consistency memory model

Across work-items (CUDA: threads) no consistency

Within a work-item (CUDA: thread) load/store consistency

Consistency of memory shared between commands is enforcedthrough synchronization

OpenCL: Memory Consistency


OpenCL: Programming Model

DataData--ParallelParallel Model

Must be implemented by all OpenCL compute devices

Define N-Dimensional computation domainEach independent element of execution in an N-Dimensional domain is called a

work-itemN-Dimensional domain defines total # of work-items that execute in parallel

= global work size

Work-items can be grouped together — work-groupWork-items in group can communicate with each otherCan synchronize execution among work-items in group to coordinate memory access

Execute multiple work-groups in parallelMapping of global work size to work-group can be implicit or explicit


OpenCL: Programming Model

TaskTask--ParallelParallel Model

Some compute devices can also execute task-parallel compute kernels

Execute as a single work-item

Users express parallelism byusing vector data types implemented by the device,enqueuing multiple tasks (compute kernels written in OpenCL), and/orenqueing native kernels developed using a programming model

orthogonal to OpenCL (native C / C++ functions, e.g.).


Work-items in a single work-group (work-group barrier)Similar to _synchthreads () in CUDA

No mechanism for synchronization between work-groups

Synchronization points between commands in command-queuesSimilar to multiple kernels in CUDA but more generalizedCommand-queue barrierWaiting on an event

OpenCL: Programming Model, Synchronization


OpenCL C for Compute Kernels

Derived from ISO C99A few restrictions: recursion, function pointers, functions in C99 standard headers ...Preprocessing directives defined by C99 are supported

Built-in Data TypesScalar and vector data types, PointersData-type conversion functions: convert_type<_sat><_roundingmode> Image types: image2d_t, image3d_t and sampler_t

Built-in Functions — RequiredWork-item functions, math.h, read and write imageRelational, geometric functions, synchronization functions

Built-in Functions — Optionaldouble precision (latest CUDA supports this), atomics to global and local memoryselection of rounding mode, writes to image3d_t surface


Function qualifiers“__kernel” qualifier declares a function as a kernelKernels can call other kernel functions

Address space qualifiers__global, __local, __constant, __privatePointer kernel arguments must be declared with an address space qualifier

Work-item functionsQuery work-item identifiersget_work_dim()get_global_id(), get_local_id(), get_group_id()

Image functionsImages must be accessed through built-in functionsReads/writes performed through sampler objects from host or defined in source

Synchronization functionsBarriers - all work-items within a work-group must execute the barrier function before any work-item can continueMemory fences - provides ordering between memory operations

OpenCL C Language Highlights


Pointers to functions not allowedPointers to pointers allowed within a kernel, but not as an argumentBit-fields not supportedVariable-length arrays and structures not supportedRecursion not supportedWrites to a pointer of types less than 32-bit not supportedDouble types not supported, but reserved3D Image writes not supported

Some restrictions are addressed through extensions

OpenCL C Language Restrictions


Basic OpenCL Program Structure


OpenCL: Kernel Code Example

Simple element by element vector addition

For all i,

C(i) = A(i) + B(i)

__kernel void VectorAdd(__global const float* a,__global const float* b,__global float* c)

{int iGID = get_global_id(0);c[iGID] = a[iGID] + b[iGID];

}


OpenCL VectorAdd: Contexts and Queues

// create the OpenCL context on a GPU devicecxMainContext = clCreateContextFromType (0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);

// get the list of GPU devices associated with contextclGetContextInfo (cxMainContext, CL_CONTEXT_DEVICES, 0, NULL, &szParmDataBytes);cdDevices = (cl_device_id*)malloc(szParmDataBytes);clGetContextInfo (cxMainContext, CL_CONTEXT_DEVICES, szParmDataBytes, cdDevices, NULL);

// create a command-queuecqCommandQue = clCreateCommandQueue (cxMainContext, cdDevices[0], 0, NULL);

cl_context cxMainContext; // OpenCL contextcl_command_queue cqCommandQue; // OpenCL command queuecl_device_id* cdDevices; // OpenCL device list size_t szParmDataBytes; // byte length of parameter storage


OpenCL VectorAdd: Create Memory Objects,Program and Kernel

Create Memory Objects

Create Program and Kernel

// allocate the first source buffer memory object… source data, so read only …// allocate the second source buffer memory object … source data, so read only …// allocate the destination buffer memory object … result data, so write only …

// create the program…// build the program…// create the kernel...// set the kernel Argument values…


OpenCL VectorAdd: Launch Kernel

// set work-item dimensionsszGlobalWorkSize[0] = iTestN;szLocalWorkSize[0]= 1;

// execute kernelciErrNum = clEnqueueNDRangeKernel (cqCommandQue, ckKernel, 1, NULL,

szGlobalWorkSize, szLocalWorkSize, 0, NULL, NULL);

// Cleanup: release kernel, program, and memory objects…

cl_command_queue cqCommandQue; // OpenCL command queuecl_kernel ckKernel; // OpenCL kernel “VectorAdd”size_t szGlobalWorkSize[1]; // Global # of work itemssize_t szLocalWorkSize[1]; // # of Work Items in Work Groupint iTestN = 10000; // Length of demo test vectors


Summary: OpenCL versus CUDA

Using synch_threadsbetween threads

Synchronization using a work-group barrier(between work-items)

Synchronization

Weak consistencyWeak consistencyMemory consistency

Global/constant/shared/local+ texture

Global/constant/local/private

Memory model

Block/ThreadWork-groups/work-itemsExecution Model

CUDAOpenCL


ConclusionsOpenCL meets the trends in HPC

Supports multicore SMPs

Supports accelerators (in particular GPGPUs)

Allows programming of heterogenous processors

Supports vectorization (SIMD operations)

Explicitly supports parallel image processing

Suitable for massive parallelism: OpenCL + MPI

OpenCL is a low-level parallel language and complicated.

If you master it you master (heterogeneous) parallel programming.

It might be the basis for new high-level parallel languages.


Questions?


Next TechTalk on CUDA: Wednesday, July 22, 2009(15:00-16:00, Raum-KP-2b-06, Funktional)

CUDA - The Compute Unified Device Architecturefrom NVIDIA

Jens Rühmkorf


Additional material: OpenCL Demos

NVIDIA: http://www.youtube.com/watch?v=PJ1jydg8mLg

AMD (1): http://www.youtube.com/watch?v=MCaGb40Bz58&feature=related

AMD (2): http://www.youtube.com/watch?v=mcU89Td53Gg

Technology

OpenCL - The Open Standard for Heterogeneous Parallel Programming