CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

HSA ENABLEMENT OF APARAPI EASING THE DEVELOPER PATH TO APU/GPU ACCELERATED JAVA APPLICATIONS

VIGNESH RAVI – SOFTWARE DEVELOPER HSA TEAM AMD GARY FROST – SOFTWARE FELLOW AMD

2 | HSA ENABLEMENT OF APARAPI | NOVEMBER 2013|

HSA ENABLEMENT OF APARAPI : AGENDA

!  Java GPU enablement via Aparapi ‒ Why Java? ‒ Aparapi

‒ What is it and how is it used?

!  Introduction to HSA !  How HSA simplifies Java GPU programming with Aparapi

‒ Simpler programming model using lambda expressions ‒ Removal of previous constraints thanks to SVM (Shared Virtual Memory)

!  The nuts and bolts of our current HSA enablement ‒ HSAIL generation ‒ Dispatch via HSA Runtime APIs

!  Summary !  Q&A


WHY JAVA?

!  Java by the numbers ‒ 9 Million Developers ‒ 1 Billion Java downloads per year ‒ 97% Enterprise desktops run Java ‒ 100% of blue ray players ship with Java hVp://oracle.com.edgesuite.net/[meline/java/

!  Java 7 language & libraries already include concurrency features ‒ primi[ves (threads, locks, monitors, atomic ops) ‒ libraries (fork/join, thread pools, executors, futures)

!  Upcoming Java 8 include stream processing enhancements ‒ support for ‘lambda’ expressions ‒ Lambda centric concurrent stream processing libs/apis (java.u[l.stream.*)


INITIAL APARAPI PROJECT OVERVIEW (2011)

!  Open Source framework !  Allows Java developers access to GPU compute !  Aparapi Java API for expressing data parallel workloads

Kernel kernel = new Kernel(){ @Override public void run(){ int i=getGlobalID(); square[i]=in[i]*in[i]; } }; kernel.execute(size);

!  Aparapi runtime capable of converting bytecode to OpenCL™ ‒ Execution on OpenCL™ 1.1+ capable devices (GPUs and APUs) Or… ‒ Execute via a thread pool if OpenCL™ is unavailable.

GPU ISA

JVM

Java Applica[on

OpenCL™

OpenCL™ compiler & Runtime

Overload Aparapi Kernel Base Class’s run() method

CPU ISA

Overload Aparapi Kernel Class’s run() method

Aparapi converts bytecode to OpenCL™

CPU GPU


MEET HSA AND HSAIL

! Heterogeneous System Architecture standardizes CPU/GPU func[onality ‒ Be ISA-‐agnos[c for both CPUs and accelerators ‒ Support high-‐level programming languages ‒ Provide the ability to access pageable system memory from the GPU ‒ Maintain cache coherency for system memory between CPU and GPU

! Specifica[ons and simulator from HSA Founda[on ‒ HSAIL portable ISA is “finalized” to par[cular hardware ISA at run[me ‒ Run[me specifica[on for job launch and control ‒ HSAIL™ simulator for development and tes[ng before hardware availability


APARAPI HSA ENABLEMENT (2013-‐2014)

!  Open Source project sponsored !  Enhanced to support HSA and Java 8 lambda expression

Device.hsa().forEach(size, i -> square[i]=in[i]*in[i] );

!  Allow developers to efficiently represent data parallel algorithms using new Java 8 Lambda expressions

!  API’s have same look & feel as proposed Java 8 stream API features

!  No modifica[ons to the JVM. ‒ We provide external JNI/Java libraries.

GPU ISA

JVM

Java Applica[on

GPU CPU

HSAIL

HSA Finalizer & Runtime

Aparapi Lambda based API

Aparapi converts bytecode to HSAIL

CPU ISA


HSA AND LAMBDA ENABLED APARAPI EXECUTION EXAMPLE

Device.hsa().forEach(size, i -> square[i]=int[i]*int[i] );

Is this the first execuAon of this

lambda instance?

Execute HSAIL Kernel on GPU/APU

Does PlaLorm Supports HSA?

Y Can bytecode be converted to

HSAIL?

Y

Convert bytecode to

HSAIL

Y

Y Do we have HSAIL for this lambda ?

N

Execute Kernel using Java thread Pool

N

N

N


SUMATRA PROJECT : NATIVE SUPPORT FOR GPU OFFLOAD ADDED TO JAVA

!  AMD/Oracle sponsored Open Source (OpenJDK) project

!  Targeted at OpenJDK Java 9 (2015) !  Allow developers to efficiently represent data parallel algorithms in Java using

Stream API + Lambda expressions

!  Sumatra is not pushing new ‘programming model’

!  Instead we ‘repurpose’ Stream API + Lambda to enable both CPU or GPU compu[ng

!  A Sumatra enabled Java Virtual Machine™ will dispatch ‘selected’ constructs to HSA enabled devices at run[me.

!  Developers already refactoring JDK to use stream & lambda API’s ‒  So anyone using exis[ng JDK should see GPU accelera[on without any code changes.

!  Links: ‒  hVp://openjdk.java.net/projects/sumatra ‒  hVps://wikis.oracle.com/display/HotSpotInternals/Sumatra ‒  hVp://mail.openjdk.java.net/pipermail/sumatra-‐dev

GPU ISA

JVM

Java Applica[on

GPU CPU

HSAIL

HSA Finalizer & Runtime

CPU ISA

Java JDK Stream + Lambda API

Java GRAAL JIT backend


HSA ENABLEMENT OF JAVA

9

CPU ISA GPU ISA

JVM

Java Applica[on

GPU CPU

OpenCL™

OpenCL™ Compiler and Run[me

APARAPI API

Java 7 – OpenCL enabled Aparapi •  AMD ini[ated Open Source project •  APIs for data parallel algorithms

GPU accelerate Java applica[ons

No need to learn OpenCL

•  Ac[ve community captured mindshare ~20 contributors >7000 downloads ~150 visits per day

CPU ISA GPU ISA

JVM

Java Applica[on

GPU CPU

HSAIL™

HSA Finalizer & Run[me

APARAPI + Lambda API

Java 8 – HSA enabled Aparapi •  Java 8 brings Stream + Lambda API.

More natural way of expressing data parallel algorithms Ini[ally targeted at mul[-‐core.

•  APARAPI will :-‐ Support Java 8 Lambdas Dispatch code to HSA enabled devices at run[me via HSAIL

Java 9 – HSA enabled Java (Sumatra) •  Adds na[ve GPU compute support to Java Virtual Machine

(JVM)

•  Developer uses JDK provided Lambda + Stream API

•  JVM uses GRAAL compiler to generate HSAIL

•  JVM decides at run[me to execute on either CPU or GPU depending on workload characteris[cs.

GPU ISA

JVM

Java Applica[on

GPU CPU

HSAIL™

HSA™ Finalizer & Run[me

Java JDK Stream + Lambda API

Java GRAAL JIT backend

CPU ISA

We plan to provide HSA Enabled Aparapi (Java 8)

as a bridge technology between OpenCL based Aparapi (Java 7)

and HSA Enabled Sumatra (Java 9)

10 | HSA ENABLEMENT OF APARAPI | NOVEMBER 2013| 10

A CASE STUDY CENTERED ON NBODY

!  A Java developer implemen[ng a sequen[al version of NBody would probably… ‒  Create a class to represent each body

!  Loop through each Body (in array of bodies[]) to update and display

class Body{ float x,y,z,m,vx,vy,vz; // Include method to update position and display void updateAndShow(Screen screen, Body[] bodies){ for (Body other:bodies){

// accumulate forces between other and this } // update vx,vy,vz,x,y and z from accumulated data screen.paint(x,y,z);

} }

for (Body b: bodies) b.updateAndShow(screen, bodies);


WITHOUT HSA WE CAN’T (EFFICIENTLY) USE OBJECTS

!  In Java; allocated Objects are scaVered on the heap. ‒  There is no way to allocate an array of objects in con[guous memory (as with C++) ‒ We force the developer to resort to using parallel arrays of primi[ves (which are con[guous)

‒  And to infer that x[n], y[n] and z[n] holds the state for bodies[n].

‒  Then the kernel can be used to execute the code on the GPU

float x[], y[], z[], m[], vx,[], vy[], vz[];

Kernel kernel = new Kernel(){ public void run(){

int i = getGlobalId(0); for (int j=0; j<bodies.length; j++){

// accum forces between (x,y,z)[j] and (x,y,z)[i] } // update vx[j],vy[j],vz[j],x[j],y[j] and z[j]

} };

Kernel.execute(bodies.length);


HSA ENABLED APARAPI (AND SUMATRA) ALLOWS USE OF OBJECTS

!  So we code our Body class exactly as we would if execu[ng in Java.

!  Then use new Aparapi lambda enabled API to coordinate dispatch to theGPU

class Body{ float x,y,z,m,vx,vy,vz; // Include method to update position and display void updateAndShow(Screen screen, Body[] bodies){ for (Body other:bodies){

// accumulate forces between other and this } // update vx,vy,vz,x,y and z from accumulated data screen.paint(x,y,z);

} }

Device.hsa().forEach(bodies, b -> { b.updateAndShow(screen, bodies); });


OVERVIEW OF HSA ENABLED APARAPI

! HSA enabled Aparapi, at run[me: ‒ Step 0: Generate HSAIL from Bytecode ‒ Step 1: Generate host HSA Run[me calls

‒ Step 1.1: Ini[alize HSA run[me, device, queue …

‒ Step 1.2: Finalize HSAIL to generate GPU ISA ‒ Step 1.3: Bind Java args to HSA args ‒ Step 1.4: Dispatch the kernel ‒ Step 1.5: Wait for comple[on ‒ Repeat steps 1.3 -‐ 1.5 for next itera[on of same kernel

‒ Repeat step 0 – 1 for each new kernel

CPU ISA

JVM

Application

Aparapi

GPU CPU

Run

time

MyLambda.java

javac (compiler)

MyLambda.class

Dev

elop

men

t tim

e

Generate HSAIL

Generate HSA RT

calls

Initialize

Bind Args

Finalize

Dispatch GPU ISA

Input

Contains


HIGH LEVEL HSA FEATURES

! Features currently being defined in the HSA Working Groups** ‒ Unified addressing across all processors ‒ Opera[on into pageable system memory ‒ Full memory coherency ‒ Pla|orm atomics ‒ User mode dispatch

‒ Enables fast dispatch with no driver involvement ‒ Architected queuing language

‒ Flexible compute dispatch, easier GPU self-‐enqueue ‒ High level language support for GPU compute processors ‒ Preemp[on and context switching

** All features subject to change, pending comple[on and ra[fica[on of specifica[ons in the HSA Working Groups

@ Copyright 2012 HSA Founda[on. All Rights Reserved.


HSA INTERMEDIATE LANGUAGE (HSAIL)**

! HSAIL is a virtual ISA for parallel programs ‒ Finalized to vendor-‐specific ISA by a JIT compiler or “Finalizer” ‒ ISA independent by design for CPU & GPU

! Explicitly parallel ‒ Designed for data parallel programming

! Support for excep[ons, virtual func[ons, and other high level language features ! Lower level than OpenCL™ SPIR

‒ Fits naturally in the OpenCL™ compila[on stack

! Suitable to support addi[onal high level languages and programming models: ‒ Java, C++, OpenMP, etc

** Subject to change, pending comple[on and ra[fica[on of specifica[ons in the HSA Working Groups @ Copyright 2012 HSA Founda[on. All Rights Reserved.


HSAIL OVERVIEW**

!  Similar to assembly language for a RISC CPU ‒  Load-‐store architecture

ld_global_u64 $d0, [$d6 + 120]; $d0= load($d6+120)

add_u64 $d1, $d2, 24; $d1= $d2+24

!  136 opcodes (Java™ bytecode has 200) ‒  Floa[ng point (single, double, half (f16)) ‒  Integer (32-‐bit, 64-‐bit) ‒  Some packed opera[ons ‒ Branches ‒  Func[on calls ‒ Pla$orm Atomic Opera[ons: and, or, xor, exch, add, sub, inc, dec, max, min, cas ‒  Synchronize host CPU and HSA Component!

!  Text and Binary formats (“BRIG”)

** Subject to change, pending comple[on and ra[fica[on of specifica[ons in the HSA Working Groups

!  Four classes of registers ‒ C: 1-‐bit, Control Registers ‒  S: 32-‐bit, Single-‐precision FP or Int ‒ D: 64-‐bit, Double-‐precision FP or Long Int ‒ Q: 128-‐bit, Packed data.

!  Fixed number of registers: ‒ 8 C ‒  S, D, Q share a single pool of resources

S + 2*D + 4*Q <= 128

Up to 128 S or 64 D or 32 Q (or a blend)

!  Register alloca[on done in high-‐level compiler ‒  Finalizer doesn’t have to perform expensive register alloca[on

INSTRUCTION SET REGISTERS



SEGMENTS AND MEMORY **

!  7 segments of memory ‒  global, readonly, group, spill, private, arg, kernarg, ‒ Memory instruc[ons can (op[onally) specify a segment

!  Global Segment ‒ Visible to all HSA agents (including host CPU)

!  Group Segment ‒ Provides high-‐performance memory shared in the work-‐group.

‒ Group memory can be read and wriVen by any work-‐item in the work-‐group

‒ HSAIL provides sync opera[ons to control visibility of group memory

!  Spill, Private, Arg Segments ‒ Represent different regions of a per-‐work-‐item stack ‒ Typically generated by compiler, not specified by programmer


!  Kernarg Segment ‒ Programmer writes kernarg segment to pass arguments to a kernel

!  Read-‐Only Segment ‒ Remains constant during execu[on of kernel

!  Flat Addressing ‒ Each segment mapped into virtual address space

‒  Flat addresses can map to segments based on virtual address

‒  Instruc[ons with no explicit segment use flat addressing

‒ Very useful for high-‐level language support (ie classes, libraries)

‒ Aligns well with OpenCL 2.0 “generic” addressing feature

ld_global_u64 $d0, [$d6] ld_group_u64 $d0,[$d6+24] st_spill_f32 $s1,[$d6+4] ld_kernarg_u64 $d6, [%_arg0] ld_u64 $d0,[$d6+24] ; flat



EXAMPLE – BYTECODE TO HSAIL GENERATION

int in[], out[]; Device.hsa().forEach(len, i-> out[i] = in[i] * in[i] );

javac –g squares.java

0: aload_0 //out[] 1: iload_2 //i 2: aload_1 //in[] 3: iload_2 4: iaload 5: aload_1 6: iload_2 7: iaload 8: imul 9: iastore 10: return

version 0:95: $full : $large; kernel &run( kernarg_u64 %_arg0, //out[] kernarg_u64 %_arg1, //in[] kernarg_s32 %_arg2 ){ ld_kernarg_u64 $d0, [%_arg0]; ld_kernarg_u64 $d1, [%_arg1]; ld_kernarg_s32 $s2, [%_arg2]; workitemabsid_u32 $s2, 0; //i mov_b64 $d3, $d0; mov_b32 $s4, $s2; mov_b64 $d5, $d1; mov_b32 $s6, $s2; cvt_u64_s32 $d6, $s6; mad_u64 $d6, $d6, 4, $d5; ld_global_s32 $s5, [$d6+24]; mov_b64 $d6, $d1; mov_b32 $s7, $s2; cvt_u64_s32 $d7, $s7; mad_u64 $d7, $d7, 4, $d6; ld_global_s32 $s6, [$d7+24]; mul_s32 $s5, $s5, $s6; cvt_u64_s32 $d4, $s4; mad_u64 $d4, $d4, 4, $d3; st_global_s32 $s5, [$d4+24]; ret; };

Generated HSAIL


APARAPI JNI CALL -‐> HSA RUNTIME API

Device Discovery & Queue Crea[on APIs**

!  Discover HSA Device ‒ Both count and device_list are out params ‒ User can iterate over HSA devices in the list

!  User-‐Mode Queue Crea[on ‒ User can provide pre-‐allocated buffer ‒  If not, API will allocate a buffer ‒ queue is the user-‐mode queue

HsaStatus HsaGetDevices(unsigned int *count, const HsaDevice **device_list);

HsaStatus HsaCreateUserModeQueue(const HsaDevice *device, void *buffer, size_t buffer_size,

HsaQueuePriority queue_priority, HsaQueueFrac[on queue_frac[on, HsaQueue **queue);

** All APIs subject to change, pending comple[on and ra[fica[on of specifica[ons in the HSA Working Groups


APARAPI JNI -‐> HSA RUNTIME API

Finalize HSAIL to GPU ISA**

!  Transla[ng HSAIL text to Binary (BRIG) ‒ BRIG is a binary container for several sec[ons

‒  Code ‒  String ‒ Direc[ve ‒ …

‒  libHsail is an assembler/disassembler ‒  This is a standalone compiler library ‒ Not part of Run[me

!  Finalize Brig to IHV specific GPU ISA ‒  Input: Brig ‒ Output: HsaKernelCode which contains ISA

** All APIs subject to change, pending comple[on and ra[fica[on of specifica[ons in the HSA Working Groups

Status Assemble (const char* hsail_text, HsaBrig *brig);

HsaStatus HsaFinalizeBrig(const HsaDevice *device, HsaBrig *brig, const char *kernel_name, const char *op[ons, HsaKernelCode **kernel);


APARAPI JNI -‐> POPULATION OF AQL DISPATCH PACKET

!  AQL Dispatch Packet** ‒ Header enables:

‒ Different packet types ‒  Specify if this packet should wait for all previous to complete

‒  Control visibility of data and memory fences before and a�er dispatch

‒ Body enables: ‒  Specify the problem fan out using launch config related fields

‒ How much workgroup memory? ‒  Loca[on of IHV specific GPU ISA ‒  Loca[on of where kernelargs can be found ‒  A signal mechanism to wait on kernel comple[on

!  Only popula[ng Kernel info and signal are opaque, so require run[me APIs ‒ Other fields are open, so simple assignments


typedef struct HsaAqlDispatchPacket { uint32_t format : 8; uint32_t barrier : 1; uint32_t acquire_fence_scope : 2; uint32_t release_fence_scope : 2; uint32_t invalidate_instruction_cache : 1; uint32_t invalidate_roi_image_cache : 1; uint32_t dimensions : 2; uint32_t reserved : 15; uint16_t workgroup_size[3]; uint16_t reserved2; uint32_t grid_size[3]; uint32_t private_segment_size_bytes; uint32_t group_segment_size_bytes; uint64_t kernel_object_address; uint64_t kernel_arg_address; uint64_t reserved3; uint64_t completion_signal; } HsaAqlDispatchPacket;

Header Fields

Launch Config

Kernel Info

Kernel SynchronizaAon


POPULATING KERNEL INFO AND SIGNAL USING HSA RT API**

HsaStatus HsaFinalizeBrig(const HsaDevice *device, HsaBrig *brig, const char *kernel_name, const char *op[ons, HsaKernelCode **kernel);

typedef struct HsaKernelCode { … uint32_t workitem_private_segment_byte_size; uint32_t workgroup_group_segment_byte_size; uint64_t kernarg_segment_byte_size; … } HsaKernelCode;


typedef struct HsaAqlDispatchPacket { … uint32_t private_segment_size_bytes; uint32_t group_segment_size_bytes; uint64_t kernel_object_address; uint64_t kernel_arg_address; … uint64_t completion_signal; }

Pack Java Args into a vector in JNI

HsaStatus HsaRegisterSystemMemory(void *address, size_t size);

Register vector data address

HsaStatus HsaCreateSignal(HsaSignal *signal);


DISPATCH AND WAIT ON KERNEL COMPLETION

!  Dispatch ‒  Submit AQL Packet into the HsaQueue ‒ Thread safe API

!  Wait on Kernel Comple[on


HsaStatus HsaSubmitAql(HsaQueue *queue,HsaAqlDispatchPacket *aql_packet);

bool is_done = false; while (!is_done) { status = HsaQuerySignal(signal, &is_done); assert(status == kHsaStatusSuccess); }

!  A�er comple[on, disposing HSA resources ‒ Release queue ‒ Release signal ‒ Release Kernel object ‒ Deregister kernel args related memory

HsaStatus HsaDestroyUserModeQueue(HsaQueue *queue); HsaStatus HsaDestroySignal(HsaSignal signal); HsaStatus HsaFreeKernelCode(HsaKernelCode *kernel); HsaStatus HsaDeregisterSystemMemory(void *address);


DEMO


SUMMARY

!  Aparapi is already an establish framework for simplifying execu[on of Java on GPU devices

!  HSA enabled Aparapi further simplifies GPU accelera[on of Java applica[ons ‒ Aligns with Java 8 features to support ‘lambda’ expression for compactness ‒ Enables ‘large unified’ system memory for GPU accelera[on ‒ Eases programming by enabling direct access to Java objects on heap ‒ Enables fast offload of Java kernels through User-‐mode queue and AQL

!  HSA enabled Aparapi lends to more interes[ng future possibili[es ‒  Simplified communica[on and workload balancing across both CPU and GPU ‒ Exploit new computa[on paVerns and recursions through kernel self-‐enqueue


QUESTIONS & ANSWERS?


DISCLAIMER & ATTRIBUTION

The informa[on presented in this document is for informa[onal purposes only and may contain technical inaccuracies, omissions and typographical errors.

The informa[on contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, so�ware changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obliga[on to update or otherwise correct or revise this informa[on. However, AMD reserves the right to revise this informa[on and to make changes from [me to [me to the content hereof without obliga[on of AMD to no[fy any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combina[ons thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdic[ons. OpenCL is a trademark of Apple Inc. HSA is a trademark of the Heterogeneous System Architecture Founda[on. Other names are for informa[onal purposes only and may be trademarks of their respec[ve owners.

Technology

CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi