27
HSA ENABLEMENT OF APARAPI EASING THE DEVELOPER PATH TO APU/GPU ACCELERATED JAVA APPLICATIONS VIGNESH RAVI – SOFTWARE DEVELOPER HSA TEAM AMD GARY FROST – SOFTWARE FELLOW AMD

CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

Embed Size (px)

DESCRIPTION

Presentation CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi at the AMD Developer Summit (APU13) Nov. 11-13, 2013.

Citation preview

Page 1: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

HSA  ENABLEMENT  OF  APARAPI    EASING  THE  DEVELOPER  PATH  TO  APU/GPU  ACCELERATED  JAVA  APPLICATIONS  

VIGNESH  RAVI  –  SOFTWARE  DEVELOPER  HSA  TEAM  AMD    GARY  FROST  –  SOFTWARE  FELLOW  AMD  

Page 2: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

2   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

HSA  ENABLEMENT  OF  APARAPI  :  AGENDA  

!  Java GPU enablement via Aparapi ‒ Why Java? ‒ Aparapi

‒ What is it and how is it used?

!  Introduction to HSA !  How HSA simplifies Java GPU programming with Aparapi

‒ Simpler programming model using lambda expressions ‒ Removal of previous constraints thanks to SVM (Shared Virtual Memory)

!  The nuts and bolts of our current HSA enablement ‒ HSAIL generation ‒ Dispatch via HSA Runtime APIs

!  Summary !  Q&A

Page 3: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

3   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

WHY  JAVA?  

!  Java  by  the  numbers    ‒ 9  Million  Developers  ‒ 1  Billion  Java  downloads  per  year  ‒ 97%    Enterprise  desktops  run  Java  ‒ 100%    of  blue  ray  players  ship  with  Java  hVp://oracle.com.edgesuite.net/[meline/java/  

!  Java  7  language  &  libraries  already  include  concurrency  features    ‒ primi[ves  (threads,  locks,  monitors,  atomic  ops)  ‒ libraries  (fork/join,  thread  pools,  executors,  futures)  

!  Upcoming  Java  8  include  stream  processing  enhancements  ‒ support  for  ‘lambda’    expressions    ‒ Lambda  centric  concurrent  stream  processing  libs/apis    (java.u[l.stream.*)      

Page 4: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

4   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

INITIAL  APARAPI  PROJECT  OVERVIEW  (2011)  

!  Open Source framework !  Allows Java developers access to GPU compute !  Aparapi Java API for expressing data parallel workloads

Kernel kernel = new Kernel(){ @Override public void run(){ int i=getGlobalID(); square[i]=in[i]*in[i]; } }; kernel.execute(size);

!  Aparapi runtime capable of converting bytecode to OpenCL™ ‒ Execution on OpenCL™ 1.1+ capable devices (GPUs and APUs) Or… ‒ Execute via a thread pool if OpenCL™ is unavailable.  

GPU ISA

JVM

 Java  Applica[on  

OpenCL™

OpenCL™ compiler & Runtime

Overload  Aparapi  Kernel  Base  Class’s  run()  method  

CPU ISA

Overload  Aparapi  Kernel  Class’s    run()  method  

Aparapi  converts  bytecode  to  OpenCL™    

CPU GPU

Page 5: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

5   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

MEET  HSA  AND  HSAIL  

! Heterogeneous  System  Architecture  standardizes  CPU/GPU  func[onality  ‒ Be  ISA-­‐agnos[c  for  both  CPUs  and  accelerators  ‒ Support  high-­‐level  programming  languages  ‒ Provide  the  ability  to  access  pageable  system  memory  from  the  GPU  ‒ Maintain  cache  coherency  for  system  memory  between  CPU  and  GPU  

! Specifica[ons  and  simulator  from  HSA  Founda[on  ‒ HSAIL  portable  ISA  is    “finalized”  to  par[cular  hardware  ISA  at  run[me  ‒ Run[me  specifica[on  for  job  launch  and  control  ‒ HSAIL™  simulator  for  development  and  tes[ng  before  hardware  availability  

Page 6: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

6   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

APARAPI  HSA  ENABLEMENT  (2013-­‐2014)  

!  Open  Source  project  sponsored    !  Enhanced  to  support  HSA  and  Java  8  lambda  expression  

Device.hsa().forEach(size, i -> square[i]=in[i]*in[i] );  

!  Allow  developers  to  efficiently  represent  data  parallel  algorithms  using  new  Java  8  Lambda  expressions  

!  API’s  have  same  look  &  feel  as  proposed  Java  8  stream  API  features  

!  No  modifica[ons  to  the  JVM.      ‒ We  provide  external  JNI/Java  libraries.    

GPU ISA

JVM

 Java  Applica[on  

GPU CPU

HSAIL

HSA Finalizer & Runtime

Aparapi  Lambda  based    API  

Aparapi  converts  bytecode  to  HSAIL  

CPU ISA

Page 7: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

7   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

HSA  AND  LAMBDA  ENABLED  APARAPI  EXECUTION  EXAMPLE  

Device.hsa().forEach(size, i -> square[i]=int[i]*int[i] );

 

Is  this  the  first  execuAon  of  this  

lambda    instance?  

Execute    HSAIL  Kernel  on  GPU/APU  

Does  PlaLorm  Supports  HSA?  

 

Y Can  bytecode  be  converted  to  

HSAIL?  

 

Y

Convert  bytecode  to  

HSAIL  

Y

Y Do  we  have  HSAIL  for  this  lambda  ?  

 

N

Execute  Kernel  using  Java  thread  Pool  

N

N

N

Page 8: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

8   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

SUMATRA  PROJECT  :  NATIVE  SUPPORT  FOR  GPU  OFFLOAD  ADDED  TO  JAVA  

!  AMD/Oracle  sponsored  Open  Source  (OpenJDK)  project  

!  Targeted  at  OpenJDK  Java  9  (2015)  !  Allow  developers  to  efficiently  represent  data  parallel  algorithms  in  Java  using  

Stream  API  +  Lambda  expressions  

!  Sumatra  is  not  pushing  new  ‘programming  model’    

!  Instead  we  ‘repurpose’  Stream  API  +  Lambda  to  enable  both  CPU  or  GPU  compu[ng  

!  A  Sumatra  enabled  Java  Virtual  Machine™  will  dispatch  ‘selected’  constructs  to  HSA  enabled  devices  at  run[me.  

!  Developers  already  refactoring  JDK  to  use  stream  &  lambda  API’s  ‒  So  anyone  using  exis[ng  JDK  should  see  GPU  accelera[on  without  any  code  changes.  

!  Links:  ‒  hVp://openjdk.java.net/projects/sumatra  ‒  hVps://wikis.oracle.com/display/HotSpotInternals/Sumatra  ‒  hVp://mail.openjdk.java.net/pipermail/sumatra-­‐dev  

GPU ISA

JVM

 Java  Applica[on  

GPU CPU

HSAIL

HSA Finalizer & Runtime

CPU ISA

Java  JDK  Stream  +  Lambda  API  

Java  GRAAL  JIT  backend  

Page 9: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

9   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

HSA  ENABLEMENT  OF  JAVA  

9

CPU ISA GPU ISA

JVM  

 Java  Applica[on  

GPU  CPU  

   

OpenCL™

OpenCL™  Compiler  and  Run[me  

APARAPI  API  

Java  7  –  OpenCL  enabled  Aparapi      •  AMD  ini[ated  Open  Source  project    •  APIs  for  data  parallel  algorithms    

GPU  accelerate  Java  applica[ons  

No  need  to  learn  OpenCL    

•  Ac[ve  community  captured  mindshare  ~20  contributors    >7000  downloads  ~150  visits  per  day  

CPU ISA GPU ISA

JVM  

 Java  Applica[on  

GPU  CPU  

   

HSAIL™

 HSA  Finalizer  &  Run[me  

APARAPI  +  Lambda  API  

Java  8  –    HSA  enabled  Aparapi      •  Java  8  brings  Stream  +  Lambda  API.  

More  natural  way  of  expressing  data  parallel  algorithms    Ini[ally  targeted  at  mul[-­‐core.    

•  APARAPI  will  :-­‐  Support  Java  8  Lambdas    Dispatch  code  to  HSA  enabled  devices  at  run[me  via  HSAIL  

Java  9  –  HSA  enabled  Java  (Sumatra)      •  Adds  na[ve  GPU  compute  support  to  Java  Virtual  Machine  

(JVM)      

•  Developer  uses  JDK  provided    Lambda  +  Stream  API  

•  JVM  uses  GRAAL  compiler  to  generate  HSAIL      

•  JVM  decides  at  run[me  to  execute  on  either  CPU  or  GPU  depending  on  workload  characteris[cs.    

GPU ISA

JVM  

 Java  Applica[on  

GPU  CPU  

   

HSAIL™

 HSA™  Finalizer  &  Run[me  

Java  JDK  Stream  +  Lambda  API  

Java  GRAAL  JIT  backend  

CPU ISA

We  plan  to  provide    HSA  Enabled  Aparapi  (Java  8)  

as  a  bridge  technology  between    OpenCL  based  Aparapi  (Java  7)  

 and    HSA  Enabled  Sumatra  (Java  9)  

Page 10: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

10   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  10

A  CASE  STUDY  CENTERED  ON  NBODY  

!  A  Java  developer  implemen[ng  a  sequen[al  version  of  NBody  would  probably…  ‒  Create  a  class    to  represent  each  body  

!  Loop  through  each  Body  (in  array  of  bodies[])  to  update  and  display  

class Body{ float x,y,z,m,vx,vy,vz; // Include method to update position and display void updateAndShow(Screen screen, Body[] bodies){ for (Body other:bodies){

// accumulate forces between other and this } // update vx,vy,vz,x,y and z from accumulated data screen.paint(x,y,z);

} }  

for (Body b: bodies) b.updateAndShow(screen, bodies);

Page 11: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

11   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  11

WITHOUT  HSA  WE  CAN’T  (EFFICIENTLY)  USE  OBJECTS    

!  In  Java;  allocated  Objects  are  scaVered  on  the  heap.  ‒  There  is  no  way  to  allocate  an  array  of  objects  in  con[guous  memory  (as    with  C++)  ‒ We  force  the  developer  to  resort  to  using  parallel  arrays  of  primi[ves  (which  are  con[guous)  

 ‒  And  to  infer  that      x[n],  y[n]  and  z[n]  holds  the  state  for  bodies[n].  

‒  Then  the  kernel    can  be  used  to  execute  the    code  on  the  GPU  

float x[], y[], z[], m[], vx,[], vy[], vz[];

Kernel kernel = new Kernel(){ public void run(){

int i = getGlobalId(0); for (int j=0; j<bodies.length; j++){

// accum forces between (x,y,z)[j] and (x,y,z)[i] } // update vx[j],vy[j],vz[j],x[j],y[j] and z[j]

} };

Kernel.execute(bodies.length);

Page 12: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

12   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  12

HSA  ENABLED  APARAPI  (AND  SUMATRA)  ALLOWS  USE  OF  OBJECTS  

!  So  we  code  our  Body  class  exactly  as  we  would  if  execu[ng  in  Java.    

!  Then  use  new  Aparapi  lambda  enabled  API  to  coordinate  dispatch  to  theGPU  

class Body{ float x,y,z,m,vx,vy,vz; // Include method to update position and display void updateAndShow(Screen screen, Body[] bodies){ for (Body other:bodies){

// accumulate forces between other and this } // update vx,vy,vz,x,y and z from accumulated data screen.paint(x,y,z);

} }  

Device.hsa().forEach(bodies, b -> { b.updateAndShow(screen, bodies); });

Page 13: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

13   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

OVERVIEW  OF  HSA  ENABLED  APARAPI  

! HSA  enabled  Aparapi,  at  run[me:  ‒ Step  0:  Generate  HSAIL  from  Bytecode  ‒ Step  1:  Generate  host  HSA  Run[me  calls  

‒ Step  1.1:  Ini[alize  HSA  run[me,  device,  queue  …  

‒ Step  1.2:  Finalize  HSAIL  to  generate  GPU  ISA  ‒ Step  1.3:  Bind  Java  args  to  HSA  args  ‒ Step  1.4:  Dispatch  the  kernel  ‒ Step  1.5:  Wait  for  comple[on  ‒ Repeat  steps  1.3  -­‐  1.5  for  next  itera[on  of  same  kernel  

‒ Repeat  step  0  –  1  for  each  new  kernel    

CPU ISA

JVM

Application

Aparapi

GPU CPU

Run

time

MyLambda.java

javac (compiler)

MyLambda.class

Dev

elop

men

t tim

e

Generate HSAIL

Generate HSA RT

calls

Initialize

Bind Args

Finalize

Dispatch GPU ISA

Input

Contains

Page 14: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

14   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

HIGH  LEVEL  HSA  FEATURES  

! Features  currently  being  defined  in  the  HSA  Working  Groups**  ‒ Unified  addressing  across  all  processors  ‒ Opera[on  into  pageable  system  memory  ‒ Full  memory  coherency  ‒ Pla|orm    atomics  ‒ User  mode  dispatch  

‒ Enables  fast  dispatch  with  no  driver  involvement  ‒ Architected  queuing  language  

‒ Flexible  compute  dispatch,  easier  GPU  self-­‐enqueue  ‒ High  level  language  support  for  GPU  compute  processors  ‒ Preemp[on  and  context  switching  

 **  All  features  subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups  

@  Copyright  2012  HSA  Founda[on.  All  Rights  Reserved.  

Page 15: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

15   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

HSA  INTERMEDIATE  LANGUAGE  (HSAIL)**  

! HSAIL  is  a  virtual  ISA  for  parallel  programs  ‒ Finalized  to  vendor-­‐specific  ISA  by  a  JIT  compiler  or  “Finalizer”  ‒ ISA  independent  by  design  for  CPU  &  GPU  

! Explicitly  parallel  ‒ Designed  for  data  parallel  programming  

! Support  for  excep[ons,  virtual  func[ons,  and  other  high  level  language  features  ! Lower  level  than  OpenCL™  SPIR  

‒ Fits  naturally  in  the  OpenCL™  compila[on  stack  

! Suitable  to  support  addi[onal  high  level  languages  and  programming  models:  ‒ Java,  C++,  OpenMP,  etc  

**  Subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups  @  Copyright  2012  HSA  Founda[on.  All  Rights  Reserved.  

Page 16: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

16   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

HSAIL  OVERVIEW**  

!  Similar  to  assembly  language  for  a  RISC  CPU  ‒  Load-­‐store  architecture  

ld_global_u64 $d0, [$d6 + 120]; $d0= load($d6+120)

add_u64 $d1, $d2, 24; $d1= $d2+24

!  136  opcodes  (Java™  bytecode  has  200)  ‒  Floa[ng  point  (single,  double,  half  (f16))  ‒  Integer  (32-­‐bit,  64-­‐bit)  ‒  Some  packed  opera[ons    ‒ Branches  ‒  Func[on  calls  ‒ Pla$orm  Atomic  Opera[ons:    and,  or,  xor,  exch,  add,  sub,  inc,  dec,  max,  min,  cas  ‒  Synchronize  host  CPU  and  HSA  Component!  

!  Text  and  Binary  formats  (“BRIG”)  

 

**  Subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups  

!  Four  classes  of  registers  ‒ C:  1-­‐bit,  Control  Registers  ‒  S:  32-­‐bit,  Single-­‐precision  FP  or  Int  ‒ D:  64-­‐bit,  Double-­‐precision  FP  or  Long  Int  ‒ Q:  128-­‐bit,  Packed  data.  

!  Fixed  number  of  registers:  ‒ 8  C    ‒  S,  D,  Q  share  a  single  pool  of  resources  

S + 2*D + 4*Q <= 128

Up to 128 S or 64 D or 32 Q (or a blend)

!  Register  alloca[on  done  in  high-­‐level  compiler    ‒  Finalizer  doesn’t  have  to  perform  expensive  register  alloca[on  

INSTRUCTION  SET   REGISTERS  

@  Copyright  2012  HSA  Founda[on.  All  Rights  Reserved.  

Page 17: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

17   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

SEGMENTS  AND  MEMORY  **  

!  7  segments  of  memory  ‒  global,  readonly,  group,  spill,  private,  arg,  kernarg,    ‒ Memory  instruc[ons  can  (op[onally)  specify  a  segment  

!  Global  Segment  ‒ Visible  to  all  HSA  agents  (including  host  CPU)  

!  Group  Segment  ‒ Provides  high-­‐performance  memory  shared  in  the  work-­‐group.  

‒ Group  memory  can  be  read  and  wriVen  by  any  work-­‐item  in  the  work-­‐group  

‒ HSAIL  provides  sync  opera[ons  to  control  visibility  of  group  memory  

!  Spill,  Private,  Arg  Segments  ‒ Represent  different  regions  of  a  per-­‐work-­‐item  stack  ‒ Typically  generated  by  compiler,  not  specified  by  programmer  

 **  Subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups  

!  Kernarg  Segment  ‒ Programmer  writes  kernarg  segment  to  pass  arguments  to  a  kernel  

!  Read-­‐Only  Segment  ‒ Remains  constant  during  execu[on  of  kernel  

!  Flat  Addressing  ‒ Each  segment  mapped  into  virtual  address  space  

‒  Flat  addresses  can  map  to  segments  based  on  virtual  address  

‒  Instruc[ons  with  no  explicit  segment  use  flat  addressing  

‒ Very  useful  for  high-­‐level  language  support  (ie  classes,  libraries)  

‒ Aligns  well  with  OpenCL  2.0  “generic”  addressing  feature  

ld_global_u64 $d0, [$d6] ld_group_u64 $d0,[$d6+24] st_spill_f32 $s1,[$d6+4] ld_kernarg_u64 $d6, [%_arg0] ld_u64 $d0,[$d6+24] ; flat

@  Copyright  2012  HSA  Founda[on.  All  Rights  Reserved.  

Page 18: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

18   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

EXAMPLE  –  BYTECODE  TO  HSAIL  GENERATION  

int in[], out[]; Device.hsa().forEach(len, i-> out[i] = in[i] * in[i] );

javac –g squares.java

0: aload_0 //out[] 1: iload_2 //i 2: aload_1 //in[] 3: iload_2 4: iaload 5: aload_1 6: iload_2 7: iaload 8: imul 9: iastore 10: return

version 0:95: $full : $large; kernel &run( kernarg_u64 %_arg0, //out[] kernarg_u64 %_arg1, //in[] kernarg_s32 %_arg2 ){ ld_kernarg_u64 $d0, [%_arg0]; ld_kernarg_u64 $d1, [%_arg1]; ld_kernarg_s32 $s2, [%_arg2]; workitemabsid_u32 $s2, 0; //i mov_b64 $d3, $d0; mov_b32 $s4, $s2; mov_b64 $d5, $d1; mov_b32 $s6, $s2; cvt_u64_s32 $d6, $s6; mad_u64 $d6, $d6, 4, $d5; ld_global_s32 $s5, [$d6+24]; mov_b64 $d6, $d1; mov_b32 $s7, $s2; cvt_u64_s32 $d7, $s7; mad_u64 $d7, $d7, 4, $d6; ld_global_s32 $s6, [$d7+24]; mul_s32 $s5, $s5, $s6; cvt_u64_s32 $d4, $s4; mad_u64 $d4, $d4, 4, $d3; st_global_s32 $s5, [$d4+24]; ret; };

Generated HSAIL

Page 19: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

19   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

APARAPI  JNI  CALL  -­‐>  HSA  RUNTIME  API  

Device  Discovery  &  Queue  Crea[on  APIs**  

!  Discover  HSA  Device  ‒ Both  count  and  device_list  are  out  params  ‒ User  can  iterate  over  HSA  devices  in  the  list  

!  User-­‐Mode  Queue  Crea[on  ‒ User  can  provide  pre-­‐allocated  buffer  ‒  If  not,  API  will  allocate  a  buffer  ‒ queue  is  the  user-­‐mode  queue  

HsaStatus  HsaGetDevices(unsigned  int  *count,                                                                  const  HsaDevice  **device_list);  

HsaStatus  HsaCreateUserModeQueue(const  HsaDevice  *device,                                                                                    void  *buffer,  size_t  buffer_size,  

     HsaQueuePriority  queue_priority,                                                                                    HsaQueueFrac[on  queue_frac[on,                                                                                    HsaQueue  **queue);    

**  All  APIs  subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups  

Page 20: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

20   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

APARAPI  JNI  -­‐>  HSA  RUNTIME  API  

Finalize  HSAIL  to  GPU  ISA**  

!  Transla[ng  HSAIL  text  to  Binary  (BRIG)  ‒ BRIG  is  a  binary  container  for  several  sec[ons  

‒  Code  ‒  String  ‒ Direc[ve  ‒ …  

‒  libHsail  is  an  assembler/disassembler  ‒  This  is  a  standalone  compiler  library  ‒ Not  part  of  Run[me  

!  Finalize  Brig  to  IHV  specific  GPU  ISA  ‒  Input:  Brig  ‒ Output:  HsaKernelCode  which  contains  ISA  

**  All  APIs  subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups  

Status  Assemble  (const  char*  hsail_text,  HsaBrig  *brig);  

HsaStatus  HsaFinalizeBrig(const  HsaDevice  *device,                                                                        HsaBrig  *brig,                                                                        const  char  *kernel_name,                                                                        const  char  *op[ons,                                                                        HsaKernelCode  **kernel);  

Page 21: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

21   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

APARAPI  JNI  -­‐>  POPULATION  OF  AQL  DISPATCH  PACKET  

!  AQL  Dispatch  Packet**  ‒ Header  enables:  

‒ Different  packet  types  ‒  Specify  if  this  packet  should  wait  for  all  previous  to  complete  

‒  Control  visibility  of  data  and  memory  fences  before  and  a�er  dispatch  

‒ Body  enables:  ‒  Specify  the  problem  fan  out  using  launch  config  related  fields  

‒ How  much  workgroup  memory?  ‒  Loca[on  of  IHV  specific  GPU  ISA  ‒  Loca[on  of  where  kernelargs  can  be  found  ‒  A  signal  mechanism  to  wait  on  kernel  comple[on  

!  Only  popula[ng  Kernel  info  and  signal  are  opaque,  so  require  run[me  APIs  ‒ Other  fields    are  open,  so  simple  assignments  

**  Subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups  

typedef  struct  HsaAqlDispatchPacket  {  uint32_t  format  :  8;  uint32_t  barrier  :  1;  uint32_t  acquire_fence_scope  :  2;  uint32_t  release_fence_scope  :  2;  uint32_t  invalidate_instruction_cache  :  1;  uint32_t  invalidate_roi_image_cache  :  1;  uint32_t  dimensions  :  2;  uint32_t  reserved  :  15;  uint16_t  workgroup_size[3];  uint16_t  reserved2;  uint32_t  grid_size[3];  uint32_t  private_segment_size_bytes;  uint32_t  group_segment_size_bytes;  uint64_t  kernel_object_address;  uint64_t  kernel_arg_address;  uint64_t  reserved3;  uint64_t  completion_signal;  }  HsaAqlDispatchPacket;  

Header  Fields  

Launch  Config  

Kernel  Info  

Kernel  SynchronizaAon  

Page 22: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

22   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

POPULATING  KERNEL  INFO  AND  SIGNAL  USING  HSA  RT  API**  

HsaStatus  HsaFinalizeBrig(const  HsaDevice  *device,                                                                        HsaBrig  *brig,                                                                        const  char  *kernel_name,                                                                        const  char  *op[ons,                                                                        HsaKernelCode  **kernel);  

typedef  struct  HsaKernelCode  {        …        uint32_t  workitem_private_segment_byte_size;        uint32_t  workgroup_group_segment_byte_size;        uint64_t  kernarg_segment_byte_size;              …  }  HsaKernelCode;  

 

**  Subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups  

typedef  struct  HsaAqlDispatchPacket  {        …        uint32_t  private_segment_size_bytes;        uint32_t  group_segment_size_bytes;        uint64_t  kernel_object_address;        uint64_t  kernel_arg_address;        …        uint64_t  completion_signal;  }  

Pack  Java  Args  into  a  vector  in  JNI  

HsaStatus  HsaRegisterSystemMemory(void  *address,  size_t  size);  

Register  vector  data  address  

HsaStatus  HsaCreateSignal(HsaSignal  *signal);  

Page 23: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

23   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

DISPATCH  AND  WAIT  ON  KERNEL  COMPLETION  

!  Dispatch  ‒  Submit  AQL  Packet  into  the  HsaQueue  ‒ Thread  safe  API  

!  Wait  on  Kernel  Comple[on  

**  Subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups  

HsaStatus  HsaSubmitAql(HsaQueue  *queue,HsaAqlDispatchPacket  *aql_packet);  

bool  is_done  =  false;  while  (!is_done)  {          status  =  HsaQuerySignal(signal,  &is_done);          assert(status  ==  kHsaStatusSuccess);  }  

!  A�er  comple[on,  disposing  HSA  resources  ‒ Release  queue  ‒ Release  signal  ‒ Release  Kernel  object  ‒ Deregister  kernel  args  related  memory  

HsaStatus  HsaDestroyUserModeQueue(HsaQueue  *queue);  HsaStatus  HsaDestroySignal(HsaSignal  signal);  HsaStatus  HsaFreeKernelCode(HsaKernelCode  *kernel);  HsaStatus  HsaDeregisterSystemMemory(void  *address);    

Page 24: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

24   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

DEMO  

Page 25: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

25   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

SUMMARY  

!  Aparapi  is  already  an  establish  framework  for  simplifying  execu[on  of  Java  on  GPU  devices  

!  HSA  enabled  Aparapi  further  simplifies  GPU  accelera[on  of  Java  applica[ons  ‒ Aligns  with  Java  8  features  to  support  ‘lambda’  expression  for  compactness  ‒ Enables  ‘large  unified’  system  memory  for  GPU  accelera[on  ‒ Eases  programming  by  enabling  direct  access  to  Java  objects  on  heap  ‒ Enables  fast  offload  of  Java  kernels  through  User-­‐mode  queue  and  AQL  

!  HSA  enabled  Aparapi  lends  to  more  interes[ng  future  possibili[es  ‒  Simplified  communica[on  and  workload  balancing  across  both  CPU  and  GPU  ‒ Exploit  new  computa[on  paVerns  and  recursions  through  kernel  self-­‐enqueue    

Page 26: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

26   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

QUESTIONS  &  ANSWERS?  

Page 27: CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

27   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  

DISCLAIMER  &  ATTRIBUTION  

The  informa[on  presented  in  this  document  is  for  informa[onal  purposes  only  and  may  contain  technical  inaccuracies,  omissions  and  typographical  errors.    

The  informa[on  contained  herein  is  subject  to  change  and  may  be  rendered  inaccurate  for  many  reasons,  including  but  not  limited  to  product  and  roadmap  changes,  component  and  motherboard  version  changes,  new  model  and/or  product  releases,  product  differences  between  differing  manufacturers,  so�ware  changes,  BIOS  flashes,  firmware  upgrades,  or  the  like.  AMD  assumes  no  obliga[on  to  update  or  otherwise  correct  or  revise  this  informa[on.  However,  AMD  reserves  the  right  to  revise  this  informa[on  and  to  make  changes  from  [me  to  [me  to  the  content  hereof  without  obliga[on  of  AMD  to  no[fy  any  person  of  such  revisions  or  changes.    

AMD  MAKES  NO  REPRESENTATIONS  OR  WARRANTIES  WITH  RESPECT  TO  THE  CONTENTS  HEREOF  AND  ASSUMES  NO  RESPONSIBILITY  FOR  ANY  INACCURACIES,  ERRORS  OR  OMISSIONS  THAT  MAY  APPEAR  IN  THIS  INFORMATION.    

AMD  SPECIFICALLY  DISCLAIMS  ANY  IMPLIED  WARRANTIES  OF  MERCHANTABILITY  OR  FITNESS  FOR  ANY  PARTICULAR  PURPOSE.  IN  NO  EVENT  WILL  AMD  BE  LIABLE  TO  ANY  PERSON  FOR  ANY  DIRECT,  INDIRECT,  SPECIAL  OR  OTHER  CONSEQUENTIAL  DAMAGES  ARISING  FROM  THE  USE  OF  ANY  INFORMATION  CONTAINED  HEREIN,  EVEN  IF  AMD  IS  EXPRESSLY  ADVISED  OF  THE  POSSIBILITY  OF  SUCH  DAMAGES.  

 

ATTRIBUTION  

©  2013  Advanced  Micro  Devices,  Inc.  All  rights  reserved.  AMD,  the  AMD  Arrow  logo  and  combina[ons  thereof  are  trademarks  of  Advanced  Micro  Devices,  Inc.  in  the  United  States  and/or  other  jurisdic[ons.  OpenCL  is  a  trademark  of  Apple  Inc.    HSA  is  a  trademark  of  the  Heterogeneous  System  Architecture  Founda[on.  Other  names  are  for  informa[onal  purposes  only  and  may  be  trademarks  of  their  respec[ve  owners.