2010 Javagpu Diss

7/31/2019 2010 Javagpu Diss

http://slidepdf.com/reader/full/2010-javagpu-diss 1/103

Peter Calvert

Parallelisation of Java for

Graphics Processors

Computer Science Tripos, Part II

Trinity College

May 11, 2010





Proforma

Name: Peter Calvert

College: Trinity College

Project Title: Parallelisation of Java for Graphics Processors

Examination:Computer Science Tripos, Part II, June 2010Word Count: 11983 words

Project Originator: Peter Calvert

Supervisors: Dr Andrew Rice and Dominic Orchard

Original Aims of the Project

The aim of the project was to allow extraction and compilation of Java vir-

tual machine bytecode for parallel execution on graphics cards, specifically the

NVIDIA CUDA framework, by both explicit and automatic means.

Work Completed

The compiler, which was produced, successfully extracts and compiles code from

class files into CUDA C++ code, and outputs transformed classes that make use

of this native code. Developers can indicate loops that should be parallelised by

use of Java annotations. Loops can also be automatically detected as ‘safe’ using

a dependency checking algorithm.

On benchmarks, speedups of up to a factor of 187 were measured. Evaluation

of the automatic dependency analysis showed 85% accuracy over a range of samplecode.

Special Difficulties

None.

i



Declaration

I, Peter Calvert of Trinity College, being a candidate for Part II of the Computer

Science Tripos, hereby declare that this dissertation and the work described in it

are my own work, unaided except as may be specified below, and that the disser-

tation does not contain material that has already been used to any substantialextent for a comparable purpose.

Signed

Date

ii



Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Project Description . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 JavaB [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.2 Within JikesRVM [16] . . . . . . . . . . . . . . . . . . . . 3

1.3.3 JCUDA [25] . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Preparation 5

2.1 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Development Process . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Methods of Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Development Environment . . . . . . . . . . . . . . . . . . . . . . 8

2.5 The Java Platform . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5.1 State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 NVIDIA CUDA Architecture . . . . . . . . . . . . . . . . . . . . 11

2.6.1 Thread Model . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6.2 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . 12

2.7 Common Compiler Analysis Techniques . . . . . . . . . . . . . . . 14

2.7.1 General Dataflow Analysis . . . . . . . . . . . . . . . . . . 14

2.7.2 Loop Detection . . . . . . . . . . . . . . . . . . . . . . . . 152.7.3 Live Variable Analysis . . . . . . . . . . . . . . . . . . . . 16

2.7.4 Constant Propagation . . . . . . . . . . . . . . . . . . . . 17

2.7.5 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . 17

2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Implementation 19

3.1 Overall Implementation Structure . . . . . . . . . . . . . . . . . . 19

iii



3.2 Internal Code Representation (ICR) . . . . . . . . . . . . . . . . . 21

3.2.1 Code Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.2 Visitor Pattern . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.3 Bytecode to ICR Translation . . . . . . . . . . . . . . . . 24

3.2.4 Type Inference . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Dataflow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.1 Support for Arrays and Objects . . . . . . . . . . . . . . . 28

3.3.2 Increment Variables . . . . . . . . . . . . . . . . . . . . . . 28

3.3.3 May-Alias . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.4 Usage Information . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Loop Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.1 Loop Trivialisation . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Kernel Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.5.1 Copy In . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.5.2 Copy Out . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6 Dependency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6.1 Annotation Based . . . . . . . . . . . . . . . . . . . . . . . 36

3.6.2 Automatic . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.7 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.7.1 C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.7.2 Kernel Invocation . . . . . . . . . . . . . . . . . . . . . . . 38

3.7.3 Data Copying . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.8 Compiler Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.8.1 Feedback to the User . . . . . . . . . . . . . . . . . . . . . 41

3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Evaluation 43

4.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.1 Model of Overheads . . . . . . . . . . . . . . . . . . . . . . 44

4.2.2 Component Benchmarks . . . . . . . . . . . . . . . . . . . 49

4.2.3 Java Grande Benchmark Suite [7] . . . . . . . . . . . . . . 51

4.2.4 Mandelbrot Set Computation . . . . . . . . . . . . . . . . 52

4.2.5 Conway’s Game of Life . . . . . . . . . . . . . . . . . . . . 53

4.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3 Accuracy of Dependency Analysis . . . . . . . . . . . . . . . . . . 56

4.4 Comparison with Existing Work . . . . . . . . . . . . . . . . . . . 56

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

iv



5 Conclusions 59

5.1 Comparison with Requirements . . . . . . . . . . . . . . . . . . . 59

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2.1 Further Hardware Support . . . . . . . . . . . . . . . . . . 60

5.2.2 Further Optimisations . . . . . . . . . . . . . . . . . . . . 60

5.2.3 Further Automatic Detection . . . . . . . . . . . . . . . . 61

5.3 Final Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Bibliography 63

A Dataflow Convergence Proofs 67

A.1 General Dataflow Analysis . . . . . . . . . . . . . . . . . . . . . . 67

A.2 Live Variable Analysis . . . . . . . . . . . . . . . . . . . . . . . . 68

A.3 Constant Propagation . . . . . . . . . . . . . . . . . . . . . . . . 69

B Code Generation Details 71

C Command Line Interface 73

D Sample Code Used 75

D.1 Java Grande Benchmark Suite . . . . . . . . . . . . . . . . . . . . 75

D.2 Mandelbrot Computation . . . . . . . . . . . . . . . . . . . . . . 76

D.3 Conway’s Game of Life . . . . . . . . . . . . . . . . . . . . . . . . 76

E Testing Gold Standards 77

F Class Index 79

G Source Code Extract 81

H Project Proposal 83

v



List of Figures

1.1 Build process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Iterative development process. . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Development environment. . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Software model of threads under CUDA. . . . . . . . . . . . . . . . . 122.4 CUDA hardware architecture. . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Various examples of loops. . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Outline call graph for main classes. . . . . . . . . . . . . . . . . . . . 20

3.2 Garbage collection of unreachable blocks. . . . . . . . . . . . . . . . . 21

3.3 Unification algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Outline of kernel extraction algorithm. . . . . . . . . . . . . . . . . . 34

3.5 Form of multiple dimension kernels. . . . . . . . . . . . . . . . . . . . 35

3.6 Array and object type templates for on-GPU execution . . . . . . . . 39

4.1 Effect on copy performance (host-to-device) of single vs. multipleallocations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Comparison of measured performance with model (using CUDA SDK). 48

4.3 Values of td and th for measurements (using CUDA SDK). . . . . . . 48

4.4 Fit of model (green) to component benchmarks. . . . . . . . . . . . . 50

4.5 Fit of model to Fourier Series benchmark, using previously calculated

parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.6 Fit of model to Mandelbrot benchmark, using previously calculated

parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.7 Speedups and overhead for Mandelbrot benchmark with fixed iteration

limit (250 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.8 Speedups and overhead for Mandelbrot benchmark with fixed grid size

(8000 × 8000). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.9 Overall times for simulation of Conway’s Game of Life. . . . . . . . . 55

5.1 Minimum finding algorithms . . . . . . . . . . . . . . . . . . . . . . . 61

D.1 3 generations of the Game of Life. . . . . . . . . . . . . . . . . . . . . 76

vi



List of Tables

2.1 CUDA memory spaces. . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 Summary of JVM Instructions and their internal representation. . . . 22

3.2 Unification Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 Tests made for each compiler state. . . . . . . . . . . . . . . . . . . . 43

4.2 Expected timings for overhead stages according to model. . . . . . . . 45

4.3 Model of overheads for component benchmark versions. . . . . . . . . 50

4.4 Model parameters, as measured using component benchmarks. . . . . 50

4.5 Speedup factors for the component benchmarks. . . . . . . . . . . . . 51

4.6 Summary of speedup factors. . . . . . . . . . . . . . . . . . . . . . . 55

4.7 Comparison of Java Grande benchmark timings with JCUDA. . . . . 56

D.1 Summary of Section 2 of the Java Grande Benchmark Suite. . . . . . 75

vii



List of Examples

1.1 Mandelbrot Set computation (kernel highlighted) . . . . . . . . . . . 3

2.1 Example of thread divergence. . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Graph for Mandelbrot computation . . . . . . . . . . . . . . . . . . . 23

3.2 UML sequence diagram for Visitor pattern operation. . . . . . . . . . 24

3.3 Basic block that causes difficulties when exporting. . . . . . . . . . . 25

3.4 Reuse of local variable locations. . . . . . . . . . . . . . . . . . . . . 27

3.5 Results from increment variable analysis computation. . . . . . . . . 28

3.6 Example inter-procedural may-alias computation. . . . . . . . . . . . 31

3.7 Non-termination of may-alias analysis. . . . . . . . . . . . . . . . . . 32

3.8 Mandelbrot control flow graph after various stages of loop detection. 33

3.9 Examples of the automatic dependency check. . . . . . . . . . . . . . 37

3.10 C++ code generation for float Cr = (x * spacing - 1.5f);. . . . 38

viii



Acknowledgements

Much thanks is owed to everyone who has given me guidance, feedback and

encouragement throughout this project. Specifically, my two supervisors, Dr

Andrew Rice and Dominic Orchard, have been invaluable in advising me at

tricky points. I owe particular thanks to Andy who stopped me from naıvelyattempting an even more ambitious project!

ix





CHAPTER 1Introduction

This chapter explains the motivation for using parallel architectures, before de-

scribing the scope of this project. I also provide a short overview of other relevant

work, and highlight the differences between these and the approach taken here.

1.1 Motivation

In the past, improvements in processor performance have taken the form of in-

creased clock speeds. However, since 2002, developments have come from the use

of multiple processors to solve independent parts of a problem in parallel [24].

Commodity parallel processing is now available not only as multi-core CPUs, but

also graphics processors (GPUs) that allow many more threads to be executed

in parallel with the restriction that they share a program counter—i.e. single

instruction multiple data (SIMD).

Unfortunately, most existing code is sequential, so the performance gains from

executing it on parallel architectures are limited. Often, it must be rewritten to

benefit. Automatic parallelisation aims to address this by analysing sequential

code during compilation, and identifying regions that can be executed in parallel.

However, determining whether dependencies exist between two regions of codeis undecidable in the general case [15]. Therefore any analysis must be approxi-

mate to some extent, and developers may find that small changes in code result

in disproportionate changes in performance. This suggests that a mix between

explicit and automatic parallelism might be desirable, with detailed feedback in

the automatic case being an important feature.

1



2 CHAPTER 1. INTRODUCTION

Bytecode (+ libraries)

Java Source

Scala Source

..

. Native Code

javac

scalac

Parallelising Compiler

JVM

Figure 1.1: Build process.

1.2 Project Description

This project focuses on the data parallel, or SIMD, pattern used on graphics pro-

cessors. For NVIDIA graphics devices, this is provided by extensions to C++ in

their CUDA framework [20]. However, this framework and similar cross-platform

APIs, such as OpenCL, operate at a low level, with developers manually handling

data transfers and ‘kernel’ invocations. Ports of CUDA to other languages gen-

erally still require kernels to be written in C++ (e.g. Py-CUDA [14] for Python,

and JCUDA [25] for Java).

This project allows these graphics processors to be used from a high level

language through both explicit annotations (parallel for loops) and automatic

analysis. For reasons of familiarity, I consider the Java Virtual Machine (JVM),

although similar techniques could be applied to other virtual machines such as

Microsoft’s Common Language Runtime. By operating at the bytecode level, no

modifications are made to the syntax of Java, and the compiler should work with

languages other than Java that compile onto the JVM. The compiler fits in asan additional step in the build process (Figure 1.1), taking a class file (compiled

bytecode) as input and producing a replacement along with any required supple-

mentary files. For clarity, this report gives examples in Java rather than bytecode

whenever possible.

One example used throughout this report is the computation of the Man-

delbrot Set (Example 1.1). The parallelising compiler extracts lines 4 to 16

(highlighted) as a two dimensional kernel that can be executed in parallel on the

GPU.

1.3 Related Work

Parallel computation is currently a huge research field. There have been many

attempts at both intuitive frameworks and effective automatic analysis. Ap-

proaches for parallelising Java have included both static analyses similar to my

work, and also direct ports of CUDA.



1.3. RELATED WORK 3

1 p u b lic v oid compute() {2 f o r ( i n t y = 0 ; y < si ze ; y++) {3 f o r ( i n t x = 0 ; x < si ze ; x++) {4 f l o a t Zr = 0 . 0 f , Z i = 0 . 0 f ;5 f l o a t C r = ( x ∗ s p a c i n g − 1 . 5 f ) , C i = ( y ∗ s p a c i n g − 1. 0 f ) ;6 f l o a t ZrN = 0 , ZiN = 0 ;7 i n t i ;89 f o r ( i = 0 ; ( i < ITERATIONS) && ( ZiN + ZrN <= LIMIT) ; i++) {

10 Zi = 2 . 0 f ∗ Zr ∗ Z i + C i ;11 Zr = ZrN − Z i N + C r ;12 ZiN = Zi ∗ Z i ;13 ZrN = Zr ∗ Zr ;14 }1516 data [ y ] [ x ] = ( short ) ( ( i

∗25 5) / ITERATIONS) ;

17 }18 }19 }

Example 1.1: Mandelbrot Set computation (kernel highlighted)

1.3.1 JavaB [4]

Developed by Aart Bik in 1997, this work adopts a similar transformation ap-

proach to that of my project, although it targets multiple CPUs rather than

GPUs. It detects regions of code that can be executed in parallel, and producesa modified class file that uses Java threads to exploit the parallelism. The detec-

tion is partially automatic, with user input to make the analysis more accurate.

However, this input is at the level of ‘do variables x and y alias?’, not ‘should this

loop be run in parallel?’, and is specified to the compiler rather than in source

code.

1.3.2 Within JikesRVM [16]

This recent work (2009) implements automatic analysis within the JikesRVM vir-

tual machine (originally Jalapeno [2]), operating on intermediate code in a similarmanner to this project. It has the advantage over a compile-time approach that

all applications are modified, but requires that users install a specific virtual ma-

chine. Unlike static approaches, it has access to runtime information. However,

it cannot provide compile-time feedback, possibly resulting in unpredictable per-

formance. The benchmarks were all written by the author, and therefore it is

hard to know how effective the analysis might be on more typical code and full



4 CHAPTER 1. INTRODUCTION

applications.

1.3.3 JCUDA [25]This paper, also published in 2009, details a partial port of CUDA to an extended

Java syntax, providing the same low level interface to invoke kernels and copy

data. Kernels must still be written in C++. This gives an unusual mix of Java’s

high level approach with low level exposure to hardware. This project’s approach

of using annotations is preferable, since the source code may still be compiled with

a standard Java compiler, which simply ignores the parallel annotations.

Their performance results, based on hand written CUDA versions of the Java

Grande benchmarks [7], give a reference point for possible speedups (assuming

similar hardware).This work should not be confused with a library of the same name. The

jCUDA library provides access to a number of numerical routines, written using

CUDA, from within Java.



CHAPTER 2Preparation

In order to complete this project successfully and develop a compiler that pro-

duced correct results with real benefits, it was crucial to be clear from the outset

what was required and to have a sensible plan for achieving this. This chapter

documents this process, and introduces the key concepts and theory on which

the compiler implementation is based.

2.1 Requirements Analysis

Given the large array of possible directions that this project could have taken,

there was a real need to set clear goals and requirements. For any inherently

technical software, such as a compiler, it is difficult to separate what it must

achieve from how this might be done. Requirements analysis aims to concentrate

on the first of these, setting out goals that can be verified objectively.

The following core requirements (C1 – C6) are derived from the success criteria

set out in the project proposal. They are made from the perspective of a developer

with no knowledge of compiler internals, and should capture their expectations.

This gives the first property to which all compilers must adhere:

C1. Correctness: Application of the compiler to JVM bytecode should not

affect the results of the code in any significant way.

Moving to the specific requirements on this project, the user should be able

to gain tangible benefits from using the compiler relatively easily. As it is an

optional step in the build process, this is required to warrant its inclusion.

5



6 CHAPTER 2. PREPARATION

C2. Performance: It must be possible to achieve improvements in execution

time by using the compiler.

C3. Usage: Any code modifications required to achieve speedups must be min-imal and transparent to standard compilers. These modifications must

make it possible to specify that a for loop is run in parallel. Furthermore,

if multiple tightly-nested loops are specified, the inner body should be run

in parallel across each of the dimensions.

For these benefits to be observed consistently, they should apply as universally

as possible:

C4. Scope: Ideally, it should be possible for any JVM instructions to be exe-

cuted on the graphics processor. However, GPU architectures place somerestrictions on what is possible, and for the core of the project, support is

restricted to use of basic arithmetic on primitive local variables and arrays.

This notably excludes support for exceptions, monitors and objects.

For various reasons, code specified for parallel execution may not be exe-

cutable as such. In this case, it is important that sufficient feedback is given:

C5. Feedback: There must be varying levels of output available that indicate

reasons if certain regions of code were not appropriate for parallel execution.

The final requirement on the core of the project ensures that the above can

be verified objectively by the developer:

C6. Verifiable: Supplementary tools and pools of example code must be made

available so that developers can evaluate the compiler objectively.

2.1.1 Extensions

The project proposal also outlines several areas where the project might be ex-

tended. These are formally set out below so that they can be assessed in theevaluation of the project.

E1. Automatic Detection of Loop Bounds: The number of iterations of a

loop should be inferred from the bytecode, without any user input.

E2. Automatic Dependency Checking: The compiler should detect, with

little help from annotations, regions of code for parallel execution.



2.2. DEVELOPMENT PROCESS 7

Evaluation / Design

Refactoring

Implementation

Testing

Prototyping

Figure 2.1: Iterative development process.

E3. Runtime Checks: Some annotations (for example, any introduced by

E2) should be replaced with runtime checks (as in [16]) that can determine

whether to execute the kernel in parallel, or to use the original CPU code.

E4. Support for Objects on GPU: It would be useful to include object-

oriented code in parallel regions, within the scope of what the graphics

processor supports.

E5. Further Code Optimisation: Some optimisations that neither the vir-

tual machine nor the GPU compiler can make (due to splitting the code

between the CPU and GPU) should be reimplemented (e.g. loop invariant

code motion).

E6. Code Transformations: In cases where code is not suitable for parallel

execution, it may be possible to modify the code—perhaps by splittingloops into parallelisable and non-parallelisable chunks (loop fission) or by

matching common patterns (e.g. minimum finding).

2.2 Development Process

The development process adopted for this project was based on an iterative style

similar to the Spiral Model [5]. This enabled compiler stages to be tested on

real class files as early in the timetable as possible. From this position, iterations

consisted of the following main stages (Figure 2.1):

1. Evaluation of which stage or feature should be implemented next, based on

measurements and observations indicating which was most applicable.

2. Refactoring of existing code to allow the new feature to be integrated nat-

urally.

3. Implementation of the new stage or feature.




4. Testing on an increasing pool of sample code, and fixing compiler bugs in

order that more code could be compiled correctly.

In this way, feedback from each iteration informed future development, avoid-ing wasted time on unnecessary features. The process also suited the integrated

testing strategy described in the next section.

A slight deviation compared with Boehm’s original Spiral model is the omis-

sion of a separate prototyping stage between 1 and 2. This was primarily due

to time constraints. However, some prototype code was written prior to starting

the main implementation in order to experiment with suitable internal represen-

tations (Section 3.2).

In comparison, the classical Waterfall method would have delayed integrated

tests until the later stages of the project, preventing benchmarks and measure-ments from directing design decisions such as selection of extensions.

2.3 Methods of Testing

Given the importance of maintaining correctness (Requirement C1), it seemed

natural that testing should include full integration tests over the whole compiler.

The first development iteration allowed a subset of JVM bytecode to be imported

into an internal representation, and re-exported to a new class file. As more

stages were added, these tests were rerun (i.e. regression testing) to ensure that

correctness was maintained, and new samples were added to test new features.

The integration tests consisted of a range of self-testing Java code to test the

compiler from both a black box and white box perspective. The first of these could

only be done by using code written by other developers, such as the Java Grande

Benchmark Suite [7]. The white box testing was done by specific examples written

to cover different features of the compiler.

At a finer granularity, analysis stages of the compiler were unit tested by

comparing their results, for the same pool of sample code, to a gold standard

produced manually.

2.4 Development Environment

The overall development environment is presented in Figure 2.2. Here I highlight

some key aspects of this and the decisions made.

Language. The Java language was the natural choice for implementing the com-

piler due to familiarity, along with some use of C++ as required by CUDA.



2.4. DEVELOPMENT ENVIRONMENT 9

Development MachineWorking Copy

SRCFDuplicate SVN Repository

Public Workstation FacilityMaster SVN Repository

Replication on every commit

(via SSH+SVN with key authentication)

UCS Backups

CL File Server

earlybird (workstation)Core 2 Quad (2.66GHz)3MB Cache, 8GB RAM

GeForce 9600 GT512MB global memory

6 multiprocessors1.6GHZ

bing (dedicated)2× Pentium 4 (3.20GHz)2MB Cache, 1GB RAM

GeForce GTX 260896MB global memory

27 multiprocessors1.24GHZ

Double precision support

NFS

scp

SVN+SSH

NFS

Test Machines

Other Users

CodingSUN JDK 1.6.0.18

Netbeans

DissertationLatexTikZ

EvaluationSQLite

GNUPlot

Matlab

Figure 2.2: Development environment.

The availability of the ASM [6] library for reading and writing class files

also influenced this decision. Note that GCC 4.3.3 was used rather than

the more recent GCC 4.4 due to compatibility issues with CUDA 2.3.

Version Control. Subversion was used for storing all project files. This allowed

changes to be rolled back, and code to be transferred between machinesin a coherent manner. Since this dissertation was written in LATEX, with

diagrams written in TikZ and graphs produced using gnuplot and shell

scripts, most binary files could be reproduced and did not need to be stored.

Backups. These were predominantly provided by the regular PWF backups

made by the University Computing Service. The copy replicated on the

SRCF1 was intended to guard against accidental deletion of the master

repository and to reduce downtime if the PWF became unavailable. The

Computer Laboratory filespace was only used during testing, with results

being transferred to the working copy immediately, and therefore did notneed backing up.

Testing Hardware. Two machines with compatible graphics cards were avail-

able (earlybird and bing). Since the resources on earlybird were shared

with other users and an X server, bing was generally preferred.

1Student Run Computing Facility (http://www.srcf.ucam.org/)




2.5 The Java Platform

The Java language and corresponding virtual machine were developed in the

1990s, and made up the first mainstream platform of their type. More recent

alternatives, such as the Common Language Runtime, have used hindsight to

improve the design in some areas. However, Java remains commonly used and

compilers that target the JVM are still developed by third parties for other lan-

guages.

The virtual machine is stack-based, and its instruction set can be considered

RISC-like2 and mostly orthogonal (i.e. each instruction is available for each type).

The features below are key for this project. Java also supports garbage-collection,

objects, synchronisation monitors and exceptions.

Annotations. These have been available since Java 1.5 and are maintained in

the compiled bytecode. They have been used widely to allow tools to modify

and instrument bytecode after compilation. Source code utilising annota-

tions also remains compatible with a standard compiler.

Native Interface (JNI). This offers the facility for using code written in other

languages, which may make use of system calls not abstracted by the Java

libraries, at the cost of portability.

JNI specifies [18] the format of shared libraries that implement ‘native’

methods and the functions that allow interaction with Java objects andcode.

2.5.1 State

Data within the JVM can exist in four locations (from the perspective of byte-

code): the operand stack, the local variable stack, static variables and the heap.

All instruction operands are taken from the operand stack, and results are pushed

onto this. Local variables and statics can be used to store any of the ‘primitive’

datatypes3 [19]. Objects and arrays reside in the heap, and are identified by ref-

erences. Monitor synchronisation support and exception handling also introducestate associated with control flow.

2Reduced instruction set computers (RISC) provide only common instructions, choosing tooptimise these rather than offering more complex instructions.

3boolean, byte, short, int, long, float, double, char and references.



2.6. NVIDIA CUDA ARCHITECTURE 11

2.5.2 Performance

Originally, JVMs interpreted bytecode at runtime, causing very poor perfor-

mance. Whilst this is still a common belief, since the introduction of Just-In-Time(JIT) compilation, there have been studies suggesting that performance is com-

parable to that of C and C++ [17]. In some cases, the studies even show that

Java can take advantage of runtime information to outperform C.

It was suggested in the late 1990s that Java might be an appropriate language

for future high performance computing (HPC) applications [23]. Whilst this has

never materialised, a recent study concludes that in most cases there is no reason

why Java shouldn’t be used for computationally expensive applications, although

they do note that there are significant overheads in communication intensive

applications [3].

2.6 NVIDIA CUDA Architecture

When released in 2007, CUDA was one of the only general purpose frameworks for

graphics processors. Previously, general purpose computation had to be formu-

lated as graphics operations [22]. CUDA supports many programming constructs

including conditional and looping control flow, although does lack support for re-

cursion and virtual function lookups [20, Appendix B.1.4].

Operations that are invoked on the GPU are executed asynchronously from

the perspective of the CPU code. There are therefore some useful constructsprovided in the CUDA API that allow accurate timing of operations.

The framework is based on C++ with keywords for specifying whether func-

tions should be compiled for the GPU or host, and in which memory space vari-

ables should be stored. It also adds a syntax for invoking kernels. With each

new version of CUDA, the provided compiler (nvcc) moves closer to supporting

all the features of C++.

2.6.1 Thread Model

The threading model exposed to software by CUDA is illustrated in Figure 2.3.

Each thread must execute the same code, but is given coordinates so that it

may operate on different data. The two level approach is due to the hardware

architecture which also places limits on the dimensions of both grids and blocks.

The CUDA hardware architecture is shown in Figure 2.4. Each block is

assigned to a multiprocessor which contains 8 processors each executing 4 threads.

As such, 32 threads can be executed concurrently in each block, in what is called




Block

Thread

Grid

Figure 2.3: Software model of threads under CUDA.

a warp. There are therefore advantages to ensuring that the number of threads in

a block is a multiple of this. It is also worth noting that there is only one double

precision unit per multiprocessor, and as such there is a significant penalty for

performing double precision arithmetic.

Since each processor within a multiprocessor must execute the same instruc-

tions, there is a performance hit whenever threads within a single warp diverge.

This occurs when two or more threads take different paths through the control

flow graph, as in Example 2.1. In this case, the hardware must execute the

different paths sequentially.

CUDA also provides primitives for synchronization between threads, however,these are not used in this project. Without these, the thread model can be

considered as a parallel for loop over a number of dimensions.

2.6.2 Memory Model

The hardware model also has implications for the software memory model. As

shown in Figure 2.4, there are a variety of memory areas, each with different

properties as summarised in Table 2.1.

When memory accesses are consecutive within a warp (i.e. thread i is reading

arr[n + i]), then the hardware can coalesce these into fewer memory accessesthat utilise the full width of the memory bus.

It is worth noting that often there is less memory available on the GPU than

the host. Therefore computations offloaded to the GPU may fail ‘early’, or be

forced to revert to CPU execution, giving a ‘wall’ in performance.



2.6. NVIDIA CUDA ARCHITECTURE 13

Host Memory

PCI-e Bus

Device Memory (Global)

Multiprocessor N (up to about 30)

Multiprocessor 2

Multiprocessor 1 Shared Memory

Instruction

Unit

Registers

Processor 1

Registers

Processor 2

Registers

Processor 8

Constant Cache

Texture Cache

Figure 2.4: CUDA hardware architecture.

(Based on a figure used in various NVIDIA presentations.)

1 i f ( i n d e x & 1 ) s [ i n d e x >> 1 ] = s i n ( i n [ i n d e x >> 1 ] ) ;2 e l s e c [ inde x >> 1] = cos ( i n [ i ndex >> 1 ] ) ;

1 i f ( index < W) s [ i n d ex ] = si n ( i n [ i n d ex ] ) ;

2 e l s e c [ inde x − W] = c o s ( i n [ i n d e x − W] ) ;

index ranges between 0 and 2W − 1, where W is a multiple of the warp size. The second case

runs roughly twice as fast, since there is no thread divergence (for W = 51200, the timings are

0.102ms and 0.043ms respectively.)

Example 2.1: Example of thread divergence.




Memory Location Cached Access Scope Size5

Registers On-chip N/A6 Read/write One thread 16384Shared On-chip N/A6 Read/write All threads in a block 16KB

Local Off-chip × Read/write One thread ↑Global Off-chip × Read/write All threads and host 896MBTexture Off-chip Read All threads and host ↓Constant Off-chip Read All threads and host 64KB

Table 2.1: CUDA memory spaces.

2.7 Common Compiler Analysis Techniques

In this section, I introduce some common methods used within compilers [1], and

indicate why each is applicable.

2.7.1 General Dataflow Analysis

Dataflow analysis describes a common framework used for determining properties

of programs [13], such as which variables must be transferred to the graphics

processor (Section 2.7.3) and the behaviour of writes to variables (Sections 2.7.4

and 3.3.2).

The result of an analysis for an instruction or block of code, R(b) ∈ X , is

given by Equation 2.1, where (X,

) is a complete lattice.

Definition 1. A complete lattice is a partially ordered set, in which every subset

has a unique least upper bound (its join or lub) and a unique greatest lower

bound (its meet or glb). We denote the join of the whole set as and the meet

as ⊥.

The function children(n) is usually defined to be either the predecessor set

( forward analysis) or the successor set (backward analysis), with F init giving the

value at entry points or exits respectively.

can be chosen either as the join

(lub) or meet (glb) operator. F b : X → X is the transfer function that alters a

result in accordance with the instruction or block b.

R(b) =

F b

c∈children(b) R(c)

children(b) = ∅

F b (F init) children(b) = ∅(2.1)

5 Sizes for a GeForce GTX 260 card.6 Neither registers nor shared memory need a cache, since both are accessed within a single

clock cycle.



2.7. COMMON COMPILER ANALYSIS TECHNIQUES 15

Entry

Exit

(a) Single Entry/Single Exit

Entry 1

Entry 2

(b) Multiple Entries

Entry

Exit 1

Exit 2

(c) Multiple Exits

Figure 2.5: Various examples of loops.

Since the control flow graph may be cyclic (due to loops), R(b) must be

computed iteratively until a fixed point solution is reached. Initially, each R(b)

is set to the least element

⊥ ∈X . The number of iterations until convergence

depends on the order in which instructions and blocks are considered. For forwardanalysis, they should be considered from start to end, and for backward the

converse. For each specific dataflow analysis, it is necessary to prove that the

analysis will converge. This can be shown to be a consequence of ( X, ) having

finite height, and F b being monotone. The proof of this, and also convergence of

the specific analyses that follow, is given in Appendix A.

2.7.2 Loop Detection

JVM instructions provide only unstructured control flow with branches to arbi-

trary labels, and all structured information regarding loops and conditionals is

discarded at compile-time. Therefore, in order to extract loop bodies for parallel

execution, some of this structure must be reconstructed using loop detection. Ide-

ally, this should be done without needing user annotations (as per Requirement

C3).

Definition 2. A natural loop is defined as a loop with only a single entry point.

In general, detection is made difficult by the possibility of loops with multiple

entries and exits as in Figure 2.5. However, by restricting detection to the case

of natural loops, a simple algorithm can be used [1, p655]. This case still includesall loops either expressible using standard for and while constructs in high-level

languages, or suitable for GPU execution (see Section 2.6.1).

Definition 3. In a control flow graph of basic blocks, we define a block m to be

a dominator of another block n if all execution paths to n contain m.

Definition 4. A back edge is defined to be an edge whose end dominates its

start.




Since a single entry point must dominate every block in the loop body, the

edge to the entry from the end of the body must be a back edge. Therefore, each

natural loop corresponds to a back edge in the control flow graph, and can be

detected by the following simple algorithm.

Step 1 Calculate the set of dominators D(b) of each block b.

Step 2 Find any edge m → n such that n ∈ D(m). This gives a natural loop

with body from n to m.

Since each dominator of a block b must also be a dominator of all of b’s im-

mediate predecessors, the dominator set of b, D(b) ∈ ℘(Blocks), can be described

by:

D(b) = {b} ∪ p∈pred(b)

D( p) (2.2)

This is a form of forward dataflow analysis over the lattice (℘(Blocks), ⊆)

using the meet operator (i.e. set intersection) and the transfer function in Equa-

tion 2.3. This can therefore be computed iteratively, initialising each D(b) to

the empty set ∅. Since set union is monotone, the analysis is also guaranteed to

converge.

F b(x) = x ∪ {b} (2.3)

Step 2 can be performed trivially to find all loops. The set of blocks in the

loop body from n to m is given by S n(m), defined recursively as follows:

S n(b) =

{n} if b = n

{b} ∪ p∈pred(b) S ( p) otherwise(2.4)

2.7.3 Live Variable Analysis

Definition 5. A variable is live at a given point if, on some execution path

starting from that point, the variable is read before it is written to.

Live variables [1, p608] can be calculated using backward dataflow analysis onthe lattice (℘(Vars), ⊆) using the join operator (i.e. set union) and the transfer

function given in equation 2.5, where Write(n) and Read(n) indicate the sets of

writes and reads made by an instruction n.

F n(x) = (x \ Write(n)) ∪ Read(n) (2.5)



2.7. COMMON COMPILER ANALYSIS TECHNIQUES 17

2.7.4 Constant Propagation

Forward dataflow analysis can be used to determine the value of a variable at

a point in code, if it is a constant [1, p632]. For each variable v and blockb, we maintain a result Rv(b) taken from the ‘flat’ lattice over constants—i.e.

({⊥, } ∪ Constants, ) where:

x y ⇐⇒ (x = ⊥) ∨ (x = y) ∨ (y = ) (2.6)

Rv(b) =

c ∈ Constants if constant c is the value of v at the end of b

if the value of v is not constant at the end of b

⊥ if no writes are made to v before the end of b(2.7)

This can be computed using the join operator with transfer function F n,v for

variable v as follows:

F n,v(x) =

c if n assigns c to v

if n writes a non-constant to v

x otherwise

(2.8)

2.7.5 Data DependenciesWhen considering whether two instructions or regions of code can be run in

parallel, the data dependencies between them must be considered. There are three

types: true dependencies (read-after-write), anti-dependencies (write-after-read)

and output dependencies (write-after-write). We can then determine whether

there are any loop-carried dependencies that prevent the loop from being executed

in parallel. The core requirements of the project require that the programmer

will consider this before marking a loop as parallel.

Determining dependencies automatically is desirable, but becomes hard as

soon as memory references are introduced, which Java does through objects and

arrays. Difficulty arises because writes to two distinct references can affect the

same state. Alias analysis aims to determine statically whether this may occur

at a given point in code. There are two variations of this problem, may-alias

and must-alias. For this project, may-alias is required, since by overestimating

conflicts to a memory address, it will always be safe (see Section 3.3.3 for this

analysis).




In languages such as Java, “with if statements, loops, dynamic storage, and

recursive data structures”, alias analysis can be shown to be undecidable by

reduction to the Halting Problem [15].

2.8 Summary

This chapter has given objective and verifiable requirements for the project. The

development and testing strategy that was employed to ensure these were met has

also been outlined. Finally, brief introductions to the Java Platform, NVIDIA’s

CUDA framework and some common compiler analysis techniques have been

given. It was from this base of knowledge and planning that the project was

started.



CHAPTER 3Implementation

This chapter first outlines the overall implementation structure as well as the

central data structure. Some new analysis techniques are then introduced and

developed, before descriptions of individual compiler stages are given. Finally,

the overarching compiler tool is briefly explained.

The size of the implementation1, despite containing a significant proportion

of boilerplate for supporting the JVM instruction set, is too large to describe in

detail. As such, this chapter gives a high-level view, identifying specific details

only when necessary. Further information is given in the appendices.

3.1 Overall Implementation Structure

The compilation process can be divided into five main stages: importing classes;

loop detection; kernel extraction including dependency checks; code generation;

and finally exporting new class files.

The final structure of the project implementation is shown by Figure 3.1. This

gives a high level view of class interactions with time running (roughly) down the

page. To keep the diagram relatively simple, commonly used classes have been

left out, notably graph.* (see Section 3.2), analysis.dataflow.SimpleUsed(see Section 3.3.4) and analysis.BlockCollector . Colour coding is used to

indicate when each class was added to the compiler.

1SLOCCount gives a total of 7686 lines.

19



20 CHAPTER 3. IMPLEMENTATION

1Translation between bytecode

and an internal representation.3

Detection of loop bounds and

increments to give trivial loops.5

Extraction of 1D kernels based

on annotations, and generation

of GPU wrappers.

2Detection of loops for repre-sentation as structured control

flow.

4Generation of C++ from byte-

code.6

Support of multiple dimension

kernels.

7Basic automatic dependency

analysis.

Parallelise External Libraries

JOpt Simple

bytecode.ClassFinder

ClassNode bytecode.ClassImporter

bytecode.AnnotationImporter

ASM

bytecode.MethodImporter

LoopDetector LiveVariable

LoopTrivialiser IncrementVariables Dataflow

LoopNester Tree

KernelExtractor AliasUsed BasicCheck

CombinedCheck AnnotationCheck

ReachingConstants Dataflow

LiveVariable

cuda.CUDAExporter

cuda.Beautifier

cuda.Helper

cuda.BlockExporter

bytecode.ClassExporter cuda.CppGenerator

ASM

bytecode.BlockExporter bytecode.InstructionExporter

cuda.CUDAExporter NVCC

(Colouring denotes the development cycle on which the code was written.)

Figure 3.1: Outline call graph for main classes.



3.2. INTERNAL CODE REPRESENTATION (ICR) 21

Unreachable

Weak References

Strong References

Figure 3.2: Garbage collection of unreachable blocks.

3.2 Internal Code Representation (ICR)

The internal representation of classes, methods and fields under transformation

is central to the compiler. This provides similar capabilities to the Java re-

flection classes but with added support for modification. Therefore, the graph

package contains ClassNode, Method, state.Field, Annotation and Modifier

as ‘replacements’ for the corresponding reflection classes. The Method class in

turn references a graph giving the implementation. It is on this graph that the

compiler analyses and transformations act.

3.2.1 Code Graph

The implementation graphs are made up of two main types of block: basic blocksand loops. For a block b, the notation pred(b) is used to denote its immediate

predecessors in the graph, and succ(b) its successors.

Definition 6. A basic block is a sequence of instructions i1, . . . , in where only the

first instruction may have multiple predecessors (|pred(ik)| = 1, for 1 < k ≤ n),

and only the last multiple successors (|succ(ik)| = 1, for 1 ≤ k < n).

In my implementation, successors are represented by a standard set. However,

in order to minimise the housekeeping required when modifying the graph, the

predecessor set is stored internally as a weakly referenced list (util.WeakList).

Then whenever it is accessed, it is returned as a standard set. By using weak

references, any code that becomes unreachable can be garbage-collected as shown

in Figure 3.2. A list is used internally to count how many links exist from each

predecessor, making it easy to update. For example, a switch instruction might

branch to the same block for multiple cases. If one of these were changed, it

would be necessary to determine whether to modify the predecessor set of the

destination block.




Result producing instructions (Producer)Arithmetic *ADD, *SUB, *MUL, *DIV, *REM Convert *2*

*AND, *OR, *SHL, *SHR Negate *NEG

Constant *CONST_*, LDC, *PUSH Compare *CMP*ArrayLength ARRAYLENGTH NewArray ANEWARRAY

NewMultiArray MULTIANEWARRAY NewObject NEW

CheckCast CHECKCAST InstanceOf INSTANCEOF

Read *LOAD, GETSTATIC, GETFIELD Call INVOKE*

Stateful instructions (Stateful)Write *STORE, PUTSTATIC, PUTFIELD Call INVOKE*Read *LOAD, GETSTATIC, GETFIELD Increment IINC

Branching instructions (Branch)Return RETURN Condition IF*

ValueReturn *RETURN TryCatch N/ASwitch TABLESWITCH, LOOKUPSWITCH Throw ATHROW

Other Instructionsunsupported RET, JSR (these are used for finally blocks), MONITOR*StackOperation SWAP, POP, POP2, DUP, DUP2, DUP X1, DUP X2

DUP2 X1, DUP2 X2

Table 3.1: Summary of JVM Instructions and their internal representation.

The instructions within a basic block are connected in a directed acyclic graph

that gives the dataflow representation of the code. In general, this forms a graph

rather than a tree since each instruction can be used as an argument to multiple

other instructions. Whilst the original bytecode will have an order for all instruc-

tions within a basic block, this ordering is only important for stateful instructions.

Therefore, each basic block also holds a timeline of stateful instructions, and a

final branch instruction. This approach sits between the common techniques:

linear lists of instructions; and complete dataflow graphs. A full summary of

instructions and their internal groupings is given in Table 3.1.

Definition 7. An instruction is stateful if the time at which it is executed may

affect its result or effect.

Loops are represented by the start and end blocks for the body.

As an example, the ICR for the Mandelbrot computation (Example 1.1) is

shown in Example 3.1.




WRITE y

READ y

READ this

READ ->height

IF >= THEN

VOID RETURNWRITE x

READ x

READ this

READ ->width

IF >= THEN

INC y BY 1

0

0

WRITE Zr

WRITE Zi

READ x

READ thisREAD ->spacing

WRITE Cr

0.0

0.0

MUL

INTTO

FLOAT

SUB 1.5

READ y

READ this

READ ->spacing

WRITE Ci

MUL

INTTO

FLOAT

SUB 1.5WRITE ZrN

WRITE ZiN

0.0

0.0

WRITE i 0

READ i

READ this

READ ->iterations

IF >= THEN

READ ZiN

READ ZrN

IF > THEN

READ this

READ ->data

READ y

READ []

READ x

READ i

READ this

READ ->iterations

WRITE []

255

MUL

DIV INT TO

SHORT

INC x BY 1

READ Zr

READ Zi

READ Ci

WRITE Zi

READ ZrN

READ ZiN

READ Cr

WRITE Zr

READ Zi

READ Zi

WRITE ZiN

READ Zr

READ Zr

WRITE ZrN

INC i BY 1

2.0

MUL

MUL

ADD

SUB

ADD

MUL

MUL

ENTRY

Example 3.1: Graph for Mandelbrot computation




be:BlockExporter w:Write a:Arithmetic ie:InstructionExporter

accept(ie)

visit(w)

getState()

accept(ie)

visit(a)

getOperandA()

getOperandB()

Example 3.2: UML sequence diagram for Visitor pattern operation.

3.2.2 Visitor PatternIn order for other classes to traverse this structure easily, the visitor pattern [10,

p331] is utilised for both the control and dataflow graphs. The abstract classes

graph.BlockVisitor and graph.CodeVisitor emulate multiple dispatch which

is not supported natively by the JVM. With multiple dispatch, the choice of

method to invoke is based on the runtime type of all arguments. The JVM does

support single dispatch , where the runtime type of the object, but not the argu-

ments, is considered. The visitor pattern makes use of this in its implementation,

as shown in Example 3.2.

In addition to the above, a decorator (analysis.CodeTraverser) is providedfor the code graph that causes a child visitor to do a depth-first traversal of a

given dataflow graph.

3.2.3 Bytecode to ICR Translation

The internal code representation must be interchangeable with JVM bytecode.

The stack-based nature of the JVM makes this relatively straightforward in the




double[][] arr = {{0.1, 0.2}};ICONST 1

ANEWARRAY "[D"

DUPICONST 0

ICONST 2

NEWARRAY double

DUP

ICONST 0

LDC double 0.1d

DASTORE

DUP

ICONST 1

LDC double 0.2d

DASTORE

AASTORE

ASTORE 2

(1) Bytecode

1

NewArray 0

2

NewArray 0 0.1 1 0.2

4. Write 3. Write 1. Write 2. Write

Multiple arrows out of a node imply a DUP instruction (or

similar) is needed.

(2) Graph

Example 3.3: Basic block that causes difficulties when exporting.

standard cases, although there are some issues that make the general case more

difficult.

Rather than producing import and export code from scratch, a class reading

library, ASM [6], was used. This provides visitor pattern access to the files rather

than producing any data structures. The library does this to remain lightweight

and fast for applications that can perform transformations in a single pass (i.e. do

not need to store the bytecode). It also allows use with whatever data structures

an application might require.

For importing, the timeline and dataflow graph for a basic block can be built

in a single pass through the code using symbolic execution, with the standard

operand stack containing graph nodes rather than real results. In cases where

the operand stack is not empty at the end of a basic block (as occurs with

ternary conditionals—expr ? a : b), the values are stored as being ‘emit-

ted’ by the block and successor blocks are marked as ‘accepting’ values of the

respective types. These values can then be accessed using the RestoreStack

pseudo-instruction.Unfortunately, exporting to bytecode is only easy in cases where stack opera-

tions (e.g. DUP, POP, . . . ) are not required, since these are represented implicitly

by the structure of the dataflow graph rather than individual nodes (see Example

3.3). Therefore, the compiler makes use of the correct bytecode sequence that is

seen for each basic block in the input class file by maintaining a cache 2.

2Using a WeakHashMap so entries are not held unnecessarily if a basic block is discarded.




In the case of code inserted by the compiler, no stack operations are required

since:

• Dataflow graphs form a tree (i.e. results are never used more than once, sono need for DUP etc. instructions).

• Reads and result producing calls occur in the timeline in the same order as

given by a depth-first search of the dataflow graph.

• Results of calls are always used.

It is therefore possible to produce bytecode by performing a depth-first search of

the dataflow graph corresponding to each timeline entry in order.

It is worth noting that all code in a transformed class is exported from the

above structure. Whilst it may have been possible to simply copy unmodified

methods, or even portions of methods, from the original class, this approach

would have been less elegant, and required either a second pass of the original

file, or storage of all original bytecode. A consequence of this decision is that

it is necessary for all instructions (including monitor and exception operations)

to be handled by the code representation, even if they cannot be executed on a

graphics card.

3.2.4 Type Inference

Compilation to bytecode loses the majority of type information, so it is necessary

to infer types, in order to copy state onto and off a graphics processor. Primitive

types are clear from the instruction used to load the value. However, reference

types can only be inferred by usage. This is achieved using a Damas-Milner style

type-checking algorithm [8]. At each instruction, we take a fresh type corre-

sponding to the usage and unify (Figure 3.3) this with the type maintained for

the object operated on. This process ensures that the stored type is valid for all

contexts. If unification ever fails, then this indicates that the input bytecode was

badly typed. Table 3.2 gives details of the unification operation performed for

some instructions.Unfortunately, the existing Type class provided by ASM had a private con-

structor, so could not be extended to include the unification functionality. There-

fore, graph.Type is based heavily on the ASM code, supplemented with unifica-

tion and some convenient methods for dealing with array types.

Type inference is slightly complicated by reuse of local variables—for instance,

in Example 3.4, variables i and j are likely to share a location on the local variable

stack. We can overcome this by using live variable analysis (Section 2.7.3) to



3.3. DATAFLOW ANALYSIS 27

if x is a supertype of y thenx ← y

else if y is a supertype of x theny ← xelse

return failureend if

Figure 3.3: Unification algorithm.

Instruction Unification performed

PUT/GETSTATIC The value passed/returned must unify withthe type of the static field.PUT/GETFIELD The object type must unify with the owner

class of the field, and the value passed/re-turned must unify with type of the field.

<T>ALOAD/<T>ASTORE The object given must unify with an array of element type <T>.

CALL Each argument’s type must unify with thecorresponding type in the method descriptor.

Table 3.2: Unification Details

1 f o r ( i n t i = 0 ; i < 10 ; i++) f ( i ) ;2 f o r ( i n t j = 0 ; j < 10; j ++) g( j ) ;

Example 3.4: Reuse of local variable locations.

determine the live ranges of each variable and ensure that the types across a

range are consistent. Since the unification algorithm is simple and not time-

consuming, these unification steps are integrated into the live variable analysis

code. Thus, at the end of each method import, live variable analysis is performedon the code to infer the types.

3.3 Dataflow Analysis

A general framework for dataflow analysis was outlined in Section 2.7.1. Here

specific dataflow analyses that were developed for use in the compiler are de-




i++; Ri = 1, Rj = 0j++; Ri = 1, Rj = 1if(...) {

i += 3; Ri = 4

, Rj = 1} else {

i += 2; Ri = 3, Rj = 1j = i + 1 0 ; Ri = 3,Rj = i++; Ri = 4,Rj =

}Ri = 4,Rj =

Example 3.5: Results from increment variable analysis computation.

scribed.

3.3.1 Support for Arrays and Objects

The live variable analysis previously given explicitly excludes array and object

accesses. However, for analysis of JVM bytecode, this is insufficient. The simple

approach taken here defines the effectn function such that array and object vari-

ables become live when any of their elements or fields are either read or written.

The only way the variable can stop being live is if it is directly assigned a value

(e.g. a new array or object reference). This ensures safety.

3.3.2 Increment VariablesThis analysis returns information about integer-typed variables for which it is

possible to statically determine the effect of a region of code. The result for each

variable is taken from a flat lattice over integers ({}∪ Z, ) with:

x y ⇐⇒ (x = y) ∨ (y = ) (3.1)

The result for a variable v at the end of a block b, Rv(b), has the behaviour

described by Equation 3.2 (also see Example 3.5).

Rv(b) =

n ∈ Z if the overall effect on v is to increment by n

if v is written to in a more complex manner(3.2)

Note that this also includes ‘decrement’ variables (i.e. n < 0). The results

can be calculated using forward dataflow analysis with the join operator (least

upper bound) and a transfer function as defined below. Each Rv(b) is initialised




to 0.

F n,v(X ) =

X + i if n increments v by i and X

∈Z

if n writes to v in a more complex manner

X otherwise

(3.3)

Theorem 1. Iterative computation of increment variables converges.

Proof. Since our lattice does not contain ⊥, we must adopt a different style of

proof. Suppose the analysis does not terminate, then there must be a loop which

increments a variable v. However, there must be an entry point and for the outer-

most loop, this gives a fixed increment for v. Therefore, the join on entering the

loop will give

for v, and since

∀n,v.F n,v(

) =

the analysis must terminate.

Hence, we have a contradiction and our assumption of non-convergence must beincorrect.

3.3.3 May-Alias

May-alias analysis is used in the compiler to establish which variables may be

affected by a write. This is then used both to determine which variables must be

copied back off the graphics card, and also in automatically detecting dependen-

cies. Computing may-alias sets is the most complex analysis performed in the

compiler. The approach presented here is an approximation, flagging some cases

as inaccurate.

Whilst reference states (i.e. array elements and object fields) are represented

within the compiler as chains of reads (e.g. a[i] would first read a and then

an element), for the description here, states will be considered as in Equation

3.4 (where c ∈ Call represents the return value of a call). I also define loose

states (Equation 3.5) that allow comparison ignoring array indices, and a function

(Equation 3.6) to ‘loosen’ states.

State ::= v | s | c where v ∈ Var, s ∈ Static, c ∈ Call

| State.f | State[expr] where f ∈ Field (3.4)

LooseState ::= v | s | c where v ∈ Var, s ∈ Static, c ∈ Call

| LooseState.f where f ∈ Field

| LooseState[•] (3.5)




loosen(s) =

loose( p).f if s = p.f

loose( p)[•] if s = p[expr]

s otherwise

(3.6)

Forward dataflow analysis can then compute, for each block b, a result M (b),

where M (b)(s) gives the set of states which may share the same reference as s.

For example, consider the code:

a = b; a[f(x)] = objA; b[g(x)] = objB;

Statically, f (x) and g(x) may be unknown, so we should deduce that any

element in either array a or b could point to either objA or objB (i.e.

{a[

•]

→ {objA, objB

}, b[

•]

→ {objA, objB

}}).

We use the lattice over functions (LooseState → ℘(State), ) with as de-fined in Equation 3.7. Therefore, joins can be considered as pointwise union.

M ∗m : State → ℘(State) (Equation 3.8) gives the closure under dereferencing of a

function m : LooseState → ℘(State).

f g ⇐⇒ ∀s.f (s) ⊆ g(s) (3.7)

M ∗m(s) =

m(loosen(s)) ∪ {x.f | x ∈ M ∗m( p)} if s = p.f

m(loosen(s))

∪ {x[e]

|x

∈M ∗m(a)

}if s = a[e]

m(loosen(s)) otherwise

(3.8)

The transfer function is defined in Equation 3.9, where aτ ← b indicates a

write to a of value b type τ . The 5 different cases will be referred to as A to E .

F n(m) = λy.

Recurse(m, c) if n = c and y = c

M ∗m(x) if n = vref ← x and y = x

M ∗m(x) ∪ m(y) if n = a[•]ref ← x and ∃a ∈ M ∗m(a) y = a[•]

M ∗m(x) ∪ m(y) if n = o.f ref ← x and ∃o ∈ M ∗m(o) y = o.f

m(y) otherwise(3.9)

The initial value, M init, at the entry of a code graph must be provided and

should indicate which states might alias.

We must also maintain a set of states R from the lattice (℘(State), ⊆) that

contains all states which might be returned from a method3. The transfer function

3Note that R is associated with the function rather than any particular block.




int x = ...; m = {}, R = {} B

List temp; m = {temp → {temp}}, R = {}List[] temp2 = new List[1]; m = {temp2 → {new0}},R = {} B

List[] data = ...; m= {data → {data}}

,R= {}

B

temp = data[0]; m = {. . . , temp → {data[0]}}, R = {} B

temp = data[x]; m = {. . . , temp → {data[x]}}, R = {} B

temp2[0] = data[100]; m = {. . . ,new0[•] → {data[100]}}, R = {} C

temp2[0] = data[x]; m = {. . . ,new0[•] → {data[100], data[x]}}, R = {} C

return f(temp, temp2[0]); m = {. . . }, R = {data[100], data[x]} A

List f(List a, List b) { M init = {a → {data[x]}, b → {data[100], data[x]}}if(Math.sqrt(4.0) < 4.0) m = {. . . }, R = {}

return a; m = {. . . }, R = {data[x]}else m = {. . . }, R = {data[x]}

return b; m = {. . . }, R = {data[x], data[100]}}

The case of F n that is applied is given on the right hand side.

Example 3.6: Example inter-procedural may-alias computation.

Gn below computes R using the current m : LooseState → ℘(State) as context.

Gn(m, R) =

R ∪ M ∗m(s) if n = RETURN(s)

R otherwise(3.10)

This allows Recurse(m, c) to be defined as R from recursive analysis on f

(where c = f (a0, . . . , an)), with M init given by Equation 3.11. However, no alias

information other than R is returned, so if the function contains reference writes

(i.e. xref ← y) then the analysis must be marked inaccurate.

M init(s) =

M ∗m(ai) if s = vi and i ≤ n

M ∗m(s) if s ∈ Static

∅ otherwise

(3.11)

Example 3.6 gives an example of the results achieved when the inter-

procedural case is used.

The analysis that has been described so far may not terminate (Example3.7). Therefore, the number of iterations is bounded, and if convergence does not

occur, the analysis is flagged inaccurate.

3.3.4 Usage Information

At various stages in the compiler, it is useful to know the set of accesses made

by a block of code. Accesses are either direct or indirect :




{a[•] → {a0}}while(a[i] != null) {

b = a[i].next; {a[•] → {a0, a0.next,... }, b → {a0.next, a0.next.next,. . . }}a[i] = b;

{a

[•] → {a0, a0.next,...

}, b

→ {a0.next, a0.next.next,. . .

}}}

Example 3.7: Non-termination of may-alias analysis.

Definition 8. An access is direct if it accesses the value of a variable or static

field.

Definition 9. An access is indirect if it accesses a value in the heap—i.e. it

requires one or more dereferences. Each indirect access can be described by a list

of indices—for example, arr[i].video.data[x][y] corresponds to [i,x,y].

The class graph.dataflow.SimpleUsed collects sets of state for the cate-

gories: variables used, statics used and state directly written. It also collects a

set of classes used. This is done simply by unioning across all instructions (i.e.

nothing is ever removed from these sets).

The case of indirect accesses is much harder to compute due to the effects of

aliasing. Therefore, the may-alias analysis described in the previous section is

used to form sets of all state that could have been written to or read from. This

is all done within the graph.dataflow.AliasUsed class.

3.4 Loop Detection

Loop detection is done in three stages: natural loop detection, loop trivialisation

and loop nesting. Example 3.8 shows the effect of these. The first is implemented

as a version of the algorithms in Section 2.7.2, restricted to cases with both a

single entry and single exit. This corresponds to the style of loops that can

be executed in parallel on GPUs (see Section 2.6.1). Loop nesting is done by

checking whether a loop is contained in the body of another.

3.4.1 Loop Trivialisation

In order to execute a loop on a graphics processor, it is necessary that the di-

mensions and limits of the loop can be determined. The compiler detects these

automatically for trivial loops as defined below. The definition is more inclusive

than that used in JavaB [4], with positive or negative increments to the loop

variable permitted anywhere in the loop body.






S ← root level loopswhile S is not empty do

l ← S.removeif extract(l) fails thenS.add(l.children)

end if end while

Figure 3.4: Outline of kernel extraction algorithm.

Definition 10. A loop is trivial if there is only a single conditional branch that

exits the loop after comparing the loop index i with an expression. Furthermore,no writes can occur before the branch, and i must be an ‘increment variable’ as

defined by the analysis of Section 3.3.2.

Therefore a trivial loop is defined by its index, its limit and a mapping between

its increment variables (of which the index must be one) and their increments.

These can be detected by the increment variables analysis in Section 3.3.2, along

with inspection of the exit condition, and are represented by extended loop nodes

in the code graph.

3.5 Kernel Extraction

In order to extract kernels from loop bodies, the tree provided by the nesting stage

must be considered, since it is not possible to extract both an outer loop and one

of its inner loops independently. In this project, outer loops are parallelised

preferentially since this minimises the number of data copies to and from the

GPU. This gives the outline algorithm for kernel extraction shown in Figure 3.4.

For the one-dimensional case, extract(l) simply uses a dependency checker

to determine whether the loop l can be run in parallel, and if so attempts to

extract it. Note that an extraction may fail due to limitations of the CUDAarchitecture—this type of failure is handled exactly as though the dependency

check failed.

For the n-dimensional case, the first level is checked as for the 1D case. For

subsequent levels, it is required that there is only one loop child and also that

the form in Figure 3.5 is followed before the level may be added as a further

dimension of the kernel.



3.6. DEPENDENCY ANALYSIS 35

for(...) { Outer Loop

v = constant (∀v ∈ Incinner) Checked using Constant Propagation (Section 2.7.4)for(...)

{...

}Parallel Inner Loop (checked by dependency checker)

v += constant (∀v ∈ Incouter)}

Figure 3.5: Form of multiple dimension kernels.

There may be other viable approaches that don’t always select the outer loop

if the compiler were capable of leaving state in GPU memory between kernel

invocations, as was done in [16] with multi-pass loops, but these are not considered

here.

3.5.1 Copy In

The copy in state for a kernel is the set of state that must be supplied to the

GPU for kernel execution. This is the set of variables made live by the loop body

plus any dimension indices not already in this set.

3.5.2 Copy Out

Since the kernel is executed in parallel, all direct writes should be local to thekernel (i.e. not live immediately following the loop). If this were not the case,

then an output dependency would exist. The copy out set is therefore given

by the indirect writes set computed by analysis.dataflow.AliasUsed (Section

3.3.4). When the may-alias analysis is flagged as inaccurate, all copy in state is

included in the copy out set.

3.6 Dependency Analysis

The dependency analysis portion of the compiler is used by the kernel extraction

stage (Section 3.5) to determine whether it is safe to parallelise a given loop.

Both the user annotation and automated checks implement the same interface

(DependencyCheck) so can be used interchangeably.




3.6.1 Annotation Based

Developers can use method annotations to both express explicit parallelism and

override automatic analysis. The annotation (@Parallel) has a single propertyloops that takes an array of index variable names for trivial loops which should be

executed in parallel. This still requires that the corresponding loop is detected

and found to be of a trivial form. The class must have been compiled with

debugging information so that variable names are available.

3.6.2 Automatic

This test consists of two checks to ensure there are no loop-carried dependencies:

Direct Writes. All direct writes must be to variables that are local to the loopbody—i.e. the variable must not be live either at the start of the loop body,

or immediately following the loop.

Indirect Writes. Momentarily ignoring the effect of aliasing, we compare each

write with all accesses (including itself) to the same loose state (i.e. states

that are the same ignoring array indices, see Equation 3.5). To be sure

they don’t access the same location on different iterations, there must be an

increment variable at the same position in each list of indices (see Definition

9 of indirect accesses). The variable must also have been incremented by

the same amount in each access. Several examples are given in Example3.9.

The effects of aliasing are managed by the AliasUsed class, which expands

each write to all states it may have affected. The may-alias analysis is

initialised using information provided by @Restrict annotations. When

marked as such, the programmer is asserting that the variable, and all

references reachable from it, do not alias with any other state. If the may-

alias is flagged inaccurate, then the loop is not accepted.

3.7 Code Generation

The top level algorithm for code generation deals with the difficulty inherent in

code generation for CUDA, which can fail due to both unsupported instructions

(e.g. exceptions, monitors and memory allocation) and calls to methods in classes

not supplied to the compiler.



3.7. CODE GENERATION 37

short[] f(short[] data, short[] dummy) {if(Math.sqrt(4.0) < 4.0) {

return data;

} else {return dummy;

}}

void compute() {short[][] dummy = new short[height][];

for(int y = 0; y < height; y++) {for(int x = 0; x < width; x++) {

...

dummy[y] = data[y];

f(dummy[y], data[y])[x] = ...;

}}}

(1) Correct Acceptance

while(i < LIMIT) {arr[i] = ...

i++;

arr[i] = ...

i++;

}(2) False Rejection

while(i < LIMIT) {arr[i] = ...

i += 2;

arr[i] = ...

i--;

}(3) Correct Rejection

Example 3.9: Examples of the automatic dependency check.

Before outputting code for any method or kernel, all of the static fields, classes

and methods on which it depends (Section 3.3.4) must be exported. This is

implemented by buffering all C++ code and recursing onto a new buffer whenever

a call is reached. Only when a method is completely exported, along with its

recursions, is its buffer flushed. As a result, some methods may be exported and

then never used, since they were exported for a kernel that later failed to export.

I will now describe how the C++ code generation itself works, before moving

onto describing the ‘launcher’ method that is called in place of parallelisable

loops to execute the kernel. Details regarding naming conventions are given in

Appendix B.

Finally, an extension of Java’s PrintStream (cuda.Beautifier) indents code

based on the location of curly braces. This was done to facilitate debugging.




ILOAD 2 (x)I2F

ALOAD 0 (this)GETFIELD spacing:F

FMUL

LDC 1.5f

FSUB

FSTORE 5

(1) Bytecode

Read this

Read x

Read spacing

× 1.5f

−

Write Cr

(2) Code Graph

const jint t0 = v2 INT;

const Object<Data samples Mandelbrot> t1 =v0 2101451235;

const jfloat t2 = DEVPTR(t1.device)->spacing;

const jfloat t3 = (jfloat) t0;

const jfloat t4 = t3*t2;

const jfloat t5 = 1.5f;

const jfloat t6 = t4-t5;

v5 FLOAT = t6;

(3) C++

Example 3.10: C++ code generation for float Cr = (x * spacing - 1.5f);.

3.7.1 C++

Exporting the basic blocks to C++ is performed with a depth-first search of each

timeline entry in turn. This ensures that stateful instructions are executed in the

correct order, and that all arguments are generated before their use. Results from

instructions (i.e. Producers, see Table 3.1, page 22) are assigned to temporary

const variables. The names of these temporary variables are stored in a map so

that each instruction is only visited once. An example of a basic block and its

exported form is given in Example 3.10.Control flow is exported using a combination of while, for recognised loops,

and goto, for all conditionals and loops not detected.

3.7.2 Kernel Invocation

The kernel is invoked on the graphics processor using the CUDA runtime library.

This requires dimensions for both the grid of blocks and the blocks themselves

(see Section 2.6.1). Dimensions are chosen using the following rules and heuristics

to maximise performance and ensure the execution succeeds. For each dimension

i, the grid size is denoted by gi, the block size by bi and the number of requirediterations by ri.

1.

i bi is less than or equal to the maximum number of threads per block.

This is governed by register and shared memory usage of the kernel.

2. b1 must be a multiple of the warp size (therefore the number of threads per

block will also be a multiple of the warp size), or less than a single warp.



3.7. CODE GENERATION 39

Object<T> Array<T>

jobject object jarray object

T* host T* host

T* device T* devicejsize length

Figure 3.6: Array and object type templates for on-GPU execution

3. bi+1 > 1 =⇒ bi ≥ ri

4. gi = min{ri/bi, Gi} where Gi is the maximum size of the grid in dimension

i.

This means that the developer does not need to consider the specification of

their specific graphics card, or have knowledge of the threading model.

3.7.3 Data Copying

Primitive types are transferred directly into the corresponding C++ types. In the

case of doubles, data must first be switched to single precision if it is to be used

on cards without double precision support. For arrays of doubles, a single check

is made to determine whether this is necessary in order to avoid unnecessary

overheads.For reference types, the C++ types, Object<T> and Array<T> (Figure 3.6),

are defined using template meta-programming, enabling recursive types to be

built up (e.g. Array<Array<Object<struct foo> > >). The object identifier

allows objects to be ‘switched’ during GPU computation (for example, reversing

the rows of a 2D array), while the host pointer is used to record the location in

host memory where the object is held. It would have been possible to free this

memory while the GPU code executed, reallocating space to perform the export.

However, I felt that the further allocation overheads outweighed any benefit.

On import, each reference is placed in a map to ensure it is not imported

twice. If this did occur and both copies were modified, then only one set of

changes would be preserved by the export. The map is also used as a list of

objects that must be exported. Without this, an object that became unreachable

as a result of the kernel might not be exported, even though it may still be

reachable from elsewhere in the program.




Arrays

Arrays with primitive elements are imported using JNI functions that force

the JVM to provide direct access to the array without copying the data(GetPrimitiveArrayCritical and ReleasePrimitiveArrayCritical ). This

avoids the need for two copies (first into a C buffer and then onto the device) at

the expense of halting the virtual machine’s garbage collector.

However, for arrays with reference-typed elements, each element must be read

separately and then imported appropriately, causing two copy stages.

Objects

Since CUDA devices support C structures, these can be used to represent Java

objects on the graphics processor. Unfortunately, populating these via the JNIAPI requires a function call to access each field of each object, which creates

noticeable overheads for large objects or large numbers of objects.

Memory Allocation

In order to minimise the number of memory allocations required, all device mem-

ory is allocated with a single allocation, and then divided up as needed. This

also results in improvements in copy performance (see Section 4.2.1).

Similarly, the host memory for an array of objects is allocated in a single

batch rather than one-by-one.

Statics

Rather than passing statics to the kernel as arguments, which must in turn be

passed on to any other methods called, they are stored in CUDA’s __constant__

memory. The read-only nature of this memory is not a problem, since a static will

never be directly written to (Section 3.5.2). There are also possible performance

gains as it allows caching by the GPU. The restricted size of __constant__

memory (64Kb for the card used in development) is unlikely to be an issue, since

even on 64bit machines, Array<T> only requires 28 bytes4

.4JVM array lengths are defined as 32bit integers even on 64bit machines: jsize → jint →

int.



3.8. COMPILER TOOL 41

3.8 Compiler Tool

The compiler is brought together in tools.Parallelise. This makes calls to the

stages of the compiler: import, the 3 stages of loop detection, kernel extraction

(which in turn performs dependency analysis and code generation) and finally

export. A description of the available arguments and their effects is given in

Appendix C. These are parsed by an open source library, “jopt simple” 5.

The compiler also invokes the CUDA compiler (nvcc) automatically, so that a

developer does not need to understand the process of producing JNI compatible

libraries from CUDA code.

3.8.1 Feedback to the User

Compiler feedback is provided at a variety of levels6, ranging from just fatal

errors through to full debugging information. Logging messages are managed,

like command line arguments, by an external library “log4j”7. As a standard

problem in many applications, it was unnecessary to implement a custom set of

logging classes. Log messages are categorised by the module of the compiler and

a level.

As well as controlling the verbosity of messages, when the logging level is set

to debug , debugging output is added to the generated CUDA code. This then

provides information regarding the invocation sizes used (Section 3.7.2) and a

breakdown of the GPU execution time into the following stages:

1. Importing data from Java (using JNI) and allocating any extra host memory

required, as well as calculating how much device memory to allocate.

2. Allocating device memory and copying data to the GPU.

3. Executing the kernel on the GPU.

4. Copying data back from the GPU.

5. Exporting any data back to Java as required and freeing memory resources.

3.9 Summary

In this chapter, I have given a complete overview of the internals of the parallelis-

ing compiler. This includes the theoretical basis for the analysis— most notably

5http://jopt-simple.sourceforge.net/6The possible levels are FATAL, ERROR, WARN, INFO, DEBUG and TRACE.7http://logging.apache.org/log4j/





CHAPTER 4Evaluation

This chapter evaluates the compiler and presents a model of the overheads caused

by data copying to the GPU. An objective comparison with related work in the

literature is also provided. Descriptions of all sample code, including their origins,

are given in Appendix D.

4.1 Correctness

As described in Section 2.2, the compiler was developed by the gradual introduc-

tion of stages. It was checked that all sample code (see Appendix D) supported

at the time continued to produce correct results after compilation.

Unit tests were also performed for the analysis stages (Table 4.1). These

consisted of gold standards (Appendix E) for each of the sample codes that could

be compared with the given results.

For the scope of target programs defined by C4 (see Section 2.1), all tests

were passed. When moving outside this scope, specifically making use of object

Compiler Stage Tests

Loop Detection Correct number.Loop Trivialisation Correct increments and bounds.Kernel Extraction Correct dimensions and copy in state. Safe copy

out state.Code Generation Specific code for different aspects (e.g. objects).Dependency Analysis Safe results.

Table 4.1: Tests made for each compiler state.

43



44 CHAPTER 4. EVALUATION

inheritance, the code generation and dependency analysis stages both wrongly

assume that methods are final so that they can be exported, since it is not

possible to know what classes may later extend and override these methods. The

alternative of rejecting code generation in these cases would prevent many valid

compilations, since the final keyword is often omitted, even if applicable.

4.2 Performance

The performance benefits achievable using the compiler depend on the combina-

tion of the speedup due to parallel execution on the graphics processor, and the

overheads due to data copying.

The execution speedup is difficult to predict due to the differences between

GPU and CPU architectures. CPU execution time depends heavily on the

amount of instruction level parallelism that can be achieved through out-of-order

execution. Whilst the GPU is simpler in this respect, its performance can be

affected by the locality of memory accesses (due to coalescing, see Section 2.6.2)

and also the runtime effect of thread divergence (see Section 2.6.1). This sec-

tion therefore comments on the measured speedups rather than trying to predict

them.

The overheads are more predictable, allowing a model to be developed and

then tested against measurements made on the collection of sample code.

In order to achieve fair results, benchmarks were run on the dedicated machine(bing) with the GPU in dedicated mode. As far as possible, other programs were

terminated before benchmarking to avoid contention for CPU time. Benchmarks

were repeated 10 times and the median of these used. All execution timings were

made against wall clock time. Using CPU time would have given biased results,

since time spent on the GPU appears as I/O and would not have been included.

4.2.1 Model of Overheads

The overheads related to off-loading computation onto the graphics processor can

be split into four categories as in Section 3.8.1: importing from Java; copying tothe GPU; copying back from the GPU; and exporting to Java. The operations

within these stages (Section 3.7.3) suggest the following costs. In general, I

expect these to behave linearly (i.e. an initial latency l∗, plus a further cost nt∗depending on the size n of the operation).

Stopping Garbage Collection (lg). Since the ‘critical’ array access JNI func-

tions are used, there will be a constant cost for stopping garbage collection.



4.2. PERFORMANCE 45

Stage Overhead Time

Importlg + (lr

p A( p)) + Sls for all parameters p

where S is the number of statics used.

Copy On ld

p R( p) + td

p M ( p) for all parameters pCopy Off lh

p R( p) + th

p M ( p) for copy off parameters p

Export lf + (lw

p A( p)) + (tw

p E ( p)) for copy off parameters p

Table 4.2: Expected timings for overhead stages according to model.

JNI Reads (lr). For each read from Java, there will be a constant cost. This

also applies to the ‘critical’ array accesses, since no copy is performed.

Constant Setting (ls). When statics are used, CUDA constant memory must

be set.

Copies (ld, td, lh and th). Copies in each direction are likely to have different

bandwidths. I ignore the allocation cost at the beginning of the ‘copy on’

stage, since this will be negligible compared to the copy.

JNI Writes (lw and tw). The ‘critical’ array access functions allow changes to

be aborted, suggesting that a copy-on-write may occur internally. This is

therefore modelled as a linear cost.

Freeing (lf ). Finally, there is the cost of freeing the used device memory.

This gives the expressions in Table 4.2 for the overheads associated with each

of the four stages. These rely on knowing certain values for each parameter p of

the kernel.

• The number of accesses A( p) required to read or write the parameter from

Java.

• The total amount of memory E ( p) that is exported by these accesses.

• The number of memory regions R( p) that this data is spread out over.

• The total amount of memory M ( p) that the data occupies once in the C++

code (this is higher due to the representations shown in Figure 3.6).

These can be calculated recursively based on the type of the parameter (and

array lengths), as shown in Equations 4.1 to 4.4.




Accesses

A(primitive) = 0

A(array of primitive) = 1

A(array of τ ) = length · (1 + A(τ ))

A(object τ ) =

τ ∈fields(τ ) 1 + A(τ )

(4.1)

Exported Memory

E (primitive) = sizeof(primitive)

E (array of τ ) = sizeof(pointer) + (length · E (τ ))

E (object τ ) = sizeof(pointer) +

τ ∈fields(τ ) E (τ )

(4.2)

Memory Regions

R(primitive) = 0

R(array of object τ ) = 2 + (length·τ

∈fields(τ )R(τ ))

R(array of τ ) = 1 + (length · R(τ ))

R(object τ ) = 1 +

τ ∈fields(τ ) R(τ )

(4.3)

Total Memory

M (primitive) = sizeof(primitive)

M (array of τ ) = 3 · sizeof(pointer) + 4 + (length · M (τ ))

M (object τ ) = 3 · sizeof(pointer) +

τ ∈fields(τ ) M (τ )

(4.4)

Measurement of Copy Parameters

As a preliminary test of the copy on and copy off models, a CUDA program that

measured the time taken to copy N arrays of N doubles (i.e. 8N bytes) was

written in C++. It became apparent that the model only holds when the copies

are within a single device memory allocation. The test program was therefore

extended to allow for a variety of memory locations both on the host and the

device. These were as follows:

Separate Each array is allocated separately with a call to the relevant memory

allocator.

Sequential The memory for all arrays is allocated at once, and then the locations

allocated sequentially from this pool.

Non-sequential Again the memory for all arrays is allocated at once, but the

regions of memory are allocated alternately from the start and end of this

pool. This was designed to simulate the case where the order of the copies

could not be predicted and would not be ‘in-order’.



4.2. PERFORMANCE 47

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0.0×100

1.0×103

2.0×103

3.0×103

4.0×103

5.0×103

6.0×103

7.0×103

8.0×103

T i m e ( m s )

N

Comparison of Copies

Separate

Single AllocationCubic FitAs per model

Figure 4.1: Effect on copy performance (host-to-device) of single vs. multipleallocations.

As shown in Figure 4.1, the predicted model (Nld + 8N 2td) is only followed

in the two cases of single allocation, with the separate case exhibiting cubic

behaviour. This also shows the improvement in copy performance that can be

achieved by performing just a single allocation. An appropriate modification was

therefore made to the compiler.

The model parameters given by gnuplot’s fitting function were ld = (8.07 ±0.11) × 10−3ms and td = (1.614 ± 0.002) × 10−6ms byte−1. The respective values

for device-to-host copy were lh = (8.56±0.15)×10−3ms and th = (2.548±0.003)×10−6ms byte−1.

NVIDIA provide a similar tool in their SDK that measures memory copy

performance. This gives results which can then be used to estimate td and th.

Timing single copies does not give sufficient accuracy to measure the latencies, sothe values for ld and lh are taken from above. However, as shown in Figure 4.2,

for small copies (< 50KB) the model does not hold, with td and th taking varying

values as shown in Figure 4.3. For the remainder of the evaluation, I continue to

assume the simple linear model, but split each parameter into ‘small’ and ‘large’

values, for below and above 50KB respectively (i.e. td,small, td,large, . . . ).




0.001

0.01

0.1

1

10

100

1.0×103

1.0×104

1.0×105

1.0×106

1.0×107

1.0×108

T i m e ( m s )

Size (bytes)

Comparison of Copy Performance with Linear Model

Host to DeviceDevice to Host

ld + Ntdlh + Nth

Figure 4.2: Comparison of measured performance with model (using CUDASDK).

0.0×100

1.0×10-6

2.0×10-6

3.0×10-6

4.0×10-6

5.0×10-6

6.0×10-6

7.0×10-6

1.0×103

1.0×104

1.0×105

1.0×106

1.0×107

1.0×108

T i m e / b y t e ( m s / b y t e )

Size (bytes)

Values of td and th assuming Constant Latency

tdth

Figure 4.3: Values of td and th for measurements (using CUDA SDK).



4.2. PERFORMANCE 49

4.2.2 Component Benchmarks

Here I present a number of micro-benchmarks that compute sin 2 x +cos2 x over a

sequence of random numbers (length N ). Each version of the benchmark storesthe sequence in a different manner. This allows the overheads model to be tested

on code produced by the compiler. It also evaluates whether speedups can be

achieved when very little computation is performed. The versions produced were:

Baseline The baseline version stores the numbers in a local 1D array.

Statics In this case, the numbers are stored as a static variable.

Objects Each number is placed inside a class, and the computation is performed

as a method of this class.

Two Dimensions The numbers are stored in a rectangular array with roughly

the same number of elements. The dimensions for the array were chosen as

√N ×

N

√N

.

Using the model in the previous section, the overhead times for each of these

versions can be predicted—as shown in Table 4.3. Measurements for a range

of N can then be used to assess whether the model fits accurately and to give

estimates for its parameters. Due to the nature of the parameters, it is necessary

to consider the complete data set (e.g. in the static case lg and ls could be varied

arbitrarily provided lg + ls gave a suitable value). The measured values are shown

in Table 4.4 and are reasonably consistent with those measured in the previous

section. The slight shift in copy latencies may be due to an overlooked difference

between the C++ test program in the previous section, and the copies performed

for offloading Java. Recalculating the rates in Figure 4.3 using the new values of

ld and lh gives values that coincide with the rates calculated here. An indication

of the quality of the fits is given by the graphs in Figure 4.4.

The benchmark timings also give an indication of the execution speedup. The

results are summarised in Table 4.5. The performance when executed on the CPU

was the same for all versions.The baseline benchmark is encouraging as it shows that even when little com-

putation is performed on the GPU, the overhead associated with transferring data

to the graphics card is not prohibitive. The statics version performs similarly, as

would be expected, with a slightly improved speedup possibly due to the array

pointer being held in constant memory which can be cached (see Section 2.6.2).

When an object array is considered, the overheads (although vastly improved

by using single memory allocation) make offloading to the GPU impractical. The




Import ExportBaseline lg + lr lf + lw + (8 + 8N )twStatics lg + lr + ls lf + lw + (8 + 8N )twObjects lg + 2Nlr lf + 2Nlw + (8 + 16N )tw2D lg + 2lr

√N lf + 2

√Nlw + (8 + 8

√N + 8N )tw

Copy On (for Copy Off replace ld and td with lh and th)Baseline ld + (28 + 8N )tdStatics ld + (28 + 8N )tdObjects 2ld + (28 + 32N )td2D (1 +

√N )ld + (28 + 28

√N + 8N )td

Table 4.3: Model of overheads for component benchmark versions.

lg lr ls ld lhms 7.37 × 10−3 3.76 × 10−4 2.08 × 10−2 1.05 × 10−2 1.04 × 10−2

lf lwms 1.43 × 10−1 2.40 × 10−4

td,small td,large th,small th,large twms/byte 9.56 × 10−7 6.82 × 10−7 1.75 × 10−6 1.23 × 10−6 1.97 × 10−9

Table 4.4: Model parameters, as measured using component benchmarks.

B a s e l i n e

T i m e

N

Import

T i m e

N

Copy On

T i m e

N

Copy Off

T i m e

N

Export

S t a t i c s

T i m e

N

T i m e

N

T i m e

N

T i m e

N

O b j e c t s

T i m e

N

T i m e

N

T i m e

N

T i m e

N

2 D

T i m e

N

T i m e

N

T i m e

N

T i m e

N

Figure 4.4: Fit of model (green) to component benchmarks.



4.2. PERFORMANCE 51

Version Execute Only Inc. Overheads

Baseline 192 40Statics 239 41Objects 220 0.182D 229 22

Table 4.5: Speedup factors for the component benchmarks.

0

0.005

0.01

0.015

0.02

0.025

T i m e ( m s )

N

Import

0123456789

10

T i m e ( m s )

N

Copy On

02468

101214161820

T i m e ( m s )

N

Copy Off

0

0.05

0.1

0.15

0.2

0.25

T i m e ( m s )

N

Export

Figure 4.5: Fit of model to Fourier Series benchmark, using previously calculatedparameters.

inaccuracy of the model during the import stage may be due to unexpected

overheads associated with the map used for listing references. Further work is

needed to isolate this and make suitable improvements.

The overheads in the two-dimensional case are also much reduced by the

single memory allocation and this improves the overall speedup from 5.6 to 22.

4.2.3 Java Grande Benchmark Suite [7]

The Java Grande benchmark suite was used as a source of external unbiased code

that could be passed to the compiler. The sequential code was annotated and

fed to the compiler. Timings were then compared between the GPU and original

versions.

A full description of the suite is given in Appendix D—including an explana-

tion of which benchmarks were used. Here I give the results of the Series and

Crypt benchmarks, relating these to the hypothesised overheads model, and also

the hardware characteristics of CUDA.

Series: Fourier Series of (x + 1)x

This benchmark exhibited the biggest speedup factor (187 overall). The break-

down of the execution time shows that only 0.5% of the GPU time was due to

overheads. These overheads were generally in agreement with those predicted

using the parameters measured in the previous section (Figure 4.5).




0

20

40

60

80

100

T i m

e ( m s )

N

Import

0

200

400

600

800

1000

1200

T i m

e ( m s )

N

Copy On

0

200

400

600

800

1000

1200

T i m

e ( m s )

N

Copy Off

0

2

4

6

8

10

T i m

e ( m s )

N

Export

Figure 4.6: Fit of model to Mandelbrot benchmark, using previously calculatedparameters.

Crypt: IDEA Encryption/Decryption

Whilst only using integer operations, the graphics processor execution still

achieves a significant speedup factor (8.7). This is again helped by the relatively

small amount of data required for computation. The lower factor is probably due

to the CPU performing better on integer benchmarks.

4.2.4 Mandelbrot Set Computation

The Mandelbrot set is defined as the set of complex values c such that the absolute

value of zn remains bounded for any value of n, where zn is defined as:

zn = 0 if n = 0

z2

n−1+ c if n > 0

(4.5)

For computation, we must define a limit on the size of n (the iteration limit )

and also a bound on values of zn. Here the bound is set as 4.0 (as used in the

original code). The iteration limit means that it is possible to vary the amount

of computation performed on the data, altering the significance of the overheads.

Again, the measured parameters from Section 4.2.2 were used to predict the

overheads, giving very accurate results (Figure 4.6). This demonstrates that the

model does not suffer from overfitting.

Turning to the speedup during the actual execution portion, I first consider

the case where the iteration limit is fixed at 250 (as in [16]) and the grid size isaltered. Figure 4.7 plots the speedup achieved on the execute portion, and also

the overall speedup when overheads are included. The reason the execute-only

speedup is lower in this benchmark could be due to the effect of thread divergence

(described in Section 2.6.1). This means that the calculation for each pixel takes

as long as the ‘slowest pixel’ in its warp.

Similarly, the variation in speedup can be investigated as the iteration limit is

altered. This is done for a fixed size computation (8000 × 8000 grid) and plotted



4.2. PERFORMANCE 53

0

10

20

30

40

50

60

70

80

90

100

S p e e d u p F a c t o r

Execute Only

Overall

0

25

50

75

100

0.0×100

2.0×103

4.0×103

6.0×103

8.0×103

1.0×104

1.2×104

% O

v e r h e a d

N

Figure 4.7: Speedups and overhead for Mandelbrot benchmark with fixed itera-tion limit (250 ).

in Figure 4.8. Since the overheads are fixed, they become less significant as the

number of iterations rises, with the overall speedup tending towards the execute

speedup.

4.2.5 Conway’s Game of Life

Conway’s Game of Life is a cellular automaton. The evolution of each cell in a

2D grid requires independent computation (see Appendix D for details).

The simulation of such a ‘game’ provides an interesting benchmark for parallel

computing, since there is a trade-off between the naıve computation that is easy toparallelise, and more sophisticated algorithms that are less suited. In particular,

I will consider the Hashlife implementation [12] that accelerates simulation by

recording the evolution subgrids to avoid later recomputation.

As shown in Figure 4.9, the naıve algorithm running on the GPU in fact runs

slower than on the CPU. Both are much slower than HashLife. One reason for

this is that all data is copied back and forth from the graphics card on each

iteration, even though the data is not used by the host in between each kernel




0

10

20

30

40

50

60

70

80

90

S p e e d u p F a c t o r

Execute Only

Overall

0

25

50

75

100

0.0×100

2.0×102

4.0×102

6.0×102

8.0×102

1.0×103

1.2×103

% O

v e r h e a d

Iterations

Figure 4.8: Speedups and overhead for Mandelbrot benchmark with fixed gridsize (8000 × 8000).

invocation. Other work [16] introduces multi-pass loops, where the loop body

only consists of GPU code, allowing data to be left on the GPU. In the case of

this specific benchmark, a more advanced approach would be needed, since a new

array is used for each iteration rather than double buffering.

A second issue is the manner in which a cell’s neighbours are counted. Since

the world is stored as an array of booleans, there is an if ...else control flow

structure for each neighbour. This suggests that the execution may be suffering

from thread divergence.

4.2.6 Summary

These results show that significant performance improvements are possible over

a range of benchmarks. Whilst accurate predictions of execute speedups have

not been possible, the factors measured are consistent with those expected given

the number of cores on the GPU and also the number of double precision units

available. Both the execute and overall speedups for each benchmark are sum-

marised in Table 4.6. These are combined using the geometric mean (see [9] for



4.2. PERFORMANCE 55

0

5000

1000015000

20000 10

1001000

10000

0

5000

10000

15000

20000

25000

30000

Time (ms)

GPU

CPU

Grid Size

Generations

Time (ms)

Figure 4.9: Overall times for simulation of Conway’s Game of Life.

Benchmark Double Precision Execute OverallBaseline 192 40Statics 239 41

Objects

220 0.182D 229 22Mandelbrot (250 iterations) × 83 39Mandelbrot (8000 × 8000 grid) × 106 79Life × – –1

Series 189 187Crypt × – 8.7

182.4 20.6

Table 4.6: Summary of speedup factors.

reasons why this is appropriate) to give an average speedup factor of 20.6.

The overheads model has also been evaluated, with the parameters measured

from the component benchmarks giving accurate predictions of the overheads in

other cases. However, some aspects are not fully understood (i.e. GPU behaviour

for small copies, and object import time).

1The Life speedup factors were all very low ( 1) but varied considerably. Therefore, therewas not a suitable single value.




Benchmark Series (Floating Point) Crypt (Integer)Data Size 104 105 106 3 · 106 2 · 107 5 · 107

CPU on bing (ms) 17971 182894 2878469 414 2190 5344

This Project on bing (ms) 99 968 9358 41 245 545JCUDA on Tesla C1060 (ms) 110 1040 10140 20 160 450

Table 4.7: Comparison of Java Grande benchmark timings with JCUDA.

4.3 Accuracy of Dependency Analysis

Using the same gold standards as were used for testing (i.e. @Parallel annota-

tions), it was possible to measure the accuracy of the automatic analysis. This

showed that an accuracy of 85% ( 29

34

) was achieved for the range of benchmarks.

In cases where the check was too conservative, the behaviour could be explained

by the may-alias and checking algorithms.

4.4 Comparison with Existing Work

This project’s approach was compared with other related work in Section 1.3. In

terms of performance, published results allow some quantitative comparisons to

be made regarding the speedups achieved. Unfortunately, the JikesRVM work

[16] uses a much older card (GeForce 7800) so is incomparable. JCUDA [25]

uses a similar card (NVIDIA Tesla C1060, 1.3GHz, 240 cores) to that of bing

(NVIDIA GTX 260, 1.24GHz, 216 cores). Their work ports the Java Grande

benchmarks [7] to C++ so that the GPU performance can be compared to that

of raw Java. My results for the Series and Crypt benchmarks (Section 4.2.3)

are broadly similar as shown in Table 4.7.

Turning to the automatic dependency analysis, neither [16] nor javab [4]

give accuracy figures for their analyses (javab instead compares the number of

parallelisable loops with the total). However, it would be expected that the

approach of [16] could use runtime information to produce more accurate results

than either this work or javab.

4.5 Summary

In this chapter, three key aspects of the project have been evaluated. First,

tests were used to demonstrate compiler correctness within the required scope.

A model for overheads was then developed and tested. It was found to be accu-



4.5. SUMMARY 57

rate with large copies, but the bandwidth to the card behaved in a manner not

fully understood with small copies. The modelling also indicated a significant im-

provement that could be made to the compiler. Execution speedups were found

to be in line with what would be expected based on the hardware architecture.

Finally, investigations were made into the accuracy of the automatic analysis.

Some quantitative comparisons with existing work have also been made, adding

to those in Section 1.3.





CHAPTER 5Conclusions

This dissertation has highlighted the key aspects of the project and the compiler

that it produced. It has explained the existing work and knowledge that was used

(Chapter 2), and how this allowed a novel compiler to be developed (Chapter 3).

Evaluation of the compiler (Chapter 4) has shown it to both maintain correctness

and provide significant speedups in the majority of sample cases. This chapter

assesses the project formally with respect to its goals, and suggests future work

to improve the compiler.

5.1 Comparison with Requirements

Ultimately, the project should be judged by whether it meets the requirements

that were elaborated from the project proposal in Section 2.1. The evaluation

allows each of these to be considered in this section.

The tests that were carried out during the project (Section 4.1) showed that

the compiler maintained correctness whenever it succeeded in compiling. A

marginal case is exhibited when the graphics processor’s memory is exceeded, this

causes the JVM to exit gracefully with a suitable error message. Use of recursion

within parallel loops is a notable case where compilation fails, due to restrictionsof CUDA. This evidence demonstrates that the project meets Requirement C1.

The various performance benchmarks that have been evaluated (Section 4.2)

show a clear benefit from using the compiler, satisfying Requirement C2.

The annotations that the compiler uses to assess code (@Parallel and

@Restrict) are both unobtrusive and transparent to the standard Java com-

piler. Transparency allows source code containing these annotations to be built

normally for environments without compatible GPUs. The annotations also al-

59



60 CHAPTER 5. CONCLUSIONS

low explicit marking of parallel for loops of multiple dimensions. Therefore,

Requirement C3 is met. The nature of @Parallel also means that the loop

bound detection extension (E1) was fully implemented.

The scope of code that can be compiled for GPU execution meets the re-

quirements of C4, although recursive code cannot be used due to restrictions in

current GPU architectures. Extension E4 for support of objects has also been

completed up to the limits of the architecture (i.e. no inheritance or allocation).

The compiler provides user feedback, giving reasons whenever parallel com-

pilation fails. This avoids unexplained performance changes when utilising the

automatic dependency analysis, satisfying C5.

The sample code (Appendix D) used in the evaluation has been fully de-

scribed, meaning that all claims can be checked objectively. This fulfils the final

core requirement, C6.The implementation of simple automatic dependency checking (Section 3.6.2)

means that Extension E2 has also been completed.

5.2 Future Work

There are many additions and improvements that could be made to further de-

velop the compiler, of which I describe a few here.

5.2.1 Further Hardware SupportWith the release of NVIDIA’s new Fermi cards [21] and CUDA 3.0, it is now

possible to provide a more complete set of features for GPU execution, including

recursion and more complete object support. While recursion would be supported

automatically via nvcc, some features would require more work. Support for

allocations might be possible in some cases by pessimistically allocating space for

all possible allocations, and then freeing unused blocks after the kernel invocation.

Support for multiple graphics cards would also be useful. However, exporting

arrays back to Java, after different portions have been modified on different cards,

may cause difficulties, and extra overheads.

5.2.2 Further Optimisations

There is certainly scope for further transformations within the compiler to im-

prove performance. For example, when copying objects onto the device, it makes

sense only to copy fields that will be used. As mentioned in the original ex-



5.3. FINAL CONCLUSIONS 61

for(int i = 0; i < length; i++)

{if(arr[i] < minimum) minimum = arr[i];

}(a) Sequential min

...

<

0.30.2

<

1.7−3.1

(b) Parallel Reduction

Figure 5.1: Minimum finding algorithms

tensions, there may also be optimisations that neither nvcc nor the JVM can

perform, such as loop invariant code motion (see Section 2.1, E5).

As exhibited in the Game of Life benchmark (Section 4.2.5), support for multi-

pass loops (as implemented in [16]) could also improve performance dramatically

in some iterative algorithms.

5.2.3 Further Automatic Detection

Given the undecidability of the automatic parallelisation, there will always be

scope for introduction of more accurate and sophisticated tests. However, an

alternative might be to leave a CPU version of the code in the class, selecting

which to use at runtime. This could be based, not just on correctness, but also

on whether the number of iterations justify the expected overheads.

There is also the potential for ‘pattern matching’ transformations to yield

significant benefits (albeit in a limited number of cases). For example, common

implementations of minimum, maximum and sum (Figure 5.1a) are not suitable

for parallel execution, however, the solution can be sped up using parallel reduc-

tion (Figure 5.1b).

5.3 Final ConclusionsOverall, I believe that the compiler is able to offer a higher level of abstraction

than other attempts (Section 1.3) without sacrificing performance (Section 4.4).





Bibliography

[1] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers : Principles,

Techniques, & Tools, Second Edition . Addison-Wesley, second edition, 2007.

[2] B. Alpern, A. Cocchi, D. Lieber, M. Mergen, and V. Sarkar. Jalapeno-

a compiler-supported Java virtual machine for servers. In Workshop on

Compiler Support for Software System (WCSSS 99), volume 14, pages 87–

94. Citeseer, 1999.

[3] B. Amedro, V. Bodnartchouk, D. Caromel, C. Delbe, F. Huet, and

G. Taboada. Current State of Java for HPC. Technical Report RT-0353,

INRIA, 2008.

[4] A. Bik and D. Gannon. javab - A prototype bytecode parallelization tool. In

ACM Workshop on Java for High-Performance Network Computing , 1998.

[5] B. Boehm. A spiral model of software development and enhancement. SIG-

SOFT Softw. Eng. Notes, 11(4):14–24, 1986.

[6] E. Bruneton, R. Lenglet, and T. Coupaye. ASM: a code manipulation tool to

implement adaptable systems. Adaptable and extensible component systems,

2002.

[7] J. M. Bull, L. A. Smith, M. D. Westhead, D. S. Henty, and R. A. Davey.

A methodology for benchmarking Java Grande applications. In JAVA ’99:

Proceedings of the ACM 1999 conference on Java Grande, pages 81–88, New

York, NY, USA, 1999. ACM.

[8] L. Damas and R. Milner. Principal type-schemes for functional programs. In

POPL ’82: Proceedings of the 9th ACM SIGPLAN-SIGACT symposium on

Principles of programming languages, pages 207–212, New York, NY, USA,

1982. ACM.

63



64 BIBLIOGRAPHY

[9] P. Fleming and J. Wallace. How not to lie with statistics: the correct way

to summarize benchmark results. Communications of the ACM , 29(3):221,

1986.

[10] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design patterns: elements

of reusable object-oriented software. Addison-Wesley Reading, MA, 1995.

[11] M. Gardner. Mathematical games: The fantastic combinations of John Con-

ways new solitaire game Life. Scientific American , 223(4):120–123, 1970.

[12] R. Gosper. Exploiting regularities in large cellular spaces. Physica D Non-

linear Phenomena , 10:75–80, 1984.

[13] G. A. Kildall. A unified approach to global program optimization. In POPL

’73: Proceedings of the 1st annual ACM SIGACT-SIGPLAN symposium on

Principles of programming languages, pages 194–206, New York, NY, USA,

1973. ACM.

[14] A. Klockner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih. Py-

CUDA: GPU Run-Time Code Generation for High-Performance Computing.

Arxiv preprint arXiv:0911.3456 , 2009.

[15] W. Landi. Undecidability of static analysis. ACM Letters on Programming

Languages and Systems, 1(4):323–337, 1992.

[16] A. Leung, O. Lhotak, and G. Lashari. Automatic parallelization for graph-

ics processing units. In Proceedings of the 7th International Conference on

Principles and Practice of Programming in Java , pages 91–100, 2009.

[17] J. Lewis and U. Neumann. Performance of Java versus C++. Computer

Graphics and Immersive Technology Lab, University of Southern California,

Jan , 2003 (updated 2004).

[18] S. Liang. Java Native Interface 6.0 Specification . Sun, 1999.

[19] T. Lindholm and F. Yellin. The Java(TM) Virtual Machine Specification (2nd Edition). Prentice Hall, 1999.

[20] NVIDIA. Compute Unified Device Architecture. Programming Guide, Au-

gust 2009. Version 2.3.1.

[21] NVIDIA. Fermi: NVIDIA’s Next Generation CUDA Compute Architecture.

White paper, October 2009.



65

[22] J. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. Lefohn, and

T. Purcell. A survey of general-purpose computation on graphics hardware.

In Computer Graphics Forum , volume 26, pages 80–113, 2007.

[23] L. Smith and M. Bull. Java for High Performance Computing.

[24] H. Sutter. The free lunch is over: A fundamental turn toward concurrency

in software. Dr Dobb’s Journal , March 2005.

[25] Y. Yan, M. Grossman, and V. Sarkar. JCUDA: A Programmer-Friendly

Interface for Accelerating Java Programs with CUDA. In Proceedings of the

15th International Euro-Par Conference on Parallel Processing , 2009.





APPENDIX ADataflow Convergence Proofs

In this appendix, proofs are given for the convergence of the iterative computation

of the various dataflow analyses, as described in Sections 2.7.1 to 2.7.4.

A.1 General Dataflow Analysis

As stated in Section 2.7.1, for an analysis over the complete lattice (X, ), with

transfer function F b : X → X , convergence is guaranteed if (X, ) is of finite

height and F b is monotone.

The proof is based on that given in [1, pp. 627 to 628], but is adjusted so

that each calculation makes use of the latest result, rather than always looking

to the previous iteration.

Definition 11. A function F : X → X is monotone if a b =⇒ F (a) F (b).

Theorem 2. If F b (for all b) is monotone and the lattice is of finite height, then

the dataflow analysis converges.

Proof. For all b, we consider the value of R(b) on the ith iteration (i.e. Ri(b)). If

we can show that ∀b.Ri(b) Ri+1(b), then the iterative calculation must convergesince we must either reach a fixed point, or all R(b) will eventually equal the upper

bound on the lattice (since the lattice has finite height, there are no infinite

chains).

For the case where children(b) = ∅, Ri(b) is constant, so trivially Ri(b) Ri+1(b). We consider the other cases by induction.

Base Case: Since we initialise R0(b) as ⊥, no matter what value R1(b) takes,

we have that R0(b) R1(b).

67



68 APPENDIX A. DATAFLOW CONVERGENCE PROOFS

Induction Step: Now we consider Ri+1(b), assuming ∀x.(Ri−1(x) Ri(x)).

Without loss of generality, we can also presume that there is an ordering of

calculations within the iteration, although this is not necessarily the same for

all iterations. We denote the set of blocks or instructions calculated before b as

calc(b). Using an inner induction proof, we can now show that ∀b.Ri(b) Ri+1(b).

Inner Base Case: For calc(b) = ∅, the value of Ri+1(b) is calculated as:

Ri+1(b) = F b

c∈children(b)

Ri(c)

We also know that Ri(b) was calculated as:

Ri(b) = F b

c∈children(b)

Ri(c)Ri−1(c)

By assumption and reflexivity, we have:

∀c ∈ children(b).Ri−1(c) Ri(c) Ri(c)

Therefore, since F b and both join and meet1 are monotone, we have that Ri(b) Ri+1(b) if calc(b) = ∅.

Inner Induction Step: Now we assume that ∀x ∈ calc(b).Ri(x) Ri+1(x).

In this case, Ri+1(b) is calculated as (using calc(b) as a partition):

Ri+1(b) = F b

c∈children(b)∩calc(b)

Ri+1(c)

∨

c∈children(b)\calc(b)

Ri(c)

Again since F b and both meet and join are monotone, and also by our assump-

tions, we have that Ri(b) Ri+1(b) if ∀x ∈ calc(x).Ri(x) Ri+1(x).

Therefore, using both the inner and outer inductions in turn, we have that

∀i,b.Ri(b) Ri+1(b). This proves that the iterative calculation converges.

A.2 Live Variable Analysis

Recall that live variable analysis is performed over the lattice (℘(Vars), ⊆) with

transfer function:

F n(x) = (x \ Write(n)) ∪ Read(n)

1It is a standard result of lattices that join and meet are monotone.



A.3. CONSTANT PROPAGATION 69

Theorem 3. Iterative computation of liveness information converges.

Proof. Convergence can be shown with the help of Theorem 2 by showing that

F n is monotone and that the lattice has finite height. This is trivially the casesince F n is the composition of two monotone operations, set minus and set union.

Also, since ℘(Vars) is finite and has top element Vars, the lattice must have finite

height.

A.3 Constant Propagation

Recall that constant propagation is performed over the lattice ({⊥, } ∪Constants, ) and transfer function F n,v where:

x y ⇐⇒ (x = ⊥) ∨ (y = )

F n,v(x) =

c if n assigns c to v

if n writes a non-constant to v

x otherwise

Theorem 4. Iterative computation of constant propagation converges.

Proof. First we show that F n,v is monotone. The definition of F n,v can be con-

sidered in two cases. When n writes to v, F n,v is simply a constant function, sois trivially monotone. Equally, when n does not write to v, F n,v is the identity

function, so is also monotone.

We can also show that the lattice is finite. By definition of , the only

increasing chains are ⊥ and for each c ∈ Constant, ⊥ c .

Therefore, as for other dataflow analyses, convergence is guaranteed by these

two properties according to Theorem 2.





APPENDIX BCode Generation Details

This appendix describes the naming conventions used within the code generation

stage of the compiler.

Within a kernel or method, the temporary variables used are simply named

consecutively (i.e. t1, t2, . . . ). For local variables, it is necessary to append

a type suffix since the same local variable might be used for different types in

different live ranges (as in Example 3.4). This gives names of vi TypeSort for

variable i where the type sort is any of the primitive types, or a unique number

for reference types.

Kernel launcher methods (i.e. those called as a replacement for the loops)are named using the hashcode of the internal object representing the kernel (i.e.

kernel <hashcode> or kernel M<-hashcode> if the value is negative). This

gives a unique name amongst the kernels exported, and is unlikely to conflict

with any methods within the original class.

JNI specifies a mangling scheme for converting Java method names to C++

[18, Table 2-1]. This must be adopted for the launcher, but is also used by the

compiler for method and static variable names with altered prefixes in place of

Java_ (Static_ for statics and none for methods). This is necessary to ensure

that there are not conflicts in naming (e.g. a naıve approach might result in

ClassX Test.f() and ClassX.Test f() both mapping to ClassX Test f).

71





APPENDIX CCommand Line Interface

The command line interface to the compiler has a number of optional arguments

that affect its behaviour. These are shown in the table below:

Option Description

cuda Directory into which the CUDA toolkit was installed,

should contain bin/nvcc.

jdk Directory into which the JDK was installed, should con-

tain an include directory with the JNI header files.

includes Directory in which the compilers include files are stored(parallel.h et al.).

library Name of the shared library that should be generated by

the compiler (defaults to libparallel).

classpath Paths (separated by :) in which input classes can be

found.

output Output directory for the shared library and modified

class files.

log Log level for feedback, accepting each of the Log4J pos-

sibilities.

detect Dependency checking method. This can be either manual

(default), auto or combined.

generate When specified, the shared library is not compiled and

the C++ code is saved.

nonportable Allows bytecode from the core Java class library to be

compiled onto the GPU. This may allow more code to be

compiled, but is not portable between library versions.

73



74 APPENDIX C. COMMAND LINE INTERFACE

Below an example of the compiler output is given, for automatic detection,

with the logging level set to INFO:

bing:dist$ java -jar Parallel.jar --log info --detect auto samples.Mandelbrot

INFO [core]: Considering samples/Mandelbrot.<init>(I)V

INFO [core]: Considering samples/Mandelbrot.compute()V

INFO [loops.detect]: Natural loop on line 62.



INFO [loops.trivialise]: Loop has multiple exit points (line 73).

INFO [loops.trivialise]: Trivial loop found (line 62): y#1 (I) <

READ ->samples/Mandelbrot.height [I] {y#1 (I)=1}

INFO [loops.trivialise]: Trivial loop found (line 63): x#2 (I) <

READ ->samples/Mandelbrot.width [I] {x#2 (I)=1}

INFO [check.Basic]: Accepted loop (line 62) based on basic test.

INFO [check.Basic]: Accepted loop (line 63) based on basic test.INFO [extract]: Kernel of 2 dimensions extracted (line 63).

INFO [extract]: Copy In: [Var#0 (Lsamples/Mandelbrot;), Var#2 (I), Var#1 (I)]

INFO [extract]: Copy Out: [Var#0 (Lsamples/Mandelbrot;)]

INFO [core]: Considering samples/Mandelbrot.main([Ljava/lang/String;)V

INFO [core]: Considering samples/Mandelbrot.output(Ljava/io/File;)V



INFO [loops.trivialise]: Trivial loop found (line 89): y#4 (I) <

READ ->samples/Mandelbrot.height [I] {y#4 (I)=1}

INFO [loops.trivialise]: Trivial loop found (line 90): x#5 (I) <

READ ->samples/Mandelbrot.width [I] {x#5 (I)=1}

INFO [check.Basic]: Alias analysis not accurate enough to judge loop (line 89).

INFO [check.Basic]: Alias analysis not accurate enough to judge loop (line 90).

INFO [core]: Considering samples/Mandelbrot.<init>(II)V

INFO [core]: Considering samples/Mandelbrot.run(I)J



APPENDIX DSample Code Used

This appendix gives further details on the sample code used in the evaluation.

D.1 Java Grande Benchmark Suite [7]

The suite is split into 3 distinct sections. The first concentrates on testing the

performance of “low level operations” such as arithmetic, and is not relevant to

this project. The second provides 7 kernel benchmarks, while the third concen-

trates on larger scale applications. A summary of the Section 2 benchmarks

available1 is given in Table D.1.

Benchmarks that could not be parallelised through use of parallel for loops

were not considered, since the goal was to use unmodified code.

1Version of 2.0 of the sequential suite was used.

Benchmark Description Used

Series Fourier coefficient analysis.

LUFact LU factorisation. ×SOR Successive over-relaxation.

×HeapSort Integer sorting. ×Crypt IDEA encryption.

FFT Fast Fourier transform. ×Sparse Sparse matrix multiplication. ×

Table D.1: Summary of Section 2 of the Java Grande Benchmark Suite.

75



76 APPENDIX D. SAMPLE CODE USED

Figure D.1: 3 generations of the Game of Life.

D.2 Mandelbrot Computation

A brief description of the Mandelbrot set is given in Section 4.2.4. The routine

used is from The Computer Language Benchmarks Game 2. Whilst the bench-

marks are now considered a bad way of comparing performance of languages,

they are still valid when comparing performance of different compilers (or run-

times) for a single language.The only modification made to the source code was to re-express the

do { ... } while(...); loop as a standard while(...) { ... } loop. This

allows trivialisation of the loop.

D.3 Conway’s Game of Life

Conway’s Game of Life is a cellular automaton. The evolution of each cell in a

2D grid is described by three simple rules (quoted from [11]), considered with

respect to the cell’s eight neighbours (an example evolution is given in Figure

D.1):

1. Survivals: “Every counter with two or three neighboring counters survives

for the next generation.”

2. Deaths: “Each counter with four or more neighbors dies (is removed)

from overpopulation. Every counter with one neighbor or none dies from

isolation.”

3. Births: “Each empty cell adjacent to exactly three neighbors – no more,

no fewer – is a birth cell. A counter is placed on it at the next move.”

The source code used for both the naıve algorithm and Hashlife is that devel-

oped by Dr Andrew Rice for use in a Java programming course 3.

2http://shootout.alioth.debian.org/3http://www.cl.cam.ac.uk/teaching/0809/ProgJava/



APPENDIX ETesting Gold Standards

The gold standard for loop trivialisation is given in the table below. Similar style

checks were made for both loop detection and kernel extraction.

Sample Details of Trivial Loops

Component Benchmarks

Base (Trigonometry) 27 (i < nums.length, i=+1), 34 (i < nums.length, i=+1), 43

(j < nums.length, j=+1)

Static (Statics) 28 (i < nums.length, i=+1), 35 (i < nums.length, i=+1), 44

(j < nums.length, j=+1)

Objects (Objects) 33 (i < nums.length, i=+1), 40 (i < nums.length, i=+1), 49(j < nums.length, j=+1)

2D (MultiDimension) 27 (k < nums.length, k=+1), 28 (l < nums[0].length,

l=+1), 36 (k < nums.length, k=+1), 37 (l <

nums[0].length, l=+1), 47 (i < nums.length, i=+1),

48 (j < nums[0].length, j=+1)

Java Grande Benchmarks

JGFCryptBench 48 (i < array rows, i=+1)

IDEATest 115 (i < 8, i=+1), 130 (j < array rows, j=+1), 154 (k < 5 2,

k=+1), 157 (k < 8, i=+1), 174 (i < 5 2, i=+1), 222 (i < 7,

i=+1), 273 (i < text1.length, i=+8, i1=+8, i2=+8), 291 (r

!= 0, j=-1)

JGFSeriesBench 56 (i < 4, i=+1), 57 (j < 2, j=+1)

SeriesTest 103 (i < array rows, i=+1), 169 (nsteps > 0, nsteps=-1)

Mandelbrot 62 (y < height, y=+1), 63 (x < width, x=+1), 89 (y <

height, y=+1), 90 (x < width, x=+1)

ReverseArray 16 (i < 3, i=+1), 20 (j < 3, j=+1), 24 (i < 3, i=+1)

The majority of benchmarks tested a range of the code generation features.

Since many benchmarks were represented by an object at the top level, this

77



78 APPENDIX E. TESTING GOLD STANDARDS

immediately tested object support. However, several benchmarks were used for

ensuring test coverage of other features:

Statics Tested support for static class fields.

MultiDimension Tested support for arrays, and arrays of arrays.

ReverseArray Tested support for manipulation of references on the GPU.

Objects Tested support for full use of objects, involving modification of multiple

classes and invoking instance methods.

Testing of the automatic dependency analysis could be done against the

@Parallel annotations that were already in place to mark parallel loops.



APPENDIX FClass Index

SLOC1 Class Name Relevant Sections

105 analysis.dataflow.Dataflow

57 analysis.dataflow.ReachingConstants 2.7.4

153 analysis.dataflow.LiveVariable 2.7.3, 3.2.4, 3.3.1

330 analysis.dataflow.AliasUsed 3.3.3, 3.3.4

71 analysis.dataflow.IncrementVariables 3.3.2

131 analysis.dataflow.SimpleUsed 3.3.4

7 analysis.dependency.DependencyCheck 3.6

174 analysis.dependency.BasicCheck 3.6.2

32 analysis.dependency.AnnotationCheck 3.6.123 analysis.dependency.CombinedCheck

104 analysis.loops.LoopDetector 2.7.2

141 analysis.loops.LoopTrivialiser 3.4.1

40 analysis.loops.LoopNester 2.7.2

16 analysis.AliasMap

35 analysis.BlockCollector

72 analysis.CanonicalState ‘State’ in 3.3.3

17 analysis.CodeTraverser 3.2.2

25 analysis.InstructionCollector

154 analysis.KernelExtractor 3.5

71 analysis.LooseState ‘LooseState’ in 3.3.3

80 bytecode.AnnotationImporter

119 bytecode.BlockExporter 3.2.3

462 bytecode.InstructionExporter 3.2.3

99 bytecode.ClassImporter

76 bytecode.ClassExporter

626 bytecode.MethodImporter 3.2.3

81 bytecode.ClassFinder

320 cuda.Helper Appendix B

258 cuda.CppGenerator 3.7.1

79



80 APPENDIX F. CLASS INDEX

SLOC1 Class Name Relevant Sections

108 cuda.BlockExporter 3.7.1

182 cuda.CUDAExporter 3.7

40 cuda.Beautifier 3.760 debug.ControlFlowOutput e.g. Example 3.8

20 debug.LinePropagator

32 exceptions.UnsupportedInstruction

1108 graph.instructions.* Table 3.1

10 graph.state.State

57 graph.state.ArrayElement

69 graph.state.Variable

75 graph.state.Field

49 graph.state.InstanceField

36 graph.Annotation 3.2

85 graph.BasicBlock 3.2.1

64 graph.Block 3.2.1

22 graph.BlockVisitor 3.2.2

152 graph.ClassNode 3.2

39 graph.CodeVisitor 3.2.2

71 graph.Kernel

29 graph.Loop 3.2.1

111 graph.Method 3.2

123 graph.Modifier 3.2

32 graph.TrivialLoop 3.2.1, 3.4.1

254 graph.Type 3.2.4

36 tools.Benchmark

202 tools.Parallelise 3.89 tools.Restrict

10 tools.Parallel

51 util.Utils

28 util.EquatableWeakReference

23 util.ConsList

16 util.MapIterable

25 util.Tree

54 util.WeakList 3.2.1

23 util.TransformIterable

10 parallel.h

50 parallel/launch.h 3.7.2

24 parallel/types.h212 parallel/memory.h 3.7.3

206 parallel/transfer.h 3.7.3

7686

1As calculated by SLOCCount—http://www.dwheeler.com/sloccount/.



APPENDIX GSource Code Extract

1 / ∗2 ∗ P a r a l l e l i s i n g JVM C o m pi l er

3 ∗ P a rt I I P r o j e ct , C om pu te r S c i e n c e T r i po s

4 ∗5 ∗ C o py r ig h t ( c ) 2 00 9 , 2 01 0 − P e t er C a l v e rt , U n i v e r s i t y o f C am br id ge

6 ∗/

78 package an al y s is . d ep e n d e nc y ;

910 import graph . Annotation ;

11 import graph . Method ;

12 import grap h . Triv ia lL oo p ;

13 import graph .Type ;1415 import j a v a . u t i l . C o l l e c t i o n s ;

16 import j a v a . u t i l . L i s t ;

1718 import org . apache . lo g4 j . Logger ;

1920 / ∗∗21 ∗ C he ck s d e p e n d e n c i e s b a s ed on a n n o t a t i o n s o n t h e c o n t a i n i n g m et ho d .

22 ∗/

23 p ub li c c l a s s AnnotationCheck implements DependencyCheck {24 / ∗∗25 ∗ Names o f l o op i n d i c i e s t h a t s h ou l d b e r un i n p a r a l l e l i n t h e c u r re n t

26 ∗ c o n t e x t .

27∗

/

28 private L i s t<S t r i n g> l o o p I n d i c e s ;

2930 / ∗∗31 ∗ S e ts t h e c o n te x t i n w hi ch l o o ps s h ou l d b e c o ns i de re d .

32 ∗33 ∗ @param m e th od M et ho d i n w h ic h l o o p s t h a t f o l l o w a r e c o n t a i n ed .

34 ∗/

35 @Override

36 p u b lic v oid set Con tex t (Method method) {37 A n n otation an n ot atio n = method . ge tA n n ota tion (

38 T ype . g e t O b j e ct T y p e ( ” t o o l s / P a r a l l e l ” )

81



82 APPENDIX G. SOURCE CODE EXTRACT

39 ) ;

4041 i f ( an n ot atio n == n u ll ) {42 l o o p I n d i c e s = C o l l e c t i o n s . e m pt y Li s t ( ) ;

43 } e l s e {44 l o o p I n d i c e s = ( L i s t<S t r i n g >) a n n o t a t i on . ge t ( ” l o o p s ” ) ;

45 }46 }4748 / ∗∗49 ∗ C hec ks w he th er i t i s s a f e t o e x ec u te t h e g i v e n <code>T r i v i a l L o o p</code> i n

50 ∗ p a r a l l e l b as ed on t h e name o f t h e l o op i n de x .

51 ∗52 ∗ @param l o op T r i v i a l l o op t o c h ec k .

53 ∗ @return <code>t r u e </code> i f s af e t o r un i n p a r a l le l ,

54 ∗ <code> f a l s e </code> o t h e r w i s e .

55 ∗/

56 @Override

57 public boolean c h e c k ( T r i v i a l L o o p l o o p ) {58 i f ( loop I n d ic e s . c on t ain s ( loop . ge tI n d e x () .ge tName () ) ) {59 L o gg e r . g e t L o gg e r ( ” a n n o t a t i o n ” ) . i n f o ( ” A cc ep t ed ” + l o o p + ” f o r

p a r a l l e l i s a t i o n . ” ) ;

60 return tru e ;

61 } e l s e {62 L o gg e r . g e t L o gg e r ( ” a n n o t a t i o n ” ) . i n f o ( ” R e j e c t e d ” + l o o p + ” f o r

p a r a l l e l i s a t i o n . ” ) ;

63 r et ur n f a l s e ;

64 }65 }66 }



APPENDIX HProject Proposal

Peter Calvert

Trinity College

prc33

Computer Science Tripos Part II Individual Project Proposal

Parallelisation of Java for Graphics Processors

October 22, 2009

Project Originator: Peter Calvert

Resources Required: See attached Project Resource Form

Project Supervisors: Dr Andrew Rice and Dominic Orchard

Signatures:

Directors of Studies: Dr Arthur Norman and Dr Sean Holden

Signatures:

Overseers: Dr David Greaves and Dr Marcelo Fiore

Signatures:

83



84 APPENDIX H. PROJECT PROPOSAL

Introduction and Description of the Work

In the past, improvements in computational performance have taken the form

of higher clock speeds. However, more recently, increased performance has come

from the use of multiple processors, to solve independent parts of a problem in

parallel.

Graphics processors (GPUs) are a good example of this, and are commonly ar-

chitected as stream processors, meaning that they can apply the same set of

instructions across a grid in parallel. As a result of this, there has been signifi-

cant recent interest in using them for more general computation. In particular,

they are suited to running loops in parallel.

However, it is a well known problem that developers find it hard to reason about

the interactions of code running in parallel. Furthermore, most existing code issequential, and thus there are no performance gains from executing it on parallel

architectures. It must be recompiled, or in some cases rewritten, to benefit.

Automatic parallelisation aims to address this by analysing existing sequential

code, and identifying areas that can be run in parallel.

This project aims to make it possible to utilise parallel processors by compiling

appropriate loops for GPU execution. Initially, developer input will be required

to determine whether the conversion maintains correctness. However, as the

project develops, it is hoped that some of these decisions can be automated. The

project will be evaluated both by the performance gain resulting from parallel

computation, and also by the scope of the analyses made.The compilation will be made from Java Virtual Machine (JVM) bytecode, since

it is possible to compile a number of languages1 for it (including Ruby, Python

and Scala). It is also relatively simple, and libraries exist to aid in its analysis 2.

The Low Level Virtual Machine (LLVM) would have been a viable alternative

for similar reasons, but was dismissed due to lack of familiarity.

The target of the compilation will be NVIDIA’s devices, due to the complete

framework (CUDA) that they have made available to allow GPU kernels to be

written along side CPU code, which will make development easier. A more stan-

dardised approach, OpenCL, is still at the draft stage.

While in general determining whether a loop’s iterations are independent is unde-

cidable, there are solutions given certain constraints which could be introduced.

There are also transformations that could be applied beforehand to remove some

dependencies. A major difficulty often experienced is related to checking whether

variables are aliasing, so this will be left in as a check for the user to make. These

1http://en.wikipedia.org/wiki/List_of_JVM_languages2ASM(http://asm.ow2.org/)



85

automatic extensions could be evaluated in terms of the accuracy of their analysis,

and also the proportion of loops in sample code that they can consider.

Resources Required

Access will be needed to a suitable graphics processor that supports the NVIDIA

CUDA architecture. However, during development it will be possible to use the

emulation mode included in the NVIDIA development tools.

Starting Point

This project will be undertaken starting from the following knowledge and expe-rience:

• General knowledge of JVM bytecode from Part Ib course Compiler Con-

struction .

• Successful compilation and execution of a couple of CUDA examples under

the emulation environment.

• Rudimentary code put together during the first week of Michaelmas term

that produces an unrefined graph of JVM code using the ASM library, and

then detects loops in this.

• Preliminary reading over the long vacation into compiler optimisation tech-

niques and dependency analysis.

Further knowledge will be gained during Michaelmas term of Part II from the

Optimizing Compilers course.

Substance and Structure of the Project

In order to allow any compilation or analysis to occur, the Java bytecode must

first be read in and represented in a suitable structure for both control and data

flow analysis. This will be a graph of basic blocks, within each of which a data

flow graph will be contained. To allow the compiler, analysis and transformers

to traverse the structure, a variant of the visitor pattern should be implemented.

The project can then be divided into the following stages, starred items are being

considered as possible extensions rather than core parts:




1. Detection of loops within the control flow graph (JVM bytecode represents

control flow in an unstructured manner) and insertion of the appropriate

‘loop’ nodes. This can be done using analysis of each basic block’s domi-

nators.

2. Wrappers that can transfer the various JVM primitive types and arrays

to the GPU. This would be done using Java’s native code interface (JNI).

At this stage it is also necessary to be able to invoke the kernels over the

required dimensions, converting these into a suitable grid of blocks for the

size of GPU available.

3. Compilation of loop bodies for execution on a NVIDIA CUDA compatible

GPU. Since NVIDIA already provide a C compiler for this, the simplest

approach here is to generate C code from JVM bytecode.

4. Detection of which variables need to be passed into the CUDA kernel.

5. Transformation of the Java class to use the relevant wrappers in place of

the original loop code.

* Automatic detection of the loop variable and its bounds rather than

prompts to the user. This will be characterised by the variable that is

used in the exit condition, and which is also only written to by a single

INCREMENT instruction on each iteration (this instruction also accepts neg-

ative increments for the case of a decrementing loop).

* Basic dependency analysis of variable and field usage, for array accesses the

relatively simple GCD test should be used (allowing analysis where array

usages are of the form ax + b).

* Support for compiling object oriented JVM code to CUDA C.

* Loop-invariant code motion: this is a common optimisation that is used

by all compilers, however, since the code is being split and passed to two

separate compilers, there is no scope for code to be moved from inside the

loop, to the outside.

* Runtime checks for aliases and regular shaped arrays.

* A constrained version of loop fission (or loop distribution ) in which we

require that the loop body does not contain conditional blocks (i.e. just

sequential instructions and nested loops). This splits existing loops into

multiple loops, so that at least some of these can be run in parallel, even if

the combined loop could not.



87

Using existing code from a benchmark suite3 as well as other code that can be

sourced, an evaluation will then be drawn up on the performance gains that can

be achieved. Additionally, these gains will be compared with those made by

hand-written parallel versions of some of the benchmarks. The success of the

automatic checks at detecting safe loops will also be evaluated. Where safe loops

were not detected as such, it will be noted (when obvious) what further analysis

or transformations may have helped. This could then be used to guide any future

work.

Success Criteria

The core parts of the project will have been a success if:

1. Existing Java code (that has had GPU areas manually marked) can be run

using CUDA hardware, producing the same results.

2. The performance of CUDA-enabled benchmarks can be compared to their

original running time, and also to the running time when the conversion to

CUDA code is done by hand.

3. In some cases, an overall speed-up can be found. However, this will not

always be possible due to the transfer overhead associated with using the

GPU. Given sufficient large problem sizes, this overhead should becomenegligible.

The automatic detection extension to the project will have been a success if

common dependency analysis techniques can be evaluated based on their ability

to detect loops that are safe for parallelisation.

Timetable and Milestones

The timeline below is structured into 2 week ‘slots’. In allocating work to slots,

there were several aims in mind:

• To have a general structure in place that allows independent testing of

separate components as early as possible.

3Java Grande (http://www2.epcc.ed.ac.uk/computing/research_activities/java_

grande/sequential.html)




• To attempt the most difficult and risky parts of the project early on, so

that there is plenty of recovery time if problems do arise.

• To implement all required features and evaluate these before extensions areincorporated.

• To write a draft dissertation as the work is done, rather than leaving it as

a big job for the end.

Slot 0: 1st October to 16th October

• Discuss with Researchers, Overseers and Director of Studies the feasibility

of the project idea, along with background reading to assess the existing

work in the area, and the quantity of work entailed.

• Arrange with Project Supervisors a schedule of meetings to ensure the

project stays on track.

• Organise access to equipment for the project (i.e. a capable computer with

CUDA GPU), as well as setting up a regular backup system.

Milestones: Project proposal and availability of CUDA GPU.

Slot 1: 17th October to 30th October• Experiment with CUDA and gain familiarity with what it can do.

• Rework preliminary flow graph producing code, taking more care over the

data structure.

• Based on the algorithms being used, implement traversal facilities for the

flow graph that give easy access to the relevant information and structure.

• Rework the preliminary loop detection code using the structure from above.

Milestone: Be able to read in JVM class files and represent both the controlflow and data flow inherent in them, recovering loop structures.

Slot 2: 31st October to 13th November

• Produce code that can transfer primitive Java types and also arrays onto a

GPU.



89

• Produce code that can invoke a compiled CUDA kernel from Java.

Milestone: Implementation of all required CUDA wrappers in JNI.

Slot 3: 14th November to 27th November

• Produce code that can detect which variables need to be transferred to and

from the GPU for a given block of code.

• Produce code that generates valid CUDA C for a given section of JVM

bytecode.

Milestones: Be able to detect which variables need to be transferred to and

from the GPU, and be able to generate CUDA C from bytecode.

Slot 4: 28th November to 11th December

Use this time to consolidate and tidy up any loose ends in the code, and test it

on a wider range of JVM bytecode.

Due to end of term events and also a ski holiday (4th to 13th December), less

work has been scheduled for this slot.

Slot 5: 12th December to 25th December

• Tie components together to be able to produce rewritten class files that

invoke GPU kernels rather than the original loops.

• Start drafting a dissertation for the core parts of the project, using notes

made whilst this was implemented in slots 1 to 3.

Milestones: Core implementation complete, and dissertation with most struc-

ture drafted along with content for the core preparation/implementation.

Slot 6: 26th December to 8th January• Catch up time to fix non-critical bugs that have been put off during previous

slots.

• Source as many benchmarks and suitable applications written in JVM lan-

guages as possible (ideally containing a couple of hundred loops in total

across all the code).




• Work out safe loops in the benchmark code collected.

•Evaluate the performance improvements from the CUDA compilation for

the benchmark code.

Milestone: Extensive set of benchmarks for CPU and CUDA versions.

Slot 7: 9th January to 22nd January

• Manually produce CUDA versions of some of the benchmarks, and add the

performance of these to the evaluation.

• Start writing the evaluation section of dissertation based on the results.

• Decide on whether to implement extensions, and if so how much of theautomated detection to attempt.

Slot 8: 23rd January to 5th February

• Prepare the required progress report and the accompanying presentation.

• Work on extensions / catch up.

Milestones: Progress report and presentation.

Slot 9: 6th February to 19th February

Further extensions and catch up time.

Milestone: Complete code base.

Slot 10: 20th February to 5th March

Update the dissertation with details of any extension work, and prepare it to

draft standard (based on the work already achieved).

Milestone: Complete draft dissertation.

Slot 11: 6th March to 19th March

End of Lent term / Easter holiday, emphasis on revision.

Slots 12 and 13: 20th March to 16th April

Easter holiday, emphasis on revision.



91

Slot 14: 17th April to 30th April

This coincides with the beginning of Easter term. This time will be spent final-

ising the dissertation, and proof reading.Milestone: Printed dissertation ready to hand in.

Slot 15: 1st May to 14th May

This slot ends with the final deadline for the dissertation. It is intended that

this slot won’t be used, and therefore it provides some buffer time for any serious

issues.

Documents

2010 Javagpu Diss