[IEEE Comput. Soc. Press EUROMICRO 96. 22nd Euromicro Conference. Beyond 2000: Hardware and Software Design Strategies - Prague, Czech Republic (2-5 Sept. 1996)] Proceedings of EUROMICRO

Optimising Pseudoknot in I'CMC

G.G.da Cruz Neto, R.M.F.Lima, R.D.Lins & A.L.M.Santos Departamento de Informktica, Universidade Federal de Pernambuco, Recife, PE, Brazil

e-mail: ggn, rmfl, rdl, [email protected]

Abstract

Benchmarking implementations is fundamental t o allow analysing performance amongst different plat- forms. The choice of a benchmark that makes possible a reliable and fair comparison of a particular aspect is a difficult task , however. The Pseudoknot benchmark is a floating-point intensive application taken from molecular biology which was used to compare the compile-iime and execution-time performance of over 25 different implementations of functional languages. Amongst those implementations was I'CMC, an abstract machine for efficient implementation of lazy functional languages. F CMC pioneered the transference the control of the execution %ow to C, as much as possible, to take advantage of the extremely low cost of procedure calls in modern RISC architectures. I' CMC was amongst the machines that presented good Pseudoknot figures, although it d id not use some of the sophisticated optimisations of most of the other implementations. The experience of implementing Pseudoknot in I' CMC was most valuable in providing insights for new ways an optimising it. This paper describes several optimisations introduced to l? CMC which bring a better Pseudoknot performance.

1 Introduction

One of the outcomes of the 1994 Dagstuhl Work- shop on Applications of Functional Programming in the Real World was the need for a medium-size bechmark that could be easily translated into the many different functional languages allowing to compare their compile and execution times. Pieter Hartel and Mark Feeley led the team that used the Pseudoknot benchmark; a symbolic computation and floating-point intensive application taken from molecular biology. World-wide collaboration from the community of implementation of functional programming languages allowed over 25 implementations to be tested [2]. Amongst them there is I'CMC [8], which is an abstract machine develop- ed with the aim of providing fast implementations of lazy functional languages. This abstract machine is based on the rewriting system of Categorical Multi-

Combinators [6]. I'CMC is implemented in C [4]. Strict functions on all arguments that generate results of ground type, as well as all arithmetic expressions are translated directly either into macros or procedures in C. All other functions are handled by the abstract machine, which glues together abstract machine code and C code.

An intermediate language, here also called I'CMC was used to write the Pseudoknot benchmark. The I'CMC language can be seen as a simple functional language, which will serve as the basis for the implementation of more sophisticated languages. The implementation of front-ends t o Haskell [3] and SML [9] are on their way.

The I'CMC front-end performs a number of intermediate code optimisations, as described in [7]. I'CMC is completely written in C allowing portability and its degree of code optimisation is still very low if compared with Chalmers LML/HBC compiler or the Glasgow Haskell Compiler, for instance. Source code "massaging" was also allowed in Pseudoknot, but in the data we provided [2] we made no use of non-automatic source-code transformation in order to get the most reliable performance spectrum of I'CMC. Taking these facts into account, the performance figures for the original version of Pseudoknot in I'CMC can be considered very good.

Although I'CMC provides lazy evaluation, Pseudo- knot is strict. This information is used to provide better Pseudoknot code. The set of optimisations presented here is suitable for the lazy and strict versions of I'CMC, unless otherwise stated. In this paper, we present a number of new code optimisations that yielded a much better performance.

2 The Original rCMC Machine

The original FCMC abstract machine is presented in reference [8]. Reference [7] largely optimised the original machine, but its basic philosophy remained unchanged. A brief introduction to FCMC is presented in this section.

1089-6503/96 $5.00 0 1996 IEEE Proceedings of EUROMICRO-22

120

2.1 Compiling into rCMC

A program in rCMC is formed by a set of function definitions plus an expression to be evaluated. The expression to be evaluated is compiled by scheme E and each function in the script is compiled depending on its nature. Strict functions on all arguments which produce results of ground type are called special and are compiled directly as procedures in C. This is the key to the efficiency of I'CMC, because in doing so it takes advantage of the very fast implementation of procedure calls in RISC architectures. The compilation schemes for the kernel of rCMC are presented in [8], and their behaviour can be summarised as follows:

Scheme E is responsible for the printing routine and driving the evaluation mechanism.

Scheme S is responsible for starting-up the compilation of special functions generating procedures in C.

Scheme S' is ancillary to S and is responsible for the compilation of inner parts of the body of a special function generating parts of procedure code in C.

Scheme 7 is responsible for the compilation of ordinary functions and generates code which is handled by the abstract machine.

Scheme generates code which when executed fills the fields of a cell in an evaluation environment.

Scheme T' produces code which when executed generates cells on the top of the T-stack. It also makes parameters ready for special functions or arithmetic expressions whenever called inside an ordinary function.

Scheme 2' makes parameters ready for special functions or arithmetic expressions whenever called inside a cell generating scheme.

Scheme L is responsible for the compilation of lists and functions over lists.

2.2 Example of Compilation

Let us show an example of FCMC compiled code. The script:

f i b n = i f n<2 then 1 e l s e f ib(n-1) + f ib(n-2) t w i f x = f (f x) ta i f i b 5?

will be compiled as,

twi fib 5 -+ MKEcell(2); MKEpc(A, 1); MKEcte(5,O); Pushfun(twi); Popenv; print();

A eval'(0); MKTk(O,fib((*(topT)) +- rem.va1)); tw i -+ MKTcomp(A'); evaLenv(1); A' - MKTvar(0); evaLenv(1); fib -+ if{n < 2}return(l)

eise{return(fib(n - 1) +fib(n - 2)))

As we can see the result of compilation of the special function fib is a procedure in C, which needs only a heading with type declarations to be compiled and executed by the C compiler.

2.3 State Transition Laws

We present I'CMC as a state transition machine. A state of rCMC is a 5-uple

(C, T , H , 0, E )

in which each component is interpreted in the following way:

C:

T:

H:

0:

E:

The code to be executed. This code is generated by the translation rules presented by the compilation schemes above.

The reduction stack. The top of T points to the part of the graph to be evaluated.

The heap where graphs are stored. The notation H[d = e l . . .e,] means that there is in H a n- component cell named d. The fields of d are filled with e l . . .e,, in this order. Cells are fully-boxed.

The output.

The environment stack. Its top contains a reference to the current environment.

rCMC is defined as a set of transition rules. The transition

(C, T, H , 0, E ) +- (C', T', HI , O', E')

must be interpreted as: "when the machine arrives at state (C, T, H , 0, E), it can get to state (C', TI, H ' , 0', E')".

3 Automatic Optimisations

In this section, we present a list of optimisations implemented on rCMC that were incorporated

121

directly into the compiler, without any need for code annotations or massaging by the programmer.

The gains in performance for each of the optimisations can be found in section 4.

3.1 Removing Environment Cells

The notion of having an explicit environment is in the very kernel of Categorical Multi-Combinators. The original rCMC machine [8] built environment cells whenever a function is executed. This environment serves for variable look-ups (exchanging formal parameters for actual ones) or to store the state of execution in closures, for a possible evaluation at a later stage of computation (this is the case of lazy-evaluation or higher-order functions). Avoiding the generation of environment cells by re-using them immediately or by sharing existing ones has proved to be an important source of gains in execution-time performance in TCMC [7].

Now, we restricted the generation of environment cells to the case when closures need to be generated. Actual parameters are stored in a stack and variables access them directly without any need to generate environment cells. Whenever a closure is to be built these parameters are transferred from the stack to the environment part of the closure-cell, which binds it together with the code to be executed. Opening a closure will copy the environment back into the stack and enter the corresponding code. Closure generation happens far more infrequently than ordinary function execution. Thus the number of closure cells generated is far lower than the number of environment cells generated before. In addition to that, this optimisation removes an indirection level in ordinary variable look-up, as parameters are fetched directly from the stack instead of fetching them from the cell referenced from the top of the environment stack .

This optimisation was the basis for all the other ones presented in this paper. Observing the tables presented in figures 1 and 2 below one can see that the compilation of this version of Pseudoknot took far too long. The execution time of 13.71 seconds is reasonable already, if compared with most of the other machines benchmarked. In the following paragraph we analyse the reason and show how to reduce the compilation- time in I’CMC.

3.2 Datatypes Built by Procedures

Originally datatype definitions were implemented as C macros. In the case of the Pseudo- knot, there is a large number of datatype definitions

(constants) and implementing them as C macros resulted in a high compile-time overhead. We decided to experiment with implementing datatypes as C functions. This resulted in major improvements in compilation time and (surprisingly) improvements in execution time.

The performance figures for compile and execution time can be found in the corresponding lines of figures 1 and 2 in section 4.

3.3 Optimising Recursion

Recursion is fundamental for functional programming languages. Many functions are not special thus can not benefit from the very efficient handling of recursion made by the C compiler] which takes advantage of the fast context switching mechanism of RISC architectures. In order to reduce the costs of the recursive procedures calls in ordinary functons, rCMC reuse the evaluation environment between function calls and replace recursion by iteration.

3.3.1 Reusing the 6‘Environment” Stack

The compilation schemes for the original rCMC machine 181 generated a new environment cell every time we made a recursive call and discarded the environment cell used for the previous call. In [7], schemes R and K were introduced to allow reusing the environment cell of recursive functions.

In the optimisation presented in section 2 actual parameters are stored in the stack and variables acess them directly without any need to generate environment cells. The changes performed to I’CMC do not invalidate the possibility of reusing the evaluation environments of the recursive calls. In this case, instead of reusing environments, we apply the same techniques explained in [7] to reuse cells in the stack that holds actual parameters.

Tail recursive functions are of widespread use in functional programs and are defined as:

fn 20. .zn = if a then b else z : (fn yo.. .yn)

Similarly to the optimisation presented in [7], we can also apply the technique of reusing evaluation environments, provided they are the outermost function for evaluation and no composition is generated.

Besides bringing a substancial gain in performance because of reducing stack manipulation (about 7% in Pseudoknot), this optimisation also overcomes problems of stack overflow.

3.3.2 Replacing Recursion by Iteration

In [7] we showed that the technique of replacing recursion by iteration could be applied to simple recusive functions and in tail-recursive functions, provided they are the outermost functions in the script. In pseudoknot this optimisation was responsible for a gain in user-time of about 7.8%.

3.4 Discarding Parameters

In some user-defined functions not all formal parameters appear in its body, exhibiting a projection- type behaviour. This kind of function is common whenever pattern-matching is compiled. The compilation rules of the original version of I'CMC did not take this behaviour into account and an environment to store all the actual parameters was generated.

In this optimisation, functions save only the parameters needed in their execution, discarding unnecessary parameters as soon as possible.

3.5 Unboxing

Storing either the original or intermediate data in cells of the abstract machine brings simplicity to it. On the other hand, this policy involves the cost of cell allocation and management. In order to avoid these costs, the original FCMC machine used unboxing in the intermediate steps of the computation of arithmetic operations and in evaluating strict functions whose results are of ground type (special functions), by work- ing with them directly in C.

Following this philosophy we introduce three new automatic optimisation strategies to FCMC, which are presented below. In section 5.1, along the same lines, user provided annotations will perform the unboxing of the abstract datatypes used in Pseudoknot.

3.5.1 Local Variables in C

Local variables to a function are accessed only whithin its code without external reference from any other function in the script. If a local variable were of ground-type, it can be translated directly into variables in C, without the generation of any cell of the abstract machine.

A reduction of about 170 thousand cells was observed in the space required by Pseudoknot, bringing a reduction of user-time of almost 2 seconds (20 % faster than its predecessor).

3.5.2 Ground-type Actual Parameters

In this optimisation we allow any ground-type strict parameter to an ordinary function to be passed unboxed. This optimisation decomposes ordinary functions into their intrinsically ordinary and special parts. The performance figures for this optimisation can be found in the table of figure 2.

3.5.3 Unboxing Results

Many times the result of a function call or of the evaluation of an arithmetic expression is immediatly used by another function call. rCMC stores this intermediate result in a cell for possible use at a later stage in computation. If we are sure that this result is of ground-type, we can avoid the generation of this cell and pass the result directly onto the new function call.

This optimisation reduced the number of cells of about 100.000 in Pseudoknot, yielding a user-time reduction of about 5%.

3.6 Inlining C Function Definitions

A wise code inlining policy may yield an increase in execution-time performance at a negligible degradation cost in compile-time performance. Excessive code inlining may bring unnecessary code expansion, causing a rapid growth in compile-time and at the same time it may also bring run-time degradation. Code inlining in functional languages has been studied in depth by Andri Santos in his recent Ph.D. thesis [ll], amongst other source- code optimisations in Haskell.

We adopted a simple strategy for inlining some of the C Functions generated:

1. all non-recursive functions which are called only once in the body of the program are inlined.

2. small functions that do not call any other and that appear less than 4 times in the code are also inlined. We considered small functions those whose code is less than ten abstract-machine instructions.

This optimisation brought a gain in user-time of about 2.5 %, as one can observe in the corresponding line of the table in figure 2.

4 Fine-Tuning the Code

In this section, we present a list of non-automatic (user annotated) optimisation strategies that provided

123

even better performance. This sort of optimisation, together with source-code massaging, was allowed in the Pseudoknot paper [2].

4.1 Unboxing Datatypes

As we explained in section 3.5 above, unboxing is known to be an important source of optimisation. We experimented with allowing user annotated unboxed datatypes, i.e. the user is allowed to tag some datatypes to be implemented unboxed instead of boxed.

For our experiments we tagged the TFO and PT datatypes, which are the most used datatypes in Pseudoknot. Their type-definitions are:

TFO :: f l o a t * f l o a t * f l o a t

type PT :: f l o a t * f l o a t * f l o a t * f l o a t * f l o a t * f l o a t * f l o a t * f l o a t * f l o a t * f l o a t * f l o a t * f l o a t

The results were very significative and brought a reduction in the number of cells to less than 25% of the original number, decreasing the number of garbage collections from 7 to 4, and also dropped the execution- time by about one-third.

This optimisation was also used by some of the other Pseudoknot contributors [2]

4.2 Heap-size Variation

The garbage collector algorithm implemented in rCMC at present is a Fenichel-Yochelson copying collector [5]. The size of the heap available to the user program can have dramatic effects on the performance of every program. This one is no exception. Unfortunately this is yet another area where it is very difficult to have a good estimate of the ideal setup. Therefore the only solution is to experiment with different configurations to try to find a good one. In the case of the Pseudoknot program, the heap residency is very small, therefore one can easily use heap sizes that fit entirely in the secondary cache of the machine. This is confirmed by our experiments, which show that a smaller heap resulted in better performance. Notice that if we keep decreasing the heap we eventually start getting worse results again, due to the big number of garbage collections we have to perform. A detailed analysis of the cost of the different garbage collection algorithms can be found in

In our case, the data in figure 2 present a user- time variation of about lo%, but the system-time variation was of 500%. These performance figures show

[51.

that garbage collection places no real overhead in users programs, against the popular belief of some years ago.

4.3 Maximum-Map

It is not possible to map a function directly over a list of unboxed values using the normal map function, because unboxed values cannot be passed to polymorphic functions. In order to increase unboxing, the Glasgow Haskell version of Pseudoknot [2] used a composite function, called maximummap, to replace three uses of the composition of functions maximum and map that appears in the original code. The definition of this new function in Haskell is: maximumaap : : (a->Float#) -> [a]->Float# maximumaap f ( h : t ) =

max f t (f h) where max :: (a->Float#) -> [a] -> Floa t# -> Floa t# max f (x:xs) m = max f x s ( l e t f x = f x i n

m a x f [ 1 m = m i f f x ' g tF loa t# ' m t h e n f x else m)

In the run-time 70 thousand less cells were needed, yielding a user-time reduction of about 5%.

4.4 Floating Point Precision

As we mentioned before, Pseudoknot is a floating- point intensive benchmark. Its execution involves about 7 million floating-point operations. Here we simply change from using double precision floating point arithmetic to single precision floating point arithmetic.

The single floating-point precision brings a reduction of the size of cells, yielding a heap one-third smaller. The increase in user-time performance is about 10%.

5 Overall Performance Results

The table presented in figures 1 presents the Pseudoknot compilation time for three versions of the TCMC machine, and the lazy and strict versions of the Glasgow Haskell compiler, and for Standard ML of New Jersey.

The table in figure 2 presents run-time performance figures for Pseudoknot for each version of the optimisations implemented in FCMC, cumulative- ly. In the last rows we present the Pseudoknot performance in Glasgow Haskell (lazy and strict versions) and Standard ML of New Jersey (lazy languages) and SML (a strict language, like the implementation of rCMC we used').

'We also have a lazy version of rCMC.

124

The data was obtained on a Sun Sparc 20 with clock 90 MHz, 64 Mb RAM, and we present user and system time.

6 Conclusions

This paper showed that, restricting the use of environment cells to functions that generate closures makes the machine much neater and more efficient. The pI'CMC machine [l] is the result of incorporating the optimisations described in this paper directly into the compilation schemes and state-transition laws of I'CMC. The number of compilation schemes and state- transition laws in pI'CMC is about one-third of the ones in I'CMC.

The optimisations presented to I'CMC made the Pseudoknot program substantially faster than its original implementation. We believe that the performance gains obtained in Pseudoknot are similar to other programs. We are currently linking the Glasgow Haskell front-end to I'CMC. This will make possible more extensive benchmarking by using the NoFib suite [lo]. We also envisage a number of new optimisations aiming to reduce user provided information and to improve its compile and execution- time performance.

Acknowledgements

The I'CMC group is supported by CNPq. (Brazil- ian Government) grants 40.9110/88-4, 46.0782/89.4, and 80.4520/88-7 and IBM-Brazil.

References

G.G.da Cruz Neto, R.M.F.Lima, R.D.Lins & A.L.M.Santos. pLrCMC: A Faster Way of Implement- ing Functional Languages. in preparation.

P.H.Harte1, M.Alt, R.D.Lins, et al. Benchmarking Implementations of Functional Languages with "Pseudoknot", a Float-Intensive Benchmark. to appear in Journal of Functional Programming, Cam- bridge University Press, 1996.

P. Hudak, S. L. Peyton Jones, and P. L. Wadler (editors). Report on the programming language Waskell - a non-strict purely functional language, version 1.2. ACM SIGPLAN notices, 27(5):1-162, May 1992.

B. W. Kernighan and D. W. Ritchie. The Cprogram-

Jersey, 1988.

R.E. Jones & R.D.Lins. AZgorithms for Garbage Collection. John Wiley & Sons, 1996.

m i n g language. Prentice H a l l , Englewood Cliffs, New

[6] R.D.Lins. Categorical Multi-Combinators. In Gilles Kahn, editor, Functional Programming Languages and Computer Architecture, pages 60-79. Springer-Verlag, September 1987. LNCS 274.

Implementing and Optimising I'CMC, In Proceedings of Euromicro'94, pp:353-361, IEEEE Press, Sep/1994.

[8] R.D.Lins & B.O.Lira. I'CMC: A Novel Way of Implementing Functional Languages, Journal of Programming Languages, 1:19-39, Chapmann & Hall, January 1993.

[9] R. Milner, M. Tofte, and R. Harper. The definition of Standard ML. MIT Press, Cambridge, Massachusetts, 1990.

available from: Computing Department, The University of Glasgow.

Compilation by Transformation of Lazy Functional Languages, Ph.D. thesis, Universisty of Glasgow, July/ 1995.

[7] R.D.Lins, G.G.da Cruz Net0 & R.M.F.Lima.

[lo] W.Partain. The NoFib Benchmark Suite.

[ll] A.L.M.Santos.

[I21 2. Shao. Compiling Standard M L for Eficient Execution on Modern Machines. PhD thesis, Princeton Univ, Princeton, New Jersey, Nov 1994.

125

U Pseudoknot Compilation Performance U rCMC

Haskell (lazy) Haskell (strict) SML-N . J .

12:20.00 46.03 3:58.00 38.12 2:07.00 5.26

Pseudoknot Execution Performance I'CMC Automatic Optimisations

Maximum-Map 4.28 0.66 3.6 Mb 432.583 17 4.04 0.90 6 Mb 363.605 9

J

Single Precision 3.53 I 0.51 I 4 M b I 363.605 I 9 Other implementations r

U

Figure 2: Execution-time figures for Pseudoknot

126

Documents

[IEEE Comput. Soc. Press EUROMICRO 96. 22nd Euromicro Conference. Beyond 2000: Hardware and Software Design Strategies - Prague, Czech Republic (2-5 Sept. 1996)] Proceedings of EUROMICRO