04618808

8/12/2019 04618808

http://slidepdf.com/reader/full/04618808 1/8

An Efficient Strip-mining Algorithm for Improving SRF Bandwidth

Utilization on Imagine

Wenjing Yang†, Jing Du

‡, Fujiang Ao

‡, Xuejun Yang

‡

† School of Computer, Beijing University of Aeronautics and Astronautics, Beijing, 100083,

China‡ PDL, School of Computer, National University of Defense Technology, Changsha, 410073,

China

[email protected]

Abstract

Strip-mining is a crucial technique for memoryhierarchy optimization. In this paper, we propose an

efficient strip-mining algorithm for improving SRF

bandwidth utilization on Imagine. Firstly, we present

how to determine the optimal kernel set for strip-

mining. The process is based on a novel structure

proposed by us, namely kernel reuse graph. Secondly,

we select the optimal strip size, so as to achieve the

tradeoff between stream reuse and stream prefetching.

Finally, we propose the efficient strip-mining

algorithm, which is implemented in SCompiler. The

experiment results show that our strip-mining

algorithm is a practical and promising solution to

improve SRF locality and hide the memory access

overhead effectively on Imagine.

Keywords. Imagine, SRF, Strip-mining, Stream reuse,

Stream prefetching

1. Introduction

The Imagine processor is designed to address the

processor-memory gap through streaming technology

at low cost and low power [1-5]. It integrates a large

on-chip memory called Stream Register File (SRF) to

buffer streams [6, 7]. Due to high and increasingmemory latencies, the optimization for the SRF is

crucial to achieve high performance on Imagine. The

stream applications on Imagine are structured as some

computation kernels that operate on sequences of data

records called streams [8-11]. Most applications often

face the same memory access bottleneck for they can’t

fit all their streams in the SRF, which harms the SRF

locality. To address the problem, it is necessary to

partition long streams into segments known as strips,

such that all of the intermediate state for thecomputation on a single strip fits in the SRF [6-9], thus

avoiding the memory transfers. Figure 1 diagrams the

strip-mining technique.

In order to implement this optimization, this paper

proposes an efficient strip-mining algorithm to achieve

high SRF bandwidth utilization on Imagine. The

contributions of our work are as follows.

The optimal kernel set for strip-mining is

determined. It exhibits high stream reuse

between kernels by kernel reordering, which is

based on a novel kernel reuse graph.

The optimal strip size is selected. The

technique makes a tradeoff between streamreuse and stream prefetching, so as to achieve

high SRF reuse and hide memory latency

effectively.

An efficient strip-mining algorithm is

proposed. It is implemented in SCompiler,

which is a compiler used to map Fortran

programs to Imagine.

To evaluate the effectiveness of the proposed strip-

mining algorithm, we apply the algorithm to 9

application kernels on the ISIM simulation of Imagine.

The experimental results show that our strip-mining

optimization is a practical and promising solution toimprove SRF locality and hide the memory access

overhead effectively on Imagine.

The remainder of this paper is organized as follows:

Section 2 overviews the Imagine processing system.

Section 3 presents the key techniques of the proposed

strip-mining optimization. The experiments are shown

in Section 4. In Section 5, we conclude the paper and

discuss some future works.

Third International IEEE Conference on Signal-Image technologies and Internet-Based System

978-0-7695-3122-9/08 $25.00 © 2008 IEEE

DOI 10.1109/SITIS.2007.23

459

Third International IEEE Conference on Signal-Image Technologies and Internet-Based System

978-0-7695-3122-9/08 $25.00 © 2008 IEEE

DOI 10.1109/SITIS.2007.23

459


978-0-7695-3122-9/08 $25.00 © 2008 IEEE

DOI 10.1109/SITIS.2007.23

453


978-0-7695-3122-9/08 $25.00 © 2008 IEEE

DOI 10.1109/SITIS.2007.23

453

8/12/2019 04618808


Figure 1. Strip-mining technique

2. The Imagine Stream Processing System

Imagine developed at Stanford University is a

single-chip stream processor. It consists of 48-ALUs

arranged as 8 SIMD clusters and three level memoryhierarchy designed to keep the functional units

saturated during stream processing. The memory

hierarchy consists several local register files (LRFs), a

128 KB stream register file (SRF) and off-chip DRAM

[12]. The architecture is centered around the SRF,

which reads data from off-chip DRAM through a

memory system interface and sequentially feeds the 8

arithmetic clusters. Figure 2 shows the Imagine stream

architecture.

cluster 0

cluster 1

cluster 7

LRF LRF

LRF LRF

SRF

Network

Interface

Stream

Controller Micro-controller

SDRAM

SDRAM

SDRAM

SDRAM

Host

Processor

D R A M

C o n t r o l l e r

Imagine Stream Processor

Network

Figure 2. The Imagine stream architecture

Programming on Imagine is divided into two levels:

stream level (using StreamC) and kernel level (using

KernelC). These levels are corresponding to streamscheduling and stream processing in logic view

respectively. In this model, an application is composed

of a collection of data streams passing through a series

of computation kernels which perform computations.

However, programmers must consider the stream

organization and communication using this explicit

stream model, increasing the programming complexity

[8-11]. So the compiler optimization is important to

achieve significant performance improvement on the

stream processor [13-16].

3. Strip-mining Technique

Strip-mining technique is used to address the

problem of long streams. In this section, we focus onexploring some key factors of strip-mining technique,

such as the optimal kernel set for strip-mining, optimal

strip size and strip-mining implementation.

3.1. Optimal kernel set for strip-mining

To improve the reuse degree in SRF, we need to

first determine the kernel set for strip-mining. This

kernel set must benefit from strip-mining operation.

The producer-consumer locality in SRF is exposed

by forwarding the streams produced by one kernel to

the subsequent kernels [12]. In order to make the

neighboring kernels be provided with reused streams,the relative order of the kernels needs to be reordered

for high computational intensiveness and fine locality

in SRF. Kernel reordering must satisfy safety and

profitability considerations. The safety refers to

reordering kernels can not violate ordering constraints

implied by reuses. The profitability lies on the data

reuse between kernels after reordering kernels. To

guarantee safety and profitability, some ordering

constraints need to be proposed.

Firstly, we should identify stream reuse between

kernels, which is the basis of kernel reordering.

Therefore, we propose the following definitions.

Definition 1. For kernel K i and K j, if there is astream reuse from K i to K j, we say that K i reuse K j

(denoted as ji K K δ ). And the reuse distance d (i, j) is

defined as the total length of all streams in the two

kernels, namely

( ) ( ) ( ) ( )∑ ∨= j xi x x K in D K in D Dlen jid |, , where

( ) x Dlen denotes the length of data stream x D .

Based on the kernel reuse of a given program, we

can obtain a kernel reuse graph, which abstract the

reuse relationship between kernels. As shown in

Figure 3, we can conclude 21 K K δ , 31 K K δ .

K 1(a, b)

K 2(a, c)

K 3(b, d)

K 1

K 3

K 2

460460454454

8/12/2019 04618808


Figure 3. Example program and its kernelreuse graph

Second, ordering constraints are proposed based on

the definition of exchangeable kernels.

Definition 2. The neighboring kernels K i and K j are

defined as exchangeable kernels if there is no stream

reuse between the two kernels. The exchangeable

kernels are denoted as ji K K ↔ .

As shown in Figure 3, 32 K K ↔ . Obviously,

reordering the exchangeable kernels can not influence

the reuse order in the original program. This is the

primary safety determination of kernel reordering. But

it is not sufficient to guarantee profitability. The

requirement for profitability is that reordering should

not increase the original reuse distance, or the kernel

reordering is not profitable. Thus the definition of

potentially adjacent kernels is proposed.

Definition 3. Suppose kernel K i, K j and K k are

executed in serial. If k i K K δ andk j

K K ↔ , K j and

K k are defined as potentially adjacent kernels.

Meanwhile, the reuse between K j and K k are defined as

potentially adjacent reuse, which is denoted as

k i K K ↔

δ .

Kernel reordering is used to transform the data

reuses in the original program to potentially adjacent

reuses, so as to improve the computational

intensiveness and data reuse between kernels. That is,

we perform potentially adjacent reuse guided kernel

reordering on the original program. As for the above

example, 31 K K ↔δ . Thus, we need to reorder the

kernels in the program.

So the kernel set for strip-mining is may be the

maximal set that consists of potentially adjacent

kernels. We define the kernel set for strip-mining as

follows.

Definition 4. Arbitrary potentially adjacent kernel

set can be used as a kernel set for strip-mining. And if

the intersection of two kernel sets for strip-mining is

not empty, the coalition of the two sets is also a kernel

set for strip-mining.

After data-centric kernel reordering, the kernel sets

for strip-mining are relatively intensive. To determinethe optimum kernel set for strip-mining, reuse distance

is used as the metric because it reflects the possibility

of reuse in SRF. The maximal reuse distance in the

optimum kernel set for strip-mining must be smaller

than SRF. Thus, the optimum kernel set is the maximal

one in which all reuse can be optimized by strip-

mining.

3.2. Optimal strip size selection

Based on the kernel reuse graph, the optimum

object set for strip-mining can be achieved. Then we

make a conservative decision for partitioning the

object set. That is, we perform strip-mining when the

total scale of all the streams in the object set is largerthan SRF, namely ( )∑ >∈∀ C DOS S D ii )( , where OS

denotes the object set, S (OS ) denotes the stream set of

OS , and C is the SRF size.

Selecting the optimal strip size is a crucial

optimization for strip-mining technique. Since Imagine

is an access/execute decoupled processor, the strip size

lies on the tradeoff between the benefit and overhead

from stream reuse and prefetching [8]. Figure 4 shows

the strip-mining introduced with stream reuse and

stream prefetching. One side, stream prefetching is a

latency tolerance method. Its benefit is achieved by the

overlapping between memory access time and

computation time. But prefetching streams occupiesthe space for reusing steams, which may enlarge the

number of kernels to increase the overhead of kernel

switching. Especially, if the memory access time is not

capable with the computation time, stream prefetching

can not achieve its benefit. Figure 4(a) shows the strip-

mining optimization with the consideration of

prefetching. On the other side, stream reuse is a latency

avoidance method. Its benefit comes from the

reduction of absolute memory access physically. But

as the strip size increases, there is no enough space for

stream prefetching, resulting low performance. Figure

4(b) shows the strip-mining optimization with the

consideration of stream reuse. In conclusion, we

should regard the stream reuse as the first decision of

strip size selection, that is, the optimal strip size must

guarantee all the exploitable reuses can be reused.

Then, if there is little stream reuse, stream prefetching

should be introduced to quantify the strip-mining

technique, as shown in Figure 4(c).

1

2

3

4

5

1

2

3

4

5

TimeInput

Kernel

Output

(a) Strip-mining only with prefetching

461461455455

8/12/2019 04618808


1,2

4

1

Flowreuse Output

reuse

Inputreuse

53,4

5

No reuse

(b) Strip-mining only with reuse

1,2

4

1

Flowreuse Output

reuse

Inputreuse

5 3,4

5

No reuse

(c) Strip-mining with both prefetching and reuse

Figure 4. Strip-mining with reuse andprefetching

Strip-mining aims at reuse or prefetching. On the

premise of that double- buffering does not exist, we

should both increase stream reuse and reduce the

number of kernels. Given a kernel in the object set for

strip-mining, there certainly exists stream reuse in its

adjacent kernels. But the reuse does not always reuse

the whole streams. So, the reuse region must be

achieved for selecting strip size more precisely. Since

different reuse regions are consumed by successive

kernels after performing strip-mining method, we

should guarantee the reuse distance between kernels is

smaller than SRF. In order to explain our technique,

the definition of stream reuse region should be defined.Definition 5. The stream reuse region is defined as

the stream region which is reused by the adjacent

kernels, and the reuse distance between the kernels can

not be reduced to a certain size which is smaller than

SRF. Meanwhile, the successive reuse times of the

stream reuse region are defined as the reuse degree of

the stream reuse region.

Thus, we can achieve a certain stream reuse region

B. In order to obtain the maximal partition of kernels

with no double-buffering, we decide the critical region

B' to occur double-buffering for the maximal stream.

The size of B' is related with whether stream

prefetching is performed. When the reuse degree ofstream reuse region B (denoted as M ) is larger than

certain threshold TH , the priority of stream prefetching

is decreased. Therefore, large reuse region can be

loaded to SRF to achieve high SRF bandwidth

utilization. Otherwise, we must reserve enough SRF

space for the input streams of the next kernel. The size

of B' can be formulated as follows, where f i( B' ) is a

mapping function to achieve the corresponding reuse

region of stream Di from B' .

⎣ ⎦ N B&C B f BTH M i ∈<+→≥ ∑ 8/')'('

⎣ ⎦ N BC B f BTH M i ∈<+→< ∑ 8/'&2)'(2'

Then the maximal kernel partition is produced

through solving for the size of

⎣ ⎦'' B B = , so as to

ensure the effective SRF reuse. Thus, the optimal strip

size can be selected as follows, where S 0 denotes the

optimal strip size.

BS B B 0 =→≤ '

B' S B B 0 =→> '

According to the optimal strip size S 0 selected by us,

we can partition the kernels more efficiently, so as to

achieve high SRF locality and effective overlapping

between computations and memory accesses.

3.3. Implementation

Fortran

codeParser

Low-level Code

GenerationImagine

SCompiler

Streamization

Resource

Allocation

Low-level

Compiler

Code

Optimization

Optimal Strip Size

Selection

Kernel Set

Determination

Kernel Reordering

Stream Intermediate

Code Generation

Stripmining

Figure 5. Framework of the SCompiler.

Figure 5 gives the framework of the SCompilerdeveloped for the Fortran language. The work of this

paper focuses on the code optimization (grey part in

the framework). The right part of Figure 5 shows the

steps of the strip-mining optimization. After the front-

end parsing, streamization, and stream-level

intermediate codes generating optimizing like Figure 5

are available. Taking the stream-level intermediate

codes as input, the kernel reuse graph is constructed.

Then based on the graph, the compiler reorders the

kernels to achieve high stream reuse between kernels.

Next, the optimal kernel set for strip-mining is

obtained, and the optimal strip size is selected based on

the optimized stream program. Finally, strip-miningoptimization is performed to enhance SRF bandwidth

utilization. The detailed algorithm for strip-mining is

shown in Figure 6.

For the example program shown in Figure 3, there

exists stream reuse between kernels based on the

analysis in Section 3.1. When the streams of a kernel

are larger than SRF, strip-mining algorithm should be

introduced to partition long streams to optimal strips,

462462456456

8/12/2019 04618808


so as to achieve high SRF reuse. The optimized

program is given in Figure 7. And the corresponding

SRF reuse result is shown in Figure 8. Obviously,

stream a and b in kernel K 1 can be fully reused after

strip-mining.

1

ALGORITHM Strip-mining ()2 Input: The stream intermediate code

3 Output: The optimized code

4 Invariants: TH = Reuse threshold

S 0 = The optimal strip Size

L =∑l i

5 // obtain the optimal kernel set

6 Build the kernel reuse graph

7 for kernel K i and K j

8 if ⎟ ⎠

⎞⎜⎝

⎛ ¬

↔

k i K K δ

9 Reorder K i and K j

10 Determinate the kernel set OS for strip-mining

11 // selecte the optimal strip size

12 Obtain the reuse degree M of each reuse region B

13 if M ≥ TH

B' + ∑ f i( B' ) < C

14 else

15 B' /2 + ∑ f i( B' ) < C /2

16 Solve for ⎣ ⎦'' B B =

17 if B < B'

18 S 0 = B

19 else

20 S 0 = B'

21 Partition the kernels in OS in terms of S 0

22

Reorder the new subkernels

Figure 6. Strip-mining algorithm.

for (int i = 0; i < N; i = i+strip){

K 1(a(i, (i+strip)), b(i, (i+strip)));

K 2(a(i, (i+strip)), c(i, (i+strip)));

K 3(b(i, (i+strip)), d(i, (i+strip)));

}

Figure 7. Example program after strip-mining.

a

a

K 1

K 2

K 3 d

t i m e

space

b

c

b

strip

reuse

Figure 8. SRF reuse result.

4. Experimental Results and Analysis

To evaluate the effectiveness of our strip-mining

algorithm, we select 9 scientific applications listed in

Table 1. Nlage-5 is a nonlinear algebra solver of two-

dimensional nonlinear diffusion of hydrodynamics

[17], and Transp is the time-consuming subroutines in

Capao that is an optics application. All programs are

Fortran versions, and they are compiled by three kindsof compilers respectively, including Intel's compiler

ifort (version 9.0) with the optimization option -O3,

SCompiler without any strip-mining optimization, and

SCompiler with the proposed strip-mining algorithm.

The first compiling results (denoted as Seri) are

executed on a single-core Itanium 2 server. Itanium 2

runs at 1.6GHz and the sizes of the caches are 16KB

for the L1 cache, 256KB for the L2 cache and 6MB for

the L3 cache. There is also a 4GB off-chip memory

with the bandwidth of 6.4GB/s. The latter two results

(denoted as Orig and SMA) are executed on ISIM [18,

19] that is a cycle-accurate simulator of Imagine. ISIM

runs at 500MHz.The execution time is obtained by inserting the

clock-fetch assembly instructions. If the data size of

the program is small, we eliminate the extra overheads

(such as system calls) by means of executing it

multiple times and calculating the average time

consumption. As I/O overheads are hidden in our

experiments, the CPU time is nearly equal to the wall-

clock time.

Table 1. Specifications of 9 benchmarks

Name Swim EP MG DFFT Laplace Jacobi GEMM NLAG-5 Transp

Source Spec2000 NPB NPB - NCSA - BLAS -

#Arrays 14 1 3 1 1 4 2 2 5

Prob. Size 513×513 131072 64×64×64 4096 256×256 128×128 256×256 256×256 512×512

463463457457

8/12/2019 04618808


8/12/2019 04618808


0

2

4

6

8

S w i m E P M G D F F T L a p l a c e J a

c o b i G E

M M N L

A G - 5 T r a n s p

T h r o u g h p u t r a t i o

O rig S MA

(b) SRF to memory

Figure 11. SRF- and LRF-to-memorythroughput ratios

Figure 12 gives the proportions of computation time

and memory access time in the total execution time.

The totals exceed 100% due to the overlapping

between computation and memory access. As shown in

Figure 12, the memory access time occupies nearly

100% of the total execution time for most Optiversions. It means memory access overhead is a

dominate part of the whole execution time. For the

SMA versions, we can see the total proportions of

most programs are over 100%, which means that the

memory access and computation can be overlapped by

using strip-mining optimization. While the memory

proportion of EP changes a little. It is because EP is a

computational intensive kernel. So the overlapping

between computation time and memory access time is

ineffective

0

50

100

S w i m E P M

G D F

F T

L a p l a c

e

J a c o b i

G E M M

N L A G

- 5

T r a n s p P

r o p o r t i o n s t o t o t a l

e x e c u t i o n t i m e ( % )

Mem_Orig Ker_Orig Mem_SMA Ker_SMA

Figure 12. Percentages of memory access andkernel execution.

5. Conclusion and Future Work

The stream architecture is a novel microprocessor

architecture with wide application potential. Despite

the presence of an efficient bandwidth hierarchy,

performance of applications may still be constrained

by bandwidth if the computation performed per unit of

data accessed at any level of the memory hierarchy is

less than what can be sustained by the arithmetic units.

The Stream Register File (SRF) is a large on-chip

memory of the stream processor. Thus, applying

compiler techniques for SRF optimization is essential

for good performance.

In this paper, we propose an efficient strip-mining

algorithm for SRF bandwidth utilization optimization

on Imagine. The contributions of our work include the

following aspects. We first present the determination

of optimal kernel set for strip-mining based on a novel

kernel reuse graph. Then we address the key technique

of strip size selection, aiming at achieving the maximal

performance form stream reuse and stream prefetching.

Lastly, we propose the efficient strip-mining

implementation. The experimental results on some

typical application kernels show that our strip-mining

algorithm is a practical and promising solution to

improve SRF bandwidth utilization on Imagine.

In the future, our efforts will mainly focus on two

aspects. Firstly, we plan to apply our algorithm to more

applications to evaluate the optimal strip size selection

algorithm on Imagine. Secondly, we would like to

exploit more compiler optimizations for memory

hierarchy on the stream processor.

Acknowledgements. We gratefully thank the Stanford

Imagine team for the use of their compilers and

simulators. We also acknowledge the reviewers for

their insightful comments. This work was supported by

NSFC (60621003).

References[1] U. J. Kapasi, S. Rixner, W. J. Dally, et al.

Programmable Stream Processors. IEEE Computer , pp.54-62, 2003.

[2] B. Khailany. The VLSI Implementation and Evaluation

of Area-and Energy-Effcient Streaming Media

Processors. Ph.D. thesis, Stanford University, 2003.

[3] B. Khailany, W. J. Dally, et al. VLSI Design and

Verification of the Imagine Processor. In Proceedings of

the IEEE International Conference on Computer Design,

pp. 289-294, 2002.

[4] A. L. Andrew, W. Thies, and S. Amarasinghe. Linear

Analysis and Optimization of Stream Programs. In

Proceedings of the SIGPLAN '03 Conference on

Programming Language Design and Implementation,

San Diego, CA, 2003.

[5]

Jung Ho Ahn, William J. Dally, et al. Evaluating theImagine Stream Architecture. In Proceedings of the

annual international symposium on Computer

Architecture 2004, 2004.

[6] A. Das, W. J. Dally, and P. Mattson, Compiling for

Stream Processing. In PACT '06: Proceedings of the

15th international conference on Parallel Architectures

and Compilation Techniques, pp. 33-42ACM Press.

New York, 2006.

465465459459

8/12/2019 04618808


[7] U. J Kapasi, P. Mattson, et al. Stream Scheduling. In

Proceedings of the 3th Workshop on Media and

Streaming Processors, pp.101-106, 2001.

[8] P. Mattson. A Programming System for the Imagine

Media Processor. Ph.D. thesis, Dept. of Electrical

Engineering, Stanford University, 2002.

[9] M. Erez. Merrimac – High-Performance, Highly-

Efficient Scientific Computing with Streams. Ph.D.

thesis, Dept. of Electrical Engineering, Stanford

University, 2007.

[10] J. D. Owens, S. Rixner, et al. Media Processing

Applications on the Imagine Stream Processor, In

Proceedings of the 2002 International Conference on

Computer Design, 2002.

[11] S. Amarasinghe, et al. Stream Languages and

Programming Models. In Proceedings of the

International Conference on Parallel Architectures and

Compilation Techniques 2003, 2003.

[12] Nuwan S. Jayasena. Memory Hierarchy Design for

Stream Computing. Ph.D. thesis, Stanford University,

2005.[13] J. Du, X. Yang, et al. Architecture-Based Optimization

for Mapping Scientific Applications to Imagine. In

ISPA'07: Proceedings of the 2007 International

Symposium on Parallel and Distributed Processing with

Applications, Ontario, Canada, 2007.

[14] J. Du, X. Yang, et al. Scientific Computing Applications

on the Imagine Stream Processor. In Proceedings of the

11th Asia-Pacific Computer Systems Architecture

Conference, Shanghai, China, 2006.

[15] O. Johnsson, M. Stenemo, Z. ul-Abdin. Programming &

Implementation of Streaming Applications. Master’s

thesis, Computer and Electrical Engineering Halmstad

University, 2005.[16] M. I. Gordon,W. Thies, and S. Amarasinghe. Exploiting

Coarse-Grained Task, Data, and Pipeline Parallelism in

Stream Programs. In Proceedings of ASPLOS'06 ,

California, USA, 2006.

[17] X. Yang, X. Yan, Z. Xing, et al. A 64-bit Stream

Processor Architecture for Scientific Applications. In

ISCA'07: Proceedings of the 34th Annual International

Symposium on Computer Architecture, pp. 210-219.

ACM Press, New York, 2007.

[18] J. Suh, E.G. Kim, et al. A Performance Analysis of PIM,

Stream Processing, and Tiled Processing on Memory-

Intensive Signal Processing Kernels. In Proceedings of

the international symposium on Computer Architecture

2003, 2003.

[19] A. Das, et al. Imagine Programming System User’s

Guide 2.0. June 2004.

466466460460

Documents

04618808