Upload
jmgn
View
223
Download
0
Embed Size (px)
Citation preview
8/12/2019 04618808
http://slidepdf.com/reader/full/04618808 1/8
An Efficient Strip-mining Algorithm for Improving SRF Bandwidth
Utilization on Imagine
Wenjing Yang†, Jing Du
‡, Fujiang Ao
‡, Xuejun Yang
‡
† School of Computer, Beijing University of Aeronautics and Astronautics, Beijing, 100083,
China‡ PDL, School of Computer, National University of Defense Technology, Changsha, 410073,
China
Abstract
Strip-mining is a crucial technique for memoryhierarchy optimization. In this paper, we propose an
efficient strip-mining algorithm for improving SRF
bandwidth utilization on Imagine. Firstly, we present
how to determine the optimal kernel set for strip-
mining. The process is based on a novel structure
proposed by us, namely kernel reuse graph. Secondly,
we select the optimal strip size, so as to achieve the
tradeoff between stream reuse and stream prefetching.
Finally, we propose the efficient strip-mining
algorithm, which is implemented in SCompiler. The
experiment results show that our strip-mining
algorithm is a practical and promising solution to
improve SRF locality and hide the memory access
overhead effectively on Imagine.
Keywords. Imagine, SRF, Strip-mining, Stream reuse,
Stream prefetching
1. Introduction
The Imagine processor is designed to address the
processor-memory gap through streaming technology
at low cost and low power [1-5]. It integrates a large
on-chip memory called Stream Register File (SRF) to
buffer streams [6, 7]. Due to high and increasingmemory latencies, the optimization for the SRF is
crucial to achieve high performance on Imagine. The
stream applications on Imagine are structured as some
computation kernels that operate on sequences of data
records called streams [8-11]. Most applications often
face the same memory access bottleneck for they can’t
fit all their streams in the SRF, which harms the SRF
locality. To address the problem, it is necessary to
partition long streams into segments known as strips,
such that all of the intermediate state for thecomputation on a single strip fits in the SRF [6-9], thus
avoiding the memory transfers. Figure 1 diagrams the
strip-mining technique.
In order to implement this optimization, this paper
proposes an efficient strip-mining algorithm to achieve
high SRF bandwidth utilization on Imagine. The
contributions of our work are as follows.
The optimal kernel set for strip-mining is
determined. It exhibits high stream reuse
between kernels by kernel reordering, which is
based on a novel kernel reuse graph.
The optimal strip size is selected. The
technique makes a tradeoff between streamreuse and stream prefetching, so as to achieve
high SRF reuse and hide memory latency
effectively.
An efficient strip-mining algorithm is
proposed. It is implemented in SCompiler,
which is a compiler used to map Fortran
programs to Imagine.
To evaluate the effectiveness of the proposed strip-
mining algorithm, we apply the algorithm to 9
application kernels on the ISIM simulation of Imagine.
The experimental results show that our strip-mining
optimization is a practical and promising solution toimprove SRF locality and hide the memory access
overhead effectively on Imagine.
The remainder of this paper is organized as follows:
Section 2 overviews the Imagine processing system.
Section 3 presents the key techniques of the proposed
strip-mining optimization. The experiments are shown
in Section 4. In Section 5, we conclude the paper and
discuss some future works.
Third International IEEE Conference on Signal-Image technologies and Internet-Based System
978-0-7695-3122-9/08 $25.00 © 2008 IEEE
DOI 10.1109/SITIS.2007.23
459
Third International IEEE Conference on Signal-Image Technologies and Internet-Based System
978-0-7695-3122-9/08 $25.00 © 2008 IEEE
DOI 10.1109/SITIS.2007.23
459
Third International IEEE Conference on Signal-Image Technologies and Internet-Based System
978-0-7695-3122-9/08 $25.00 © 2008 IEEE
DOI 10.1109/SITIS.2007.23
453
Third International IEEE Conference on Signal-Image Technologies and Internet-Based System
978-0-7695-3122-9/08 $25.00 © 2008 IEEE
DOI 10.1109/SITIS.2007.23
453
8/12/2019 04618808
http://slidepdf.com/reader/full/04618808 2/8
Figure 1. Strip-mining technique
2. The Imagine Stream Processing System
Imagine developed at Stanford University is a
single-chip stream processor. It consists of 48-ALUs
arranged as 8 SIMD clusters and three level memoryhierarchy designed to keep the functional units
saturated during stream processing. The memory
hierarchy consists several local register files (LRFs), a
128 KB stream register file (SRF) and off-chip DRAM
[12]. The architecture is centered around the SRF,
which reads data from off-chip DRAM through a
memory system interface and sequentially feeds the 8
arithmetic clusters. Figure 2 shows the Imagine stream
architecture.
cluster 0
cluster 1
cluster 7
LRF LRF
LRF LRF
SRF
Network
Interface
Stream
Controller Micro-controller
SDRAM
SDRAM
SDRAM
SDRAM
Host
Processor
D R A M
C o n t r o l l e r
Imagine Stream Processor
Network
Figure 2. The Imagine stream architecture
Programming on Imagine is divided into two levels:
stream level (using StreamC) and kernel level (using
KernelC). These levels are corresponding to streamscheduling and stream processing in logic view
respectively. In this model, an application is composed
of a collection of data streams passing through a series
of computation kernels which perform computations.
However, programmers must consider the stream
organization and communication using this explicit
stream model, increasing the programming complexity
[8-11]. So the compiler optimization is important to
achieve significant performance improvement on the
stream processor [13-16].
3. Strip-mining Technique
Strip-mining technique is used to address the
problem of long streams. In this section, we focus onexploring some key factors of strip-mining technique,
such as the optimal kernel set for strip-mining, optimal
strip size and strip-mining implementation.
3.1. Optimal kernel set for strip-mining
To improve the reuse degree in SRF, we need to
first determine the kernel set for strip-mining. This
kernel set must benefit from strip-mining operation.
The producer-consumer locality in SRF is exposed
by forwarding the streams produced by one kernel to
the subsequent kernels [12]. In order to make the
neighboring kernels be provided with reused streams,the relative order of the kernels needs to be reordered
for high computational intensiveness and fine locality
in SRF. Kernel reordering must satisfy safety and
profitability considerations. The safety refers to
reordering kernels can not violate ordering constraints
implied by reuses. The profitability lies on the data
reuse between kernels after reordering kernels. To
guarantee safety and profitability, some ordering
constraints need to be proposed.
Firstly, we should identify stream reuse between
kernels, which is the basis of kernel reordering.
Therefore, we propose the following definitions.
Definition 1. For kernel K i and K j, if there is astream reuse from K i to K j, we say that K i reuse K j
(denoted as ji K K δ ). And the reuse distance d (i, j) is
defined as the total length of all streams in the two
kernels, namely
( ) ( ) ( ) ( )∑ ∨= j xi x x K in D K in D Dlen jid |, , where
( ) x Dlen denotes the length of data stream x D .
Based on the kernel reuse of a given program, we
can obtain a kernel reuse graph, which abstract the
reuse relationship between kernels. As shown in
Figure 3, we can conclude 21 K K δ , 31 K K δ .
K 1(a, b)
K 2(a, c)
K 3(b, d)
K 1
K 3
K 2
460460454454
8/12/2019 04618808
http://slidepdf.com/reader/full/04618808 3/8
Figure 3. Example program and its kernelreuse graph
Second, ordering constraints are proposed based on
the definition of exchangeable kernels.
Definition 2. The neighboring kernels K i and K j are
defined as exchangeable kernels if there is no stream
reuse between the two kernels. The exchangeable
kernels are denoted as ji K K ↔ .
As shown in Figure 3, 32 K K ↔ . Obviously,
reordering the exchangeable kernels can not influence
the reuse order in the original program. This is the
primary safety determination of kernel reordering. But
it is not sufficient to guarantee profitability. The
requirement for profitability is that reordering should
not increase the original reuse distance, or the kernel
reordering is not profitable. Thus the definition of
potentially adjacent kernels is proposed.
Definition 3. Suppose kernel K i, K j and K k are
executed in serial. If k i K K δ andk j
K K ↔ , K j and
K k are defined as potentially adjacent kernels.
Meanwhile, the reuse between K j and K k are defined as
potentially adjacent reuse, which is denoted as
k i K K ↔
δ .
Kernel reordering is used to transform the data
reuses in the original program to potentially adjacent
reuses, so as to improve the computational
intensiveness and data reuse between kernels. That is,
we perform potentially adjacent reuse guided kernel
reordering on the original program. As for the above
example, 31 K K ↔δ . Thus, we need to reorder the
kernels in the program.
So the kernel set for strip-mining is may be the
maximal set that consists of potentially adjacent
kernels. We define the kernel set for strip-mining as
follows.
Definition 4. Arbitrary potentially adjacent kernel
set can be used as a kernel set for strip-mining. And if
the intersection of two kernel sets for strip-mining is
not empty, the coalition of the two sets is also a kernel
set for strip-mining.
After data-centric kernel reordering, the kernel sets
for strip-mining are relatively intensive. To determinethe optimum kernel set for strip-mining, reuse distance
is used as the metric because it reflects the possibility
of reuse in SRF. The maximal reuse distance in the
optimum kernel set for strip-mining must be smaller
than SRF. Thus, the optimum kernel set is the maximal
one in which all reuse can be optimized by strip-
mining.
3.2. Optimal strip size selection
Based on the kernel reuse graph, the optimum
object set for strip-mining can be achieved. Then we
make a conservative decision for partitioning the
object set. That is, we perform strip-mining when the
total scale of all the streams in the object set is largerthan SRF, namely ( )∑ >∈∀ C DOS S D ii )( , where OS
denotes the object set, S (OS ) denotes the stream set of
OS , and C is the SRF size.
Selecting the optimal strip size is a crucial
optimization for strip-mining technique. Since Imagine
is an access/execute decoupled processor, the strip size
lies on the tradeoff between the benefit and overhead
from stream reuse and prefetching [8]. Figure 4 shows
the strip-mining introduced with stream reuse and
stream prefetching. One side, stream prefetching is a
latency tolerance method. Its benefit is achieved by the
overlapping between memory access time and
computation time. But prefetching streams occupiesthe space for reusing steams, which may enlarge the
number of kernels to increase the overhead of kernel
switching. Especially, if the memory access time is not
capable with the computation time, stream prefetching
can not achieve its benefit. Figure 4(a) shows the strip-
mining optimization with the consideration of
prefetching. On the other side, stream reuse is a latency
avoidance method. Its benefit comes from the
reduction of absolute memory access physically. But
as the strip size increases, there is no enough space for
stream prefetching, resulting low performance. Figure
4(b) shows the strip-mining optimization with the
consideration of stream reuse. In conclusion, we
should regard the stream reuse as the first decision of
strip size selection, that is, the optimal strip size must
guarantee all the exploitable reuses can be reused.
Then, if there is little stream reuse, stream prefetching
should be introduced to quantify the strip-mining
technique, as shown in Figure 4(c).
1
2
3
4
5
1
2
3
4
5
TimeInput
Kernel
Output
(a) Strip-mining only with prefetching
461461455455
8/12/2019 04618808
http://slidepdf.com/reader/full/04618808 4/8
1,2
4
1
Flowreuse Output
reuse
Inputreuse
53,4
5
No reuse
(b) Strip-mining only with reuse
1,2
4
1
Flowreuse Output
reuse
Inputreuse
5 3,4
5
No reuse
(c) Strip-mining with both prefetching and reuse
Figure 4. Strip-mining with reuse andprefetching
Strip-mining aims at reuse or prefetching. On the
premise of that double- buffering does not exist, we
should both increase stream reuse and reduce the
number of kernels. Given a kernel in the object set for
strip-mining, there certainly exists stream reuse in its
adjacent kernels. But the reuse does not always reuse
the whole streams. So, the reuse region must be
achieved for selecting strip size more precisely. Since
different reuse regions are consumed by successive
kernels after performing strip-mining method, we
should guarantee the reuse distance between kernels is
smaller than SRF. In order to explain our technique,
the definition of stream reuse region should be defined.Definition 5. The stream reuse region is defined as
the stream region which is reused by the adjacent
kernels, and the reuse distance between the kernels can
not be reduced to a certain size which is smaller than
SRF. Meanwhile, the successive reuse times of the
stream reuse region are defined as the reuse degree of
the stream reuse region.
Thus, we can achieve a certain stream reuse region
B. In order to obtain the maximal partition of kernels
with no double-buffering, we decide the critical region
B' to occur double-buffering for the maximal stream.
The size of B' is related with whether stream
prefetching is performed. When the reuse degree ofstream reuse region B (denoted as M ) is larger than
certain threshold TH , the priority of stream prefetching
is decreased. Therefore, large reuse region can be
loaded to SRF to achieve high SRF bandwidth
utilization. Otherwise, we must reserve enough SRF
space for the input streams of the next kernel. The size
of B' can be formulated as follows, where f i( B' ) is a
mapping function to achieve the corresponding reuse
region of stream Di from B' .
⎣ ⎦ N B&C B f BTH M i ∈<+→≥ ∑ 8/')'('
⎣ ⎦ N BC B f BTH M i ∈<+→< ∑ 8/'&2)'(2'
Then the maximal kernel partition is produced
through solving for the size of
⎣ ⎦'' B B = , so as to
ensure the effective SRF reuse. Thus, the optimal strip
size can be selected as follows, where S 0 denotes the
optimal strip size.
BS B B 0 =→≤ '
B' S B B 0 =→> '
According to the optimal strip size S 0 selected by us,
we can partition the kernels more efficiently, so as to
achieve high SRF locality and effective overlapping
between computations and memory accesses.
3.3. Implementation
Fortran
codeParser
Low-level Code
GenerationImagine
SCompiler
Streamization
Resource
Allocation
Low-level
Compiler
Code
Optimization
Optimal Strip Size
Selection
Kernel Set
Determination
Kernel Reordering
Stream Intermediate
Code Generation
Stripmining
Figure 5. Framework of the SCompiler.
Figure 5 gives the framework of the SCompilerdeveloped for the Fortran language. The work of this
paper focuses on the code optimization (grey part in
the framework). The right part of Figure 5 shows the
steps of the strip-mining optimization. After the front-
end parsing, streamization, and stream-level
intermediate codes generating optimizing like Figure 5
are available. Taking the stream-level intermediate
codes as input, the kernel reuse graph is constructed.
Then based on the graph, the compiler reorders the
kernels to achieve high stream reuse between kernels.
Next, the optimal kernel set for strip-mining is
obtained, and the optimal strip size is selected based on
the optimized stream program. Finally, strip-miningoptimization is performed to enhance SRF bandwidth
utilization. The detailed algorithm for strip-mining is
shown in Figure 6.
For the example program shown in Figure 3, there
exists stream reuse between kernels based on the
analysis in Section 3.1. When the streams of a kernel
are larger than SRF, strip-mining algorithm should be
introduced to partition long streams to optimal strips,
462462456456
8/12/2019 04618808
http://slidepdf.com/reader/full/04618808 5/8
so as to achieve high SRF reuse. The optimized
program is given in Figure 7. And the corresponding
SRF reuse result is shown in Figure 8. Obviously,
stream a and b in kernel K 1 can be fully reused after
strip-mining.
1
ALGORITHM Strip-mining ()2 Input: The stream intermediate code
3 Output: The optimized code
4 Invariants: TH = Reuse threshold
S 0 = The optimal strip Size
L =∑l i
5 // obtain the optimal kernel set
6 Build the kernel reuse graph
7 for kernel K i and K j
8 if ⎟ ⎠
⎞⎜⎝
⎛ ¬
↔
k i K K δ
9 Reorder K i and K j
10 Determinate the kernel set OS for strip-mining
11 // selecte the optimal strip size
12 Obtain the reuse degree M of each reuse region B
13 if M ≥ TH
B' + ∑ f i( B' ) < C
14 else
15 B' /2 + ∑ f i( B' ) < C /2
16 Solve for ⎣ ⎦'' B B =
17 if B < B'
18 S 0 = B
19 else
20 S 0 = B'
21 Partition the kernels in OS in terms of S 0
22
Reorder the new subkernels
Figure 6. Strip-mining algorithm.
for (int i = 0; i < N; i = i+strip){
K 1(a(i, (i+strip)), b(i, (i+strip)));
K 2(a(i, (i+strip)), c(i, (i+strip)));
K 3(b(i, (i+strip)), d(i, (i+strip)));
}
Figure 7. Example program after strip-mining.
a
a
K 1
K 2
K 3 d
t i m e
space
b
c
b
strip
reuse
Figure 8. SRF reuse result.
4. Experimental Results and Analysis
To evaluate the effectiveness of our strip-mining
algorithm, we select 9 scientific applications listed in
Table 1. Nlage-5 is a nonlinear algebra solver of two-
dimensional nonlinear diffusion of hydrodynamics
[17], and Transp is the time-consuming subroutines in
Capao that is an optics application. All programs are
Fortran versions, and they are compiled by three kindsof compilers respectively, including Intel's compiler
ifort (version 9.0) with the optimization option -O3,
SCompiler without any strip-mining optimization, and
SCompiler with the proposed strip-mining algorithm.
The first compiling results (denoted as Seri) are
executed on a single-core Itanium 2 server. Itanium 2
runs at 1.6GHz and the sizes of the caches are 16KB
for the L1 cache, 256KB for the L2 cache and 6MB for
the L3 cache. There is also a 4GB off-chip memory
with the bandwidth of 6.4GB/s. The latter two results
(denoted as Orig and SMA) are executed on ISIM [18,
19] that is a cycle-accurate simulator of Imagine. ISIM
runs at 500MHz.The execution time is obtained by inserting the
clock-fetch assembly instructions. If the data size of
the program is small, we eliminate the extra overheads
(such as system calls) by means of executing it
multiple times and calculating the average time
consumption. As I/O overheads are hidden in our
experiments, the CPU time is nearly equal to the wall-
clock time.
Table 1. Specifications of 9 benchmarks
Name Swim EP MG DFFT Laplace Jacobi GEMM NLAG-5 Transp
Source Spec2000 NPB NPB - NCSA - BLAS -
#Arrays 14 1 3 1 1 4 2 2 5
Prob. Size 513×513 131072 64×64×64 4096 256×256 128×128 256×256 256×256 512×512
463463457457
8/12/2019 04618808
http://slidepdf.com/reader/full/04618808 6/8
8/12/2019 04618808
http://slidepdf.com/reader/full/04618808 7/8
0
2
4
6
8
S w i m E P M G D F F T L a p l a c e J a
c o b i G E
M M N L
A G - 5 T r a n s p
T h r o u g h p u t r a t i o
O rig S MA
(b) SRF to memory
Figure 11. SRF- and LRF-to-memorythroughput ratios
Figure 12 gives the proportions of computation time
and memory access time in the total execution time.
The totals exceed 100% due to the overlapping
between computation and memory access. As shown in
Figure 12, the memory access time occupies nearly
100% of the total execution time for most Optiversions. It means memory access overhead is a
dominate part of the whole execution time. For the
SMA versions, we can see the total proportions of
most programs are over 100%, which means that the
memory access and computation can be overlapped by
using strip-mining optimization. While the memory
proportion of EP changes a little. It is because EP is a
computational intensive kernel. So the overlapping
between computation time and memory access time is
ineffective
0
50
100
S w i m E P M
G D F
F T
L a p l a c
e
J a c o b i
G E M M
N L A G
- 5
T r a n s p P
r o p o r t i o n s t o t o t a l
e x e c u t i o n t i m e ( % )
Mem_Orig Ker_Orig Mem_SMA Ker_SMA
Figure 12. Percentages of memory access andkernel execution.
5. Conclusion and Future Work
The stream architecture is a novel microprocessor
architecture with wide application potential. Despite
the presence of an efficient bandwidth hierarchy,
performance of applications may still be constrained
by bandwidth if the computation performed per unit of
data accessed at any level of the memory hierarchy is
less than what can be sustained by the arithmetic units.
The Stream Register File (SRF) is a large on-chip
memory of the stream processor. Thus, applying
compiler techniques for SRF optimization is essential
for good performance.
In this paper, we propose an efficient strip-mining
algorithm for SRF bandwidth utilization optimization
on Imagine. The contributions of our work include the
following aspects. We first present the determination
of optimal kernel set for strip-mining based on a novel
kernel reuse graph. Then we address the key technique
of strip size selection, aiming at achieving the maximal
performance form stream reuse and stream prefetching.
Lastly, we propose the efficient strip-mining
implementation. The experimental results on some
typical application kernels show that our strip-mining
algorithm is a practical and promising solution to
improve SRF bandwidth utilization on Imagine.
In the future, our efforts will mainly focus on two
aspects. Firstly, we plan to apply our algorithm to more
applications to evaluate the optimal strip size selection
algorithm on Imagine. Secondly, we would like to
exploit more compiler optimizations for memory
hierarchy on the stream processor.
Acknowledgements. We gratefully thank the Stanford
Imagine team for the use of their compilers and
simulators. We also acknowledge the reviewers for
their insightful comments. This work was supported by
NSFC (60621003).
References[1] U. J. Kapasi, S. Rixner, W. J. Dally, et al.
Programmable Stream Processors. IEEE Computer , pp.54-62, 2003.
[2] B. Khailany. The VLSI Implementation and Evaluation
of Area-and Energy-Effcient Streaming Media
Processors. Ph.D. thesis, Stanford University, 2003.
[3] B. Khailany, W. J. Dally, et al. VLSI Design and
Verification of the Imagine Processor. In Proceedings of
the IEEE International Conference on Computer Design,
pp. 289-294, 2002.
[4] A. L. Andrew, W. Thies, and S. Amarasinghe. Linear
Analysis and Optimization of Stream Programs. In
Proceedings of the SIGPLAN '03 Conference on
Programming Language Design and Implementation,
San Diego, CA, 2003.
[5]
Jung Ho Ahn, William J. Dally, et al. Evaluating theImagine Stream Architecture. In Proceedings of the
annual international symposium on Computer
Architecture 2004, 2004.
[6] A. Das, W. J. Dally, and P. Mattson, Compiling for
Stream Processing. In PACT '06: Proceedings of the
15th international conference on Parallel Architectures
and Compilation Techniques, pp. 33-42ACM Press.
New York, 2006.
465465459459
8/12/2019 04618808
http://slidepdf.com/reader/full/04618808 8/8
[7] U. J Kapasi, P. Mattson, et al. Stream Scheduling. In
Proceedings of the 3th Workshop on Media and
Streaming Processors, pp.101-106, 2001.
[8] P. Mattson. A Programming System for the Imagine
Media Processor. Ph.D. thesis, Dept. of Electrical
Engineering, Stanford University, 2002.
[9] M. Erez. Merrimac – High-Performance, Highly-
Efficient Scientific Computing with Streams. Ph.D.
thesis, Dept. of Electrical Engineering, Stanford
University, 2007.
[10] J. D. Owens, S. Rixner, et al. Media Processing
Applications on the Imagine Stream Processor, In
Proceedings of the 2002 International Conference on
Computer Design, 2002.
[11] S. Amarasinghe, et al. Stream Languages and
Programming Models. In Proceedings of the
International Conference on Parallel Architectures and
Compilation Techniques 2003, 2003.
[12] Nuwan S. Jayasena. Memory Hierarchy Design for
Stream Computing. Ph.D. thesis, Stanford University,
2005.[13] J. Du, X. Yang, et al. Architecture-Based Optimization
for Mapping Scientific Applications to Imagine. In
ISPA'07: Proceedings of the 2007 International
Symposium on Parallel and Distributed Processing with
Applications, Ontario, Canada, 2007.
[14] J. Du, X. Yang, et al. Scientific Computing Applications
on the Imagine Stream Processor. In Proceedings of the
11th Asia-Pacific Computer Systems Architecture
Conference, Shanghai, China, 2006.
[15] O. Johnsson, M. Stenemo, Z. ul-Abdin. Programming &
Implementation of Streaming Applications. Master’s
thesis, Computer and Electrical Engineering Halmstad
University, 2005.[16] M. I. Gordon,W. Thies, and S. Amarasinghe. Exploiting
Coarse-Grained Task, Data, and Pipeline Parallelism in
Stream Programs. In Proceedings of ASPLOS'06 ,
California, USA, 2006.
[17] X. Yang, X. Yan, Z. Xing, et al. A 64-bit Stream
Processor Architecture for Scientific Applications. In
ISCA'07: Proceedings of the 34th Annual International
Symposium on Computer Architecture, pp. 210-219.
ACM Press, New York, 2007.
[18] J. Suh, E.G. Kim, et al. A Performance Analysis of PIM,
Stream Processing, and Tiled Processing on Memory-
Intensive Signal Processing Kernels. In Proceedings of
the international symposium on Computer Architecture
2003, 2003.
[19] A. Das, et al. Imagine Programming System User’s
Guide 2.0. June 2004.
466466460460