Upload
vanminh
View
220
Download
3
Embed Size (px)
Citation preview
Optimizing Data Intensive Window-based Image
Processing on Reconfigurable Hardware Boards
A Dissertation Presented
by
Haiqian Yu
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in the field of
Electrical and Computer Engineering
Northeastern University
Boston, Massachusetts
December 2006
c© Copyright 2007 by Haiqian Yu
All Rights Reserved
NORTHEASTERN UNIVERSITYGraduate School of Engineering
Thesis Title: Optimizing Data Intensive Window-based Image Processing onReconfigurable Hardware Boards.
Author: Haiqian Yu.
Department: Electrical and Computer Engineering.
Approved for Thesis Requirements of the Doctor of Philosophy Degree
Thesis Advisor: Prof. Miriam Leeser Date
Thesis Reader: Prof. Jennifer Dy Date
Thesis Reader: Prof. Eric Miller Date
Department Chair: Prof. Ali Abur Date
Graduate School Notified of Acceptance:
Dean: Prof. Allen L. Soyster Date
NORTHEASTERN UNIVERSITYGraduate School of Engineering
Thesis Title: Optimizing Data Intensive Window-based Image Processing onReconfigurable Hardware Boards.
Author: Haiqian Yu.
Department: Electrical and Computer Engineering.
Approved for Thesis Requirements of the Doctor of Philosophy Degree
Thesis Advisor: Prof. Miriam Leeser Date
Thesis Reader: Prof. Jennifer Dy Date
Thesis Reader: Prof. Eric Miller Date
Department Chair: Prof. Ali Abur Date
Graduate School Notified of Acceptance:
Dean: Prof. Allen L. Soyster Date
Copy Deposited in Library:
Reference Librarian Date
Abstract
FPGA-based computing boards are frequently used as hardware accelerators
for image processing algorithms with large amounts of computation and data
accesses. The current design process requires that, for each specific image pro-
cessing application, a detailed design must be completed before a realistic es-
timate of the achievable speedup can be obtained. However, users need the
speedup information to decide whether or not they want to use FPGA hardware
to accelerate the application. Quickly providing an accurate speedup estimation
becomes increasingly important for designers to make the decision without going
through the lengthy design process.
We present an automated tool, Sliding Window Operation OPtimization
(SWOOP), that generates an estimate of speedup for a high performance design
before detailed implementation is complete. SWOOP targets Sliding Window
Operations (SWOs). SWOOP provides a system block diagram of the final
design as well as an optimal memory hierarchy.
One of the contributions of this research is the automatic design of the on-
chip memory as a managed cache. The hardware setup we target can be viewed
as exploiting a cached memory system with the on-chip memory acting as an
L1 cache and the on-board memory acting as an L2 cache. However, unlike
most processors, no support for caching is provided. To minimize the number of
off-chip data accesses, the memory has to be carefully managed by the designer.
Our approach automatically determines the way the data should be accessed and
buffered in the on-chip memory to minimize the off-chip memory traffic. This
approach is applicable to any hardware architecture that contains a hierarchy
i
of memory outside of the normal caching structure.
SWOOP takes both the application parameters and FPGA board parameters
as input. The achievable speedup is determined by the area of the FPGA, or,
more often, the memory bandwidth to the processing elements. The memory
bandwidth to each processing element is a combination of bandwidth to the
FPGA and the efficient use of on-chip RAM as a data cache. SWOOP uses
analytic techniques to automatically determine the number of parallel processing
elements to implement on the FPGA, the assignment of input and output data
to on-board memory, and the organization of data in on-chip memory to most
effectively keep the processing elements busy. The result is a block layout of the
final design, its memory architecture, the estimated usage of different resources
and a measure of the achievable speedup.
Several manually designed applications including simple 2-D high-pass and
low-pass filters, template matching and 2-D cross-correlation have been used
to test the performance of SWOOP. Our experiments show that SWOOP can
quickly (less than a second) and accurately (less than 10% difference from man-
ual designs) estimate the maximum parallelism according to the applications
and constraints. The block layout of the final designs together with the mem-
ory architecture are near optimal with regard to performance. Moreover, since
SWOOP identifies where the tightest constraint to parallelism is found in a de-
sign, it can tell designers where to focus their efforts for further optimization.
ii
Acknowledgements
I am glad that I have this opportunity to thank all those who make this disser-
tation possible. First of all, I would like to express my deeply gratitude to my
advisor, Professor Miriam Leeser, for her stimulating suggestion and continuous
guidance over the past five years. I benefit a lot from her both technically and
personally. I would like to thank Dr. Gilead Tadmor and Dr. Stefan Siegel,
who gave me insightful suggestions on research and have been very patient of
revising the papers. I also would like to thank Dr. Jennifer Dy and Dr. Eric
Miller, who served on both my masters and doctoral committees.
I would like to thank all the members in Reconfigurable Computing Lab-
oratories, they build this friendly environment and make research much more
enjoyable. I would like to Shawn Miller to let me use some of his results in this
dissertation.
I have been truly blessed with good friends. I would like to thank my friend
in NEU, Sophia, Yiheng, Ping, Xiaojun, Wang, Janice, Ting for the interesting
discussion during lunch break. I would like to thank Liying, Haidan for their
long lasting friendship over the past 10 years. I would like to thank Chuwei,
Huajie and their lovely son, Kaishu for bring me surprises all the time.
A special thanks goes to my family. I would like to thank my husband,
Mengxi, for his continuous encouragement which accompanied me through the
ups and downs of life. I would like to thank my son, Ethan, his lovely smile never
failed to boost me up when I feel frustrated. Many thanks goes to my parents
and parents-in-law, without their help with the baby-sitting, this dissertation
may take much longer to finish.
iii
Contents
Abstract i
Acknowledgements iii
1 INTRODUCTION 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 FPGA Based Computing Boards . . . . . . . . . . . . . 3
1.1.2 FPGA Structure . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . 8
1.1.4 Sliding Window Operations(SWOs) . . . . . . . . . . . . 9
1.2 Our Automated Tool . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . 13
2 RELATED WORK 14
2.1 General-purpose Processors . . . . . . . . . . . . . . . . . . . . 15
2.2 Application-specific Processors . . . . . . . . . . . . . . . . . . . 16
2.3 Special-purpose Processors . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Target Independent Optimization . . . . . . . . . . . . . 19
2.3.2 Targeting ASICs . . . . . . . . . . . . . . . . . . . . . . 22
2.3.3 Targeting FPGAs . . . . . . . . . . . . . . . . . . . . . . 25
2.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Implementing SWOs on COTS FPGA boards . . . . . . . . . . 31
iv
3 DESIGN TRADEOFFS 36
3.1 Some Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Example Description . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Area Availability . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Principle Blocks . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 High-Pass Filter Example . . . . . . . . . . . . . . . . . 44
3.4 Memory Bandwidth Limitation . . . . . . . . . . . . . . . . . . 45
3.4.1 Upper Bound with No Buffering and Full Row Buffering 46
3.4.2 High-Pass Filter Example . . . . . . . . . . . . . . . . . 49
3.5 On-chip Memory Availability . . . . . . . . . . . . . . . . . . . 50
3.5.1 Block Buffering Method . . . . . . . . . . . . . . . . . . 51
3.5.2 Selecting p And q . . . . . . . . . . . . . . . . . . . . . . 54
3.5.3 High-Pass Filter Example . . . . . . . . . . . . . . . . . 56
3.6 Example Summary . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.7 SWOOP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8 Summpar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4 EXPERIMENTS 62
4.1 3× 3 High-Pass Filter AND 5× 5 Low-Pass Filter . . . . . . . . 64
4.1.1 Algorithm and Parameters . . . . . . . . . . . . . . . . . 64
4.1.2 Results from SWOOP . . . . . . . . . . . . . . . . . . . 68
4.1.3 Comparison and Analysis . . . . . . . . . . . . . . . . . 69
4.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Retinal Vascular Tracing Algorithm . . . . . . . . . . . . . . . . 72
4.2.1 Algorithm and Parameters . . . . . . . . . . . . . . . . . 72
4.2.2 SWOOP Results . . . . . . . . . . . . . . . . . . . . . . 76
4.2.3 Comparison and Analysis . . . . . . . . . . . . . . . . . 78
4.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3 Particle Image Velocimetry Algorithm . . . . . . . . . . . . . . 84
4.3.1 Algorithm and Parameters . . . . . . . . . . . . . . . . . 87
4.3.2 SWOOP Results . . . . . . . . . . . . . . . . . . . . . . 91
v
4.3.3 Comparison and Analysis . . . . . . . . . . . . . . . . . 92
4.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5 CONCLUSIONS 98
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Bibliography 103
Appendices 111
A Glossary 112
B Equation Proof 115
vi
List of Tables
3.1 Parameter Definition . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Area Usage of High Pass Filter Blocks . . . . . . . . . . . . . . 45
4.1 Board Parameters and Application Parameters for HPF and LPF
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Area Usage of Low Pass Filter Blocks . . . . . . . . . . . . . . . 68
4.3 Duplication Factors According to Different Constraints . . . . . 68
4.4 Bound Dependent Variables . . . . . . . . . . . . . . . . . . . . 69
4.5 Comparison between Automatic and Manual Results for HPF and
LPF Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.6 Board Parameters and Application Parameters for RVT Applica-
tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.7 Enumeration of AIF and BIF for FireBird . . . . . . . . . . . . 77
4.8 Duplication Factor for Different Usage of External Memory Banks 78
4.9 Comparison between Automatic and Manual Results for RVT Al-
gorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.10 Comparison between Automatic and Manual Results for RVT Al-
gorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.11 Board Parameters and Application Parameters for PIV Applications 92
4.12 Duplication Factor for Different Usage of External Memory Banks 93
4.13 Comparison between Automatic and Manual Results for PIV Al-
gorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
vii
List of Figures
1.1 Block Diagram of an Commercial FPGA Computing Engine . . 4
1.2 FPGA Structure based on Xilinx Virtex E . . . . . . . . . . . . 6
1.3 Example of Sliding Windowing Operation (window size = 3× 4) 10
3.1 Highpass Filter Example . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Block Diagram of an Commercial FPGA Computing Engine . . 42
3.3 Sequencing Graph of High Pass Filter . . . . . . . . . . . . . . . 44
3.4 Using Full Row Buffering Scheme . . . . . . . . . . . . . . . . . 47
3.5 Block Buffering Method Example . . . . . . . . . . . . . . . . . 52
3.6 Overlapping Loading and Processing . . . . . . . . . . . . . . . 55
3.7 SWOOP Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1 FireBird Block Diagram . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Highpass Filter Coefficients . . . . . . . . . . . . . . . . . . . . 64
4.3 Sequencing Graph of High Pass Filter . . . . . . . . . . . . . . . 65
4.4 Low-pass Filter Coefficients . . . . . . . . . . . . . . . . . . . . 66
4.5 Low-pass Filter MPE . . . . . . . . . . . . . . . . . . . . . . . . 67
4.6 Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.7 Response Module . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.8 Direction Module . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.9 Shifting the Window . . . . . . . . . . . . . . . . . . . . . . . . 80
4.10 Creating New Neighborhoods by Shifting Data on the FPGA . . 81
4.11 PIV System Overview (From Dantec Dynamics) . . . . . . . . . 85
4.12 Cross Correlation Plane . . . . . . . . . . . . . . . . . . . . . . 86
viii
4.13 Velocity Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.14 Pipelined Structure . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.15 An Example of Sub-pixel Interpolation . . . . . . . . . . . . . . 89
ix
Chapter 1
INTRODUCTION
Digital hardware has been used for speeding up computationally and data inten-
sive applications for decades due to its high degree of parallelism. Reconfigurable
digital hardware, specifically FPGAs, has become more and more popular be-
cause of its flexibility and short design cycle. A good candidate for reconfigurable
hardware speedup is digital image processing. Widely used either to improve pic-
toral quality for human interpretation or to extract information for autonomous
machine perception, digital image processing is very computationally and data
intensive and software processing can not meet the real-time requirement for
many applications, especially high resolution image processing. Sliding window
operations (SWOs) are a class of the most popular algorithm in digital image
processing and we are focus on this type of algorithms in this dissertation.
Commercial Off-The-Shelf (COTS) FPGA based boards, with both FPGAs
and memory banks integrated on them, are often used to improve performance
of image processing algorithms. It would be extremely useful for designers to
1
CHAPTER 1. INTRODUCTION 2
be able to quickly estimate the maximum performance, given the COTS board
and the algorithm are known. Different methods [1, 2, 3, 4, 5, 6, 7] haven
been proposed to fulfill this task. Some are specific to a given algorithm [5, 6,
7]and cannot be extended to other algorithms, while others are too generic and
thus cannot give an accurate estimate of maximum performance. Our research
balances between these two extremes by providing an automated tool which
can quickly find the upper bound of the performance according to the available
resources of the COTS boards. Moreover, this method can be used for most
sliding window operation (SWO) applications, which are commonly found in
image processing.
In this chapter, a brief background introduction explains why we are us-
ing FPGA based computing board to implement the SWO applications. The
contributions of our automated tool, Sliding Window Operation Optimization
(SWOOP), are also covered in this chapter.
1.1 Background
In this section, we first introduce a typical FPGA board and its structure as well
as why FPGAs can be used to implement different algorithms and gain speedup.
Then we will cover the concept of memory hierarchy in FPGA design which is
usually ignored for FPGA designs. SWOs are introduced and a small example
is given later in this section.
CHAPTER 1. INTRODUCTION 3
1.1.1 FPGA Based Computing Boards
With the development of charge-coupled devices (CCDs), high resolution images
can easily be acquired from digital cameras. Processing these images involves a
large amount of data. For most SWO applications, the algorithms are both data
intensive and computationally intensive. Software processing is slow and cannot
meet the speed requirements of some applications. Fortunately, SWOs are inher-
ently highly parallelizable and hardware implementations are favored in delay
sensitive applications. Using Application Specific Integrated Circuits (ASICs)
to accelerate these algorithms proves to be efficient and can greatly reduce the
processing time. However, the cost of an ASIC is extremely high. Reconfigurable
hardware based coprocessor boards can be used to flexibly implement similar al-
gorithms with a much shorter design cycle compared to ASICs. Reconfigurable
hardware can greatly reduce the Non-Recurring Engineering (NRE) costs and
the hardware can be adapted when the system architecture or application re-
quirements change. Another factor driving the surge of reconfigurable hardware
computing in image processing applications is the availability of standard system
solutions. Significant growth in the availability of COTS (commercial off-the-
shelf) reconfigurable computing engines, along with improvements in design-tool
capability make reconfigurable hardware more and more popular.
For these reasons, we focus on Field Programmable Gate Arrays (FPGAs).
Commercial reconfigurable coprocessor boards are composed of FPGA chip(s),
external memory bank(s), interfaces between memory banks and FPGA chips,
CHAPTER 1. INTRODUCTION 4
and a connection to a host processor. Figure 1.1 shows a typical computing
board with two FPGAs and five memory banks on it (based on the Annapolis
Micro Systems Inc. WildStar [8]). The interfaces and the number of memory
banks determine the maximum data flow between FPGA and memory. This
is called memory bandwidth. The main task for a hardware designer is to
maximally use the resources in the FPGA chips, intelligently organize the logical
memory structures according to their physical interconnections, and fully utilize
the memory bandwidth between the memory banks and FPGA chips for optimal
performance.
PCIController
PCI Bus to Host
FPGA 1 FPGA 2
Memory Bank 1
MemoryBank 4
MemoryBank 5
MemoryBank 2
MemoryBank 3
Figure 1.1: Block Diagram of an Commercial FPGA Computing Engine
One challenging aspect of reconfigurable hardware-based co-processor design
lies in the fact that different FPGA computing engines have different memory ar-
chitectures, different FPGA chips etc., which can lead to very different hardware
designs. Traditionally, given a FPGA based co-processor board and an image
processing algorithm, a hardware engineer analyzes the low level parallelism and
CHAPTER 1. INTRODUCTION 5
creates a design that meets the timing and resource constraints. A small change,
such as a change in the word length of the external memory bank on the board,
may result in a very different design. This process has to be repeated for each
different algorithm and each different board architecture. To avoid the time
consuming re-design process, we propose a generalized design method which can
lead to near optimal designs by defining the upper bounds of the design based
on different resource constraints. By picking the most critical constraint of the
design, we can maximally utilize the available resources. Moreover, the same ap-
proach can be used to modify the design when the resources change. Using this
design method for the FPGA implementation of SWO algorithms can greatly
reduce the time spent on hardware design, while at the same time obtaining a
high performance design with near optimal memory allocation and intelligent
memory hierarchy usage.
1.1.2 FPGA Structure
The key to the popularity of FPGAs is their ability to implement any circuit sim-
ply by being appropriately programmed. The three basic elements of an FPGA
are configurable logic blocks (CLBs), I/O blocks and programmable routing.
Figure 1.2 [9] shows the Xilinx Virtex FPGA architecture [10].
Each CLB has one or more lookup tables (LUTs) and several Flip-Flops
(FFs) as shown at the bottom right of Figure 1.2. Several LUTs and FFs are
grouped together to form a logic slice. The number of slices is often used to
CHAPTER 1. INTRODUCTION 6
ConfigurableLogic Block
(CLB)
I/O Block
ProgrammableRouting
Route map of a real FPGA chip Structure of 2-Slice Vertec-E CLB
Figure 1.2: FPGA Structure based on Xilinx Virtex E
CHAPTER 1. INTRODUCTION 7
indicate the size of the FPGA area. Arbitrary logic functions can be imple-
mented by appropriately configuring the LUTs and connecting them through
programmable routing. Each I/O block can act as either an input pad or an
output pad as required by the circuit. An FPGA as a whole can therefore im-
plement digital circuits by mapping the functional units in the design onto logic
blocks. The final processing frequency of an FPGA depends on the depth of the
the computation between Flip-Flops (FFs) and on routing wire delay.
Modern FPGAs can implement million gate equivalent circuits. In addition
to CLBs, I/O blocks and programmable routing, modern FPGAs usually have
on-chip memory in the form of RAM blocks, and some may have embedded
multipliers or even complete RISC processors on the chip. These extra available
resources can further boost the performance of the implemented digital circuits.
In our research, we consider the structure shown in Figure 1.2 with additional
on-chip memory as our available FPGA hardware resources.
In addition to on-chip memory, FPGAs usually use external memory banks
(also called off-chip memory) to store extra data. External memory is par-
ticularly important for image processing applications because on-chip memory,
although faster than off-chip memory, is relatively small and cannot store all the
data. On-chip memory can be used as a buffer to store temporary data to avoid
slower data transfers from/to external memory. Memory bandwidth is defined
as the maximum data transfer speed between an FPGA and the external mem-
ory banks. In some cases, if we cannot transfer the data from/to the external
CHAPTER 1. INTRODUCTION 8
memory banks quickly enough, the system performance may be degraded.
1.1.3 Memory Hierarchy
Memory hierarchy is often discussed in computer architecture and is a useful
way to increase performance. The memory in a system can usually be arranged
in a hierarchy from the fastest (and lowest capacity) to the slowest (and highest
capacity) with the fastest staying closest to the central processor. By utilizing
the locality of reference property of memory access, effective management of the
memory hierarchy can greatly improve system performance and give the appear-
ance that the system has the fastest memory with the highest capacity [11].
Memory hierarchy management had not been considered by FPGA designers
until on-chip memory of FPGAs became available. But now it attracts more
and more interest especially for data-intensive applications. FPGA chips are
usually connected to external memory banks, but if the data needs to be read
from or written to external memory banks on every access, the performance of
the design would be greatly degraded because accessing the external memory
is usually slow. In this case, using on-chip memory to form a suitable memory
hierarchy becomes very important. The smaller but faster on-chip memory, if
organized properly, can be used as a buffer to store temporary or repeatedly
used data so that off-chip memory accesses are reduced. The ability to reduce
off-chip memory accesses depends on the on-chip memory size and the algorithm
itself. FPGA designers need to carefully analyze data access patterns for each
CHAPTER 1. INTRODUCTION 9
application to get a specific optimized memory architecture. In our research, we
find that for most SWO applications, there exists a common data access pattern.
Therefore, we can build a memory architecture for all instances of this type of
application. The details will be discussed in section 3.5.
1.1.4 Sliding Window Operations(SWOs)
Many spatial domain methods for image processing can be summarized as fol-
lows:
g(x, y) = T [f(x, y)] (1.1)
where f(x, y) is the input image, g(x, y) is the processed image, and T is
an operator on f , defined over some neighborhood of (x, y) [12]. This spatial
domain processing is widely used in image processing. Image averaging, smooth-
ing, sharpening and convolution all belong to this category. Figure 1.3 shows
an example of spatial domain processing. In this example, the T operator is
moving in raster-scan order. This type of operation is widely used in digital im-
age processing and is called a sliding windowing operation(SWO). Others have
studied 1-D SWOs [13], which concentrates on 1-D FIR/IIR signal processing.
We are interested in the more general case of 2-D digital signal processing.
CHAPTER 1. INTRODUCTION 10
One Pixel
1st Window
Window Moving Direction
2nd Window
1st Windowfor 2nd row
3rd Window
Figure 1.3: Example of Sliding Windowing Operation (window size = 3× 4)
1.2 Our Automated Tool
Sliding Window Operation OPtimization, the automated tool we present in this
dissertation, provides a good solution to quickly implement SWO applications
on COTS FPGA boards. SWOOP take FPGA board parameters as well as
SWO application parameters as inputs. The outputs are:
1. A block diagram of the SWO implementation.
2. Three upper bounds according to different resource constraints. The tight-
est upper bound will be select for the actual implementation, which will
give the implementation the near optimal performance given the available
resources.
3. A near optimal memory structure which is tailored for current SWO ap-
plication and the given FPGA board. The memory organization can max-
imally reduce the redundant the external memory access by fully utilizing
the available internal memory of FPGAs.
CHAPTER 1. INTRODUCTION 11
After applying SWOOP to a specific application, the designer still has control
over how to implement the algorithm. SWOOP tells the designer what is the
near optimal design, but the designer can make adjustments depending on other
factors such as cost, time-to-market etc.
1.3 Contributions
The contribution of this research are:
1. An automated tool, Sliding Window Operation OPtimization (SWOOP).
SWOOP takes both the size of the SWO and the board parameters as
input. Using analytic techniques, SWOOP automatically determines the
number of parallel processing elements to implement on the FPGA, the
assignment of input and output data to on-board memory, and the orga-
nization of data in on-chip memory to most effectively keep the processing
elements busy. The result is a block layout of the final design, its memory
architecture, the estimated usage of different resources, and a measure of
the achievable speedup.
2. The analytical representation of three upper bounds. Knowing the possible
speedup is very important for a hardware designer. The upper bounds
provided by our research can be used to quickly estimate whether or not
the COTS implementation can meet the application’s requirements before
the actual implementation is complete.
CHAPTER 1. INTRODUCTION 12
3. A new buffering method suitable for SWO based applications. The block
buffering method can maximally use the available on-chip memory to re-
duce the external memory accesses. Using this buffering method can help
designers build an efficient memory hierarchy architecture for SWO based
applications.
4. The calculation of the upper-bound of the parallelism subject to on-chip
memory size. Once the size of the block is determined by using the block
buffering method, an analytical representation of the upper bound sub-
jected to on-chip memory size is given. This upper bound can be combined
with the other two upper bounds to determine the maximum performance.
Several manually designed applications has been used to test the performance
of SWOOP. Our experiments show that SWOOP can quickly (less than a second)
and accurately (less than 10% difference from manual designs) estimate the
maximum parallelism according to the applications and constraints.
Our approach automatically determines the way data should be accessed and
buffered in the on-chip memory to minimize the off-chip memory traffic. This
approach is applicable to any hardware architecture that contains a hierarchy
of memory outside of the normal caching structure. This is true of FPGA
architectures as well as tiled architectures such as the Cell multiprocessor from
IBM [14].
Note that this design flow can be applied not only to FPGA designs, but also
to other hardware designs with area, internal and external memory constraints.
CHAPTER 1. INTRODUCTION 13
By using different libraries, the method we present here can be modified so that it
can be used for other styles of hardware. The block buffering method we present
here has been previously explored in the context of motion estimation [15, 16].
In these approaches the size of the buffer is the size of the search area. Our block
buffering method differs in that the block size is selected based on the available
on-chip memory in order to reduce the total external memory accesses.
1.4 Dissertation Structure
A summary of the related work in this area is presented in Chapter 2. Chap-
ter 3 presents a detailed explanation of how we estimate the three upper bounds
subjected to different constraints. Four different applications are described in
Chapter 4 and a detailed analysis of the results are provided to prove the cor-
rectness and efficacy of our method. Chapter 5 concludes the dissertation and
closes with thoughts about future work.
Chapter 2
RELATED WORK
Co-processor boards embedded within larger computer systems are very popular
in embedded computing. Compared to other non-embedded computing systems,
they are usually tightly constrained, reactive and real-time [17]. Therefore,
when designing embedded systems, we need to pay much more attention to
performance. Moreover, the time-to-market constraint has become more and
more demanding and can influence a design process dramatically. Of course
there are other design metrics we consider including NRE, size, power etc.
Three types of processor technology are commonly used when implementing
an embedded system: general-purpose processors (GPPs), application-specific
processors (ASPs) and special-purpose processors (SPPs). A GPP is a pro-
grammable device that is suitable for a variety of applications to maximize the
number of devices sold. An SPP is defined as a digital circuit designed to ex-
ecute exactly one program. An ASP is a compromise between a GPP and an
14
CHAPTER 2. RELATED WORK 15
SPP, it is a programmable processor optimized for a particular class of applica-
tions having common characteristics [17]. The design process for different target
technologies are very different. Although the design cycle for GPPs and ASPs
are relatively shorter than SPPs, one disadvantage of these technologies is the
processing speed. We present a brief discussion about GPPs and ASPs and
concentrate most of our attentions on SPPs, the class that includes FPGAs.
2.1 General-purpose Processors
The basic architecture of a GPP consists of a data-path, a control unit and a
memory interface. For each instruction, GPPs typically need to go through 5
steps: instruction fetch, operand fetch, execute, memory access and operand
store. The generalized steps make sure that GPPs can be used for different
applications, but at the same time greatly reduce their efficiency. The designer
of a GPP usually builds a programmable device associated with an instruction
set architecture and usually does not know what kind of application will run
on the GPP. So GPP designers try to construct a proper computer architecture
by deciding the optimal number of stages of pipeline and building an efficient
memory hierarchy so that for most applications the data can flow smoothly
and consistently. A lot has been published about computer architecture design;
[11, 18, 19] give a good overview.
An embedded system designer has a totally different task than a GPP de-
CHAPTER 2. RELATED WORK 16
signer. Usually, given a specific GPP, embedded system designers are only con-
cerned with writing efficient code for the GPP. Code rewriting techniques, con-
sisting of loop transformations or data flow conversion, are an essential part
of modern optimizing and parallelizing compilers. By exploring the inher-
ent parallelism and enhancing the temporal and spatial locality of the algo-
rithms [20, 21, 22], a designer can improve the performance of an application by
more efficiently using the available CPU pipeline stages and cache space. Code
rewriting can also be done automatically through compiler optimization. Much
work has been done in this area and their aim is to find an efficient translation
mechanism so that compiled code can best fit into the GPP structure [23, 24, 25].
A GPP is an economical and quick solution for applications without compli-
cated data processing. But when the computational requirements of an appli-
cation increases, GPP’s disadvantages, especially slow processing speed due to
limited parallelism, become more and more intolerable. An alternative solution
is using ASPs or SPPs.
2.2 Application-specific Processors
Micro-controllers and Digital-signal Processors (DSPs) are two types of com-
monly used ASPs. Unlike GPPs, their structures and instruction sets are op-
timized for a particular class of applications. However, the design process for
ASPs is similar to GPPs. A designer’s task is to write code which can run ef-
CHAPTER 2. RELATED WORK 17
ficiently on the available ASP. Micro-controllers are widely used in embedded
control applications such as monitoring or single-bit control signal setting. They
are not suitable for performing large amounts of data computation. Modern sig-
nal processing applications involve a lot of data transfer and computation, thus
DSPs become a good candidate for this type of algorithm. DSPs are widely used
in application domains including communications [26, 27], image processing [28],
audio and video processing [29], power control [30], etc. The important differ-
ence between a DSP and a GPP is that a DSP processor is designed to support
high-performance, repetitive, numerically intensive tasks. Different DSPs from
different manufacturers or even from the same manufacturers may have different
specific features. Four major DSP chip manufacturers are Texas Instruments,
with the TMS320C2000, TMS320C5000, and TMS320C6000 series of chips [31];
Motorola, with the DSP56300, DSP56800, and MSC8100 (StarCore) series [32];
Agere Systems (formerly Lucent Technologies), with the DSP16000 series [33];
and Analog Devices, with the ADSP-2100 and ADSP-21000 (”SHARC”) se-
ries [34]. Most DSPs have common features which make their high performance
in data processing applications possible [35]:
1. DSPs can complete a multiply-accumulate operation in one clock cycle.
High-performance DSPs often have two or more multipliers that enable
two multiply-accumulate operations per instruction cycle.
2. DSPs can provide specialized addressing modes, such as pre- and post-
CHAPTER 2. RELATED WORK 18
modification of address pointers, circular addressing, and bit-reversed ad-
dressing.
3. Most DSPs provide various configurations of on-chip memory and periph-
erals tailored for DSP applications. DSPs generally feature multiple-access
memory architectures that enable DSPs to complete several accesses to
memory in a single instruction cycle.
4. Usually, DSP processors provide a loop instruction that allows tight loops
to be repeated without spending any instruction cycles for updating and
testing the loop counter or for jumping back to the top of the loop.
5. DSP processors are known for their irregular instruction sets, which gen-
erally allow several operations to be encoded in a single instruction. For
example, a processor that uses 32-bit instructions may encode two addi-
tions, two multiplications, and four 16-bit data moves into a single instruc-
tion. In general, DSP processor instruction sets allow a data move to be
performed in parallel with an arithmetic operation. GPPs, in contrast,
usually specify a single operation per instruction.
Compared to GPPs, ASPs are more suitable for digital signal processing
applications. However, ASPs essentially still provide sequential processing of
each instruction and ASPs cannot fully explore the parallelism inherited in the
algorithm. For delay sensitive applications, SPPs may be a better choice.
CHAPTER 2. RELATED WORK 19
2.3 Special-purpose Processors
A SPP is a digital circuit designed for a specific application. Compared to a
GPP or an ASP, it is much more efficient because it is tailored specifically for the
application and much overhead can be avoided [36]. However, the design cycle
for an SPP is much longer because all the details need to be determined by the
designer. Unlike GPP and ASP design which involves mostly software coding,
SPP design is a hardware design process that can be implemented in full-custom
hardware or programmable logic depending on the application requirements.
For SPP designers, a critical task is to maximize the parallelism while at the
same time keeping the design cycle short. To fulfill this task, different design
methodologies have been proposed and they are catalogued in the following
sections.
2.3.1 Target Independent Optimization
High Level Synthesis (HLS) is a useful tool for target-independent optimization
of SPP designs. It can take a description of a design in a Hardware Descrip-
tion Language (HDL) or other high level language (Matlab, C, C++ etc.) and
synthesize it into a low level HDL description or intermediate form. High level
language compilers usually capture the behavior using a control/data flowgraph
(CDFG) [37]. Starting from these CDFGs, compilers construct the macroscopic
structure of a digital circuit; this process is called architectural synthesis and op-
CHAPTER 2. RELATED WORK 20
timization. The output of architectural synthesis is sent to logic level synthesis
and optimization.
SystemC [38] is a popular open source high level description language. Its
community consists of a growing number of system design companies, semicon-
ductor companies, IP providers and Electronic Design Automation (EDA) tool
vendors. SystemC is built on standard C++; the core language consists of mod-
ules and ports for representing structure. A set of data types with various data
widths are provided by the library for hardware modeling. Designers use C++
as an input language and can develop and change the system-level models very
fast. To some extent, hardware design is more like software coding with the
help of HLS tools, except that the designer must be aware that all the modules
work in parallel instead of sequentially. Once the system-level module is built, a
designer can use tools that support SystemC to simulate the design, verify the
functional correctness, and translate the C++ language into Register-Transfer-
Layer (RTL) VHDL code. The lower level VHDL code can be further combined
with different libraries for different target implementations.
An example of a HLS project is SPARK [39] by Sumit Gupta et al. It is a C-
based HLS framework which employs a set of parallelizing compiler techniques
and synthesis transformations to improve the quality of high-level synthesis re-
sults. The SPARK methodology is particularly targeted to control-intensive
microprocessor functional blocks and multimedia applications. Gupta claims
that “SPARK takes behavioral ANSI-C code as input, schedules it using spec-
CHAPTER 2. RELATED WORK 21
ulative code motion and loop transformations, runs an interconnect-minimizing
resource binding pass and generates a finite state machine for the scheduled
design graph. Finally, a backend code generation pass outputs synthesizable
register-transfer level (RTL) VHDL. This VHDL can then be synthesized using
logic synthesis tools into an ASIC or be mapped onto a FPGA.”[40].
The advantage of using HLS tools is that they can expedite the design pro-
cess by shifting the burden of hardware design to the tools. A good HLS tool can
provide a platform for easy simulation, verification and synthesis. The problem
with using a high level input language is its inefficiency and the poor memory
architecture of the resulting design. The efficiency of HLS tools is highly de-
pendent on the compiler used and currently no compiler can achieve the same
performance as implementations written directly in the VHDL or Verilog lan-
guage. Moreover, being target independent means the tools are not aware of
the overall physical or logical layout of the hardware and their I/O or memory
connections. High level language compilers, in most cases, cannot optimally al-
locate memory and build the memory hierarchy. Consequently, the performance
of designs generated with HLS tools tends to degrade for data intensive appli-
cations. These problems can be solved with target specific optimization, where,
for different target technologies, there are different optimization method. These
will be introduced in the following sections.
CHAPTER 2. RELATED WORK 22
2.3.2 Targeting ASICs
Once the target technology is chosen, we can further optimize the design for
high performance or use target specific automated tools to speedup the design
process. For ASIC design, additional design metrics are considered such as
power consumption, delay constraints, area limitations etc. or a combination of
these. Depending on the different design metrics, designers may choose different
optimization methods; we list some of these in this section.
Optimization for High Performance
One of the primary reasons for using ASICs to implement algorithms instead of
using GPPs or ASPs is the high performance of ASICs. By exploiting parallelism
at different granularities and customizing at different layers, a full custom ASIC
design can yield a high density and high performance product at the cost of a
much longer design process. ASIC’s high performance comes from combining
both fine-grained parallelism and coarse-grained parallelism to achieve a high
performance implementation. In order to explore the inherited fine-grained par-
allelism of the application, designers need to trace the data flow step by step to
figure out a corresponding efficient hardware structure. Performance optimiza-
tion can be further done iteratively at different layers under different constraints
such as area, power, timing, etc. Although design tools can help designers to
automatically explore the trade-offs, often it is still a designers’ task to manu-
ally optimize the circuits for each application especially when some constraints
CHAPTER 2. RELATED WORK 23
cannot be met by the tools. It is usually a tedious and tricky job and requires
a lot of experience. It is difficult to get an early estimate of the quality of your
RTL, whether your timing constraints are achievable, and whether your netlist
is feasible. Any mistakes made early in the process do not appear until after
place and route.
To summarize, ASIC optimization is not a trival task and most of the opti-
mizations are application specific. All the constraints have to be considered and
balanced to get a final viable implementation. ASIC design is very expensive
and is typically used for very high volume design or designs where low power is
essential.
Design Automation
ASIC design, has a very long turnaround times and high NRE costs. High NRE
costs are inevitable due to the complexity of the ASIC design and manufacturing
process. Using design automation tools can greatly help the designer to shorten
the time-to-market cycle. The HLS tools we mentioned in Section 2.3.1 can
be used to implement designs targeting ASICs by manually adding annotations
and constraints during the automatic synthesize process. Moreover, HLS needs
to be combined with libraries provided from different ASIC vendors to get the
final design. Modern ASIC design involves placement and physical optimization,
clock tree synthesis, signal integrity and routing. Without good EDA tools, it
is almost impossible to get a working design [41]. Most of these tools such as
CHAPTER 2. RELATED WORK 24
Cadence [42], Synopsys [43], etc. are very expensive which is another reason for
the high NRE of ASIC design.
Optimization for Memory Access
Memory allocation and assignment becomes a critical issue when the applica-
tion involves a large amount of data transfer. ASIC designers need to choose the
optimal memory size and the optimal memory port width so that these will not
be a bottleneck to high speed processing. This results in a large design space
to explore. Panda et al. give a good summary of research in this area [44].
It summarizes the automatic memory allocation and assignment (MAA) meth-
ods in High Level Synthesis (HLS) and also discusses the memory hierarchy
related optimizations of embedded systems by taking advantages of spatial and
temporal locality [45]. Miranda et al. propose a methodology which can au-
tomatically generate the memory addresses for data-transfer-intensive applica-
tions [46]. Other researchers [47] introduce several optimization strategies for
application-specific systems using different memory architecture: data cache,
scratch-pad memory, custom memory architecture and dynamic random-access
memory (DRAM). These methods can improve the overall performance for some
types of applications, but the algorithms for deciding memory size and optimiz-
ing the memory hierarchy are complicated. ASIC designers need to adapt these
methods for specific applications.
CHAPTER 2. RELATED WORK 25
2.3.3 Targeting FPGAs
An alternative for hardware implementation is using programmable logic devices
(PLDs). One type of complex PLD, which has grown very rapidly in popularity
over the past decade, is the field programmable gate array (FPGA). Compared
to ASIC design, the back-end design for FPGA devices is very simple and the
time-to-market is much shorter than for ASICs. Designers are therefore often
opting to use FPGA devices, either for the entire life of a product, if applications
require only a few tens of thousands of devices, or for prototyping and volume
ramp-up. Once volume production shows that a design is stable, engineers can
port the design to an ASIC device.
Design Automation
One major feature of FPGA technology is that it is re-programmable and gives
the user the option to develop an electronic design with ease, update their system
in the field, and test and verify the implementation quickly. Design automation
can further shorten the design cycle and is widely used in FPGA design.
The HLS tools mentioned in Section 2.3.1 can also be used for FPGA de-
sign with the aid of low level FPGA targeted synthesis tools. There also exist
some HLS tools specifically targeting FPGA designs. By knowing the target
technology at an earlier stage, tools can direct the optimization process more
efficiently.
CHAPTER 2. RELATED WORK 26
JHDL [1, 2] is a set of FPGA CAD tools developed at Brigham Young Univer-
sity’s Configurable Computing Laboratory. It is a “structurally based Hardware
Description Language (HDL) implemented with JAVA”. The JHDL project is
an exploratory attempt to identify the key features and functionality of good
FPGA tools. Its aim is to provide a simulation and debug environment for both
the host code and FPGA design so that the design and debug process is easier.
Handel-C [3] is the language used by the commercial HLS tools from Celoxica.
It is based on ANSI-C and simple constructs are added which make it possible for
the design suite to compile algorithms directly into EDIF netlists. The output is
optimized for a target FPGA device. Celoxica also provides a design suite called
DK which can provide multiple language co-simulation for C, C++, SystemC,
SpecC, Handle-C, VHDL and Verilog.
SA-C (Single Assignment C) [4] takes a C-like programming language as in-
put and use its own compiling tools to translate C into VHDL. In SA-C, the code
is automatically separated into inner loops, which are run on the FPGA, and the
remainder of the code, which runs on the host. The SA-C compiler assumes the
on-chip processing clock is the same as the I/O clock. The designer can ignore
timing information and the design process is closer to software programming.
The SA-C compiler can successfully map image processing algorithms onto an
FPGA, achieving up to an 800 fold speed-up over a Pentium III for complicated
image processing applications. But for simpler image operators, such as SWOs,
only 10 times or less speedup has been achieved.
CHAPTER 2. RELATED WORK 27
Streams-C [48] is an intermediate approach between HLS and low level syn-
thesis tools. The target machine is an attached FPGA based computing board to
the host. The compiler can translate the C-like input code into clock-cycle level
design of hardware circuits. The compiler also pipelines computation and man-
ages stream synchronization. Although it cannot achieve the same performance
as hand-crafted designs, the final speedup is about a factor of 10.
The MATCH (MATlab Compiler for distributed Heterogeneous computing
systems) compiler project [49, 50] at Northwestern University aims to make it
easier for the users to develop efficient code for configurable computing systems.
It is an experimental prototype of a software system that maps Matlab functions
into RTL VHDL for FPGA synthesis. MATCH was later commercialized by Ac-
celChip and the name was changed to AccelFPGA synthesis tools. AccelFPGA
reads Matlab and Simulink files and outputs synthesizable VHDL and Verilog
in RTL that has been optimized for an FPGA. The tool also creates simulation
models for bit-true verification, eliminating the need to create test benches for
DSP algorithms [51].
DEFACTO (Design Environment For Adaptive Computing TechnOlogy) [52,
53] is another compilation and synthesis system targeting FPGAs. The tool can
automatically do the design space exploration and select an implementation that
closely matches the performance of the fastest design in the design space. More-
over, DEFACTO defines a balance metric for guiding design space exploration so
that both memory bandwidth and hardware resources can be maximally utilized.
CHAPTER 2. RELATED WORK 28
All of these design automation tools aim to raise the level of abstraction in
hardware design to simplify the design process and therefore shorten the design
cycle. However, compared to hand-written VHDL/Verilog design, the efficiency
of the designs from these automation tools can still be improved. Furthermore,
these design automated tools cannot efficiently use the on-chip memory of the
FPGA and therefore performance is degraded for data intensive applications.
Optimization for High Performance
Similar to ASIC design, FPGA design also combines both fine-grained paral-
lelism and coarse-grained parallelism to get speedup over GPPs or ASPs. Un-
like ASIC’s full custom design, FPGA configures built-in logic blocks and switch
boxes to implement algorithms. For the same algorithms, the FPGA implemen-
tation is usually larger and slower than the ASIC implementation. Recently,
the availability of multi-million gate FPGAs with clock speed in the hundreds
of MHz enable FPGA designers to implement very complicate algorithms in
FPGAs with performance comparable to ASICs.
Optimization of FPGA design performance can be divided into two cate-
gories: application-independent optimization and application-dependent opti-
mization. Application-independent optimization includes optimizing commonly
used functional units or macros such as FIFOs, adders, multipliers, FIR filters,
FFT etc. [54, 55]. Application-dependent optimization needs to consider not
only the design of the FPGAs, but also the layout of the whole system including
CHAPTER 2. RELATED WORK 29
sensors, memories etc. [5, 6, 7]. For application-dependent optimization, there
are no general rules and designers must carefully select optimization methods
according to the application itself as well as the system structure in order to get
high performance.
Memory Hierarchy
In the mid 1980s, FPGA manufacturers offered only a few thousand gates per
chip and clock speeds of a few MHz, but now, manufacturers not only build
multi-million gates FPGAs, but also integrated on-chip memories (block RAM)
into FPGAs. These small but fast on-chip memories can be used to build a
memory hierarchy that reduces the number of slower external memory accesses.
Data-intensive applications, which require frequent and large amounts of data
transfer, require an intelligent memory architecture to achieve high throughput.
A lot of research has been done on building an efficient memory architecture
according to the analysis of the algorithms to be implemented.
Gokhale et al. [56] present an algorithm to assign data automatically to mem-
ories to produce minimum overall execution time of the loops in the algorithm.
Instead of searching the exponential search space, this algorithm uses an implicit
enumeration method to reduce the search space. The limitation of this paper
is it only considers external memory bank allocations and ignores the on-chip
memories of the FPGA.
Weinhardt and Luk [57, 58] discuss a memory access optimization method
CHAPTER 2. RELATED WORK 30
for FPGA-based reconfigurable systems with a hierarchy of on-chip and off-chip
(external) memory to speed up applications limited by memory access speed.
They focus on loop nests and map all data processing in the inner loop to a
data path. Data dependence analysis is used to find legal loop unrollings. The
optimization method used here is quite general and therefore it can only achieve
around 10 times speedup over software. This speedup is not sufficient for some
applications.
Andersson and Kuchcinski [59] address an automatic local memory archi-
tecture generation method for FPGAs with embedded CPUs, called System on
Programmable Chip (SoPC). They employ data reuse and duplicate data com-
monly used in memories close to the data path. The optimization algorithm is
rather complicated because it has to first partition the task between the em-
bedded CPU and custom logic. Moreover, similar to [57, 58], performance is
sacrificed because it is a high-level hardware compiler.
Another approach is to optimize the memory interface and controller between
FPGAs and external memory banks. In [60], Park et. al. describe a set of
parameterizable memory interface designs for both SRAM and SDRAM memory
technologies. The interface modules is suitable for a wide variety of designs
with pipelining and page-mode memory operations. Lee et. al. [61] present a
new methodology that can automate the connection of an Intellectual Property
(IP) block to a wide variety of interface architectures with low overhead. This
research was still at an early stage without many results.
CHAPTER 2. RELATED WORK 31
2.3.4 Summary
There are other target technologies that can be used to implement SPPs such
as CPLDs, structured ASICs etc. Full custom ASICs and FPGAs are the most
popular ones. As we have discussed, they have their own advantages and dis-
advantages for implementing an algorithm. Designers can use automated tools
to speed up the design cycle or optimize the design or both, but a good design
still requires great effort from the designer to explore the design space. Our
research attempts to reduce that effort by providing a optimal and fast design
flow which can be used for most SWO-based applications. The memory hier-
archy is carefully designed so that the performance can be guaranteed for this
type of data-intensive application. Moreover, although currently we are target-
ing FPGA based COTS board, the ideas we present here can extend to any
architecture with multiple hardware cores, internal buffer memory and external
memory.
2.4 Implementing SWOs on COTS FPGA boards
Our previous discussion showed that there are no optimization rules that fit all
applications independant of the target technology chosen. However, we can find
some common rules for sets of similar applications. We are interested in SWO
applications because they are common operations implemented on SPPs.
In our case, the COTS boards are built before we implement any algorithm
CHAPTER 2. RELATED WORK 32
on them, which means all the implementations are subjected to the board’s
constraints. We are concerned with improving performance and shortening the
design cycle when we implement SWOs on these COTS boards.
There is not much related work in this specific area, except for a series of
publications from Liang et al. We will show the details of how they implement
their SWOs and the differences between their method and ours.
Liang et. al. [62] proposed a method for mapping generalized windowing
operations called Generalized Template Matching (GTM) onto reconfigurable
hardware using basic building blocks. First, all the possible non-dominant Mem-
ory Access Patterns (MAPs) are enumerated according to different combinations
of the Packing Factor (PF) and Initiation Interval (II) [63, 64, 65]. PF is defined
as the number of image pixels in one memory location. II is the constant time
(in clock cycles) between initiating the processing of two consecutive windows.
MAPs, which determine when and how data is transferred between FPGA and
external memory banks, are dependent on the PF and on the II. Liang et al. enu-
merate all MAPs for different combination of PFs and IIs. Once the MAPs are
listed, the corresponding data allocation and buffering schemes are determined.
In [66], three data allocation buffering strategies are proposed.
1. Full Image Row Buffering. Enough rows of pixels are buffered so that only
one new pixel datum is needed from an external memory bank for a new
window operation. This method will be discussed in detail in Section 3.4.
2. Small Internal Buffering. When storing rows of pixels becomes too expen-
CHAPTER 2. RELATED WORK 33
sive, the image can only be stored in external memory. By using a small
internal buffer, data that has already been read in but not used in the
current window operation can be stored temporarily for the windowing
operation it is used in. This method introduces extra delay compared to
full image row buffering since it needs more external memory accesses. To
fully utilized the available memory bandwidth, two methods are used:
(a) Pixel Packing. When pixel data width is less than the memory width,
several pixels are packed in one memory address.
(b) Redundant External Data Storage. By storing several copies of data
in different external memory banks, the memory bandwidth can be
increased.
3. Partial Buffering of Image Rows. This is a combination of 1 and 2. When
there is not enough buffer space for full image row buffering, this method
can be used to maximally reduce the external memory accesses thus re-
ducing delay.
Once the MAPs are decided for each packing factor (PF) and Initiation
Interval (II), basic GTM building blocks called Region Functions (RFs) are
allocated to minimize the area and buffer size under constraints of an II, a PF
and latency. Finally, one or more RFs are implemented on the FPGA chip
so that the total execution time is minimized under the FPGA board resource
constraints.
CHAPTER 2. RELATED WORK 34
Compared to a generic high level language hardware design process, Liang
et. al [62] take memory access and data buffering into consideration, which
is essential for data intensive applications. However, their approach has the
following problems:
1. Their design process begins with enumerating MAPs according to different
PFs and IIs. Although some dominant MAPs are pruned, the design space
is still too large especially when the size of the window or the number of
memory banks is large. Moreover, for each window operation, temporal
schedule and spatial binding of the window processing can greatly affect
the memory access requirements.
2. Their buffering method is not efficient when the available buffer size is not
enough for full image row buffering. This will be discussed in detail in
Section 3.5.
3. The routing area required in the FPGA is not considered. It is very difficult
to estimate routing area for FPGAs even though in many cases, it occupies
a large percentage of the total area.
Compared to Liang’s method, the design flow we propose is simple, near
optimal, and adaptive to the availability of hardware resources. We determine
the duplication factor, defined as the number of copies of function units for one
SWO operation, at the early stage of the implementation. Specifically, we:
CHAPTER 2. RELATED WORK 35
1. Estimate the total area of functional units required for one SWO’s pro-
cessing and get the upper-bound of the duplication factor subject to area
constraints.
2. Calculate both the upper-bound and lower-bound of the duplication factor
subject to memory bandwidth constraints.
3. Calculate the upper-bound of the duplication factor subject to on-chip
memory size.
According to these three upper-bounds, we find the tightest upper-bound and
generate the corresponding design.
Chapter 3
DESIGN TRADEOFFS
The method we present here targets FPGA based coprocessor boards. We as-
sume that the FPGA chips and their connections to the external memory banks
have already been built on the board. Our goal is to design an implementation
which can fully utilize the available resources for optimal performance. By es-
timating the upper bounds to parallelism according to different constraints, we
can decide which constraint is the most critical one and then produce a design
accordingly.
There are three constraints we need to consider. First, the number of slices
in the FPGA chip limits how many copies of processing elements we can put
on the chip. Second, memory bandwidth defines the maximum data transfer
speed between the FPGAs and the external memory banks. Third, the size
of the on-chip memory which can be used as buffers reduces redundant data
transfer. Our goal is to maximize the usage of all the available resources for
36
CHAPTER 3. DESIGN TRADEOFFS 37
maximum parallelism. For now we ignore timing constraints for each pipeline
stage; they can be optimized later after we determine the design structure. The
following sections go into the details of how these three constraints influence our
final implementation. Although most of our analysis is based on a coprocessor
board with a single FPGA chip, the same analysis can hold for boards with
multiple FPGA chips because most COTS systems have a symmetrical layout
of the interconnection between FPGA chips and memory banks. We can divide
the problem into several equal size sub-problems by exploiting coarse-grained
parallelism. After finding solutions for the sub-problems, we can combine these
solutions to achieve a final solution for the whole problem. Even if the intercon-
nection is different for different FPGA chips, we can still use the same analysis
process iteratively until we get an optimal division of the problem and allocate
the sub-problems to different chips.
The rest of this chapter is organized as follows. Section 3.1 lists the param-
eters we will use later in this dissertation. Section 3.2 describes the example we
will use as we discuss the design tradeoffs for the area constraint in section 3.3,
the memory bandwidth constraint in section 3.4 and the on-chip memory con-
straint in section 3.5.
CHAPTER 3. DESIGN TRADEOFFS 38
3.1 Some Definitions
We define the parameters we use in Table 3.1. They are grouped into application
parameters, board parameters and other parameters. Application parameters
define the size of the SWO application. Board parameters include the board
resource information. Other parameters are intermediate and final results de-
termined from SWOOP. These parameters will be used later without further
explanation.
3.2 Example Description
In this chapter, we use an example to demonstrate our method. After we present
each constraint, we show how it applies to this example. The example has the
following properties:
• Input image size is 1024× 1024, where each pixel has 8 bits.
• The window applied to the image is a 3 × 3 high pass filter, as shown in
Figure 3.1. It can be used to sharpen images by highlighting fine detail
in an image or enhancing detail that has been blurred [12]. Sliding this
window in raster scan order throughout the whole image, we can get an
enhanced image where each output pixel POi,j takes the value of 19× [8×
PIi,j − ΣPIneighbor]. Each output pixel also has 8 bits.
• The board we have only contains one FPGA chip with 12,288 slices and
CHAPTER 3. DESIGN TRADEOFFS 39
Table 3.1: Parameter DefinitionApplication Parameters Symbol Explanation
Image size M ×N M row, N columnWindow size m× n m row, n columnPixel value of input image PIi,j i row, j columnPixel value of output image POi,j i row, j columnBits per input pixel Wpi
Bits per output pixel Wpo
Board Parameters Symbol Explanation
Total area size Atotal
No. of total memory banks kMemory bit width Wmj Wm1,Wm2, . . . , Wmk
Total buffer size Btotal
Other Parameters Symbol Explanation
No. of input memory banks kin
No. of output memory banks kout
On-chip memory used for memory inter-face as FIFO
BIF
Total available buffer size for buffering Bavail
Block buffer size (in pixels) p× q p row, q columnDuplication factor under area constraints Da
Upper bound of duplication factor with nobuffering
Dml
Upper bound of duplication factor with fullrow buffering
Dmu
Upper bound of duplication factor underbuffer size constraints
Db
Tightest upper bound of duplication factor D
CHAPTER 3. DESIGN TRADEOFFS 40
81,920 bits of on-chip memory.
• 4 memory banks are connected to this FPGA chip, with data widths of
32, 32, 64 and 64 bits respectively.
• We assume the processing clock is the same as the memory read/write
clock so that we do not have to consider cross clock boundary issues.
Extra buffers and converting the data transfer rate into bits/processing
cycle are needed if these two clocks are different.
-1 -1 -1
-1 8 -1
-1 -1 -1
9
1
Highpass Filter Mask
Figure 3.1: Highpass Filter Example
3.3 Area Availability
Every functional unit (FU), e.g. adders, multipliers, registers etc., consumes one
or more slices of the FPGA. The number of slices in an FPGA is proportional
to the area in an ASIC; we use area here for simplicity.
CHAPTER 3. DESIGN TRADEOFFS 41
3.3.1 Principle Blocks
Each FPGA has a fixed area (fixed number of slices) when it is manufactured;
the actual area depends on the model of the FPGA. Designs implemented in
a specific FPGA chip are limited by the available area, which means the total
area of all the functional units and routing cannot exceed the total area of that
FPGA chip. It is extremely difficult to estimate the routing area, so we reserve
20% of the total chip area for routing purposes, which leaves 80% of the total
area for FUs.
SWOs involve a set of repeated operations at different image locations. We
define a Micro Processing Element (MPE) as a set of pipelined function units
which can process one window. Once we build the MPE, we can sequentially
feed pixel data from different windows into this pipeline and get the outputs one
by one. Since each window is independent, if we have Da duplicate copies of the
same MPE, we can process Da windows at different locations simultaneously
and get a speedup of Da times. This is called spatial parallelism. Here D stands
for the duplication factor and the subscript a means this duplication factor is
based on area constraints. Given the total available area, we can estimate the
maximum Da based on the area consumed by the MPEs and the overhead which
coordinates these MPEs. Da is the upper bound on parallelism subject to the
area limitations.
Maximizing Da requires an accurate area estimate of the MPE, the control
logic and the interfacing logic. Figure 3.2 shows the block diagram of the differ-
CHAPTER 3. DESIGN TRADEOFFS 42
ent blocks that are included in the area estimate. We define AMPEs as the sum
of all function unit area for processing one window; this includes the address
generation for one window. AIF and Actrl are the area overhead for interfacing
logic and control logic. AIF is easy to determine because it does not change
much when the design changes. In most cases, the interface modules are pre-
defined and their area can be estimated before we start the design process if we
know how many external memory banks will be used. Actrl is the area consumed
by the control logic. The control logic coordinates between MPEs so that they
can share the memory interface without conflict. Avalid is defined as 80%×Atotal
because we need to reserve an extra 20% area for the routing overhead, which
means 80 percent of the total area can be used for our implementation.
Function Units
AddressGeneration
(AMPEs)
On-chip MemoryControl Part(Actrl)
Memoryand I/O
interface(AIF)
MemoryBank 1
MemoryBank 2
Host
FPGA
Function Units
AddressGeneration
(AMPEs)
Function Units
AddressGeneration
(AMPEs)
Figure 3.2: Block Diagram of an Commercial FPGA Computing Engine
Equation 3.1 shows the constraint for maximizing Da.
CHAPTER 3. DESIGN TRADEOFFS 43
Da × AMPEs + Actrl + AIF ≤ Avalid (3.1)
Actrl depends on the complexity of the control state machine. Higher levels
of parallelism, which corresponds to a larger Da, normally means a more com-
plicated controller; therefore, Actrl can change for different Da. It is difficult to
estimate the control logic area before it is actually implemented. Others have
proposed a controller estimation method which needs to know the number of
control states before estimation[67]. Multiple linear regression is used to build
the function of area vs. complexity for the controller. Here we simplify the
estimate by adding 15∼20% of the area of an MPE for control. This estimate is
based on our previous design experience. We assume that for each MPE, an ex-
tra 15∼20% of the area is used for the controller. This overhead may be a little
bit too pessimistic in some cases. We can always adjust this parameter later for
further optimization. We therefore can rewrite Equation 3.1 to Equation 3.2:
Da × (AMPEs + 20%× AMPEs) + AIF ≤ Avalid (3.2)
Once scheduling and binding of the operations for one MPE is fixed, the
area can be determined. Getting an optimal schedule and binding given area
constraints is an NP-hard problem [37]. Many heuristic algorithms have been
explored to solve this problem so that sub-optimal solutions can be found with
less computation. We further simplify our method by assuming there is no
area limitation for processing one window. This assumption is reasonable for
CHAPTER 3. DESIGN TRADEOFFS 44
small size SWOs and current FPGA technology. Da can be easily modified in
Equation 3.2 when AMPE changes because a different implementation is chosen.
3.3.2 High-Pass Filter Example
Figure 3.3 shows a schedule using the ASAP scheduling algorithm [37] for our
high pass filter example. The dotted horizontal lines correspond to clock cycle
boundaries. We assume the multipliers and adders have the same unit delay;
this assumption can be modified depending on the library binding.
j i P , j i P , 1 1 , 1 j i P 1 , j i P 1 , 1 j i P 1 , 1 j i P 1 , j i P 1 , 1 j i P j i P , 1
9
1
8
Figure 3.3: Sequencing Graph of High Pass Filter
Once the original neighboring pixels are available, we get the output pixel
five clock cycles later. A filled pipeline can provide one result every clock cycle
thereafter. All these conclusions are based on the assumption that neighboring
pixels can constantly be fed into the pipeline. In the following sections, we will
discuss the situation when this assumption does not hold.
Table 3.2 shows the area usage of different function units, memory interfaces
CHAPTER 3. DESIGN TRADEOFFS 45
and control logic. We use 20% of one MPE’s area as the estimate for controller
area.
Table 3.2: Area Usage of High Pass Filter BlocksBlocks Slices Consumed
Memory Interfaces 1768MPEs Multiplier 69
Adders 42Registers 52Counters 24Total 187
Control (20% of MPEs total) 38
According to Equation 3.2 and the data we get from Table 3.2, we can derive
the value of Da as follows:
1768 + Da × (187 + 38) ≤ 80%× 12288
Solving this equation, we get the maximum value Da = 35. If we only consider
area constraints for this example, we can have at most 35 copies of one MPE
and thus have 35 windows being processed simultaneously.
3.4 Memory Bandwidth Limitation
In section 3.3, we only consider the area constraints of the design. The feasibility
of this implementation depends on whether or not we can have all the data ready
for the functional units. If memory bandwidth becomes a bottleneck, then even
CHAPTER 3. DESIGN TRADEOFFS 46
if we have Da MPEs on chip, some of them will be idle until the data arrives. In
this case, it is not useful to have Da copies of MPEs. We derive another upper
bound according to the memory bandwidth so that we can make sure all copies
of the MPE are fully working at all times.
3.4.1 Upper Bound with No Buffering and Full Row Buffer-
ing
As shown in Figure 1.3, each time the m × n window moves from left to right,
m new data are needed from memory and one output is generated. If there
is no buffering, we discard the m pixels which have moved out of the window.
However, of these discarded pixels, m−1 pixels will be reused when the window
moves to the next row. In this case, we load the pixels redundantly and increase
the memory bandwidth requirements. If we have enough buffer space, we can
keep the m− 1 pixels in the buffer and only discard 1 pixel which will never be
used for the following windows. When the window moves to the next row, it
requires m new data, m−1 of which are already in the buffer. Thus we only need
to read one new data from external memory. By this means we can minimize
the loading redundancy. Figure 3.4 shows a full row buffering method which
can achieve this goal [64]. These two cases (loading m pixels per window and
loading 1 pixel per window) can be defined as the lower bound and upper bound
of the parallelism factor, Dml and Dmu, respectively, according to the memory
bandwidth limitation.
CHAPTER 3. DESIGN TRADEOFFS 47
N+2 N+3 N+4 N+5 N+6
2N+
2
2N+
3
2N+
4
2N+
5
2N+
6
3N+
2
3N+
3
3N+
4
3N+
5
3N+
6
N+7
2N+
7
3N+
7
4N+
2
4N+
3
4N+
4
4N+
5
4N+
6
4N+
7
5N+
2
5N+
3
5N+
4
5N+
5
5N+
6
5N+
7
2 3 4 5 6 7
N N+1
2N2N+
1
3N3N+
1
4N4N+
1
5N5N+
1
0 1
2N-2
3N-2
4N-2
5N-2
6N-2
N-2
2N-1
3N-1
4N-1
5N-1
6N-1
N-1
current processing window
next processing window
BufferedPixels
Figure 3.4: Using Full Row Buffering Scheme
Upper Bound with Minimal Buffering Dml: If there is minimal buffer space
available to store m× (n− 1) pixels, then we need to load data from the
external memory banks each time. We also need to consider the output
memory data transfer which will share the memory bandwidth with the in-
put memory data transfer. Although the same memory port can be shared
between input and output data transfer by interleaving the data transfers,
we allocate input and output memory ports separately to simplify the
control and to achieve high throughput designs.
Equation 3.3 shows how we define Dml.
Dml × (m×Wpi + 1×Wpo) ≤k∑
i=1
Wmi (3.3)
Dml is subject to the constraints shown in Equation 3.4 because input and
output memory ports are allocated separately.
CHAPTER 3. DESIGN TRADEOFFS 48
Dml × (m×Wpi) ≤ ∑kini=1 Wmi
Dml × (1×Wpo) ≤ ∑kouti=1 Wmi
kin + kout ≤ k
(3.4)
Upper Bound with Full Row Buffering Dmu: By assuming enough buffers
are available, we can reduce the input memory data transfer to 1 data per
new window using the full row buffering method shown in Figure 3.4. In
the case where we have Dmu windows processed simultaneously, memory
bandwidth limits the value of Dmu according to Equation 3.5. This is
subject to the constraints shown in Equation 3.6.
Dmu × (1×Wpi + 1×Wpo) ≤k∑
i=1
Wmi (3.5)
Dmu is subject to the constraints shown in Equation 3.6 because input and
output memory ports are allocated separately.
Dmu × (1×Wpi) ≤ ∑kini=1 Wmi
Dmu × (1×Wpo) ≤ ∑kouti=1 Wmi
kin + kout ≤ k
(3.6)
The assignment of input and output memory may differ between these two
cases where memory bandwidth is the constraint.
CHAPTER 3. DESIGN TRADEOFFS 49
3.4.2 High-Pass Filter Example
We consider again the example of 2-D filter from section 3.3.2. We have already
shown that if we only consider area constraints, we can have 35 MPEs in parallel.
This means that once the pipeline is full, we can produce 35 output pixels in one
clock cycle. These output pixels need to be stored in external memory so the
host can get the processed image. However, 35 output pixels/cycle means we
need at least 35× 8bits/cycle memory bandwidth to store the data. According
to our board parameters, we only have (64 + 64 + 32 + 32)bits/cycle memory
bandwidth, which is not even enough to transfer the output data to external
memory, let alone the input data transfer from external memory. Obviously, for
this example, area is not the critical constraint of the design and we need to
consider the upper bound to be the duplication factor subject to the memory
bandwidth constraint.
Without any buffering, three new pixel data must be loaded from external
memory and one processed pixel data stored to external memory each time for
each window processed in parallel. To determine Dml, we need to allocate the
input memory ports and output memory ports so that the maximum number for
Dml meets the constraints of Equation 3.4. Based on the fact that the memory
port number k is not a large number, we can enumerate all the input/output
memory port allocation possibilities to obtain the maximum Dml. In this ex-
ample, the optimal allocation would be to assign the two 64 bit width memory
ports as input memory while the two 32 bit width memory ports are output
CHAPTER 3. DESIGN TRADEOFFS 50
memory. By substituting these the numerical values into Equations 3.3 and 3.4
we get Dml = 5, which is much smaller than Da = 35.
Using a similar strategy to calculate Dmu, we find that allocating one 64
bit and one 32 bit width memory as input memories and the others as output
memories, we can get the maximum value of Dmu = 12 under the constraints of
Equation 3.6.
Dml and Dmu define the upper bound of the duplication factor for two ex-
treme cases: no buffering and full row buffering. Obviously, the number of
simultaneously processed windows also depends on the buffering scheme we se-
lect. When on-chip memory is limited and not enough for full row buffering,
the actual upper bound may be less than Dmu. This analysis is available in the
next section.
3.5 On-chip Memory Availability
Section 3.4 gives the two upper bounds of the duplication factor under memory
bandwidth constraints. As we mentioned before, if we have enough buffer space,
we can reduce the external memory data transfer rate from m pixels per window
to 1 pixel per window. Figure 3.4 shows a full row buffering method which
requires the buffer size to be at least ((m − 1) ∗ N + m) ∗Wpi bits. This may
be larger than the buffer space available. When buffer size becomes a critical
constraint of the duplication factor, our goal is to optimize the buffer size while
CHAPTER 3. DESIGN TRADEOFFS 51
still keeping the number of external memory data accesses as small as possible.
3.5.1 Block Buffering Method
We present a new method we call the block buffering method. It can greatly
reduce the buffer size while still keeping the data loading redundancy low.
Figure 3.5 gives a simple example of the block buffering method. Before
processing is begun, a p × q block of pixel data is buffered, where p ≥ m and
q ≥ n. In this example the window size is m = 3, n = 4 and the block
size is p = 4, q = 6. With a p × q block buffer, we can process a total of
(p−m + 1)× (q− n + 1) windows without loading any new data. While we are
processing the current block, we can at the same time load the data for the next
p× q block buffer. As shown in Figure 3.5, each time when we load a new block,
p×(q−n+1) pixels will be loaded from off-chip memory. So the average number
of off-chip memory accesses for this type of move is p×(q−n+1)(p−m+1)×(q−n+1)
= pp−m+1
pixels per window operation. This is a significant savings for p > m when
compared to off-chip memory access without buffering, which requires m pixels
per window operation. When the block moves to the next row block, the number
of pixels needed from off-chip memory to initiate the new windowing processing
is also p × q. We are more interested in the horizontal movement of the block
because it occurs more frequent than the vertical movement.
Equation (3.7) shows a simple proof that, after introducing the p by q buffer,
the memory requirement p(p−m+1)
is less than m, the data transfer rate without
CHAPTER 3. DESIGN TRADEOFFS 52
N+
2
N+
3
N+
4
N+
5
N+
6
2N
+2
2N
+3
2N
+4
2N
+5
2N
+6
3N
+2
3N
+3
3N
+4
3N
+5
3N
+6
N+
7
2N
+7
3N
+7
4N
+2
4N
+3
4N
+4
4N
+5
4N
+6
4N
+7
5N
+2
5N
+3
5N
+4
5N
+5
5N
+6
5N
+7
2 3 4 5 6 7
NN+
1
2N2N
+1
3N3N
+1
4N4N
+1
5N5N
+1
0 1
N+
8
2N
+8
3N
+8
N+
9
2N
+9
3N
+9
4N
+8
4N
+9
5N
+8
5N
+9
8 9
1st block buffer
next row block buffer
window
2nd block buffer
Figure 3.5: Block Buffering Method Example
the buffer. Moreover, the larger p is, the less memory bandwidth is required.
The proof requires p ≥ m, which we have already assumed.
p ≥ m
⇒ p(m− 1) ≥ m(m− 1)
⇒ p ≤ pm−m2 + m
⇒ pp−m+1
≤ m
(3.7)
Once we decide the values of p and q, we can determine the corresponding
duplication factor Db subjected to buffer availability. Equation 3.8 shows how
we can determine the value of Db if we are only concerned about the memory
requirement when blocks move from left to right.
Db × (p
p−m + 1×Wpi + 1×Wpo) ≤
k∑
i=1
Wmi (3.8)
CHAPTER 3. DESIGN TRADEOFFS 53
Again, Db is subject to the constraints shown in Equation 3.9 because the
input and the output memory ports are allocated separately.
Db × ( pp−m+1
×Wpi) ≤ ∑kini=1 Wmi
Db × (1×Wpo) ≤ ∑kouti=1 Wmi
kin + kout ≤ k
(3.9)
The full row buffering method can be viewed as a special case of the block
buffering method with some modifications. If we let p = M, q = n using the
block buffering method, then n lines of pixels are buffered before we start sliding
window processing. It is similar to the full row buffering method. However, for
the full row buffering method, processing starts even before all n lines of pixels
are buffered. Once the pixels for the first window is in the buffer, processing
can start. After each window is processed, one datum in the buffer is discarded
and a new datum is loaded. In this manner, the full row buffering method can
save some buffer space by updating the position of the start and end pixels
after each window. This also means that the address generation part of full row
buffering is extremely complicated because we are using a “moving” buffer. The
block buffering method, on the other hand, can provide much simpler address
generation.
CHAPTER 3. DESIGN TRADEOFFS 54
3.5.2 Selecting p And q
Section 3.5.1 shows the theoretical analysis of using the block buffering method.
For actual implementations, instead of loading p× (q − n + 1) new pixels from
off-chip memory when the block moves horizontally, we load a new block of p×q
pixels from off-chip memory even though p × (n − 1) pixels are already in the
on-chip memory. Our consideration here is that usually p >> m and q >> n
and these extra redundant off-chip memory accesses can greatly simplify the
control signals, and allow us to overlap loading and processing data.
Therefore, Equation 3.8 needs to be modified as follows:
Db × (p× q
(p−m + 1)× (q − n + 1)×Wpi + 1×Wpo) ≤
k∑
i=1
Wmi (3.10)
Correspondingly, the constraints should change to those in Equation 3.11.
Db × ( p×q(p−m+1)×(q−n+1)
×Wpi) ≤ ∑kini=1 Wmi
Db × (1×Wpo) ≤ ∑kouti=1 Wmi
kin + kout ≤ k
(3.11)
Our goal is to minimize the total off-chip accesses. According to the previous
assumptions, that is equivalent to minimize the average off-chip accesses per
window, p×q(p−m+1)×(q−n+1)
.
When we increase the value of p or q, we can reduce the external memory
accesses from m to p×q(p−m+1)×(q−n+1)
. When p and q are larger, the value of
p×q(p−m+1)×(q−n+1)
will be closer to 1. Since p and q are constrained by Equation
CHAPTER 3. DESIGN TRADEOFFS 55
(3.12), we need to balance between p and q to get the minimal external memory
accesses.
p× q ×Wpi ≤ Bavail
2(3.12)
Bavail is the on-chip memory used for data buffering and is defined as Btotal−Bother since some on-chip memory may be used for other purposes, e.g. as a FIFO
for the memory interface. The available on-chip memory is used to build two
block buffers so that while we are processing the current windows, we can load
the next block of data in parallel. By building two block buffers instead of one,
we increase the average number of external memory accesses, but by overlapping
the loading and the computation we save in overall processing time. Figure 3.6
shows that by alternately using two small buffers we can greatly reduce the
waiting and increase the processing time.
Process
qpBlockSize opt ×=
optopt qpBlockSize ×=
Load Wait Load WaitLoad
Wait Process Wait Process
Load Load Load Load
Wait Process Wait Process Wait Process Wait
Load
Process
Figure 3.6: Overlapping Loading and Processing
CHAPTER 3. DESIGN TRADEOFFS 56
The optimal values for p and q minimize the equation:
p× q
(p−m + 1)× (q − n + 1)
under the constraint shown in Equation (3.12). Solving this equation, we find
that by selecting:
popt =√
Bavail×(m−1)2(n−1)×Wpi
qopt =√
Bavail×(n−1)2(m−1)×Wpi
(3.13)
we can minimize the total number of external memory loading accesses. The
detailed proof can be found in Appendix B
3.5.3 High-Pass Filter Example
According to the above analysis, we know that Db satisfies the following condi-
tion:
Dml ≤ Db ≤ Dmu
because the external memory accesses per window using the block buffering
method is between 1 (using full row buffering) and m (minimal buffering). The
value of Db depends on the buffer size. In our example, we have a total of 81,920
bits of on-chip memory, where 65,536 bits are used for the memory interface and
the remaining 16,384 can be used as the buffer. The full row buffering method
shown in Figure 3.4 requires (n−1)×N +m pixels to be buffered, which means
it needs 16,408 bits of on-chip memory. It is more than the available buffer size
CHAPTER 3. DESIGN TRADEOFFS 57
and, in this case, the full row buffering method is not feasible and we need to
use the block buffering method instead.
Our computation shows given Btotal = 16, 384 (in pixels), the value of popt is
32 and qopt is 32. After we determine popt and qopt, we can get the data transfer
rate requirement subject to Equation 3.11. By allocating one 64 bit and one 32
bit width memory ports as input memory and the other 64 and 32 bit ports as
output memory, we get Db = 10, which doubles the duplication factor Dml and
is close to Dmu.
3.6 Example Summary
This chapter presents a detailed discussion of how to determine the different up-
per bounds of parallelism according to the different resource constraints. The ex-
ample given in this chapter shows that different resource constraints can greatly
influence the value of the upper bounds. According to our analysis, the max-
imum value of Da is 35 if we only consider area constraints. Further analysis
shows that the duplication factor should be in the range of [5, 12] when mem-
ory bandwidth is considered. Finally, if we take buffer space into account, the
maximum duplication factor should be 10. Clearly, we will select 10 as our final
duplication factor. This also means that the tightest constraint for this example
is buffer size. Once we select the tightest constraint, the corresponding hardware
block structure can be determined. In this example, we will use the block buffer-
CHAPTER 3. DESIGN TRADEOFFS 58
ing method and let p = 32, q = 32. 10 copies of MPEs will be generated and the
corresponding address generation part will be built. Moreover, each memory
port is assigned either as input memory or output memory at this time.
3.7 SWOOP
SWOOP is a tool that automates the process of finding the upper bounds and
choosing the tightest upper bound. It also report the optimal memory port allo-
cation according to the tightest upper bound. For on-chip memory constrained
applications, SWOOP will give the optimal value of the block size too. SWOOP
is written in MATLAB. It implements the method we discussed before in this
chapter. The algorithm SWOOP uses is summarized as follows:
Step 1: Enumerate all possible memory bank usages and get the values of
different AIF and BIF .
For each set of AIF and BIF :
Step 2: Get the value of Da according to board parameters and
application parameters.
Step 3: Calculate Dml and Dmu using optimal memory allocation.
IF Bavail is enough for full row buffering:
Step 4: The tightest constraint could be either area or memory
CHAPTER 3. DESIGN TRADEOFFS 59
bandwidth and D = min{Da, Dmu}.ELSE:
Step 4: Calculate popt and qopt and corresponding Db using optimal
memory allocation and D = min{Da, Db}.
End For.
Step 5: Find the maximal value of D for all the cases. This D is the final
maximal duplication factor.
Figure 3.7 gives the flowchart of the algorithm.
Running SWOOP only takes seconds to get results, it can be used as a tool
to get a rough estimate of an architecture for a given design. Later on, the
designer can change the parameters and run SWOOP multiple times to get an
near optimal solution to meet specific requirements if they change.
In our previous discussion, we assume that the three constraints are indepen-
dent of each other. This assumption is not always true and some modifications
are necessary when estimating the maximal duplication factor. For example, a
different allocation of memory ports may lead to changes in AIF and Bavail. As
a result, the values of Da and Db may be different. In SWOOP, we enumerate all
the possible memory allocations and the corresponding values of AIF and Bavail
to find the optimal memory architecture. This enumeration, although exponen-
CHAPTER 3. DESIGN TRADEOFFS 60
Input board parameters and application parameters
Enumerate all the possible AIF
and BIF
For each AIF and BIF
Calculate Da, Dml and Dmu and its corresponding memory allocations
Da>Dml?
Bavail enough for full row buffering?
D=Da
D=min(Da, Dmu)
Calculate optimal value of p and q
D=min(Da, Db)
All possible AIF and BIF considered?
Select the maximal D for all
the cases
Calculate Db and its optimal memory allocation
Get the optimal D value and its corresponding memory
allocation
Yes
No
Yes
No
Yes
No
Figure 3.7: SWOOP Flowchart
CHAPTER 3. DESIGN TRADEOFFS 61
tial in the number of the off-chip memory banks, is still affordable because the
number of the off-chip memory banks is usually small, i.e., less than 10. Here
Bavail is defined as the total available buffer size for SWO buffering. In this
dissertation, Bavail equals Btotal − BIF where BIF is the buffer size used as a
FIFO for the memory interface.
3.8 Summpar
This chapter discusses in detail the design tradeoffs. Three upper bounds ac-
cording to three different resources are determined using analytical methods. A
simple example was presented to illustrate the design tradeoffs. The detailed
SWOOP algorithm is also presented in this chapter. In the next chapter, we
present four different SWO applications implemented using both SWOOP and
manual design. The results from SWOOP and manual designs are compared
and detailed analysis are given for each application.
Chapter 4
EXPERIMENTS
In Chapter 3, we have discussed how to determine the different upper bounds
according to different resource constraints. We consider the three upper bounds
as relatively independent of each other, but practically each will have some
influence on the others. Since our aim is to quickly estimate the maximum
performance given the co-processor board, we ignore these second order effects
at this estimation stage and are only interested in defining the block structure.
The value of the maximum performance may change a little if the design is
further optimized based on the block structure we get. The small variation
between the estimate and actual implementation can be ignored in most cases.
Four examples in very different areas have been implemented using our
method to prove its effectiveness. One is the high-pass filter we have intro-
duced as our example in Chapter 3; one is a low-pass filter commonly used in
digital image processing for smoothing or blurring; one is the Retinal Vascular
62
CHAPTER 4. EXPERIMENTS 63
Tracing (RVT) algorithm used in medical imaging [68]; and the last one is the
Particle Image Velocimetry (PIV) algorithm used in fluid dynamics [69]. For
the high-pass filter, we use the constraints we assumed previously. For the other
three algorithms, we will apply our tool, SWOOP, to constraints based on Fire-
Bird [70], a commercial FPGA-based computing board from Annapolis Micro
Systems Inc. Figure 4.1 shows the block diagram of the board; it contains one
Xilinx Virtex 2000E FPGA chip and 5 external memory banks.
PCIConnector
CLK
PCIController
Flash
VIRTEXTMEXCVE2000
Mem_0ZBT SRAM
8MB
PCI Bus(64 bits)
LAD Bus
66MHz, 64 bits
64
bits
64
bits
32
bits
64
bits
64
bits
Mem_1ZBT SRAM
8MB
Mem_2ZBT SRAM
8MB
Mem_3ZBT SRAM
8MB
Mem_4ZBT SRAM
4 MB
Figure 4.1: FireBird Block Diagram
The estimates of the performance for each application will be compared to
a hand-written design implementation to show the effectiveness of our method.
The handwritten designs showed significant speedup over the same algorithm
implemented in software.
CHAPTER 4. EXPERIMENTS 64
4.1 3× 3 High-Pass Filter AND 5× 5 Low-Pass
Filter
Two 2-D filter applications are presented in this section with different image
sizes, window sizes and board parameters. Both algorithms have relatively small
window sizes and the area required for processing a single window is relatively
low. Therefore, the tightest constraint is either memory bandwidth or on-chip
memory size, depending on the board parameters.
4.1.1 Algorithm and Parameters
The 3 × 3 High-Pass Filter (HPF) has already be described in Chapter 3, we
summarize it here for completeness. Figure 4.2 and Figure 4.3 show the HPF
mask and the corresponding sequencing graph.
-1 -1 -1
-1 8 -1
-1 -1 -1
9
1
Highpass Filter Mask
Figure 4.2: Highpass Filter Coefficients
A 5 × 5 Low-Pass Filter (LPF) is also presented in this section because it
CHAPTER 4. EXPERIMENTS 65
j i P , j i P , 1 1 , 1 j i P 1 , j i P 1 , 1 j i P 1 , 1 j i P 1 , j i P 1 , 1 j i P j i P , 1
9
1
8
Figure 4.3: Sequencing Graph of High Pass Filter
is similar to the HPF but uses different board parameters. The 5 × 5 low-
pass filter, sometimes called a smoothing or blurring algorithm, is widely used
in image preprocessing. It can be used to reduce noise, bridge small gaps in
lines or curves etc. [12]. In our experiment, we implement a 5×5 LPF using the
coefficients shown in Figure 4.4. To save area, instead of using multipliers, we use
shift registers combined with adders to implement the constant multiplications.
Figure 4.5 shows the sequencing graph for the LPF. Because the graph is too
large to fit into one page, we break it into 5 parts according to the coefficients.
Basically we group the pixels that multiply the same coefficient value together.
From Figure 4.4 we know that there are in total 6 different coefficients for the
LPF mask. In Figure 4.5, part A shows for the center pixel with coefficient
value 15; part B stands for the four pixels with coefficient value 12; part C for
coefficient 9; part D for coefficient 5, part E for coefficient 2 and part F for
coefficient 4. The complete MPE will use an adder tree to add these 5 parts
CHAPTER 4. EXPERIMENTS 66
together to get the output for one window.
2 4 5 24
4 9 12 49
5 12 15 512
4 9 12 49
2 4 5 24
×1151
Lowpass Filter Mask
Figure 4.4: Low-pass Filter Coefficients
Table 4.1 shows the board parameters and application parameters for these
two 2-D filter applications.
Table 4.1: Board Parameters and Application Parameters for HPF and LPFApplications
HPF LPF
Atotal 12288 19200k 4 5Wmj {64, 64, 32, 32} {64, 64, 64, 64, 32}Btotal 81920 655360M ×N 1024× 1024 512× 512m× n 3× 3 5× 5Wpi 8 8Wpo 8 8AMPE 187 400
Table 4.2 shows the slices usage for one MPE for the LPF algorithm.
CHAPTER 4. EXPERIMENTS 67
jiP ,
shift 1 bit
shift 2 bits
shift 3 bits
A
jiP ,1−
jiP ,1+
1, −jiP
1, +jiP
shift 2 bits
shift 3 bits
B
1,1 −− jiP
1,1 −+ jiP
1,1 ++ jiP
11 +− jiP
shift 3 bits
C
jiP ,2+
jiP ,2−
2, −jiP
2, +jiP
shift 2 bits
D
2,2 −+ jiP
2,2 −− jiP
2,2 +− jiP
2,2 +− jiP
Eshift 1 bit
2,1 −+ jiP
2,1 −− jiP
2,1 +− jiP
2,1 ++ jiP shift 2 bits
1,2 −+ jiP
1,2 −− jiP
1,2 +− jiP
1,2 ++ jiP
F
Figure 4.5: Low-pass Filter MPE
CHAPTER 4. EXPERIMENTS 68
Table 4.2: Area Usage of Low Pass Filter BlocksBlocks Slices Consumed
MPEs Multiplier 107Adders and Counters 174Registers 119Total 400
Control (20% of MPEs total) 80
4.1.2 Results from SWOOP
We can use our automated tool, SWOOP, with the board parameters in Table 4.1
to estimate which constraint will be the tightest for these two applications.
According to our calculation, the 3 × 3 high-pass filter algorithm is on-chip
memory bound while the 5 × 5 low-pass filter algorithm is memory bandwidth
bound. Table 4.3 gives the different upper bound results and the final maximal
duplication factor from SWOOP according to the different constraints. For
LPF, because it is memory bandwidth constrained, SWOOP will use the full
row buffering scheme and will not calculate the values of p, q, and Db.
Table 4.3: Duplication Factors According to Different ConstraintsHPF LPF
Da 35 28Dml 5 5Dmu 12 16Db 10 N/Ap× q 32× 32 N/AD 10 16
CHAPTER 4. EXPERIMENTS 69
The optimal external memory allocation differs for the calculation of the
different bounds. Table 4.4 shows how we allocate the external memory bank
for optimal implementation and the corresponding values of AIF and Bavail.
Table 4.4: Bound Dependent VariablesHPF LPF
kin(Input Memory banks) {64, 32} {64, 64}kout(Output Memory banks) {64, 32} {64, 64}
AIF (slices) 1768 1835Bavail(bits) 16384 589824
4.1.3 Comparison and Analysis
We manually implemented these two applications on the FPGA board and the
results are used to compare with the SWOOP results. We define the following
parameters for the comparison.
• Dall: The maximum duplication factor.
• Ball: Total on-chip memory usage (in bits).
• Aall: Total area usage (in slices).
• Wmj: Memory bank used.
Table 4.5 shows the differences between the manual designs and the results
from our automated tool.
CHAPTER 4. EXPERIMENTS 70
Table 4.5: Comparison between Automatic and Manual Results for HPF andLPF Algorithms
Automatic Manual Difference (%)
HPF Dall 10 9 11%Ball 16384 16128 1.6%Aall 4460 4236 5.3%Wmj {64,32,64,32} {64,32,64,32} SAME
LPF Dall 16 16 0%Ball 16416 20480 -19.8%Aall 9515 8827 7.8%Wmj {64,64,64,64} {64,64,64,64} SAME
For the HPF, the three parameters differ up to 11% from the manual design,
which means that SWOOP gives a good estimate of the speedup as well as
of the usage of on-chip memory and area. The largest difference lies in the
duplication factor; for the manual design this is Dall = 9. This is different
from the automatic design because the manual design considers the memory
packing factor. The packing factor is defined as the number of pixels grouped
together and stored in one memory address. In our case, one 64 bit and one 32
bit memory bank are allocated as input memory; for one external memory read
access, a total of 12 pixels ((64 + 32)/8 = 12) can be read. Thus the packing
factor is 12. According to the automatic tool, the optimal block buffer size is 32
by 32. If we assume that 12 pixels are stored in one memory address and the
image is stored line by line, loading 32 pixels of one row would require three off-
chip memory accesses. In fact, we can load 36 pixels in three memory accesses.
CHAPTER 4. EXPERIMENTS 71
As a result, some of the memory bandwidth is wasted. In our manual design,
we choose the block buffer size to be 24 by 42 instead. By adjusting the block
buffer size, we can avoid wasting memory bandwidth and at the same time we
simplify the control part of the design. In the future, SWOOP will be extended
to take the packing factor into account.
For the LPF, our automatic tool gives the exact value of speedup. The
difference in Ball between the automatic and the manual results is −19.8 because
in our manual design, instead of buffering (m − 1) × N + n − 1 pixels for full
row buffering, m×N pixels are buffered. By buffering one extra row, each MPE
can load data from on-chip memory instead of loading data from both on-chip
memory and off-chip memory. We greatly simplified the control signals of our
design by doing so. Moreover, the automatic tool provides a guide to how much
on-chip memory we can use for buffering and in this case, we have enough on-
chip memory to buffer m×N pixels. In the case of limited on-chip memory, we
can reduce on-chip memory usage to (m − 1) × N + n − 1 and the difference
in Ball between the automatic and manual results would be 0% at the cost of a
more complicated controller.
4.1.4 Summary
For these two 2-D filters, SWOOP gives a very accurate estimate of the optimal
design. Moreover, since SWOOP only gives the block structure of the design,
designers have the flexibility to further optimize the design. The results of
CHAPTER 4. EXPERIMENTS 72
SWOOP can serve as a guide for the designer to focus their efforts since SWOOP
identifies where the tightest constraint to parallelism is found in a design.
4.2 Retinal Vascular Tracing Algorithm
The RVT algorithm was designed by colleagues at Rensselaer Polytechnic In-
stitute (RPI) [71]. It is used in the biomedical field for tracing blood vessels in
retinal fundus images. The speed of motion of the human eye and the desire for
these traces to be available to assist during laser surgery require these traces to
be available immediately after the retinal image is acquired. The algorithm is
split into two parts. The first part of the algorithm detects and validates several
seed points that are known to be on blood vessels. The tracing algorithm uses
these seed points as initial points. The second part of the RVT algorithm is the
basic tracing algorithm which starts at one point on a blood vessel, finds the
next point on that same vessel, and continues to follow its path until it reaches
the end. All of the background information on the RVT algorithm in this sec-
tion is from [68]. We are interested in the tracing algorithm because it is very
computationally intensive and it falls into our SWO category.
4.2.1 Algorithm and Parameters
In order to begin tracing, an initial starting point, ~pk, and orientation, sk, are
necessary. The starting point is in the center of a blood vessel, and the orien-
CHAPTER 4. EXPERIMENTS 73
tation gives the direction that the vessel is pointing. The initialization points
are found in a pre-processing step, which is not related to this dissertation and
is not covered here. Vector ~uk is the unit vector along the blood vessel at point
~pk, defined by Equation 4.1 [71].
~uk =
ukx
uky
=
cos(
2πsk
16
)
sin(
2πsk
16
)
(4.1)
The next point on the vessel, ~pk+1, and its orientation, sk+1, can be found by
investigating the orientation of the vessel’s boundaries. This is accomplished by
calculating the results from a set of two-dimensional correlation kernels, designed
by researchers at RPI [71] and shown in Figure 4.6.
2
E Direction 0
Angle 0
ENE Direction 1 Angle 22.5
NE Direction 2 Angle 45
NNE Direction 3 Angle 67.5
N Direction 4 Angle 90
NNW Direction 5
Angle 112.5
NW Direction 6 Angle 135
WNW Direction 7
Angle 157.5
W Direction 8 Angle 180
WSW Direction 9
Angle 202.5
SW Direction 10 Angle 225
SSW Direction 11 Angle 247.5
S Direction 12 Angle 270
SSE Direction 13 Angle 292.5
SE Direction 14 Angle 315
ESE Direction 15 Angle 337.5
0 1 -1 -2 Target Pixel
Figure 4.6: Templates
Each of the 16 templates represents a unique direction. The angles for these
directions are discrete values separated by 22.5◦. Given a target pixel and the
correct neighborhood of pixels, each kernel will give a scalar result by multiplying
CHAPTER 4. EXPERIMENTS 74
the gray scale values of the pixels by 0, ±1, or ±2 as defined by the 11 × 11
template. This result is referred to as the template response. The template with
the greatest response for the same target pixel represents the orientation of the
vessel at that pixel.
We name the hardware module for individual template matching “response”.
By comparing results from 16 different templates, we select the maximal value
and the corresponding template that gives the actual direction of the blood ves-
sel. We call this comparison module “direction”. Figures 4.7 and 4.8 show the
detailed structure of these two modules.
PIPELINE REGISTERS
PIPELINE REGISTERS
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+1 Coeff. +2 Coeff. -1 Coeff. -2 Coeff.
PIPELINE REGISTERS
<<
+
<<
+
PIPELINE REGISTERS
- - >
PIPELINE REGISTERS
PIPELINE REGISTERS
Response Direction
POS NEG
Figure 4.7: Response Module
CHAPTER 4. EXPERIMENTS 75
INTERCONNECTION
000 001 010 011 100 101 110 111
RESPONSE TEMPLATE
REGISTERS (11 x 11 Window of Pixels)
TEMPLATE
COMPARATOR
TEMPLATE
COMPARATOR
TEMPLATE
COMPARATOR
TEMPLATE
COMPARATOR
TEMPLATE
COMPARATOR
TEMPLATE
COMPARATOR
TEMPLATE
COMPARATOR
R E
S P
O N
S E
R E
S P
O N
S E
R E
S P
O N
S E
R E
S P
O N
S E
R E
S P
O N
S E
R E
S P
O N
S E
R E
S P
O N
S E
R E
S P
O N
S E
PIPELINE REGISTERS
PIPELINE REGISTERS
Figure 4.8: Direction Module
Note that in Figure 4.8, instead of using 16 response modules, 8 response
modules are used because the other 8 response values can be obtained by chang-
ing the sign of the current 8 response values (Figure 4.6). We define one MPE
as one direction module and its inputs are the 11 × 11 pixels while its out-
put is the 4 bit label used to identify the direction. The FPGA board we used
for this application is the same one we use for our LPF. Table 4.6 lists all the
parameters.
CHAPTER 4. EXPERIMENTS 76
Table 4.6: Board Parameters and Application Parameters for RVT ApplicationsBoard Parameters
Atotal 19200k 5Wmj {64, 64, 64, 64, 32}Btotal 655360
Application ParametersM ×N 512× 512m× n 11× 11Wpi 12Wpo 4AMPE 3826
4.2.2 SWOOP Results
Table 4.7 lists all the possible values of AIF and BIF when using different pos-
sible combinations of the external memory banks. AIF and BIF are the same
independent of whether the external memory bank is used as input memory or
output memory. Since we assume input memory and output memory are in-
dividually assigned and there are at least two memory banks available, we can
ignore the case with exactly one 64-bit memory and the case with only one 32-
bit memory. For all the other cases, we use SWOOP to check the values of Da,
Dml, Dmu and Db to get the maximal duplication factor D. The final maximal
duplication factor Dall is determined as discussed in Section 3.6.
Table 4.8 lists all the duplication factors using different usages of external
memory banks. From this table, we show that using a different number of
CHAPTER 4. EXPERIMENTS 77
Table 4.7: Enumeration of AIF and BIF for FireBirdMemory used AIF (in slices) BIF (in bits)
1 64-bit memory 604 16,3842 64-bit memory 1031 32,7683 64-bit memory 1456 49,1524 64-bit memory 1835 65,5361 32-bit memory 564 16,384
1 64-bit memory + 1 32-bit memory 990 32,7682 64-bit memory + 1 32-bit memory 1397 49,1523 64-bit memory + 1 32-bit memory 1829 65,5324 64-bit memory + 1 32-bit memory 2239 81,920
external memory banks can change the value of the maximal duplication factor.
The final duplication factor Dall we choose is the maximal value of all the D and
in our case Dall = 3. Note here that only Da and Dmu are listed and compared
to decide the duplication factor D because the on-chip memory is large enough
to use the full row buffering method for all cases.
According to SWOOP, the RVT algorithm is area constrained. The maximal
number of MPEs we can put on the board is 3. However, according to Table 4.8,
there are several cases that the maximal duplication factor is 3 (case 1, 2, 4).
Currently, we select case 4 because when compared to cases 1 and 2, this one
uses the smallest area for the same performance. By adjusting the weight of
different resource constraints, we can easily change our selection according to
the user’s requirements.
For the final design, we select one 64 bit external memory as input memory
CHAPTER 4. EXPERIMENTS 78
Table 4.8: Duplication Factor for Different Usage of External Memory Banks
Case No. Memory used AIF (in slices) BIF (in bits) Da Dmu D
1 2 64-bit memory 1031 32,768 3.12 5 32 3 64-bit memory 1456 49,152 3.03 10 33 4 64-bit memory 1835 65,536 2.95 16 24 1 64-bit memory +
1 32-bit memory990 32,768 3.13 5 3
5 2 64-bit memory +1 32-bit memory
1397 49,152 3.04 8 2
6 3 64-bit memory +1 32-bit memory
1829 65,532 2.95 13 2
7 4 64-bit memory +1 32-bit memory
2239 81,920 2.86 16 2
and one 32 bit external memory as output memory. The maximum number of
MPEs we can put on the FireBird is 3.
4.2.3 Comparison and Analysis
The RVT algorithm has been designed and implemented on the FireBird using
a hand-crafted design described in VHDL [68]. Both coarse-grained and fine-
grained parallelism were investigated to ensure high performance of the final
design. Table 4.9 gives the results from both SWOOP and the manual design.
Here the percentage differences are not listed because the manual design takes
other factors into account and is not a good reference for comparison. A more
reasonable comparison is given in Table 4.10.
For the RVT algorithm, the results from SWOOP are quite different from
CHAPTER 4. EXPERIMENTS 79
Table 4.9: Comparison between Automatic and Manual Results for RVT Algo-rithm
Automatic Manual
Dall 3 1Ball 90,112 180,224Aall 14,764 10,035Wmj {64,32} {64,64,32}
the manual design. According to SWOOP, we can maximally put 3 MPEs on
the board while the manual implementation put only 1 MPE. The differences
can be explained as follows:
• In addition to the template matching; the manual design also includes a
module which interface with the frame grabber. This module needs extra
slices to implement. We can easily re-run SWOOP and alter the area to
accommodate this interface to get a more accurate estimate.
• For this specific hand-crafted implementation, the processing time is lim-
ited by the image capture rate from the frame grabber. As long as the
processing is faster than the incoming data rate, there is no need to fur-
ther speedup the processing part. A modified SWOOP result is shown in
Table 4.10 where only 1 MPE is implemented on the board and the results
are very close to the manual design regarding area usage.
• The designer of the manual design used a different buffering method which
is not optimal with respect to the on-chip memory usage and redundant
CHAPTER 4. EXPERIMENTS 80
external data accesses. Although manual designs are usually more efficient
than an automated tool design, this is not always true. In this application,
SWOOP uses the full row buffering method described in Section 3.4.1. The
buffering method used by the manual design is described below.
For the manual design, 5 pixel data are grouped in one memory address
and correspondingly, the processing part takes 5 windows as a unit and feeds
them into the pipeline. As shown in Figure 4.9, adjacent neighborhoods contain
ten columns of common data. Every time the windows shift 5 pixels, five new
columns of pixels will be loaded from external memory.
Pixel
Target Pixel
Pixel just read frommemory
11x15 pixel window
Figure 4.9: Shifting the Window
A mix of BlockRAM and registers are used to store the 11 x 15 pixel neighbor-
hoods, so the data can be quickly accessed by the FPGA logic. A neighborhood
CHAPTER 4. EXPERIMENTS 81
is stored in three 11 x 5 pixel sections. The idea is that to shift the neigh-
borhood, we can throw away one section, move the data from the other two
sections, which is common to the adjacent neighborhood, and read in 11 new
words from memory to fill the third section. Figure 4.10 shows how the shifting
of data works.
Off-Chip Memory
Off-Chip Memory
B l o
c k
A 2
B l o
c k
A 3
B l o
c k
B 1
B l o
c k
B 2
B l o
c k
B 3
B l o
c k
A 1
B l o
c k
A 2
B l o
c k
A 3
15
11
Neighborhood 1
Neighborhood 2
Neighborhood 3
B l o
c k
A 1
Figure 4.10: Creating New Neighborhoods by Shifting Data on the FPGA
By using this buffering method, the data in the other two sections can be
reused to save external memory access. However, 10 lines of pixels in the dis-
carded section need to be reloaded when the templates move to the next row.
CHAPTER 4. EXPERIMENTS 82
Redundant external memory accesses cannot be avoided by using this buffer-
ing method. SWOOP uses the full row buffering method which can completely
avoid the redundant external memory accesses with even less on-chip memory
usage.
Table 4.10 uses the adjusted parameters and compares the SWOOP results
with the manual design. Instead of putting the maximal allowable MPEs into
the board, we limit the design to one MPE on the board because in this specific
case, one MPE is enough to process the data in real-time. Moreover, the extra
area used to interfacing with the frame grabber is also taken into account by
SWOOP.
Table 4.10: Comparison between Automatic and Manual Results for RVT Al-gorithm
Automatic Manual Difference (%)
Dall 1 1 0%Ball 90,112 180,224 -56%Aall 8,091 10,035 -19.37%Wmj {64,32} {64,64,32} N/A
According to Table 4.10, the biggest difference is the on-chip memory usage.
SWOOP can find an optimal data access pattern for different SWO-based appli-
cations; it can efficiently use the available on-chip memory size and reduce the
external memory accesses. Compared to the manual design, SWOOP gives a
better possible implementation for the RVT algorithm. If we follow the SWOOP
results, we can implement this algorithm using less on-chip memory while at the
CHAPTER 4. EXPERIMENTS 83
same time reducing the external memory accesses.
The area difference between SWOOP and the manual design can be explained
as follows:
• For the manual design, a different buffering method is used and it requires
a lot of registers and more complicated control circuitry. These registers
and control circuits need extra area.
• Because of the registers and complicated control, routing is more compli-
cated and needs more area too.
• The manual design uses one more 64-bit memory bank and the memory
interface to this extra memory bank needs more slices.
The memory allocation is different between SWOOP and the manual design.
The manual design uses two 64-bit memory bank alternatively to store the data
from the frame grabber and load the data to the FPGA chip. This saves some
control circuitry but at the cost of wasting a portion of the memory bandwidth.
4.2.4 Summary
From this experiment, we can see that a real implementation needs to take not
only the three constraints but also other factors into account. In this case,
we need to modify the input parameters to SWOOP to get a more accurate
estimate. Moreover, we can easily change the weight of different resources to get
CHAPTER 4. EXPERIMENTS 84
a near optimal design with the same performance according to the designer’s
requirements.
Another interesting result from the RVT algorithm is that, in some cases,
hand-crafted designs may not necessarily be most efficient. SWOOP, in some
cases, can give the designer a guide as to how to fully utilize the available
resources, especially the on-chip memory resources.
4.3 Particle Image Velocimetry Algorithm
Particle image velocimetry is a useful tool in fluid dynamics due to its non-
intrusive and concurrent estimation of the fluid movement. In a conventional
PIV system [72, 73], small particles are added to the fluid and their movements
are measured by comparing pairs of images of the flow, taken in rapid succession.
The local fluid velocity is estimated by dividing the images into small interro-
gation areas and cross correlating the areas recorded in the two frames. Such
systems are called double frame/single exposure systems. Figure 4.11 shows a
typical PIV system [74]. There are two different SWOs in a PIV application.
One is the whole image SWO, which selects the interrogation area in a raster
scan order. But unlike the SWO applications we mentioned before, instead of
moving one pixel to the next position, the window skips a fixed number of pixels
to the next position. The number of pixels the window skips depends on user
requirements. The other SWO is the cross-correlation computation for one inter-
CHAPTER 4. EXPERIMENTS 85
rogation area. The cross-correlation part of the algorithm is very computation-
ally intensive and it limits PIV usage almost exclusively to off-line processing,
analysis and modelling. For this application, we ignore the whole image SWO
because in most cases, the overlap of two neighboring interrogation areas is 50%
or less. That means the data reuse rate is much lower than the cross-correlation
SWOs. Moreover, we are more interested in the cross-correlation SWO because
it is performed for every interrogation area, much more frequently than the
whole image SWO. We will implement the cross-correlation part on a FPGA
board using the method we presented in this research to determine the speedup.
The sizes of the interrogation areas are 40× 40 and 32× 32 for two consecutive
images.
Figure 4.11: PIV System Overview (From Dantec Dynamics)
Figure 4.12 shows an example of how to get the values of the cross-correlation
CHAPTER 4. EXPERIMENTS 86
plane. By moving the smaller area (Area B with n × n pixels) throughout the
larger area (Area A with m×m pixels), we get (m− n + 1)× (m− n + 1) data
which form the cross-correlation plane. The position of the peak value on the
cross-correlation plane indicates which direction and how many pixel distances
the particles moves.
x
y
0-1
-1
0
1
1
Area A
Area B
Cross-correlation plane
)1,1( =−= yxShift )0,0( == yxShift )1,1( == yxShift
)1,1( −=−= yxShift )0,1( == yxShift
Figure 4.12: Cross Correlation Plane
Figure 4.13 shows an example of what the velocity plane will look like after
the cross-correlation of different interrogation areas. The velocity information
for each interrogation area comes from the location of the best match of the two
sub-images (peak value of the cross-correlation).
Similar to the RVT algorithm, we will first estimate the performance using
our method and then compare it to the hand-written design results [69, 75].
CHAPTER 4. EXPERIMENTS 87
Velocity vector for oneinterrogation area
Figure 4.13: Velocity Plane
4.3.1 Algorithm and Parameters
The core of the PIV algorithm is 2-D cross-correlation. Figure 4.12 shows how
we calculate the cross-correlation plane, and then take the peak value to get the
estimate of the velocity. In this example, the size of Area A is 4× 4 and the size
of Area B is 2 × 2. Area B moves pixel by pixel throughout Area A, which is
similar to moving a window throughout an image. In our PIV application, Area
A is 40 × 40 and Area B is 32 × 32. Therefore the size of the cross correlation
plane is 9 × 9. The input images are 8-bit grey scale images. To meet the
accuracy requirement for the final velocity value, we keep all the accumulated
bits width for the correlation plane, which is 26 bits for each data value. The final
velocity value use two 9 bit values, one for the horizontal direction of velocity
and one for the vertical direction. We take Area B as a window and Area A
CHAPTER 4. EXPERIMENTS 88
as an image to define the application parameters. One big difference in this
application compared to previous applications is that the size of the window is
much larger. Instead of putting the window value in registers, we use on-chip
memory to store the window.
Some modifications are needed before we define the MPE. According to our
calculation, one multiplier with two 8-bit inputs and one 16-bit output needs 47
slices. Therefore it is impossible to implement 32 by 32 multipliers in parallel.
Instead, we use 32 multipliers and several stages of adder as part of one MPE.
Figure 4.14 shows the detailed structure of one MPE [75]. The narrow rectangles
indicate the registers that divide the process into several stages of pipeline. The
digits denote the bit width for each stage.
8
8
8
8
8
8
8
8
8
8
8
8 16
16
16
16
16
16
16
16
17
16
16
17
17
17
18
17
17
18
20
20
2121
26
5-stage accumulationfor each line
accumulation for wholeinterrogation area
ComparatorPeak value
32 multipliersin parallel
Figure 4.14: Pipelined Structure
CHAPTER 4. EXPERIMENTS 89
Another modification we need to make is to the output. As we have men-
tioned before, the location of the peak value of the cross-correlation gives the
velocity information of the particle movement. For an interrogation area with
40 × 40 and 32 × 32 pixels, the output is the position of the peak value. To
increase the accuracy of the velocity estimate, a technique called sub-pixel in-
terpolation is used in this application. For this application, we use parabolic
peak fit (Equation (4.2)) because it is more suitable for hardware implementa-
tion than Gaussian peak fit [73].
px = x +R(x−1,y)−R(x+1,y)
2R(x−1,y)−4R(x,y)+2R(x+1,y)
py = y +R(x,y−1)−R(x,y+1)
2R(x,y−1)−4R(x,y)+2R(x,y+1)
(4.2)
Figure 4.15 shows an example of how sub-pixel interpolation works. In this
figure, the peak value appears at location (0,0). When we apply sub-pixel inter-
polation, the adjusted location would be (-0.1, -0.4) after taking the neighboring
cross-correlation values into account.
0-1-2 1 2
X(pixel)Y(pixel)0
-1
-2
12
Cross-correlation0
-1
-2
12 X(pixel)Y(pixel)
Cross-correlation
0-1-2 1 2
Figure 4.15: An Example of Sub-pixel Interpolation
Clearly, it is not necessary to store the entire cross-correlation plane to the
CHAPTER 4. EXPERIMENTS 90
external memory. Instead, for each cross-correlation plane, we only need to
store the peak value position after sub-pixel interpolation. That means for every
(40− 32 + 1) × (40− 32 + 1) window processed, only two 9-bit results need to
be stored to the off-chip memory. However, we do need extra on-chip memory
to store the values of the cross-correlation plane before we get the final estimate
of the velocity.
To summarize, the modifications we made before we apply this application
to SWOOP is as follows:
• A window needs to be loaded for each interrogation area. To load the win-
dow from external memory every time, it introduces extra input memory
bandwidth requirement. Moreover, as we mentioned before, we only take
cross-correlation SWO into account and the SWO for the whole image is
ignored.
• Extra on-chip memory is needed to store the value of windows and the
cross-correlation plane. In this application, the window size is 32 by 32
with 8 bits for each value and the cross-correlation plane size is 9 by 9
with 26 bits for each value. Therefore, we need 32× 32× 8 bits of on-chip
memory to store the window and 9 × 9 × 26 bits to store the correlation
plane.
• For each data on the correlation plane, 32 clock cycles are needed to com-
pute the correlation. Therefore, to calculate the whole cross correlation
CHAPTER 4. EXPERIMENTS 91
plane, 9 × 9 × 32 clock cycles are needed. If we assume there is no re-
dundant data loading from the external memory banks, there are in total
40 × 40 + 32 × 32 pixels to load for one interrogation area. Averaging
the data loading requirements over the processing time, the input memory
bandwidth requirement is (40×40+32×32)∗8/(9×9×32), that is about
8 bits per cycle.
• The output memory bandwidth will be averaged over the entire cross-
correlation plane too. For every 32× 9× 9 clock cycles, two 9 bits results
need to be stored.
• The area used to implement sub-pixel interpolation is considered to be
part of one MPE. The cross-correlation part requires 2092 slices and sub-
pixel interpolation needs 559 slices. The total slices needed for one MPE
is 2092 + 559 = 2651.
After taking all these modifications into account, we obtain the modified
parameters listed in Table 4.11.
4.3.2 SWOOP Results
Similar to the RVT algorithm, all the possible values of AIF and BIF are enu-
merated for different possible combinations of the external memory banks. Ta-
ble 4.12 lists all the duplication factors with different usages of external memory
CHAPTER 4. EXPERIMENTS 92
Table 4.11: Board Parameters and Application Parameters for PIV ApplicationsBoard Parameters
Atotal 19200k 5Wmj {64, 64, 64, 64, 32}Btotal 655360
Application ParametersM ×N 40× 40m× n 32× 32Wpi 8Wpo 9AMPE 2651
banks. For all of these combinations, the PIV algorithm is always area con-
strained, therefore, only Da and Dmu are listed.
According to Table 4.12, different usages of external memory banks won’t
change the duplication factor. The maximal number of MPEs is always 4. We
select case 4 as our final implementation because it uses the least area.
4.3.3 Comparison and Analysis
The PIV algorithm has been designed and implemented for the Annapolis Fire-
Bird by hand using the VHDL language [69]. Table 4.13 compares the results
from SWOOP and manual designs.
Interestingly, if we follow the SWOOP design, we can double the performance
of the manual design with only about half of the on-chip memory size. The fact
that the SWOOP implementation is better than the manual design relies on the
CHAPTER 4. EXPERIMENTS 93
Table 4.12: Duplication Factor for Different Usage of External Memory Banks
Case No. Memory used AIF (in slices) BIF (in bits) Da Dmu D
1 2 64-bit memory 1031 32,768 4.50 8 42 3 64-bit memory 1456 49,152 4.37 16 43 4 64-bit memory 1835 65,536 4.25 24 44 1 64-bit memory +
1 32-bit memory990 32,768 4.51 8 4
5 2 64-bit memory +1 32-bit memory
1397 49,152 4.38 16 4
6 3 64-bit memory +1 32-bit memory
1829 65,532 4.25 24 4
7 4 64-bit memory +1 32-bit memory
2239 81,920 4.12 32 4
Table 4.13: Comparison between Automatic and Manual Results for PIV Algo-rithm
Automatic Manual
Dall 4 2Ball 71,738 137,332Aall 17,148 10,109Wmj {64,32} {64,64,32}
CHAPTER 4. EXPERIMENTS 94
following factors:
• The input data requirement has been greatly reduced because of the se-
lection of the MPE. As we mentioned above, we use 32 multipliers and
an adder tree to do the cross-correlation. This means for every 32 clock
cycles, one window is being processed. When the window moves to the
next position, only one new pixel is needed if the full row buffering method
is used, which will keep the MPE busy for another 32 clock cycles. There-
fore, the input memory bandwidth requirement is very small (one pixel
per 32 clock cycles). The manual design did not take advantage of this
thus introducing a lot of redundant data access.
• The output data bandwidth requirement is very small too. To get the
result for one interrogation area, we need to get all the cross-correlation
result for the entire cross-correlation plane, which, in our case, is of size
9 × 9. As we mentioned before, one cross-correlation results needs 32
cycles and that means for every result of one interrogation area, it requires
one MPE to compute for 32 × 9 × 9 cycles. For each interrogation area
computation, two 9 bits results are needed to be stored. The memory
bandwidth requirement for storing the output data is very trivial for this
application.
The goal of this manual design is to meet the real-time requirements using
the available board. According to [69], the manual design can process 15 pairs
CHAPTER 4. EXPERIMENTS 95
of images which is good enough for the real-time system. However, according to
our analysis, there are at least several places that can be further optimized to
improve the performance.
• PIV manual design uses two buffers to store entire area A (40 × 40) and
area B (32×32) while SWOOP uses line buffering method which stores 32
lines of area A (32 × 40) and entire area B (32 × 32). Without degraded
performance, the SWOOP buffering method can save some on-chip mem-
ory.
• To overlap the loading and processing time, the PIV manual design uses
two copies of the buffers which is not necessary because in this application,
loading data takes much less time than processing.
• Memory port allocation for the manual design is not optimal. Due to the
reasons above (data loading is no longer a bottleneck), it is not necessary
to allocate two 64-bit input memory banks and one 64-bit memory bank
is enough.
Another big difference between the design from SWOOP and the manual
design is the level of parallelism. In the manual design, two copies of the MPE
process two pairs of interrogation areas separately. In SWOOP, four copies
of one MPE process the same pair of interrogation areas. The advantage of
processing the same pair of interrogation areas is it can minimize the on-chip
memory usage by loading one interrogation area at a time.
CHAPTER 4. EXPERIMENTS 96
4.3.4 Summary
The PIV application is not a strict SWO application compared to the previous
experiments. Therefore, some modifications are needed to fit the PIV algorithm
into our SWOOP model. Moreover, the window size of PIV is very large (32×32)
and we can no longer put everything in parallel for one MPE. We reduced the size
of one MPE at the cost of longer processing time. This proves that SWOOP can
be quite flexible for complicated window processing. Users can always further
optimize the MPE and use SWOOP to estimate how much extra performance
they can get by this optimization.
Again, in this application, the SWOOP result is better than the manual de-
sign due to its optimized memory organization. Our experiment shows that if we
follow the method provided by SWOOP, we can at least double the performance
of the PIV application.
4.4 Summary
The four algorithms presented in this section were selected because they not
only show the wide application area of SWOs, but also demonstrate how the
three upper bounds used by SWOOP can affect the achievable performance.
Most SWO applications fall in the range of window sizes (from 3× 3 to 32× 32)
covered here. The analysis of these four algorithms can be used as a starting
point for other SWO applications with different window sizes.
CHAPTER 4. EXPERIMENTS 97
For the HPF and LPF algorithms, SWOOP gives relative accurate estimates
of resource usage and maximal performance. SWOOP results can be used as a
guideline for implementation or further optimization.
For the RVT and PIV algorithms, SWOOP results are even better than the
manual designs due to the optimized memory architecture. These two examples
show that SWOOP can give an optimized block diagram according to different
resource constraints while at the same time be applied to different applications.
Currently, some of the modifications (RVT and PIV) are manually deter-
mined. A more sophisticated tool can also handle these modifications automat-
ically by adding or adjusting parameters so that it can deal with more compli-
cated applications.
In the next chapter, we will wrap up this dissertation with the conclusions
and future work.
Chapter 5
CONCLUSIONS
In this dissertation, we presented a new tool, SWOOP, to automatically im-
plementing SWOs on COTS FPGA board. Most current available high level
synthesize tools are target converting high level languages into synthesizable
VHDL/Verilog code. The final implementations are usually not very efficient in
terms of speed and resource utilization. SWOOP only targets SWO-based ap-
plications, and thus can get a near optimal design. Currently, SWOOP cannot
generate VHDL or Verilog code automatically, but it can analyze three different
upper bounds according to different constraints: area, memory bandwidth and
on-chip memory size. SWOOP will select the tightest upper bound to get the
possible maximal performance. A block diagram of the design and a near opti-
mal memory hierarchy for each specific application is also given by SWOOP at
the same time.
Four different SWO applications in different areas are used to verify the
98
CHAPTER 5. CONCLUSIONS 99
correctness of SWOOP. The applications are selected such that the sizes of the
window vary from 3 × 3 up to 32 × 32. This range of window sizes covers
most SWO applications and can be used as a start point for more complicated
problems. Moreover, these four different SWO applications have shown designs
that are constrained by each possible bound: area, memory bandwidth and
on-chip memory.
The first two applications, 3 × 3 High-Pass Filter AND 5 × 5 Low-Pass
Filter, are very commonly used for image pre-processing. Because the size of
the windows is small, these types of SWO applications are usually memory
bandwidth constrained or on-chip memory size constrained. SWOOP results
are very close to the manual design results. The resource usage is accurately
estimated and the block diagram with near optimal memory structure is obtained
by running SWOOP.
The third application, RVT is a typical area constrained application because
its template matching is very complicated and requires a lot of area for parallel
computing. Interestingly, the results from SWOOP are better than the manual
results with respect to the performance and resource utilization.
The last application, PIV is a SWO application with very large window size
(32× 32). For applications with large windows, some modifications are needed
to fit in the SWOOP model. After the adjustments, SWOOP gives a design
with a better performance than the manual design again in this case.
For all the above applications, SWOOP can quickly estimate the near optimal
CHAPTER 5. CONCLUSIONS 100
performance based on board parameters and application parameters. For the
RVT and PIV applications, the suggested implementations derived by SWOOP
are even better than the hand-crafted HDL designs because of their near opti-
mal memory architecture and intelligent buffering method. Compared to most
currently available automated tools, SWOOP gives very good results for SWO
applications.
5.1 Future Work
There are a number of directions this research can be continued:
• SWOOP is good at estimating the performance and resource usage for
simple window operations. For more complicated applications, some man-
ual interference is necessary to make the application fit into the SWOOP.
One of our future directions is to make our tool more adaptive. We plan
to have adjustable parameters that can be changed to modify SWOOP
for different applications. Some examples of these parameters are: shared
memory port for both read/write when memory bandwidth is very limited,
breaking parallelled MPE into sequential MPE to save area etc. Moreover
we could put weights on different constraints and let the user decide which
one is the most important parameter for the specific applications. For ex-
ample, if user are more concerned about the memory bandwidth, we can
put small weight on a design with high demand of the memory bandwidth.
CHAPTER 5. CONCLUSIONS 101
The final decision can be made after considering all these weights and users
can select the optimal implementation with great flexibility.
• Packing Factor plays an important role on memory port allocation. For
some cases, it might waste area and memory bandwidth resources with
improper data packing. Currently, manual interference is needed to make
the optimal decision. By building a model for optimal packing factor, we
can integrated it into SWOOP and let SWOOP automatically select the
packing factor that can maximally utilized the available resources.
• SWOOP can estimate the maximal performance much faster than most
HLS tools, but one of the differences is HLS can generate HDL designs
while SWOOP can only give a block diagram. A designer still have to
write their own HDL code. Another future direction is instead of letting
the user to write the HDL code from scratch, we can build some HDL-based
library beforehand. Since we already know the block diagram, we can
build parameterized modules using a HDL so that once the block diagram
is decided by SWOOP, the user can select modules from the library that
match the block diagram.
• SWOOP cannot estimate the control logic area yet because control logic
area is very complicate to estimate. There is related research regarding
how to estimate the area of control logic. If we can integrate these results
into SWOOP, the final estimate might be more accurate.
CHAPTER 5. CONCLUSIONS 102
5.2 Conclusions
In summary, we have presented SWOOP, an automated tool, which can take
the FPGA board parameters and SWO application parameters as inputs and
then generate the block diagram of the FPGA implementation with near optimal
performance and resource utilization. SWOOP can give a very accurate estimate
according to three difference constraints: area constraints, memory bandwidth
constraints and on-chip memory constraints. It can maximally use the available
resources and estimate the achievable performance according to the tightest
constraints. Four different experiments are introduced in this dissertation using
both SWOOP and manual design. The results show that SWOOP can obtain
better performance than manual design in some cases.
Bibliography
[1] P. B. Brad Hutchings, J. Hawkins, S. Hemmert, B. Nelson, and M. Rytting,
“A CAD Suite for High-Performance FPGA Design,” Seventh Annual IEEE
Symposium on Field-Programmable Custom Computing Machines, pp. 12–
24, April 1999.
[2] “JHDL: FPGA CAD Tools,” http://www.jhdl.org, Last accessed Dec 30,
2004.
[3] “Handel-C, Software-Compiled System Design,”
http://www.celoxica.com/products/c to fpga.asp, Last accessed Oct
29, 2006.
[4] B. A. Draper, J. R. Beveridge, A. P. W. Bohm, C. Ross, and M. Chawathe,
“Accelerated Image Processing On FPGAs,” IEEE Transactions on Image
Processing, vol. 12, no. 12, pp. 1543–1551, December 2003.
[5] A. Benedetti, A. Prati, and N. Scarabottolo, “Image Convolution on FP-
GAs: the Implementation of a Multi-FPGA FIFO Structure,” Proceedings
of the 24th Euromicro Conference, pp. 123–130, 1998.
[6] A. J. Elbirt and C. Paar, “An FPGA Implementation and Performance
Evaluation of the Serpent Block Cipher,” in FPGA ’00: Proceedings of the
2000 ACM/SIGDA eighth international symposium on Field programmable
gate arrays. ACM Press, 2000, pp. 33–40.
[7] M. Leeser, S. Miller, and H. Yu, “Smart Camera Based on Reconfigurable
Hardware Enables Diverse Real-time Applications,” FCCM’04, 2004.
103
BIBLIOGRAPHY 104
[8] A. M. S. Inc., WildStarTM Reference Manual, revision 3.0, 2000.
[9] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-
Submicro FPGAs. USA: Kluwer Academic Publisher, February 1999.
[10] “VirtexTM -E 1.8V Field Programmable Gate Arrays,”
http://direct.xilinx.com/bvdocs/publications/ds022.pdf, Last accessed
Jan 12, 2004.
[11] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative
Approach, 3rd ed. Morgan Kaufmann, 2002.
[12] R. C. Gonzalez and R. E. Woods, Digital Image Processing. Addison
Wesley, 1993.
[13] C. Thibeault and G. Begin, “A Scan-Based Configurable, Programmable
and Scalable Architecture for Sliding Window-Based Operations,” IEEE
Transactions on Computers, vol. 48, no. 6, pp. 615–627, June 1999.
[14] J. A. Kahle, M. N. Day, et al., “Introduction to the Cell multiprocessor,”
IBM Journal of Research and Development, vol. 49, no. 4/5, 2005.
[15] Y.-H. Yeh and C.-Y. Kee, “Cost-Effective VLSI Architectures and Buffer
Size Optimization for Full-Search Block Matching Algorithms,” IEEE
Transactions on Very Large Scale Integration Systems, vol. 7, no. 3, pp.
345–358, September 1999.
[16] M. Kim, I. Hwang, and S.-I. Chae, “A Fast VLSI Architecture for Full-
Search Variable Block Size Motion Estimation in MPEG-4 AVC/H.264,”
in Asia and South Pacific Design Automation Conference, Jan 2005, pp.
631–634.
[17] F. Vahid and T. Givargis, Embedded System Design: A Unified Hard-
ware/Software Introduction. John Wiley & Sons, Inc., 2002.
[18] H. S. Stone, High-Performance Computer Architecture, 3rd ed. Addison-
Wesley, 1993.
BIBLIOGRAPHY 105
[19] D. A. Patterson and J. L. Hennessy, Computer Organization and Design:
The Hardware/Software Interface, 2nd ed. Morgan Kaufmann, 1998.
[20] S. P. Amarasinghe, J. M. Anderson, M. S. Lam, and C.-W. Tseng, “An
Overview of the SUIF Compiler for Scalable Parallel Machines,” Proceed-
ings of the 7th SIAM Conference on Parallel Processing for Scientific Com-
puting, 1995.
[21] U. Banerjee, R. Eigenmann, and A. Nicolau, “Automatic Program Paral-
lelization,” Proceedings of the IEEE, vol. 81, no. 2, pp. 211–243, Feb 1993.
[22] M. J. Wolfe, High-Performance Compilers for Parallel Computing. Addi-
son Wesley, 1996.
[23] “Total Solutions for Embedded Development,”
http://www.ghs.com/products/compiler.html, Last accessed December
14, 2004.
[24] A. S. Huang and J. P. Shen, “A Limit Study of Local Memory Require-
ments Using Value Reuse Profiles,” Proceedings of MICRO-28, pp. 71–81,
December 1995.
[25] S. Triantafyllis, M. Vachharajani, N. Vachharajani, and D. I. August,
“Compiler Optimization-Space Exploration,” Proceeding of the Inter-
national Symposium on Code Generation and Optimization: Feedback-
Directed and Runtime Optimization, pp. 204–215, 2003.
[26] Y. Ahmed, F. Jawed, S. Zia, and M. S. Aga, “Real-Time Implementation
of Adaptive Channel Equalization Algorithms on TMS320C6x DSP Pro-
cessors,” E-Tech, pp. 101–108, July 2004.
[27] J. H. Lee, J. H. Moon, K. L. Heo, M. H. Sunwoo, S. K. Oh, and I. H.
Kim, “Implementation of Application-Specific DSP for OFDM Systems,”
Proceedings of the 2004 International Symposium on Circuits and Systems,
vol. 3, pp. 665–668, May 2004.
BIBLIOGRAPHY 106
[28] A. R. Silva and V. I. Ponomaryov, “Color Imaging by Using of DSP Imple-
mentation of Different Filters,” 14th International Conference on Electron-
ics, Communications and Computers, pp. 293–298, Feb 2004.
[29] L.-H. Chen, O. T.-C. Chen, T.-Y. Wang, and C.-L. Wang, “An Adaptive
DSP Processor for High-Efficiency Computing MPEG-4 Video Encoder,”
Proceedings of the 2004 International Symposium on Circuits and Systems,
vol. 2, pp. 157–160, May 2004.
[30] J. M. Guerrero, L. G. de Vicuna, J. Matas, J. Miret, and M. Castilla, “A
High-Performance DSP-Controller for Parallel Operation of Online UPS
Systems,” Applied Power Electronics Conference and Exposition, vol. 1,
pp. 463–469, 2004.
[31] “Texas Instruments,” http://www.ti.com, Last accessed Dec 20, 2004.
[32] “Motorola,” http://www.motorola.com, Last accessed Dec 20, 2004.
[33] “Agere Systems,” http://www.agere.com, Last accessed Dec 20, 2004.
[34] “Analog Devices,” http://www.analog.com, Last accessed Dec 20, 2004.
[35] “Programmable DSP chips and their software,”
http://www.bdti.com/faq/3.htm, Last accessed Dec 20, 2004.
[36] Z. Guo, W. Najjar, F. Vahid, and K. Vissers, “A Quantitative Analysis of
the Speedup Factors of FPGAs over Processors,” Proceeding of the 2004
ACM/SIGDA 12th International Symposium on Field Programmable Gate
Arrays, pp. 162–170, February 2004.
[37] G. D. Micheli, Synthesis and Optimization of Digital Circuits. McGraw-
Hill, Inc., 1994.
[38] “SystemC Community,” http://www.systemc.org, Last accessed Dec 14,
2004.
BIBLIOGRAPHY 107
[39] S. Gupta, “Coordinated Coarse-Grain and Fine-Grain Optimizations for
High-Level Synthesis,” Ph.D. dissertation, University of California, Irvine,
School of Information and Computer Science, June 2003.
[40] “SPARK: A Parallelizing Approach to the High-Level Synthesis of Digital
Circuits,” http://www.cecs.uci.edu/ sumitg/, Last accessed Dec 15, 2004.
[41] Altera, ASIC to FPGA Design Methodology & Guidelines, Application Note
311, Ver. 1.0, July 2003.
[42] “Cadence,” http://www.cadence.com, Last accessed Dec 24, 2004.
[43] “Synopsys,” http://www.synopsys.com, Last accessed Dec 24, 2004.
[44] P. R. Panda, F. Catthoor, N. Dutt, K. Danckaert, E. Brockmeyer, C. Kulka-
rni, A. Vandecappelle, and P. G. Kjeldsberg, “Data and Memory Optimiza-
tion Techniques for Embedded Systems,” ACM Transactions on Design
Automation for Embedded Systems (TODAES), vol. 6, no. 2, pp. 149–206,
April 2001.
[45] P. R. Panda, N. D. Dutt, and A. Nicolau, “Memory Data Organization for
Improved Cache Performance in Embedded Processor Applications,” ACM
Transactions on Design Automation for Embedded Systems (TODAES),
vol. 2, no. 4, pp. 384–409, April 1997.
[46] M. A. Miranda, F. V. M. Catthoor, M. Janssen, and H. J. D. Man, “High-
Level Address Optimizations and Synthesis Techniques for Data-Transfer-
Intensive Applications,” IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 6, no. 4, pp. 677–686, December 1998.
[47] P. R. Panda, N. D. Dutt, A. Nicolau, F. Catthoor, A. Vandecappelle,
E. Brockmeyer, C. Kulkarni, and E. D. Greef, “Data Memory Organiza-
tion and Optimizations in Application-Specific Systems,” IEEE Design &
Test of Computers, vol. 18, pp. 56–68, May 2001.
BIBLIOGRAPHY 108
[48] M. Gokhale, J. Stone, and J. Arnold, “Stream-Oriented FPGA Computing
in the Streams-C High Level Language,” FCCM’00, pp. 49–56, April 2000.
[49] P. Banerjee, N. Shenoy, A. Choudhary, S. Hauck, C. Bachmann, M. Haldar,
P. Joisha, A. Jones, A. Kanhare, A. Nayak, S. Periyacheri, and M. Walk-
den, “MATCH: A Matlab Compiler for Configurable Computing Systems,”
Northwestern University, Tech. Rep. CPDC-TR-9908-013, 1999.
[50] P. Banerjee, N. Shenoy, A. Choudhary, S. Hauck, C. Bachmann, M. Haldar,
P. Joisha, A. Jones, A. Kanhare, A. Nayak, S. Periyacheri, M. Walkden,
and D. Zaretsky, “A Matlab Compiler for Distributed, Heterogeneous, Re-
configurable Computing Systems,” FCCM’00, pp. 39–48, April 2000.
[51] “AccelChip,” http://www.accelchip.com, Last accessed Jan 31, 2005.
[52] K. Bondalapati, P. C. Diniz, P. Duncan, J. Granacki, M. W. Hall, R. Jain,
and H. Ziegler, “DEFACTO: Design Environment For Adaptive Computing
TechnOlogy,” IPPS/SPDP Workshops, pp. 570–578, April 1999.
[53] B. So, M. W. Hall, and P. C. Diniz, “A Compiler Approach to Design Space
Exploration in FPGA-Based Systems,” Proceedings of the ACM Conference
on Programming Language Design and Implementation, pp. 165–176, June
2002.
[54] P. Kollig, B. M. Al-Hashimi, and K. M. Abbott, “FPGA Implementation
of High Performance FIR Filters,” Proceedings of 1997 IEEE International
Symposium on Circuits and Systems (ISCAS’97), vol. 14, pp. 2240–2243,
June 1997.
[55] R. H. Turner and R. F. Woods, “Highly Efficient, Limited Range Multipliers
for LUT-Based FPGA Architectures,” IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. 12, no. 10, pp. 1113–1117, October
2004.
BIBLIOGRAPHY 109
[56] M. Gokhale and J. Stone, “Automatic Allocation of Arrays to Memories
in FPGA Processors with Multiple Memory Banks,” FCCM’99, pp. 63–69,
April 1999.
[57] M. Weinhardt and W. Luk, “Memory Access Optimization and RAM
Inference for Pipeline Vectorization,” in Field-Programmable Logic
and Applications, P. Lysaght, J. Irvine, and R. W. Hartenstein,
Eds. Springer-Verlag, Berlin, / 1999, pp. 61–70. [Online]. Available:
citeseer.ist.psu.edu/weinhardt99memory.html
[58] ——, “Memory Access Optimization for Reconfigurable Systems,” IEE Pro-
ceedings Computers and Digital Techniques, vol. 148, no. 3, pp. 105–112,
May 2001.
[59] P. Andersson and K. Kuchcinski, “Automatic Local Memory Architecture
Generation for Data Reuse in Custom Data Paths,” Proceeding of Engi-
neering of Reconfigurable System and Algorithm, 2004.
[60] J. Park and P. C. Diniz, “Synthesis and Estimation of Memory Interfaces for
FPGA-based Reconfigurable Computing Engines,” Proc. of the 2003 IEEE
Symposium on FPGAs for Custom Computing Machines (FCCM’03), pp.
297–299, April 2003.
[61] T.-L. Lee and N. W. Bergmann, “An Interface Methodology for Retar-
gettable FPGA Peripherals,” Engineering of Reconfigurable Systems and
Algorithms, pp. 167–173, July 2003.
[62] X. Liang and J. S.-N. Jean, “Mapping of Generalized Template Matching
onto Reconfigurable Computers,” IEEE Transactions on VLSI Systems,
vol. 11, no. 3, pp. 485–498, 2003.
[63] ——, “Memory Access Pattern Enumeration in GTM Mapping on Recon-
figurable Computers,” The International Conference on Engineering of Re-
configurable Systems Algorithms, pp. 8–14, June 2001, Las Vegas, Nevada,
USA.
BIBLIOGRAPHY 110
[64] X. Liang and J. Jean, “Memory Access Scheduling and Loop Pipelining,”
IEEE Transactions on Very Large Scale Integration(VLSI) Systems, vol. 11,
no. 3, pp. 485–498, June 2003.
[65] X. Liang and Q. Malluhi, “Combinatorial Optimization in Mapping Gen-
eralized Template Matching onto Reconfigurable Computers,” Proceedings
of International Conference on Engineering of Reconfigurable Systems and
Algorithms, pp. 223–226, June 2006.
[66] X. Liang, J. Jean, and K. Tomko, “Data Buffering and Allocation in Map-
ping Generalized Template Matching on Reconfigurable Systems,” The
Journal of Supercomputing, Special Issue on Engineering of Reconfigurable
Hardware/Software Objects, pp. 77–91, 2001.
[67] C. Menn, O. Bringmann, and W. Rosenstiel, “Controller Estimation for
FPGA Target Architectures During High-Level Synthesis,” Proceedings of
the 15th international symposium on System Synthesis, pp. 56–61, 2002.
[68] S. Miller, “Enabling a Real-time Solution to Retinal Vascular Tracing Using
FPGAs,” Master’s thesis, Northeastern University, April 2004.
[69] H. Yu, M. Leeser, G. Tadmor, and S. Siegel, “Real-time Particle Image
Velocimetry for Feedback Loops Using FPGA Implementation,” Journal of
Aerospace Computing, Information, and Communication 2006, no. 2, pp.
52–62, 2006.
[70] Annapolis Micro Systems Inc., “FireBirdTM Hardware Reference Manual,”
in www.annapmicro.com, 2000.
[71] A. Can, H. Shen, J. N. Turner, H. L. Tanenbaum, and B. Roysam, “Rapid
Automated Tracing and Feature Extraction from Retinal Fundus Images
Using Direct Exploratory Algorithms,” IEEE Transactions on Information
Technology in Biomedicine, vol. 3, no. 1, March 1999.
[72] R. Adrian, “Particle-Imaging Technique for Experimental Fluid Mechan-
ics,” Annual Reviews in Fluid Mechanics, vol. 23, pp. 261–304, 1991.
BIBLIOGRAPHY 111
[73] M. Raffel, C. Willert, and J. Kompenehans, Particle Image Velocimetry.
Berlin, Germany: Springer-Verlag, 1998.
[74] D. D. A/S, “Dantec, Principle of Partical Image Velocimetry,”
http://www.dantecdynamics.com/piv/Princip/Index.html, Last accessed
on May 09, 2006.
[75] H. Yu and M. Leeser, “Automatic Sliding Window Operation Optimization
for FPGA-Based,” 14th Annual IEEE Symposium on Field-Programmable
Custom Computing Machines (FCCM’06), pp. 76–88, April 2006.
Appendix A
Glossary
SWO Sliding Window Operation
FPGA Field Programmable Gate Array
COTS Commercial Off-The-Shelf
SWOOP Sliding Window Operation Optimization
CCD Charge-Coupled Device
ASIC Application Specific Integrated Circuits
NRE Non-Recurring Engineering
CLB Configurable Logic Block
LUT Look Up Table
FF Flip-Flop
RISC Reduced Instruction Set Computer
HLS High Level Synthesize
112
APPENDIX A. GLOSSARY 113
GPP General Purpose Processor
ASP Application Specific Processor
SPP Special Purpose Processor
DSP Digital-signal Processor
CDFG Control/Data Flow Graph
EDA Electronic Design Automation
RTL Register Transfer Layer
HDL Hardware Description Language
VHDL Very-High-Speed Integrated Circuit Hardware Description Language
MAA Memory Allocation and Assignment
DRAM Dynamic Random Access Memory
PLD Programmable Logic Device
EDIF Electronic Design Interchange Format
FIFO First In, First Out
FIR Finite Impulse Response
FFT Fast Fourier Transform
SoPC System on Programmable Chip
GTM Generalized Template Matching
MAP Memory Access Pattern
APPENDIX A. GLOSSARY 114
PF Packing Factor
II Initiation Interval
RF Region Function
FU Functional Unit
MPE Micro Processing Element
RVT Retinal Vascular Tracing
PIV Particle Image Velocimetry
HPF High-Pass Filter
LPF Low-Pass Filter
IF Interface
BlockRAM Block Random Access Memory
Appendix B
Equation Proof
The total external memory loading is defined as Ltotal.
Ltotal = [(N − q
q − n + 1)× (
M − p
p−m + 1+ 1)× (p× q)]
︸ ︷︷ ︸when block moves left to right
+ [(M − p
p−m + 1)× (p× q)]
︸ ︷︷ ︸when block moves top to bottom
= ( (N−q)(M−p)(p−m+1)(q−n+1)
+ M−pp−m+1
)× (p× q)
= ( (N−n+1)(M−p)(p−m+1)(q−n+1)
)× (p× q)
(B.1)
Comparing to M and N , p and q are small and we assume (M − p) ∼ M and
(N − n + 1) ∼ N therefore, the problem can be further simplified to minimizep×q
(p−m+1)(q−n+1)since p > m, q > n, m,n > 1. To minimize p×q
(p−m+1)(q−n+1), we
rewrite it as follows:
p×q(p−m+1)(q−n+1)
= p×q[p×q+(m−1)×(n−1)]−[p×(n−1)+q×(m−1)]
(B.2)
The second part of the divisor satisfies:
p× (n− 1) + q × (m− 1) ≥ 2√
p× q × (m− 1)× (n− 1) (B.3)
115
APPENDIX B. EQUATION PROOF 116
where “=” exists only when p× (n− 1) = q× (m− 1). Now Equation (B.2) can
be rewritten as
p×q(p−m+1)(q−n+1)
≥ p×q
[p×q+(m−1)×(n−1)]−[2√
p×q×(m−1)×(n−1)]
=(√
pq2)√
pq2+(√
(m−1)(n−))2−2√
p×q×(m−1)×(n−1)
= (√
pq√
pq−√
(m−1)(n−1))2
(B.4)
According to the constraints listed in Equation(B.5), we can further derive
Equation(B.6).
p× q ×Wpi ≤ Bavail
2(B.5)
where Wpi and Bavail are defined in Table 3.1.√
pq√
pq−√
(m−1)(n−1)
≥√
Bavail2Wpi√
Bavail2Wpi
−√
(m−1)(n−1)
(B.6)
where “=” exists only when p×q = Bavail
2Wpi. Put Equation(B.4) and (B.6) together,
we getp×q
(p−m+1)(q−n+1)
≥ {√
Bavail2Wpi√
Bavail2Wpi
−√
(m−1)(n−1)}2
(B.7)
where “=” exists only when p× (n− 1) = q × (m− 1) and p× q = Bavail
2Wpi.
To summarize, the optimal value of p and q are those make the “=” condition
true. Let qopt = Bavail
2×Wpi×poptand put this into the other condition p × (n − 1) =
q × (m− 1), we get
popt × (n− 1) = Bavail
2×Wpi×popt× (m− 1)
⇒ p2opt = Bavail×(m−1)
2×Wpi×(n−1)
⇒ popt =√
Bavail×(m−1)2×Wpi×(n−1)
⇒ qopt =√
Bavail×(n−1)2×Wpi×(m−1)
(B.8)
APPENDIX B. EQUATION PROOF 117
By selecting the popt and qopt according to Equation(B.8), we can minimizep×q
(p−m+1)(q−n+1)and therefore minimize the Ltotal, the total external memory ac-
cess.