Optimizing Data Intensive Window-based Image … Comparison between Automatic and Manual Results for PIV Al- ... 4.11 PIV System Overview (From Dantec Dynamics) . . . . . . . . . 85

Optimizing Data Intensive Window-based Image

Processing on Reconfigurable Hardware Boards

A Dissertation Presented

by

Haiqian Yu

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

in the field of

Electrical and Computer Engineering

Northeastern University

Boston, Massachusetts

December 2006

c© Copyright 2007 by Haiqian Yu

All Rights Reserved

NORTHEASTERN UNIVERSITYGraduate School of Engineering

Thesis Title: Optimizing Data Intensive Window-based Image Processing onReconfigurable Hardware Boards.

Author: Haiqian Yu.

Department: Electrical and Computer Engineering.

Approved for Thesis Requirements of the Doctor of Philosophy Degree

Thesis Advisor: Prof. Miriam Leeser Date

Thesis Reader: Prof. Jennifer Dy Date

Thesis Reader: Prof. Eric Miller Date

Department Chair: Prof. Ali Abur Date

Graduate School Notified of Acceptance:

Dean: Prof. Allen L. Soyster Date

NORTHEASTERN UNIVERSITYGraduate School of Engineering

Thesis Title: Optimizing Data Intensive Window-based Image Processing onReconfigurable Hardware Boards.

Author: Haiqian Yu.

Department: Electrical and Computer Engineering.

Approved for Thesis Requirements of the Doctor of Philosophy Degree

Thesis Advisor: Prof. Miriam Leeser Date

Thesis Reader: Prof. Jennifer Dy Date

Thesis Reader: Prof. Eric Miller Date

Department Chair: Prof. Ali Abur Date

Graduate School Notified of Acceptance:

Dean: Prof. Allen L. Soyster Date

Copy Deposited in Library:

Reference Librarian Date

Abstract

FPGA-based computing boards are frequently used as hardware accelerators

for image processing algorithms with large amounts of computation and data

accesses. The current design process requires that, for each specific image pro-

cessing application, a detailed design must be completed before a realistic es-

timate of the achievable speedup can be obtained. However, users need the

speedup information to decide whether or not they want to use FPGA hardware

to accelerate the application. Quickly providing an accurate speedup estimation

becomes increasingly important for designers to make the decision without going

through the lengthy design process.

We present an automated tool, Sliding Window Operation OPtimization

(SWOOP), that generates an estimate of speedup for a high performance design

before detailed implementation is complete. SWOOP targets Sliding Window

Operations (SWOs). SWOOP provides a system block diagram of the final

design as well as an optimal memory hierarchy.

One of the contributions of this research is the automatic design of the on-

chip memory as a managed cache. The hardware setup we target can be viewed

as exploiting a cached memory system with the on-chip memory acting as an

L1 cache and the on-board memory acting as an L2 cache. However, unlike

most processors, no support for caching is provided. To minimize the number of

off-chip data accesses, the memory has to be carefully managed by the designer.

Our approach automatically determines the way the data should be accessed and

buffered in the on-chip memory to minimize the off-chip memory traffic. This

approach is applicable to any hardware architecture that contains a hierarchy

i

of memory outside of the normal caching structure.

SWOOP takes both the application parameters and FPGA board parameters

as input. The achievable speedup is determined by the area of the FPGA, or,

more often, the memory bandwidth to the processing elements. The memory

bandwidth to each processing element is a combination of bandwidth to the

FPGA and the efficient use of on-chip RAM as a data cache. SWOOP uses

analytic techniques to automatically determine the number of parallel processing

elements to implement on the FPGA, the assignment of input and output data

to on-board memory, and the organization of data in on-chip memory to most

effectively keep the processing elements busy. The result is a block layout of the

final design, its memory architecture, the estimated usage of different resources

and a measure of the achievable speedup.

Several manually designed applications including simple 2-D high-pass and

low-pass filters, template matching and 2-D cross-correlation have been used

to test the performance of SWOOP. Our experiments show that SWOOP can

quickly (less than a second) and accurately (less than 10% difference from man-

ual designs) estimate the maximum parallelism according to the applications

and constraints. The block layout of the final designs together with the mem-

ory architecture are near optimal with regard to performance. Moreover, since

SWOOP identifies where the tightest constraint to parallelism is found in a de-

sign, it can tell designers where to focus their efforts for further optimization.

ii

Acknowledgements

I am glad that I have this opportunity to thank all those who make this disser-

tation possible. First of all, I would like to express my deeply gratitude to my

advisor, Professor Miriam Leeser, for her stimulating suggestion and continuous

guidance over the past five years. I benefit a lot from her both technically and

personally. I would like to thank Dr. Gilead Tadmor and Dr. Stefan Siegel,

who gave me insightful suggestions on research and have been very patient of

revising the papers. I also would like to thank Dr. Jennifer Dy and Dr. Eric

Miller, who served on both my masters and doctoral committees.

I would like to thank all the members in Reconfigurable Computing Lab-

oratories, they build this friendly environment and make research much more

enjoyable. I would like to Shawn Miller to let me use some of his results in this

dissertation.

I have been truly blessed with good friends. I would like to thank my friend

in NEU, Sophia, Yiheng, Ping, Xiaojun, Wang, Janice, Ting for the interesting

discussion during lunch break. I would like to thank Liying, Haidan for their

long lasting friendship over the past 10 years. I would like to thank Chuwei,

Huajie and their lovely son, Kaishu for bring me surprises all the time.

A special thanks goes to my family. I would like to thank my husband,

Mengxi, for his continuous encouragement which accompanied me through the

ups and downs of life. I would like to thank my son, Ethan, his lovely smile never

failed to boost me up when I feel frustrated. Many thanks goes to my parents

and parents-in-law, without their help with the baby-sitting, this dissertation

may take much longer to finish.

iii

Contents

Abstract i

Acknowledgements iii

1 INTRODUCTION 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 FPGA Based Computing Boards . . . . . . . . . . . . . 3

1.1.2 FPGA Structure . . . . . . . . . . . . . . . . . . . . . . 5

1.1.3 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . 8

1.1.4 Sliding Window Operations(SWOs) . . . . . . . . . . . . 9

1.2 Our Automated Tool . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . 13

2 RELATED WORK 14

2.1 General-purpose Processors . . . . . . . . . . . . . . . . . . . . 15

2.2 Application-specific Processors . . . . . . . . . . . . . . . . . . . 16

2.3 Special-purpose Processors . . . . . . . . . . . . . . . . . . . . . 19

2.3.1 Target Independent Optimization . . . . . . . . . . . . . 19

2.3.2 Targeting ASICs . . . . . . . . . . . . . . . . . . . . . . 22

2.3.3 Targeting FPGAs . . . . . . . . . . . . . . . . . . . . . . 25

2.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4 Implementing SWOs on COTS FPGA boards . . . . . . . . . . 31

iv

3 DESIGN TRADEOFFS 36

3.1 Some Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2 Example Description . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Area Availability . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.1 Principle Blocks . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.2 High-Pass Filter Example . . . . . . . . . . . . . . . . . 44

3.4 Memory Bandwidth Limitation . . . . . . . . . . . . . . . . . . 45

3.4.1 Upper Bound with No Buffering and Full Row Buffering 46


3.5 On-chip Memory Availability . . . . . . . . . . . . . . . . . . . 50

3.5.1 Block Buffering Method . . . . . . . . . . . . . . . . . . 51

3.5.2 Selecting p And q . . . . . . . . . . . . . . . . . . . . . . 54


3.6 Example Summary . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.7 SWOOP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.8 Summpar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4 EXPERIMENTS 62

4.1 3× 3 High-Pass Filter AND 5× 5 Low-Pass Filter . . . . . . . . 64

4.1.1 Algorithm and Parameters . . . . . . . . . . . . . . . . . 64

4.1.2 Results from SWOOP . . . . . . . . . . . . . . . . . . . 68

4.1.3 Comparison and Analysis . . . . . . . . . . . . . . . . . 69

4.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 Retinal Vascular Tracing Algorithm . . . . . . . . . . . . . . . . 72


4.2.2 SWOOP Results . . . . . . . . . . . . . . . . . . . . . . 76


4.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.3 Particle Image Velocimetry Algorithm . . . . . . . . . . . . . . 84


4.3.2 SWOOP Results . . . . . . . . . . . . . . . . . . . . . . 91

v


4.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5 CONCLUSIONS 98

5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Bibliography 103

Appendices 111

A Glossary 112

B Equation Proof 115

vi

List of Tables

3.1 Parameter Definition . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Area Usage of High Pass Filter Blocks . . . . . . . . . . . . . . 45

4.1 Board Parameters and Application Parameters for HPF and LPF

Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2 Area Usage of Low Pass Filter Blocks . . . . . . . . . . . . . . . 68

4.3 Duplication Factors According to Different Constraints . . . . . 68

4.4 Bound Dependent Variables . . . . . . . . . . . . . . . . . . . . 69

4.5 Comparison between Automatic and Manual Results for HPF and

LPF Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.6 Board Parameters and Application Parameters for RVT Applica-

tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.7 Enumeration of AIF and BIF for FireBird . . . . . . . . . . . . 77

4.8 Duplication Factor for Different Usage of External Memory Banks 78

4.9 Comparison between Automatic and Manual Results for RVT Al-

gorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.10 Comparison between Automatic and Manual Results for RVT Al-

gorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.11 Board Parameters and Application Parameters for PIV Applications 92

4.12 Duplication Factor for Different Usage of External Memory Banks 93

4.13 Comparison between Automatic and Manual Results for PIV Al-

gorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

vii

List of Figures

1.1 Block Diagram of an Commercial FPGA Computing Engine . . 4

1.2 FPGA Structure based on Xilinx Virtex E . . . . . . . . . . . . 6

1.3 Example of Sliding Windowing Operation (window size = 3× 4) 10

3.1 Highpass Filter Example . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Block Diagram of an Commercial FPGA Computing Engine . . 42

3.3 Sequencing Graph of High Pass Filter . . . . . . . . . . . . . . . 44

3.4 Using Full Row Buffering Scheme . . . . . . . . . . . . . . . . . 47

3.5 Block Buffering Method Example . . . . . . . . . . . . . . . . . 52

3.6 Overlapping Loading and Processing . . . . . . . . . . . . . . . 55

3.7 SWOOP Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1 FireBird Block Diagram . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Highpass Filter Coefficients . . . . . . . . . . . . . . . . . . . . 64

4.3 Sequencing Graph of High Pass Filter . . . . . . . . . . . . . . . 65

4.4 Low-pass Filter Coefficients . . . . . . . . . . . . . . . . . . . . 66

4.5 Low-pass Filter MPE . . . . . . . . . . . . . . . . . . . . . . . . 67

4.6 Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.7 Response Module . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.8 Direction Module . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.9 Shifting the Window . . . . . . . . . . . . . . . . . . . . . . . . 80

4.10 Creating New Neighborhoods by Shifting Data on the FPGA . . 81

4.11 PIV System Overview (From Dantec Dynamics) . . . . . . . . . 85

4.12 Cross Correlation Plane . . . . . . . . . . . . . . . . . . . . . . 86

viii

4.13 Velocity Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.14 Pipelined Structure . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.15 An Example of Sub-pixel Interpolation . . . . . . . . . . . . . . 89

ix

Chapter 1

INTRODUCTION

Digital hardware has been used for speeding up computationally and data inten-

sive applications for decades due to its high degree of parallelism. Reconfigurable

digital hardware, specifically FPGAs, has become more and more popular be-

cause of its flexibility and short design cycle. A good candidate for reconfigurable

hardware speedup is digital image processing. Widely used either to improve pic-

toral quality for human interpretation or to extract information for autonomous

machine perception, digital image processing is very computationally and data

intensive and software processing can not meet the real-time requirement for

many applications, especially high resolution image processing. Sliding window

operations (SWOs) are a class of the most popular algorithm in digital image

processing and we are focus on this type of algorithms in this dissertation.

Commercial Off-The-Shelf (COTS) FPGA based boards, with both FPGAs

and memory banks integrated on them, are often used to improve performance

of image processing algorithms. It would be extremely useful for designers to

1

CHAPTER 1. INTRODUCTION 2

be able to quickly estimate the maximum performance, given the COTS board

and the algorithm are known. Different methods [1, 2, 3, 4, 5, 6, 7] haven

been proposed to fulfill this task. Some are specific to a given algorithm [5, 6,

7]and cannot be extended to other algorithms, while others are too generic and

thus cannot give an accurate estimate of maximum performance. Our research

balances between these two extremes by providing an automated tool which

can quickly find the upper bound of the performance according to the available

resources of the COTS boards. Moreover, this method can be used for most

sliding window operation (SWO) applications, which are commonly found in

image processing.

In this chapter, a brief background introduction explains why we are us-

ing FPGA based computing board to implement the SWO applications. The

contributions of our automated tool, Sliding Window Operation Optimization

(SWOOP), are also covered in this chapter.

1.1 Background

In this section, we first introduce a typical FPGA board and its structure as well

as why FPGAs can be used to implement different algorithms and gain speedup.

Then we will cover the concept of memory hierarchy in FPGA design which is

usually ignored for FPGA designs. SWOs are introduced and a small example

is given later in this section.


1.1.1 FPGA Based Computing Boards

With the development of charge-coupled devices (CCDs), high resolution images

can easily be acquired from digital cameras. Processing these images involves a

large amount of data. For most SWO applications, the algorithms are both data

intensive and computationally intensive. Software processing is slow and cannot

meet the speed requirements of some applications. Fortunately, SWOs are inher-

ently highly parallelizable and hardware implementations are favored in delay

sensitive applications. Using Application Specific Integrated Circuits (ASICs)

to accelerate these algorithms proves to be efficient and can greatly reduce the

processing time. However, the cost of an ASIC is extremely high. Reconfigurable

hardware based coprocessor boards can be used to flexibly implement similar al-

gorithms with a much shorter design cycle compared to ASICs. Reconfigurable

hardware can greatly reduce the Non-Recurring Engineering (NRE) costs and

the hardware can be adapted when the system architecture or application re-

quirements change. Another factor driving the surge of reconfigurable hardware

computing in image processing applications is the availability of standard system

solutions. Significant growth in the availability of COTS (commercial off-the-

shelf) reconfigurable computing engines, along with improvements in design-tool

capability make reconfigurable hardware more and more popular.

For these reasons, we focus on Field Programmable Gate Arrays (FPGAs).

Commercial reconfigurable coprocessor boards are composed of FPGA chip(s),

external memory bank(s), interfaces between memory banks and FPGA chips,


and a connection to a host processor. Figure 1.1 shows a typical computing

board with two FPGAs and five memory banks on it (based on the Annapolis

Micro Systems Inc. WildStar [8]). The interfaces and the number of memory

banks determine the maximum data flow between FPGA and memory. This

is called memory bandwidth. The main task for a hardware designer is to

maximally use the resources in the FPGA chips, intelligently organize the logical

memory structures according to their physical interconnections, and fully utilize

the memory bandwidth between the memory banks and FPGA chips for optimal

performance.

PCIController

PCI Bus to Host

FPGA 1 FPGA 2

Memory Bank 1

MemoryBank 4

MemoryBank 5

MemoryBank 2

MemoryBank 3

Figure 1.1: Block Diagram of an Commercial FPGA Computing Engine

One challenging aspect of reconfigurable hardware-based co-processor design

lies in the fact that different FPGA computing engines have different memory ar-

chitectures, different FPGA chips etc., which can lead to very different hardware

designs. Traditionally, given a FPGA based co-processor board and an image

processing algorithm, a hardware engineer analyzes the low level parallelism and


creates a design that meets the timing and resource constraints. A small change,

such as a change in the word length of the external memory bank on the board,

may result in a very different design. This process has to be repeated for each

different algorithm and each different board architecture. To avoid the time

consuming re-design process, we propose a generalized design method which can

lead to near optimal designs by defining the upper bounds of the design based

on different resource constraints. By picking the most critical constraint of the

design, we can maximally utilize the available resources. Moreover, the same ap-

proach can be used to modify the design when the resources change. Using this

design method for the FPGA implementation of SWO algorithms can greatly

reduce the time spent on hardware design, while at the same time obtaining a

high performance design with near optimal memory allocation and intelligent

memory hierarchy usage.

1.1.2 FPGA Structure

The key to the popularity of FPGAs is their ability to implement any circuit sim-

ply by being appropriately programmed. The three basic elements of an FPGA

are configurable logic blocks (CLBs), I/O blocks and programmable routing.

Figure 1.2 [9] shows the Xilinx Virtex FPGA architecture [10].

Each CLB has one or more lookup tables (LUTs) and several Flip-Flops

(FFs) as shown at the bottom right of Figure 1.2. Several LUTs and FFs are

grouped together to form a logic slice. The number of slices is often used to


ConfigurableLogic Block

(CLB)

I/O Block

ProgrammableRouting

Route map of a real FPGA chip Structure of 2-Slice Vertec-E CLB

Figure 1.2: FPGA Structure based on Xilinx Virtex E


indicate the size of the FPGA area. Arbitrary logic functions can be imple-

mented by appropriately configuring the LUTs and connecting them through

programmable routing. Each I/O block can act as either an input pad or an

output pad as required by the circuit. An FPGA as a whole can therefore im-

plement digital circuits by mapping the functional units in the design onto logic

blocks. The final processing frequency of an FPGA depends on the depth of the

the computation between Flip-Flops (FFs) and on routing wire delay.

Modern FPGAs can implement million gate equivalent circuits. In addition

to CLBs, I/O blocks and programmable routing, modern FPGAs usually have

on-chip memory in the form of RAM blocks, and some may have embedded

multipliers or even complete RISC processors on the chip. These extra available

resources can further boost the performance of the implemented digital circuits.

In our research, we consider the structure shown in Figure 1.2 with additional

on-chip memory as our available FPGA hardware resources.

In addition to on-chip memory, FPGAs usually use external memory banks

(also called off-chip memory) to store extra data. External memory is par-

ticularly important for image processing applications because on-chip memory,

although faster than off-chip memory, is relatively small and cannot store all the

data. On-chip memory can be used as a buffer to store temporary data to avoid

slower data transfers from/to external memory. Memory bandwidth is defined

as the maximum data transfer speed between an FPGA and the external mem-

ory banks. In some cases, if we cannot transfer the data from/to the external


memory banks quickly enough, the system performance may be degraded.

1.1.3 Memory Hierarchy

Memory hierarchy is often discussed in computer architecture and is a useful

way to increase performance. The memory in a system can usually be arranged

in a hierarchy from the fastest (and lowest capacity) to the slowest (and highest

capacity) with the fastest staying closest to the central processor. By utilizing

the locality of reference property of memory access, effective management of the

memory hierarchy can greatly improve system performance and give the appear-

ance that the system has the fastest memory with the highest capacity [11].

Memory hierarchy management had not been considered by FPGA designers

until on-chip memory of FPGAs became available. But now it attracts more

and more interest especially for data-intensive applications. FPGA chips are

usually connected to external memory banks, but if the data needs to be read

from or written to external memory banks on every access, the performance of

the design would be greatly degraded because accessing the external memory

is usually slow. In this case, using on-chip memory to form a suitable memory

hierarchy becomes very important. The smaller but faster on-chip memory, if

organized properly, can be used as a buffer to store temporary or repeatedly

used data so that off-chip memory accesses are reduced. The ability to reduce

off-chip memory accesses depends on the on-chip memory size and the algorithm

itself. FPGA designers need to carefully analyze data access patterns for each


application to get a specific optimized memory architecture. In our research, we

find that for most SWO applications, there exists a common data access pattern.

Therefore, we can build a memory architecture for all instances of this type of

application. The details will be discussed in section 3.5.

1.1.4 Sliding Window Operations(SWOs)

Many spatial domain methods for image processing can be summarized as fol-

lows:

g(x, y) = T [f(x, y)] (1.1)

where f(x, y) is the input image, g(x, y) is the processed image, and T is

an operator on f , defined over some neighborhood of (x, y) [12]. This spatial

domain processing is widely used in image processing. Image averaging, smooth-

ing, sharpening and convolution all belong to this category. Figure 1.3 shows

an example of spatial domain processing. In this example, the T operator is

moving in raster-scan order. This type of operation is widely used in digital im-

age processing and is called a sliding windowing operation(SWO). Others have

studied 1-D SWOs [13], which concentrates on 1-D FIR/IIR signal processing.

We are interested in the more general case of 2-D digital signal processing.


One Pixel

1st Window

Window Moving Direction

2nd Window

1st Windowfor 2nd row

3rd Window

Figure 1.3: Example of Sliding Windowing Operation (window size = 3× 4)

1.2 Our Automated Tool

Sliding Window Operation OPtimization, the automated tool we present in this

dissertation, provides a good solution to quickly implement SWO applications

on COTS FPGA boards. SWOOP take FPGA board parameters as well as

SWO application parameters as inputs. The outputs are:

1. A block diagram of the SWO implementation.

2. Three upper bounds according to different resource constraints. The tight-

est upper bound will be select for the actual implementation, which will

give the implementation the near optimal performance given the available

resources.

3. A near optimal memory structure which is tailored for current SWO ap-

plication and the given FPGA board. The memory organization can max-

imally reduce the redundant the external memory access by fully utilizing

the available internal memory of FPGAs.


After applying SWOOP to a specific application, the designer still has control

over how to implement the algorithm. SWOOP tells the designer what is the

near optimal design, but the designer can make adjustments depending on other

factors such as cost, time-to-market etc.

1.3 Contributions

The contribution of this research are:

1. An automated tool, Sliding Window Operation OPtimization (SWOOP).

SWOOP takes both the size of the SWO and the board parameters as

input. Using analytic techniques, SWOOP automatically determines the

number of parallel processing elements to implement on the FPGA, the

assignment of input and output data to on-board memory, and the orga-

nization of data in on-chip memory to most effectively keep the processing

elements busy. The result is a block layout of the final design, its memory

architecture, the estimated usage of different resources, and a measure of

the achievable speedup.

2. The analytical representation of three upper bounds. Knowing the possible

speedup is very important for a hardware designer. The upper bounds

provided by our research can be used to quickly estimate whether or not

the COTS implementation can meet the application’s requirements before

the actual implementation is complete.


3. A new buffering method suitable for SWO based applications. The block

buffering method can maximally use the available on-chip memory to re-

duce the external memory accesses. Using this buffering method can help

designers build an efficient memory hierarchy architecture for SWO based

applications.

4. The calculation of the upper-bound of the parallelism subject to on-chip

memory size. Once the size of the block is determined by using the block

buffering method, an analytical representation of the upper bound sub-

jected to on-chip memory size is given. This upper bound can be combined

with the other two upper bounds to determine the maximum performance.

Several manually designed applications has been used to test the performance

of SWOOP. Our experiments show that SWOOP can quickly (less than a second)

and accurately (less than 10% difference from manual designs) estimate the

maximum parallelism according to the applications and constraints.

Our approach automatically determines the way data should be accessed and

buffered in the on-chip memory to minimize the off-chip memory traffic. This

approach is applicable to any hardware architecture that contains a hierarchy

of memory outside of the normal caching structure. This is true of FPGA

architectures as well as tiled architectures such as the Cell multiprocessor from

IBM [14].

Note that this design flow can be applied not only to FPGA designs, but also

to other hardware designs with area, internal and external memory constraints.


By using different libraries, the method we present here can be modified so that it

can be used for other styles of hardware. The block buffering method we present

here has been previously explored in the context of motion estimation [15, 16].

In these approaches the size of the buffer is the size of the search area. Our block

buffering method differs in that the block size is selected based on the available

on-chip memory in order to reduce the total external memory accesses.

1.4 Dissertation Structure

A summary of the related work in this area is presented in Chapter 2. Chap-

ter 3 presents a detailed explanation of how we estimate the three upper bounds

subjected to different constraints. Four different applications are described in

Chapter 4 and a detailed analysis of the results are provided to prove the cor-

rectness and efficacy of our method. Chapter 5 concludes the dissertation and

closes with thoughts about future work.

Chapter 2

RELATED WORK

Co-processor boards embedded within larger computer systems are very popular

in embedded computing. Compared to other non-embedded computing systems,

they are usually tightly constrained, reactive and real-time [17]. Therefore,

when designing embedded systems, we need to pay much more attention to

performance. Moreover, the time-to-market constraint has become more and

more demanding and can influence a design process dramatically. Of course

there are other design metrics we consider including NRE, size, power etc.

Three types of processor technology are commonly used when implementing

an embedded system: general-purpose processors (GPPs), application-specific

processors (ASPs) and special-purpose processors (SPPs). A GPP is a pro-

grammable device that is suitable for a variety of applications to maximize the

number of devices sold. An SPP is defined as a digital circuit designed to ex-

ecute exactly one program. An ASP is a compromise between a GPP and an

14

CHAPTER 2. RELATED WORK 15

SPP, it is a programmable processor optimized for a particular class of applica-

tions having common characteristics [17]. The design process for different target

technologies are very different. Although the design cycle for GPPs and ASPs

are relatively shorter than SPPs, one disadvantage of these technologies is the

processing speed. We present a brief discussion about GPPs and ASPs and

concentrate most of our attentions on SPPs, the class that includes FPGAs.

2.1 General-purpose Processors

The basic architecture of a GPP consists of a data-path, a control unit and a

memory interface. For each instruction, GPPs typically need to go through 5

steps: instruction fetch, operand fetch, execute, memory access and operand

store. The generalized steps make sure that GPPs can be used for different

applications, but at the same time greatly reduce their efficiency. The designer

of a GPP usually builds a programmable device associated with an instruction

set architecture and usually does not know what kind of application will run

on the GPP. So GPP designers try to construct a proper computer architecture

by deciding the optimal number of stages of pipeline and building an efficient

memory hierarchy so that for most applications the data can flow smoothly

and consistently. A lot has been published about computer architecture design;

[11, 18, 19] give a good overview.

An embedded system designer has a totally different task than a GPP de-


signer. Usually, given a specific GPP, embedded system designers are only con-

cerned with writing efficient code for the GPP. Code rewriting techniques, con-

sisting of loop transformations or data flow conversion, are an essential part

of modern optimizing and parallelizing compilers. By exploring the inher-

ent parallelism and enhancing the temporal and spatial locality of the algo-

rithms [20, 21, 22], a designer can improve the performance of an application by

more efficiently using the available CPU pipeline stages and cache space. Code

rewriting can also be done automatically through compiler optimization. Much

work has been done in this area and their aim is to find an efficient translation

mechanism so that compiled code can best fit into the GPP structure [23, 24, 25].

A GPP is an economical and quick solution for applications without compli-

cated data processing. But when the computational requirements of an appli-

cation increases, GPP’s disadvantages, especially slow processing speed due to

limited parallelism, become more and more intolerable. An alternative solution

is using ASPs or SPPs.

2.2 Application-specific Processors

Micro-controllers and Digital-signal Processors (DSPs) are two types of com-

monly used ASPs. Unlike GPPs, their structures and instruction sets are op-

timized for a particular class of applications. However, the design process for

ASPs is similar to GPPs. A designer’s task is to write code which can run ef-


ficiently on the available ASP. Micro-controllers are widely used in embedded

control applications such as monitoring or single-bit control signal setting. They

are not suitable for performing large amounts of data computation. Modern sig-

nal processing applications involve a lot of data transfer and computation, thus

DSPs become a good candidate for this type of algorithm. DSPs are widely used

in application domains including communications [26, 27], image processing [28],

audio and video processing [29], power control [30], etc. The important differ-

ence between a DSP and a GPP is that a DSP processor is designed to support

high-performance, repetitive, numerically intensive tasks. Different DSPs from

different manufacturers or even from the same manufacturers may have different

specific features. Four major DSP chip manufacturers are Texas Instruments,

with the TMS320C2000, TMS320C5000, and TMS320C6000 series of chips [31];

Motorola, with the DSP56300, DSP56800, and MSC8100 (StarCore) series [32];

Agere Systems (formerly Lucent Technologies), with the DSP16000 series [33];

and Analog Devices, with the ADSP-2100 and ADSP-21000 (”SHARC”) se-

ries [34]. Most DSPs have common features which make their high performance

in data processing applications possible [35]:

1. DSPs can complete a multiply-accumulate operation in one clock cycle.

High-performance DSPs often have two or more multipliers that enable

two multiply-accumulate operations per instruction cycle.

2. DSPs can provide specialized addressing modes, such as pre- and post-


modification of address pointers, circular addressing, and bit-reversed ad-

dressing.

3. Most DSPs provide various configurations of on-chip memory and periph-

erals tailored for DSP applications. DSPs generally feature multiple-access

memory architectures that enable DSPs to complete several accesses to

memory in a single instruction cycle.

4. Usually, DSP processors provide a loop instruction that allows tight loops

to be repeated without spending any instruction cycles for updating and

testing the loop counter or for jumping back to the top of the loop.

5. DSP processors are known for their irregular instruction sets, which gen-

erally allow several operations to be encoded in a single instruction. For

example, a processor that uses 32-bit instructions may encode two addi-

tions, two multiplications, and four 16-bit data moves into a single instruc-

tion. In general, DSP processor instruction sets allow a data move to be

performed in parallel with an arithmetic operation. GPPs, in contrast,

usually specify a single operation per instruction.

Compared to GPPs, ASPs are more suitable for digital signal processing

applications. However, ASPs essentially still provide sequential processing of

each instruction and ASPs cannot fully explore the parallelism inherited in the

algorithm. For delay sensitive applications, SPPs may be a better choice.


2.3 Special-purpose Processors

A SPP is a digital circuit designed for a specific application. Compared to a

GPP or an ASP, it is much more efficient because it is tailored specifically for the

application and much overhead can be avoided [36]. However, the design cycle

for an SPP is much longer because all the details need to be determined by the

designer. Unlike GPP and ASP design which involves mostly software coding,

SPP design is a hardware design process that can be implemented in full-custom

hardware or programmable logic depending on the application requirements.

For SPP designers, a critical task is to maximize the parallelism while at the

same time keeping the design cycle short. To fulfill this task, different design

methodologies have been proposed and they are catalogued in the following

sections.

2.3.1 Target Independent Optimization

High Level Synthesis (HLS) is a useful tool for target-independent optimization

of SPP designs. It can take a description of a design in a Hardware Descrip-

tion Language (HDL) or other high level language (Matlab, C, C++ etc.) and

synthesize it into a low level HDL description or intermediate form. High level

language compilers usually capture the behavior using a control/data flowgraph

(CDFG) [37]. Starting from these CDFGs, compilers construct the macroscopic

structure of a digital circuit; this process is called architectural synthesis and op-


timization. The output of architectural synthesis is sent to logic level synthesis

and optimization.

SystemC [38] is a popular open source high level description language. Its

community consists of a growing number of system design companies, semicon-

ductor companies, IP providers and Electronic Design Automation (EDA) tool

vendors. SystemC is built on standard C++; the core language consists of mod-

ules and ports for representing structure. A set of data types with various data

widths are provided by the library for hardware modeling. Designers use C++

as an input language and can develop and change the system-level models very

fast. To some extent, hardware design is more like software coding with the

help of HLS tools, except that the designer must be aware that all the modules

work in parallel instead of sequentially. Once the system-level module is built, a

designer can use tools that support SystemC to simulate the design, verify the

functional correctness, and translate the C++ language into Register-Transfer-

Layer (RTL) VHDL code. The lower level VHDL code can be further combined

with different libraries for different target implementations.

An example of a HLS project is SPARK [39] by Sumit Gupta et al. It is a C-

based HLS framework which employs a set of parallelizing compiler techniques

and synthesis transformations to improve the quality of high-level synthesis re-

sults. The SPARK methodology is particularly targeted to control-intensive

microprocessor functional blocks and multimedia applications. Gupta claims

that “SPARK takes behavioral ANSI-C code as input, schedules it using spec-


ulative code motion and loop transformations, runs an interconnect-minimizing

resource binding pass and generates a finite state machine for the scheduled

design graph. Finally, a backend code generation pass outputs synthesizable

register-transfer level (RTL) VHDL. This VHDL can then be synthesized using

logic synthesis tools into an ASIC or be mapped onto a FPGA.”[40].

The advantage of using HLS tools is that they can expedite the design pro-

cess by shifting the burden of hardware design to the tools. A good HLS tool can

provide a platform for easy simulation, verification and synthesis. The problem

with using a high level input language is its inefficiency and the poor memory

architecture of the resulting design. The efficiency of HLS tools is highly de-

pendent on the compiler used and currently no compiler can achieve the same

performance as implementations written directly in the VHDL or Verilog lan-

guage. Moreover, being target independent means the tools are not aware of

the overall physical or logical layout of the hardware and their I/O or memory

connections. High level language compilers, in most cases, cannot optimally al-

locate memory and build the memory hierarchy. Consequently, the performance

of designs generated with HLS tools tends to degrade for data intensive appli-

cations. These problems can be solved with target specific optimization, where,

for different target technologies, there are different optimization method. These

will be introduced in the following sections.


2.3.2 Targeting ASICs

Once the target technology is chosen, we can further optimize the design for

high performance or use target specific automated tools to speedup the design

process. For ASIC design, additional design metrics are considered such as

power consumption, delay constraints, area limitations etc. or a combination of

these. Depending on the different design metrics, designers may choose different

optimization methods; we list some of these in this section.

Optimization for High Performance

One of the primary reasons for using ASICs to implement algorithms instead of

using GPPs or ASPs is the high performance of ASICs. By exploiting parallelism

at different granularities and customizing at different layers, a full custom ASIC

design can yield a high density and high performance product at the cost of a

much longer design process. ASIC’s high performance comes from combining

both fine-grained parallelism and coarse-grained parallelism to achieve a high

performance implementation. In order to explore the inherited fine-grained par-

allelism of the application, designers need to trace the data flow step by step to

figure out a corresponding efficient hardware structure. Performance optimiza-

tion can be further done iteratively at different layers under different constraints

such as area, power, timing, etc. Although design tools can help designers to

automatically explore the trade-offs, often it is still a designers’ task to manu-

ally optimize the circuits for each application especially when some constraints


cannot be met by the tools. It is usually a tedious and tricky job and requires

a lot of experience. It is difficult to get an early estimate of the quality of your

RTL, whether your timing constraints are achievable, and whether your netlist

is feasible. Any mistakes made early in the process do not appear until after

place and route.

To summarize, ASIC optimization is not a trival task and most of the opti-

mizations are application specific. All the constraints have to be considered and

balanced to get a final viable implementation. ASIC design is very expensive

and is typically used for very high volume design or designs where low power is

essential.

Design Automation

ASIC design, has a very long turnaround times and high NRE costs. High NRE

costs are inevitable due to the complexity of the ASIC design and manufacturing

process. Using design automation tools can greatly help the designer to shorten

the time-to-market cycle. The HLS tools we mentioned in Section 2.3.1 can

be used to implement designs targeting ASICs by manually adding annotations

and constraints during the automatic synthesize process. Moreover, HLS needs

to be combined with libraries provided from different ASIC vendors to get the

final design. Modern ASIC design involves placement and physical optimization,

clock tree synthesis, signal integrity and routing. Without good EDA tools, it

is almost impossible to get a working design [41]. Most of these tools such as


Cadence [42], Synopsys [43], etc. are very expensive which is another reason for

the high NRE of ASIC design.

Optimization for Memory Access

Memory allocation and assignment becomes a critical issue when the applica-

tion involves a large amount of data transfer. ASIC designers need to choose the

optimal memory size and the optimal memory port width so that these will not

be a bottleneck to high speed processing. This results in a large design space

to explore. Panda et al. give a good summary of research in this area [44].

It summarizes the automatic memory allocation and assignment (MAA) meth-

ods in High Level Synthesis (HLS) and also discusses the memory hierarchy

related optimizations of embedded systems by taking advantages of spatial and

temporal locality [45]. Miranda et al. propose a methodology which can au-

tomatically generate the memory addresses for data-transfer-intensive applica-

tions [46]. Other researchers [47] introduce several optimization strategies for

application-specific systems using different memory architecture: data cache,

scratch-pad memory, custom memory architecture and dynamic random-access

memory (DRAM). These methods can improve the overall performance for some

types of applications, but the algorithms for deciding memory size and optimiz-

ing the memory hierarchy are complicated. ASIC designers need to adapt these

methods for specific applications.


2.3.3 Targeting FPGAs

An alternative for hardware implementation is using programmable logic devices

(PLDs). One type of complex PLD, which has grown very rapidly in popularity

over the past decade, is the field programmable gate array (FPGA). Compared

to ASIC design, the back-end design for FPGA devices is very simple and the

time-to-market is much shorter than for ASICs. Designers are therefore often

opting to use FPGA devices, either for the entire life of a product, if applications

require only a few tens of thousands of devices, or for prototyping and volume

ramp-up. Once volume production shows that a design is stable, engineers can

port the design to an ASIC device.

Design Automation

One major feature of FPGA technology is that it is re-programmable and gives

the user the option to develop an electronic design with ease, update their system

in the field, and test and verify the implementation quickly. Design automation

can further shorten the design cycle and is widely used in FPGA design.

The HLS tools mentioned in Section 2.3.1 can also be used for FPGA de-

sign with the aid of low level FPGA targeted synthesis tools. There also exist

some HLS tools specifically targeting FPGA designs. By knowing the target

technology at an earlier stage, tools can direct the optimization process more

efficiently.


JHDL [1, 2] is a set of FPGA CAD tools developed at Brigham Young Univer-

sity’s Configurable Computing Laboratory. It is a “structurally based Hardware

Description Language (HDL) implemented with JAVA”. The JHDL project is

an exploratory attempt to identify the key features and functionality of good

FPGA tools. Its aim is to provide a simulation and debug environment for both

the host code and FPGA design so that the design and debug process is easier.

Handel-C [3] is the language used by the commercial HLS tools from Celoxica.

It is based on ANSI-C and simple constructs are added which make it possible for

the design suite to compile algorithms directly into EDIF netlists. The output is

optimized for a target FPGA device. Celoxica also provides a design suite called

DK which can provide multiple language co-simulation for C, C++, SystemC,

SpecC, Handle-C, VHDL and Verilog.

SA-C (Single Assignment C) [4] takes a C-like programming language as in-

put and use its own compiling tools to translate C into VHDL. In SA-C, the code

is automatically separated into inner loops, which are run on the FPGA, and the

remainder of the code, which runs on the host. The SA-C compiler assumes the

on-chip processing clock is the same as the I/O clock. The designer can ignore

timing information and the design process is closer to software programming.

The SA-C compiler can successfully map image processing algorithms onto an

FPGA, achieving up to an 800 fold speed-up over a Pentium III for complicated

image processing applications. But for simpler image operators, such as SWOs,

only 10 times or less speedup has been achieved.


Streams-C [48] is an intermediate approach between HLS and low level syn-

thesis tools. The target machine is an attached FPGA based computing board to

the host. The compiler can translate the C-like input code into clock-cycle level

design of hardware circuits. The compiler also pipelines computation and man-

ages stream synchronization. Although it cannot achieve the same performance

as hand-crafted designs, the final speedup is about a factor of 10.

The MATCH (MATlab Compiler for distributed Heterogeneous computing

systems) compiler project [49, 50] at Northwestern University aims to make it

easier for the users to develop efficient code for configurable computing systems.

It is an experimental prototype of a software system that maps Matlab functions

into RTL VHDL for FPGA synthesis. MATCH was later commercialized by Ac-

celChip and the name was changed to AccelFPGA synthesis tools. AccelFPGA

reads Matlab and Simulink files and outputs synthesizable VHDL and Verilog

in RTL that has been optimized for an FPGA. The tool also creates simulation

models for bit-true verification, eliminating the need to create test benches for

DSP algorithms [51].

DEFACTO (Design Environment For Adaptive Computing TechnOlogy) [52,

53] is another compilation and synthesis system targeting FPGAs. The tool can

automatically do the design space exploration and select an implementation that

closely matches the performance of the fastest design in the design space. More-

over, DEFACTO defines a balance metric for guiding design space exploration so

that both memory bandwidth and hardware resources can be maximally utilized.


All of these design automation tools aim to raise the level of abstraction in

hardware design to simplify the design process and therefore shorten the design

cycle. However, compared to hand-written VHDL/Verilog design, the efficiency

of the designs from these automation tools can still be improved. Furthermore,

these design automated tools cannot efficiently use the on-chip memory of the

FPGA and therefore performance is degraded for data intensive applications.

Optimization for High Performance

Similar to ASIC design, FPGA design also combines both fine-grained paral-

lelism and coarse-grained parallelism to get speedup over GPPs or ASPs. Un-

like ASIC’s full custom design, FPGA configures built-in logic blocks and switch

boxes to implement algorithms. For the same algorithms, the FPGA implemen-

tation is usually larger and slower than the ASIC implementation. Recently,

the availability of multi-million gate FPGAs with clock speed in the hundreds

of MHz enable FPGA designers to implement very complicate algorithms in

FPGAs with performance comparable to ASICs.

Optimization of FPGA design performance can be divided into two cate-

gories: application-independent optimization and application-dependent opti-

mization. Application-independent optimization includes optimizing commonly

used functional units or macros such as FIFOs, adders, multipliers, FIR filters,

FFT etc. [54, 55]. Application-dependent optimization needs to consider not

only the design of the FPGAs, but also the layout of the whole system including


sensors, memories etc. [5, 6, 7]. For application-dependent optimization, there

are no general rules and designers must carefully select optimization methods

according to the application itself as well as the system structure in order to get

high performance.

Memory Hierarchy

In the mid 1980s, FPGA manufacturers offered only a few thousand gates per

chip and clock speeds of a few MHz, but now, manufacturers not only build

multi-million gates FPGAs, but also integrated on-chip memories (block RAM)

into FPGAs. These small but fast on-chip memories can be used to build a

memory hierarchy that reduces the number of slower external memory accesses.

Data-intensive applications, which require frequent and large amounts of data

transfer, require an intelligent memory architecture to achieve high throughput.

A lot of research has been done on building an efficient memory architecture

according to the analysis of the algorithms to be implemented.

Gokhale et al. [56] present an algorithm to assign data automatically to mem-

ories to produce minimum overall execution time of the loops in the algorithm.

Instead of searching the exponential search space, this algorithm uses an implicit

enumeration method to reduce the search space. The limitation of this paper

is it only considers external memory bank allocations and ignores the on-chip

memories of the FPGA.

Weinhardt and Luk [57, 58] discuss a memory access optimization method


for FPGA-based reconfigurable systems with a hierarchy of on-chip and off-chip

(external) memory to speed up applications limited by memory access speed.

They focus on loop nests and map all data processing in the inner loop to a

data path. Data dependence analysis is used to find legal loop unrollings. The

optimization method used here is quite general and therefore it can only achieve

around 10 times speedup over software. This speedup is not sufficient for some

applications.

Andersson and Kuchcinski [59] address an automatic local memory archi-

tecture generation method for FPGAs with embedded CPUs, called System on

Programmable Chip (SoPC). They employ data reuse and duplicate data com-

monly used in memories close to the data path. The optimization algorithm is

rather complicated because it has to first partition the task between the em-

bedded CPU and custom logic. Moreover, similar to [57, 58], performance is

sacrificed because it is a high-level hardware compiler.

Another approach is to optimize the memory interface and controller between

FPGAs and external memory banks. In [60], Park et. al. describe a set of

parameterizable memory interface designs for both SRAM and SDRAM memory

technologies. The interface modules is suitable for a wide variety of designs

with pipelining and page-mode memory operations. Lee et. al. [61] present a

new methodology that can automate the connection of an Intellectual Property

(IP) block to a wide variety of interface architectures with low overhead. This

research was still at an early stage without many results.


2.3.4 Summary

There are other target technologies that can be used to implement SPPs such

as CPLDs, structured ASICs etc. Full custom ASICs and FPGAs are the most

popular ones. As we have discussed, they have their own advantages and dis-

advantages for implementing an algorithm. Designers can use automated tools

to speed up the design cycle or optimize the design or both, but a good design

still requires great effort from the designer to explore the design space. Our

research attempts to reduce that effort by providing a optimal and fast design

flow which can be used for most SWO-based applications. The memory hier-

archy is carefully designed so that the performance can be guaranteed for this

type of data-intensive application. Moreover, although currently we are target-

ing FPGA based COTS board, the ideas we present here can extend to any

architecture with multiple hardware cores, internal buffer memory and external

memory.

2.4 Implementing SWOs on COTS FPGA boards

Our previous discussion showed that there are no optimization rules that fit all

applications independant of the target technology chosen. However, we can find

some common rules for sets of similar applications. We are interested in SWO

applications because they are common operations implemented on SPPs.

In our case, the COTS boards are built before we implement any algorithm


on them, which means all the implementations are subjected to the board’s

constraints. We are concerned with improving performance and shortening the

design cycle when we implement SWOs on these COTS boards.

There is not much related work in this specific area, except for a series of

publications from Liang et al. We will show the details of how they implement

their SWOs and the differences between their method and ours.

Liang et. al. [62] proposed a method for mapping generalized windowing

operations called Generalized Template Matching (GTM) onto reconfigurable

hardware using basic building blocks. First, all the possible non-dominant Mem-

ory Access Patterns (MAPs) are enumerated according to different combinations

of the Packing Factor (PF) and Initiation Interval (II) [63, 64, 65]. PF is defined

as the number of image pixels in one memory location. II is the constant time

(in clock cycles) between initiating the processing of two consecutive windows.

MAPs, which determine when and how data is transferred between FPGA and

external memory banks, are dependent on the PF and on the II. Liang et al. enu-

merate all MAPs for different combination of PFs and IIs. Once the MAPs are

listed, the corresponding data allocation and buffering schemes are determined.

In [66], three data allocation buffering strategies are proposed.

1. Full Image Row Buffering. Enough rows of pixels are buffered so that only

one new pixel datum is needed from an external memory bank for a new

window operation. This method will be discussed in detail in Section 3.4.

2. Small Internal Buffering. When storing rows of pixels becomes too expen-


sive, the image can only be stored in external memory. By using a small

internal buffer, data that has already been read in but not used in the

current window operation can be stored temporarily for the windowing

operation it is used in. This method introduces extra delay compared to

full image row buffering since it needs more external memory accesses. To

fully utilized the available memory bandwidth, two methods are used:

(a) Pixel Packing. When pixel data width is less than the memory width,

several pixels are packed in one memory address.

(b) Redundant External Data Storage. By storing several copies of data

in different external memory banks, the memory bandwidth can be

increased.

3. Partial Buffering of Image Rows. This is a combination of 1 and 2. When

there is not enough buffer space for full image row buffering, this method

can be used to maximally reduce the external memory accesses thus re-

ducing delay.

Once the MAPs are decided for each packing factor (PF) and Initiation

Interval (II), basic GTM building blocks called Region Functions (RFs) are

allocated to minimize the area and buffer size under constraints of an II, a PF

and latency. Finally, one or more RFs are implemented on the FPGA chip

so that the total execution time is minimized under the FPGA board resource

constraints.


Compared to a generic high level language hardware design process, Liang

et. al [62] take memory access and data buffering into consideration, which

is essential for data intensive applications. However, their approach has the

following problems:

1. Their design process begins with enumerating MAPs according to different

PFs and IIs. Although some dominant MAPs are pruned, the design space

is still too large especially when the size of the window or the number of

memory banks is large. Moreover, for each window operation, temporal

schedule and spatial binding of the window processing can greatly affect

the memory access requirements.

2. Their buffering method is not efficient when the available buffer size is not

enough for full image row buffering. This will be discussed in detail in

Section 3.5.

3. The routing area required in the FPGA is not considered. It is very difficult

to estimate routing area for FPGAs even though in many cases, it occupies

a large percentage of the total area.

Compared to Liang’s method, the design flow we propose is simple, near

optimal, and adaptive to the availability of hardware resources. We determine

the duplication factor, defined as the number of copies of function units for one

SWO operation, at the early stage of the implementation. Specifically, we:


1. Estimate the total area of functional units required for one SWO’s pro-

cessing and get the upper-bound of the duplication factor subject to area

constraints.

2. Calculate both the upper-bound and lower-bound of the duplication factor

subject to memory bandwidth constraints.

3. Calculate the upper-bound of the duplication factor subject to on-chip

memory size.

According to these three upper-bounds, we find the tightest upper-bound and

generate the corresponding design.

Chapter 3

DESIGN TRADEOFFS

The method we present here targets FPGA based coprocessor boards. We as-

sume that the FPGA chips and their connections to the external memory banks

have already been built on the board. Our goal is to design an implementation

which can fully utilize the available resources for optimal performance. By es-

timating the upper bounds to parallelism according to different constraints, we

can decide which constraint is the most critical one and then produce a design

accordingly.

There are three constraints we need to consider. First, the number of slices

in the FPGA chip limits how many copies of processing elements we can put

on the chip. Second, memory bandwidth defines the maximum data transfer

speed between the FPGAs and the external memory banks. Third, the size

of the on-chip memory which can be used as buffers reduces redundant data

transfer. Our goal is to maximize the usage of all the available resources for

36

CHAPTER 3. DESIGN TRADEOFFS 37

maximum parallelism. For now we ignore timing constraints for each pipeline

stage; they can be optimized later after we determine the design structure. The

following sections go into the details of how these three constraints influence our

final implementation. Although most of our analysis is based on a coprocessor

board with a single FPGA chip, the same analysis can hold for boards with

multiple FPGA chips because most COTS systems have a symmetrical layout

of the interconnection between FPGA chips and memory banks. We can divide

the problem into several equal size sub-problems by exploiting coarse-grained

parallelism. After finding solutions for the sub-problems, we can combine these

solutions to achieve a final solution for the whole problem. Even if the intercon-

nection is different for different FPGA chips, we can still use the same analysis

process iteratively until we get an optimal division of the problem and allocate

the sub-problems to different chips.

The rest of this chapter is organized as follows. Section 3.1 lists the param-

eters we will use later in this dissertation. Section 3.2 describes the example we

will use as we discuss the design tradeoffs for the area constraint in section 3.3,

the memory bandwidth constraint in section 3.4 and the on-chip memory con-

straint in section 3.5.


3.1 Some Definitions

We define the parameters we use in Table 3.1. They are grouped into application

parameters, board parameters and other parameters. Application parameters

define the size of the SWO application. Board parameters include the board

resource information. Other parameters are intermediate and final results de-

termined from SWOOP. These parameters will be used later without further

explanation.

3.2 Example Description

In this chapter, we use an example to demonstrate our method. After we present

each constraint, we show how it applies to this example. The example has the

following properties:

• Input image size is 1024× 1024, where each pixel has 8 bits.

• The window applied to the image is a 3 × 3 high pass filter, as shown in

Figure 3.1. It can be used to sharpen images by highlighting fine detail

in an image or enhancing detail that has been blurred [12]. Sliding this

window in raster scan order throughout the whole image, we can get an

enhanced image where each output pixel POi,j takes the value of 19× [8×

PIi,j − ΣPIneighbor]. Each output pixel also has 8 bits.

• The board we have only contains one FPGA chip with 12,288 slices and


Table 3.1: Parameter DefinitionApplication Parameters Symbol Explanation

Image size M ×N M row, N columnWindow size m× n m row, n columnPixel value of input image PIi,j i row, j columnPixel value of output image POi,j i row, j columnBits per input pixel Wpi

Bits per output pixel Wpo

Board Parameters Symbol Explanation

Total area size Atotal

No. of total memory banks kMemory bit width Wmj Wm1,Wm2, . . . , Wmk

Total buffer size Btotal

Other Parameters Symbol Explanation

No. of input memory banks kin

No. of output memory banks kout

On-chip memory used for memory inter-face as FIFO

BIF

Total available buffer size for buffering Bavail

Block buffer size (in pixels) p× q p row, q columnDuplication factor under area constraints Da

Upper bound of duplication factor with nobuffering

Dml

Upper bound of duplication factor with fullrow buffering

Dmu

Upper bound of duplication factor underbuffer size constraints

Db

Tightest upper bound of duplication factor D


81,920 bits of on-chip memory.

• 4 memory banks are connected to this FPGA chip, with data widths of

32, 32, 64 and 64 bits respectively.

• We assume the processing clock is the same as the memory read/write

clock so that we do not have to consider cross clock boundary issues.

Extra buffers and converting the data transfer rate into bits/processing

cycle are needed if these two clocks are different.

-1 -1 -1

-1 8 -1

-1 -1 -1

9

1

Highpass Filter Mask

Figure 3.1: Highpass Filter Example

3.3 Area Availability

Every functional unit (FU), e.g. adders, multipliers, registers etc., consumes one

or more slices of the FPGA. The number of slices in an FPGA is proportional

to the area in an ASIC; we use area here for simplicity.


3.3.1 Principle Blocks

Each FPGA has a fixed area (fixed number of slices) when it is manufactured;

the actual area depends on the model of the FPGA. Designs implemented in

a specific FPGA chip are limited by the available area, which means the total

area of all the functional units and routing cannot exceed the total area of that

FPGA chip. It is extremely difficult to estimate the routing area, so we reserve

20% of the total chip area for routing purposes, which leaves 80% of the total

area for FUs.

SWOs involve a set of repeated operations at different image locations. We

define a Micro Processing Element (MPE) as a set of pipelined function units

which can process one window. Once we build the MPE, we can sequentially

feed pixel data from different windows into this pipeline and get the outputs one

by one. Since each window is independent, if we have Da duplicate copies of the

same MPE, we can process Da windows at different locations simultaneously

and get a speedup of Da times. This is called spatial parallelism. Here D stands

for the duplication factor and the subscript a means this duplication factor is

based on area constraints. Given the total available area, we can estimate the

maximum Da based on the area consumed by the MPEs and the overhead which

coordinates these MPEs. Da is the upper bound on parallelism subject to the

area limitations.

Maximizing Da requires an accurate area estimate of the MPE, the control

logic and the interfacing logic. Figure 3.2 shows the block diagram of the differ-


ent blocks that are included in the area estimate. We define AMPEs as the sum

of all function unit area for processing one window; this includes the address

generation for one window. AIF and Actrl are the area overhead for interfacing

logic and control logic. AIF is easy to determine because it does not change

much when the design changes. In most cases, the interface modules are pre-

defined and their area can be estimated before we start the design process if we

know how many external memory banks will be used. Actrl is the area consumed

by the control logic. The control logic coordinates between MPEs so that they

can share the memory interface without conflict. Avalid is defined as 80%×Atotal

because we need to reserve an extra 20% area for the routing overhead, which

means 80 percent of the total area can be used for our implementation.

Function Units

AddressGeneration

(AMPEs)

On-chip MemoryControl Part(Actrl)

Memoryand I/O

interface(AIF)

MemoryBank 1

MemoryBank 2

Host

FPGA

Function Units

AddressGeneration

(AMPEs)

Function Units

AddressGeneration

(AMPEs)

Figure 3.2: Block Diagram of an Commercial FPGA Computing Engine

Equation 3.1 shows the constraint for maximizing Da.


Da × AMPEs + Actrl + AIF ≤ Avalid (3.1)

Actrl depends on the complexity of the control state machine. Higher levels

of parallelism, which corresponds to a larger Da, normally means a more com-

plicated controller; therefore, Actrl can change for different Da. It is difficult to

estimate the control logic area before it is actually implemented. Others have

proposed a controller estimation method which needs to know the number of

control states before estimation[67]. Multiple linear regression is used to build

the function of area vs. complexity for the controller. Here we simplify the

estimate by adding 15∼20% of the area of an MPE for control. This estimate is

based on our previous design experience. We assume that for each MPE, an ex-

tra 15∼20% of the area is used for the controller. This overhead may be a little

bit too pessimistic in some cases. We can always adjust this parameter later for

further optimization. We therefore can rewrite Equation 3.1 to Equation 3.2:

Da × (AMPEs + 20%× AMPEs) + AIF ≤ Avalid (3.2)

Once scheduling and binding of the operations for one MPE is fixed, the

area can be determined. Getting an optimal schedule and binding given area

constraints is an NP-hard problem [37]. Many heuristic algorithms have been

explored to solve this problem so that sub-optimal solutions can be found with

less computation. We further simplify our method by assuming there is no

area limitation for processing one window. This assumption is reasonable for


small size SWOs and current FPGA technology. Da can be easily modified in

Equation 3.2 when AMPE changes because a different implementation is chosen.

3.3.2 High-Pass Filter Example

Figure 3.3 shows a schedule using the ASAP scheduling algorithm [37] for our

high pass filter example. The dotted horizontal lines correspond to clock cycle

boundaries. We assume the multipliers and adders have the same unit delay;

this assumption can be modified depending on the library binding.

j i P , j i P , 1 1 , 1 j i P 1 , j i P 1 , 1 j i P 1 , 1 j i P 1 , j i P 1 , 1 j i P j i P , 1

9

1

8

Figure 3.3: Sequencing Graph of High Pass Filter

Once the original neighboring pixels are available, we get the output pixel

five clock cycles later. A filled pipeline can provide one result every clock cycle

thereafter. All these conclusions are based on the assumption that neighboring

pixels can constantly be fed into the pipeline. In the following sections, we will

discuss the situation when this assumption does not hold.

Table 3.2 shows the area usage of different function units, memory interfaces


and control logic. We use 20% of one MPE’s area as the estimate for controller

area.

Table 3.2: Area Usage of High Pass Filter BlocksBlocks Slices Consumed

Memory Interfaces 1768MPEs Multiplier 69

Adders 42Registers 52Counters 24Total 187

Control (20% of MPEs total) 38

According to Equation 3.2 and the data we get from Table 3.2, we can derive

the value of Da as follows:

1768 + Da × (187 + 38) ≤ 80%× 12288

Solving this equation, we get the maximum value Da = 35. If we only consider

area constraints for this example, we can have at most 35 copies of one MPE

and thus have 35 windows being processed simultaneously.

3.4 Memory Bandwidth Limitation

In section 3.3, we only consider the area constraints of the design. The feasibility

of this implementation depends on whether or not we can have all the data ready

for the functional units. If memory bandwidth becomes a bottleneck, then even


if we have Da MPEs on chip, some of them will be idle until the data arrives. In

this case, it is not useful to have Da copies of MPEs. We derive another upper

bound according to the memory bandwidth so that we can make sure all copies

of the MPE are fully working at all times.

3.4.1 Upper Bound with No Buffering and Full Row Buffer-

ing

As shown in Figure 1.3, each time the m × n window moves from left to right,

m new data are needed from memory and one output is generated. If there

is no buffering, we discard the m pixels which have moved out of the window.

However, of these discarded pixels, m−1 pixels will be reused when the window

moves to the next row. In this case, we load the pixels redundantly and increase

the memory bandwidth requirements. If we have enough buffer space, we can

keep the m− 1 pixels in the buffer and only discard 1 pixel which will never be

used for the following windows. When the window moves to the next row, it

requires m new data, m−1 of which are already in the buffer. Thus we only need

to read one new data from external memory. By this means we can minimize

the loading redundancy. Figure 3.4 shows a full row buffering method which

can achieve this goal [64]. These two cases (loading m pixels per window and

loading 1 pixel per window) can be defined as the lower bound and upper bound

of the parallelism factor, Dml and Dmu, respectively, according to the memory

bandwidth limitation.


N+2 N+3 N+4 N+5 N+6

2N+

2

2N+

3

2N+

4

2N+

5

2N+

6

3N+

2

3N+

3

3N+

4

3N+

5

3N+

6

N+7

2N+

7

3N+

7

4N+

2

4N+

3

4N+

4

4N+

5

4N+

6

4N+

7

5N+

2

5N+

3

5N+

4

5N+

5

5N+

6

5N+

7

2 3 4 5 6 7

N N+1

2N2N+

1

3N3N+

1

4N4N+

1

5N5N+

1

0 1

2N-2

3N-2

4N-2

5N-2

6N-2

N-2

2N-1

3N-1

4N-1

5N-1

6N-1

N-1

current processing window

next processing window

BufferedPixels

Figure 3.4: Using Full Row Buffering Scheme

Upper Bound with Minimal Buffering Dml: If there is minimal buffer space

available to store m× (n− 1) pixels, then we need to load data from the

external memory banks each time. We also need to consider the output

memory data transfer which will share the memory bandwidth with the in-

put memory data transfer. Although the same memory port can be shared

between input and output data transfer by interleaving the data transfers,

we allocate input and output memory ports separately to simplify the

control and to achieve high throughput designs.

Equation 3.3 shows how we define Dml.

Dml × (m×Wpi + 1×Wpo) ≤k∑

i=1

Wmi (3.3)

Dml is subject to the constraints shown in Equation 3.4 because input and

output memory ports are allocated separately.


Dml × (m×Wpi) ≤ ∑kini=1 Wmi

Dml × (1×Wpo) ≤ ∑kouti=1 Wmi

kin + kout ≤ k

(3.4)

Upper Bound with Full Row Buffering Dmu: By assuming enough buffers

are available, we can reduce the input memory data transfer to 1 data per

new window using the full row buffering method shown in Figure 3.4. In

the case where we have Dmu windows processed simultaneously, memory

bandwidth limits the value of Dmu according to Equation 3.5. This is

subject to the constraints shown in Equation 3.6.

Dmu × (1×Wpi + 1×Wpo) ≤k∑

i=1

Wmi (3.5)

Dmu is subject to the constraints shown in Equation 3.6 because input and

output memory ports are allocated separately.

Dmu × (1×Wpi) ≤ ∑kini=1 Wmi

Dmu × (1×Wpo) ≤ ∑kouti=1 Wmi

kin + kout ≤ k

(3.6)

The assignment of input and output memory may differ between these two

cases where memory bandwidth is the constraint.



We consider again the example of 2-D filter from section 3.3.2. We have already

shown that if we only consider area constraints, we can have 35 MPEs in parallel.

This means that once the pipeline is full, we can produce 35 output pixels in one

clock cycle. These output pixels need to be stored in external memory so the

host can get the processed image. However, 35 output pixels/cycle means we

need at least 35× 8bits/cycle memory bandwidth to store the data. According

to our board parameters, we only have (64 + 64 + 32 + 32)bits/cycle memory

bandwidth, which is not even enough to transfer the output data to external

memory, let alone the input data transfer from external memory. Obviously, for

this example, area is not the critical constraint of the design and we need to

consider the upper bound to be the duplication factor subject to the memory

bandwidth constraint.

Without any buffering, three new pixel data must be loaded from external

memory and one processed pixel data stored to external memory each time for

each window processed in parallel. To determine Dml, we need to allocate the

input memory ports and output memory ports so that the maximum number for

Dml meets the constraints of Equation 3.4. Based on the fact that the memory

port number k is not a large number, we can enumerate all the input/output

memory port allocation possibilities to obtain the maximum Dml. In this ex-

ample, the optimal allocation would be to assign the two 64 bit width memory

ports as input memory while the two 32 bit width memory ports are output


memory. By substituting these the numerical values into Equations 3.3 and 3.4

we get Dml = 5, which is much smaller than Da = 35.

Using a similar strategy to calculate Dmu, we find that allocating one 64

bit and one 32 bit width memory as input memories and the others as output

memories, we can get the maximum value of Dmu = 12 under the constraints of

Equation 3.6.

Dml and Dmu define the upper bound of the duplication factor for two ex-

treme cases: no buffering and full row buffering. Obviously, the number of

simultaneously processed windows also depends on the buffering scheme we se-

lect. When on-chip memory is limited and not enough for full row buffering,

the actual upper bound may be less than Dmu. This analysis is available in the

next section.

3.5 On-chip Memory Availability

Section 3.4 gives the two upper bounds of the duplication factor under memory

bandwidth constraints. As we mentioned before, if we have enough buffer space,

we can reduce the external memory data transfer rate from m pixels per window

to 1 pixel per window. Figure 3.4 shows a full row buffering method which

requires the buffer size to be at least ((m − 1) ∗ N + m) ∗Wpi bits. This may

be larger than the buffer space available. When buffer size becomes a critical

constraint of the duplication factor, our goal is to optimize the buffer size while


still keeping the number of external memory data accesses as small as possible.

3.5.1 Block Buffering Method

We present a new method we call the block buffering method. It can greatly

reduce the buffer size while still keeping the data loading redundancy low.

Figure 3.5 gives a simple example of the block buffering method. Before

processing is begun, a p × q block of pixel data is buffered, where p ≥ m and

q ≥ n. In this example the window size is m = 3, n = 4 and the block

size is p = 4, q = 6. With a p × q block buffer, we can process a total of

(p−m + 1)× (q− n + 1) windows without loading any new data. While we are

processing the current block, we can at the same time load the data for the next

p× q block buffer. As shown in Figure 3.5, each time when we load a new block,

p×(q−n+1) pixels will be loaded from off-chip memory. So the average number

of off-chip memory accesses for this type of move is p×(q−n+1)(p−m+1)×(q−n+1)

= pp−m+1

pixels per window operation. This is a significant savings for p > m when

compared to off-chip memory access without buffering, which requires m pixels

per window operation. When the block moves to the next row block, the number

of pixels needed from off-chip memory to initiate the new windowing processing

is also p × q. We are more interested in the horizontal movement of the block

because it occurs more frequent than the vertical movement.

Equation (3.7) shows a simple proof that, after introducing the p by q buffer,

the memory requirement p(p−m+1)

is less than m, the data transfer rate without


N+

2

N+

3

N+

4

N+

5

N+

6

2N

+2

2N

+3

2N

+4

2N

+5

2N

+6

3N

+2

3N

+3

3N

+4

3N

+5

3N

+6

N+

7

2N

+7

3N

+7

4N

+2

4N

+3

4N

+4

4N

+5

4N

+6

4N

+7

5N

+2

5N

+3

5N

+4

5N

+5

5N

+6

5N

+7

2 3 4 5 6 7

NN+

1

2N2N

+1

3N3N

+1

4N4N

+1

5N5N

+1

0 1

N+

8

2N

+8

3N

+8

N+

9

2N

+9

3N

+9

4N

+8

4N

+9

5N

+8

5N

+9

8 9

1st block buffer

next row block buffer

window

2nd block buffer

Figure 3.5: Block Buffering Method Example

the buffer. Moreover, the larger p is, the less memory bandwidth is required.

The proof requires p ≥ m, which we have already assumed.

p ≥ m

⇒ p(m− 1) ≥ m(m− 1)

⇒ p ≤ pm−m2 + m

⇒ pp−m+1

≤ m

(3.7)

Once we decide the values of p and q, we can determine the corresponding

duplication factor Db subjected to buffer availability. Equation 3.8 shows how

we can determine the value of Db if we are only concerned about the memory

requirement when blocks move from left to right.

Db × (p

p−m + 1×Wpi + 1×Wpo) ≤

k∑

i=1

Wmi (3.8)


Again, Db is subject to the constraints shown in Equation 3.9 because the

input and the output memory ports are allocated separately.

Db × ( pp−m+1

×Wpi) ≤ ∑kini=1 Wmi

Db × (1×Wpo) ≤ ∑kouti=1 Wmi

kin + kout ≤ k

(3.9)

The full row buffering method can be viewed as a special case of the block

buffering method with some modifications. If we let p = M, q = n using the

block buffering method, then n lines of pixels are buffered before we start sliding

window processing. It is similar to the full row buffering method. However, for

the full row buffering method, processing starts even before all n lines of pixels

are buffered. Once the pixels for the first window is in the buffer, processing

can start. After each window is processed, one datum in the buffer is discarded

and a new datum is loaded. In this manner, the full row buffering method can

save some buffer space by updating the position of the start and end pixels

after each window. This also means that the address generation part of full row

buffering is extremely complicated because we are using a “moving” buffer. The

block buffering method, on the other hand, can provide much simpler address

generation.


3.5.2 Selecting p And q

Section 3.5.1 shows the theoretical analysis of using the block buffering method.

For actual implementations, instead of loading p× (q − n + 1) new pixels from

off-chip memory when the block moves horizontally, we load a new block of p×q

pixels from off-chip memory even though p × (n − 1) pixels are already in the

on-chip memory. Our consideration here is that usually p >> m and q >> n

and these extra redundant off-chip memory accesses can greatly simplify the

control signals, and allow us to overlap loading and processing data.

Therefore, Equation 3.8 needs to be modified as follows:

Db × (p× q

(p−m + 1)× (q − n + 1)×Wpi + 1×Wpo) ≤

k∑

i=1

Wmi (3.10)

Correspondingly, the constraints should change to those in Equation 3.11.

Db × ( p×q(p−m+1)×(q−n+1)

×Wpi) ≤ ∑kini=1 Wmi

Db × (1×Wpo) ≤ ∑kouti=1 Wmi

kin + kout ≤ k

(3.11)

Our goal is to minimize the total off-chip accesses. According to the previous

assumptions, that is equivalent to minimize the average off-chip accesses per

window, p×q(p−m+1)×(q−n+1)

.

When we increase the value of p or q, we can reduce the external memory

accesses from m to p×q(p−m+1)×(q−n+1)

. When p and q are larger, the value of

p×q(p−m+1)×(q−n+1)

will be closer to 1. Since p and q are constrained by Equation


(3.12), we need to balance between p and q to get the minimal external memory

accesses.

p× q ×Wpi ≤ Bavail

2(3.12)

Bavail is the on-chip memory used for data buffering and is defined as Btotal−Bother since some on-chip memory may be used for other purposes, e.g. as a FIFO

for the memory interface. The available on-chip memory is used to build two

block buffers so that while we are processing the current windows, we can load

the next block of data in parallel. By building two block buffers instead of one,

we increase the average number of external memory accesses, but by overlapping

the loading and the computation we save in overall processing time. Figure 3.6

shows that by alternately using two small buffers we can greatly reduce the

waiting and increase the processing time.

Process

qpBlockSize opt ×=

optopt qpBlockSize ×=

Load Wait Load WaitLoad

Wait Process Wait Process

Load Load Load Load

Wait Process Wait Process Wait Process Wait

Load

Process

Figure 3.6: Overlapping Loading and Processing


The optimal values for p and q minimize the equation:

p× q

(p−m + 1)× (q − n + 1)

under the constraint shown in Equation (3.12). Solving this equation, we find

that by selecting:

popt =√

Bavail×(m−1)2(n−1)×Wpi

qopt =√

Bavail×(n−1)2(m−1)×Wpi

(3.13)

we can minimize the total number of external memory loading accesses. The

detailed proof can be found in Appendix B


According to the above analysis, we know that Db satisfies the following condi-

tion:

Dml ≤ Db ≤ Dmu

because the external memory accesses per window using the block buffering

method is between 1 (using full row buffering) and m (minimal buffering). The

value of Db depends on the buffer size. In our example, we have a total of 81,920

bits of on-chip memory, where 65,536 bits are used for the memory interface and

the remaining 16,384 can be used as the buffer. The full row buffering method

shown in Figure 3.4 requires (n−1)×N +m pixels to be buffered, which means

it needs 16,408 bits of on-chip memory. It is more than the available buffer size


and, in this case, the full row buffering method is not feasible and we need to

use the block buffering method instead.

Our computation shows given Btotal = 16, 384 (in pixels), the value of popt is

32 and qopt is 32. After we determine popt and qopt, we can get the data transfer

rate requirement subject to Equation 3.11. By allocating one 64 bit and one 32

bit width memory ports as input memory and the other 64 and 32 bit ports as

output memory, we get Db = 10, which doubles the duplication factor Dml and

is close to Dmu.

3.6 Example Summary

This chapter presents a detailed discussion of how to determine the different up-

per bounds of parallelism according to the different resource constraints. The ex-

ample given in this chapter shows that different resource constraints can greatly

influence the value of the upper bounds. According to our analysis, the max-

imum value of Da is 35 if we only consider area constraints. Further analysis

shows that the duplication factor should be in the range of [5, 12] when mem-

ory bandwidth is considered. Finally, if we take buffer space into account, the

maximum duplication factor should be 10. Clearly, we will select 10 as our final

duplication factor. This also means that the tightest constraint for this example

is buffer size. Once we select the tightest constraint, the corresponding hardware

block structure can be determined. In this example, we will use the block buffer-


ing method and let p = 32, q = 32. 10 copies of MPEs will be generated and the

corresponding address generation part will be built. Moreover, each memory

port is assigned either as input memory or output memory at this time.

3.7 SWOOP

SWOOP is a tool that automates the process of finding the upper bounds and

choosing the tightest upper bound. It also report the optimal memory port allo-

cation according to the tightest upper bound. For on-chip memory constrained

applications, SWOOP will give the optimal value of the block size too. SWOOP

is written in MATLAB. It implements the method we discussed before in this

chapter. The algorithm SWOOP uses is summarized as follows:

Step 1: Enumerate all possible memory bank usages and get the values of

different AIF and BIF .

For each set of AIF and BIF :

Step 2: Get the value of Da according to board parameters and

application parameters.

Step 3: Calculate Dml and Dmu using optimal memory allocation.

IF Bavail is enough for full row buffering:

Step 4: The tightest constraint could be either area or memory


bandwidth and D = min{Da, Dmu}.ELSE:

Step 4: Calculate popt and qopt and corresponding Db using optimal

memory allocation and D = min{Da, Db}.

End For.

Step 5: Find the maximal value of D for all the cases. This D is the final

maximal duplication factor.

Figure 3.7 gives the flowchart of the algorithm.

Running SWOOP only takes seconds to get results, it can be used as a tool

to get a rough estimate of an architecture for a given design. Later on, the

designer can change the parameters and run SWOOP multiple times to get an

near optimal solution to meet specific requirements if they change.

In our previous discussion, we assume that the three constraints are indepen-

dent of each other. This assumption is not always true and some modifications

are necessary when estimating the maximal duplication factor. For example, a

different allocation of memory ports may lead to changes in AIF and Bavail. As

a result, the values of Da and Db may be different. In SWOOP, we enumerate all

the possible memory allocations and the corresponding values of AIF and Bavail

to find the optimal memory architecture. This enumeration, although exponen-


Input board parameters and application parameters

Enumerate all the possible AIF

and BIF

For each AIF and BIF

Calculate Da, Dml and Dmu and its corresponding memory allocations

Da>Dml?

Bavail enough for full row buffering?

D=Da

D=min(Da, Dmu)

Calculate optimal value of p and q

D=min(Da, Db)

All possible AIF and BIF considered?

Select the maximal D for all

the cases

Calculate Db and its optimal memory allocation

Get the optimal D value and its corresponding memory

allocation

Yes

No

Yes

No

Yes

No

Figure 3.7: SWOOP Flowchart


tial in the number of the off-chip memory banks, is still affordable because the

number of the off-chip memory banks is usually small, i.e., less than 10. Here

Bavail is defined as the total available buffer size for SWO buffering. In this

dissertation, Bavail equals Btotal − BIF where BIF is the buffer size used as a

FIFO for the memory interface.

3.8 Summpar

This chapter discusses in detail the design tradeoffs. Three upper bounds ac-

cording to three different resources are determined using analytical methods. A

simple example was presented to illustrate the design tradeoffs. The detailed

SWOOP algorithm is also presented in this chapter. In the next chapter, we

present four different SWO applications implemented using both SWOOP and

manual design. The results from SWOOP and manual designs are compared

and detailed analysis are given for each application.

Chapter 4

EXPERIMENTS

In Chapter 3, we have discussed how to determine the different upper bounds

according to different resource constraints. We consider the three upper bounds

as relatively independent of each other, but practically each will have some

influence on the others. Since our aim is to quickly estimate the maximum

performance given the co-processor board, we ignore these second order effects

at this estimation stage and are only interested in defining the block structure.

The value of the maximum performance may change a little if the design is

further optimized based on the block structure we get. The small variation

between the estimate and actual implementation can be ignored in most cases.

Four examples in very different areas have been implemented using our

method to prove its effectiveness. One is the high-pass filter we have intro-

duced as our example in Chapter 3; one is a low-pass filter commonly used in

digital image processing for smoothing or blurring; one is the Retinal Vascular

62

CHAPTER 4. EXPERIMENTS 63

Tracing (RVT) algorithm used in medical imaging [68]; and the last one is the

Particle Image Velocimetry (PIV) algorithm used in fluid dynamics [69]. For

the high-pass filter, we use the constraints we assumed previously. For the other

three algorithms, we will apply our tool, SWOOP, to constraints based on Fire-

Bird [70], a commercial FPGA-based computing board from Annapolis Micro

Systems Inc. Figure 4.1 shows the block diagram of the board; it contains one

Xilinx Virtex 2000E FPGA chip and 5 external memory banks.

PCIConnector

CLK

PCIController

Flash

VIRTEXTMEXCVE2000

Mem_0ZBT SRAM

8MB

PCI Bus(64 bits)

LAD Bus

66MHz, 64 bits

64

bits

64

bits

32

bits

64

bits

64

bits

Mem_1ZBT SRAM

8MB

Mem_2ZBT SRAM

8MB

Mem_3ZBT SRAM

8MB

Mem_4ZBT SRAM

4 MB

Figure 4.1: FireBird Block Diagram

The estimates of the performance for each application will be compared to

a hand-written design implementation to show the effectiveness of our method.

The handwritten designs showed significant speedup over the same algorithm

implemented in software.


4.1 3× 3 High-Pass Filter AND 5× 5 Low-Pass

Filter

Two 2-D filter applications are presented in this section with different image

sizes, window sizes and board parameters. Both algorithms have relatively small

window sizes and the area required for processing a single window is relatively

low. Therefore, the tightest constraint is either memory bandwidth or on-chip

memory size, depending on the board parameters.

4.1.1 Algorithm and Parameters

The 3 × 3 High-Pass Filter (HPF) has already be described in Chapter 3, we

summarize it here for completeness. Figure 4.2 and Figure 4.3 show the HPF

mask and the corresponding sequencing graph.

-1 -1 -1

-1 8 -1

-1 -1 -1

9

1

Highpass Filter Mask

Figure 4.2: Highpass Filter Coefficients

A 5 × 5 Low-Pass Filter (LPF) is also presented in this section because it


j i P , j i P , 1 1 , 1 j i P 1 , j i P 1 , 1 j i P 1 , 1 j i P 1 , j i P 1 , 1 j i P j i P , 1

9

1

8

Figure 4.3: Sequencing Graph of High Pass Filter

is similar to the HPF but uses different board parameters. The 5 × 5 low-

pass filter, sometimes called a smoothing or blurring algorithm, is widely used

in image preprocessing. It can be used to reduce noise, bridge small gaps in

lines or curves etc. [12]. In our experiment, we implement a 5×5 LPF using the

coefficients shown in Figure 4.4. To save area, instead of using multipliers, we use

shift registers combined with adders to implement the constant multiplications.

Figure 4.5 shows the sequencing graph for the LPF. Because the graph is too

large to fit into one page, we break it into 5 parts according to the coefficients.

Basically we group the pixels that multiply the same coefficient value together.

From Figure 4.4 we know that there are in total 6 different coefficients for the

LPF mask. In Figure 4.5, part A shows for the center pixel with coefficient

value 15; part B stands for the four pixels with coefficient value 12; part C for

coefficient 9; part D for coefficient 5, part E for coefficient 2 and part F for

coefficient 4. The complete MPE will use an adder tree to add these 5 parts


together to get the output for one window.

2 4 5 24

4 9 12 49

5 12 15 512

4 9 12 49

2 4 5 24

×1151

Lowpass Filter Mask

Figure 4.4: Low-pass Filter Coefficients

Table 4.1 shows the board parameters and application parameters for these

two 2-D filter applications.

Table 4.1: Board Parameters and Application Parameters for HPF and LPFApplications

HPF LPF

Atotal 12288 19200k 4 5Wmj {64, 64, 32, 32} {64, 64, 64, 64, 32}Btotal 81920 655360M ×N 1024× 1024 512× 512m× n 3× 3 5× 5Wpi 8 8Wpo 8 8AMPE 187 400

Table 4.2 shows the slices usage for one MPE for the LPF algorithm.


jiP ,

shift 1 bit

shift 2 bits

shift 3 bits

A

jiP ,1−

jiP ,1+

1, −jiP

1, +jiP

shift 2 bits

shift 3 bits

B

1,1 −− jiP

1,1 −+ jiP

1,1 ++ jiP

11 +− jiP

shift 3 bits

C

jiP ,2+

jiP ,2−

2, −jiP

2, +jiP

shift 2 bits

D

2,2 −+ jiP

2,2 −− jiP

2,2 +− jiP

2,2 +− jiP

Eshift 1 bit

2,1 −+ jiP

2,1 −− jiP

2,1 +− jiP

2,1 ++ jiP shift 2 bits

1,2 −+ jiP

1,2 −− jiP

1,2 +− jiP

1,2 ++ jiP

F

Figure 4.5: Low-pass Filter MPE


Table 4.2: Area Usage of Low Pass Filter BlocksBlocks Slices Consumed

MPEs Multiplier 107Adders and Counters 174Registers 119Total 400

Control (20% of MPEs total) 80

4.1.2 Results from SWOOP

We can use our automated tool, SWOOP, with the board parameters in Table 4.1

to estimate which constraint will be the tightest for these two applications.

According to our calculation, the 3 × 3 high-pass filter algorithm is on-chip

memory bound while the 5 × 5 low-pass filter algorithm is memory bandwidth

bound. Table 4.3 gives the different upper bound results and the final maximal

duplication factor from SWOOP according to the different constraints. For

LPF, because it is memory bandwidth constrained, SWOOP will use the full

row buffering scheme and will not calculate the values of p, q, and Db.

Table 4.3: Duplication Factors According to Different ConstraintsHPF LPF

Da 35 28Dml 5 5Dmu 12 16Db 10 N/Ap× q 32× 32 N/AD 10 16


The optimal external memory allocation differs for the calculation of the

different bounds. Table 4.4 shows how we allocate the external memory bank

for optimal implementation and the corresponding values of AIF and Bavail.

Table 4.4: Bound Dependent VariablesHPF LPF

kin(Input Memory banks) {64, 32} {64, 64}kout(Output Memory banks) {64, 32} {64, 64}

AIF (slices) 1768 1835Bavail(bits) 16384 589824

4.1.3 Comparison and Analysis

We manually implemented these two applications on the FPGA board and the

results are used to compare with the SWOOP results. We define the following

parameters for the comparison.

• Dall: The maximum duplication factor.

• Ball: Total on-chip memory usage (in bits).

• Aall: Total area usage (in slices).

• Wmj: Memory bank used.

Table 4.5 shows the differences between the manual designs and the results

from our automated tool.


Table 4.5: Comparison between Automatic and Manual Results for HPF andLPF Algorithms

Automatic Manual Difference (%)

HPF Dall 10 9 11%Ball 16384 16128 1.6%Aall 4460 4236 5.3%Wmj {64,32,64,32} {64,32,64,32} SAME

LPF Dall 16 16 0%Ball 16416 20480 -19.8%Aall 9515 8827 7.8%Wmj {64,64,64,64} {64,64,64,64} SAME

For the HPF, the three parameters differ up to 11% from the manual design,

which means that SWOOP gives a good estimate of the speedup as well as

of the usage of on-chip memory and area. The largest difference lies in the

duplication factor; for the manual design this is Dall = 9. This is different

from the automatic design because the manual design considers the memory

packing factor. The packing factor is defined as the number of pixels grouped

together and stored in one memory address. In our case, one 64 bit and one 32

bit memory bank are allocated as input memory; for one external memory read

access, a total of 12 pixels ((64 + 32)/8 = 12) can be read. Thus the packing

factor is 12. According to the automatic tool, the optimal block buffer size is 32

by 32. If we assume that 12 pixels are stored in one memory address and the

image is stored line by line, loading 32 pixels of one row would require three off-

chip memory accesses. In fact, we can load 36 pixels in three memory accesses.


As a result, some of the memory bandwidth is wasted. In our manual design,

we choose the block buffer size to be 24 by 42 instead. By adjusting the block

buffer size, we can avoid wasting memory bandwidth and at the same time we

simplify the control part of the design. In the future, SWOOP will be extended

to take the packing factor into account.

For the LPF, our automatic tool gives the exact value of speedup. The

difference in Ball between the automatic and the manual results is −19.8 because

in our manual design, instead of buffering (m − 1) × N + n − 1 pixels for full

row buffering, m×N pixels are buffered. By buffering one extra row, each MPE

can load data from on-chip memory instead of loading data from both on-chip

memory and off-chip memory. We greatly simplified the control signals of our

design by doing so. Moreover, the automatic tool provides a guide to how much

on-chip memory we can use for buffering and in this case, we have enough on-

chip memory to buffer m×N pixels. In the case of limited on-chip memory, we

can reduce on-chip memory usage to (m − 1) × N + n − 1 and the difference

in Ball between the automatic and manual results would be 0% at the cost of a

more complicated controller.

4.1.4 Summary

For these two 2-D filters, SWOOP gives a very accurate estimate of the optimal

design. Moreover, since SWOOP only gives the block structure of the design,

designers have the flexibility to further optimize the design. The results of


SWOOP can serve as a guide for the designer to focus their efforts since SWOOP

identifies where the tightest constraint to parallelism is found in a design.

4.2 Retinal Vascular Tracing Algorithm

The RVT algorithm was designed by colleagues at Rensselaer Polytechnic In-

stitute (RPI) [71]. It is used in the biomedical field for tracing blood vessels in

retinal fundus images. The speed of motion of the human eye and the desire for

these traces to be available to assist during laser surgery require these traces to

be available immediately after the retinal image is acquired. The algorithm is

split into two parts. The first part of the algorithm detects and validates several

seed points that are known to be on blood vessels. The tracing algorithm uses

these seed points as initial points. The second part of the RVT algorithm is the

basic tracing algorithm which starts at one point on a blood vessel, finds the

next point on that same vessel, and continues to follow its path until it reaches

the end. All of the background information on the RVT algorithm in this sec-

tion is from [68]. We are interested in the tracing algorithm because it is very

computationally intensive and it falls into our SWO category.


In order to begin tracing, an initial starting point, ~pk, and orientation, sk, are

necessary. The starting point is in the center of a blood vessel, and the orien-


tation gives the direction that the vessel is pointing. The initialization points

are found in a pre-processing step, which is not related to this dissertation and

is not covered here. Vector ~uk is the unit vector along the blood vessel at point

~pk, defined by Equation 4.1 [71].

~uk =

ukx

uky

=

cos(

2πsk

16

)

sin(

2πsk

16

)

(4.1)

The next point on the vessel, ~pk+1, and its orientation, sk+1, can be found by

investigating the orientation of the vessel’s boundaries. This is accomplished by

calculating the results from a set of two-dimensional correlation kernels, designed

by researchers at RPI [71] and shown in Figure 4.6.

2

E Direction 0

Angle 0

ENE Direction 1 Angle 22.5

NE Direction 2 Angle 45

NNE Direction 3 Angle 67.5

N Direction 4 Angle 90

NNW Direction 5

Angle 112.5

NW Direction 6 Angle 135

WNW Direction 7

Angle 157.5

W Direction 8 Angle 180

WSW Direction 9

Angle 202.5

SW Direction 10 Angle 225

SSW Direction 11 Angle 247.5

S Direction 12 Angle 270

SSE Direction 13 Angle 292.5

SE Direction 14 Angle 315

ESE Direction 15 Angle 337.5

0 1 -1 -2 Target Pixel

Figure 4.6: Templates

Each of the 16 templates represents a unique direction. The angles for these

directions are discrete values separated by 22.5◦. Given a target pixel and the

correct neighborhood of pixels, each kernel will give a scalar result by multiplying


the gray scale values of the pixels by 0, ±1, or ±2 as defined by the 11 × 11

template. This result is referred to as the template response. The template with

the greatest response for the same target pixel represents the orientation of the

vessel at that pixel.

We name the hardware module for individual template matching “response”.

By comparing results from 16 different templates, we select the maximal value

and the corresponding template that gives the actual direction of the blood ves-

sel. We call this comparison module “direction”. Figures 4.7 and 4.8 show the

detailed structure of these two modules.

PIPELINE REGISTERS

PIPELINE REGISTERS

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+1 Coeff. +2 Coeff. -1 Coeff. -2 Coeff.

PIPELINE REGISTERS

<<

+

<<

+

PIPELINE REGISTERS

- - >

PIPELINE REGISTERS

PIPELINE REGISTERS

Response Direction

POS NEG

Figure 4.7: Response Module


INTERCONNECTION

000 001 010 011 100 101 110 111

RESPONSE TEMPLATE

REGISTERS (11 x 11 Window of Pixels)

TEMPLATE

COMPARATOR

TEMPLATE

COMPARATOR

TEMPLATE

COMPARATOR

TEMPLATE

COMPARATOR

TEMPLATE

COMPARATOR

TEMPLATE

COMPARATOR

TEMPLATE

COMPARATOR

R E

S P

O N

S E

R E

S P

O N

S E

R E

S P

O N

S E

R E

S P

O N

S E

R E

S P

O N

S E

R E

S P

O N

S E

R E

S P

O N

S E

R E

S P

O N

S E

PIPELINE REGISTERS

PIPELINE REGISTERS

Figure 4.8: Direction Module

Note that in Figure 4.8, instead of using 16 response modules, 8 response

modules are used because the other 8 response values can be obtained by chang-

ing the sign of the current 8 response values (Figure 4.6). We define one MPE

as one direction module and its inputs are the 11 × 11 pixels while its out-

put is the 4 bit label used to identify the direction. The FPGA board we used

for this application is the same one we use for our LPF. Table 4.6 lists all the

parameters.


Table 4.6: Board Parameters and Application Parameters for RVT ApplicationsBoard Parameters

Atotal 19200k 5Wmj {64, 64, 64, 64, 32}Btotal 655360

Application ParametersM ×N 512× 512m× n 11× 11Wpi 12Wpo 4AMPE 3826

4.2.2 SWOOP Results

Table 4.7 lists all the possible values of AIF and BIF when using different pos-

sible combinations of the external memory banks. AIF and BIF are the same

independent of whether the external memory bank is used as input memory or

output memory. Since we assume input memory and output memory are in-

dividually assigned and there are at least two memory banks available, we can

ignore the case with exactly one 64-bit memory and the case with only one 32-

bit memory. For all the other cases, we use SWOOP to check the values of Da,

Dml, Dmu and Db to get the maximal duplication factor D. The final maximal

duplication factor Dall is determined as discussed in Section 3.6.

Table 4.8 lists all the duplication factors using different usages of external

memory banks. From this table, we show that using a different number of


Table 4.7: Enumeration of AIF and BIF for FireBirdMemory used AIF (in slices) BIF (in bits)

1 64-bit memory 604 16,3842 64-bit memory 1031 32,7683 64-bit memory 1456 49,1524 64-bit memory 1835 65,5361 32-bit memory 564 16,384

1 64-bit memory + 1 32-bit memory 990 32,7682 64-bit memory + 1 32-bit memory 1397 49,1523 64-bit memory + 1 32-bit memory 1829 65,5324 64-bit memory + 1 32-bit memory 2239 81,920

external memory banks can change the value of the maximal duplication factor.

The final duplication factor Dall we choose is the maximal value of all the D and

in our case Dall = 3. Note here that only Da and Dmu are listed and compared

to decide the duplication factor D because the on-chip memory is large enough

to use the full row buffering method for all cases.

According to SWOOP, the RVT algorithm is area constrained. The maximal

number of MPEs we can put on the board is 3. However, according to Table 4.8,

there are several cases that the maximal duplication factor is 3 (case 1, 2, 4).

Currently, we select case 4 because when compared to cases 1 and 2, this one

uses the smallest area for the same performance. By adjusting the weight of

different resource constraints, we can easily change our selection according to

the user’s requirements.

For the final design, we select one 64 bit external memory as input memory


Table 4.8: Duplication Factor for Different Usage of External Memory Banks

Case No. Memory used AIF (in slices) BIF (in bits) Da Dmu D

1 2 64-bit memory 1031 32,768 3.12 5 32 3 64-bit memory 1456 49,152 3.03 10 33 4 64-bit memory 1835 65,536 2.95 16 24 1 64-bit memory +

1 32-bit memory990 32,768 3.13 5 3

5 2 64-bit memory +1 32-bit memory

1397 49,152 3.04 8 2


1829 65,532 2.95 13 2


2239 81,920 2.86 16 2

and one 32 bit external memory as output memory. The maximum number of

MPEs we can put on the FireBird is 3.


The RVT algorithm has been designed and implemented on the FireBird using

a hand-crafted design described in VHDL [68]. Both coarse-grained and fine-

grained parallelism were investigated to ensure high performance of the final

design. Table 4.9 gives the results from both SWOOP and the manual design.

Here the percentage differences are not listed because the manual design takes

other factors into account and is not a good reference for comparison. A more

reasonable comparison is given in Table 4.10.

For the RVT algorithm, the results from SWOOP are quite different from


Table 4.9: Comparison between Automatic and Manual Results for RVT Algo-rithm

Automatic Manual

Dall 3 1Ball 90,112 180,224Aall 14,764 10,035Wmj {64,32} {64,64,32}

the manual design. According to SWOOP, we can maximally put 3 MPEs on

the board while the manual implementation put only 1 MPE. The differences

can be explained as follows:

• In addition to the template matching; the manual design also includes a

module which interface with the frame grabber. This module needs extra

slices to implement. We can easily re-run SWOOP and alter the area to

accommodate this interface to get a more accurate estimate.

• For this specific hand-crafted implementation, the processing time is lim-

ited by the image capture rate from the frame grabber. As long as the

processing is faster than the incoming data rate, there is no need to fur-

ther speedup the processing part. A modified SWOOP result is shown in

Table 4.10 where only 1 MPE is implemented on the board and the results

are very close to the manual design regarding area usage.

• The designer of the manual design used a different buffering method which

is not optimal with respect to the on-chip memory usage and redundant


external data accesses. Although manual designs are usually more efficient

than an automated tool design, this is not always true. In this application,

SWOOP uses the full row buffering method described in Section 3.4.1. The

buffering method used by the manual design is described below.

For the manual design, 5 pixel data are grouped in one memory address

and correspondingly, the processing part takes 5 windows as a unit and feeds

them into the pipeline. As shown in Figure 4.9, adjacent neighborhoods contain

ten columns of common data. Every time the windows shift 5 pixels, five new

columns of pixels will be loaded from external memory.

Pixel

Target Pixel

Pixel just read frommemory

11x15 pixel window

Figure 4.9: Shifting the Window

A mix of BlockRAM and registers are used to store the 11 x 15 pixel neighbor-

hoods, so the data can be quickly accessed by the FPGA logic. A neighborhood


is stored in three 11 x 5 pixel sections. The idea is that to shift the neigh-

borhood, we can throw away one section, move the data from the other two

sections, which is common to the adjacent neighborhood, and read in 11 new

words from memory to fill the third section. Figure 4.10 shows how the shifting

of data works.

Off-Chip Memory

Off-Chip Memory

B l o

c k

A 2

B l o

c k

A 3

B l o

c k

B 1

B l o

c k

B 2

B l o

c k

B 3

B l o

c k

A 1

B l o

c k

A 2

B l o

c k

A 3

15

11

Neighborhood 1

Neighborhood 2

Neighborhood 3

B l o

c k

A 1

Figure 4.10: Creating New Neighborhoods by Shifting Data on the FPGA

By using this buffering method, the data in the other two sections can be

reused to save external memory access. However, 10 lines of pixels in the dis-

carded section need to be reloaded when the templates move to the next row.


Redundant external memory accesses cannot be avoided by using this buffer-

ing method. SWOOP uses the full row buffering method which can completely

avoid the redundant external memory accesses with even less on-chip memory

usage.

Table 4.10 uses the adjusted parameters and compares the SWOOP results

with the manual design. Instead of putting the maximal allowable MPEs into

the board, we limit the design to one MPE on the board because in this specific

case, one MPE is enough to process the data in real-time. Moreover, the extra

area used to interfacing with the frame grabber is also taken into account by

SWOOP.

Table 4.10: Comparison between Automatic and Manual Results for RVT Al-gorithm

Automatic Manual Difference (%)

Dall 1 1 0%Ball 90,112 180,224 -56%Aall 8,091 10,035 -19.37%Wmj {64,32} {64,64,32} N/A

According to Table 4.10, the biggest difference is the on-chip memory usage.

SWOOP can find an optimal data access pattern for different SWO-based appli-

cations; it can efficiently use the available on-chip memory size and reduce the

external memory accesses. Compared to the manual design, SWOOP gives a

better possible implementation for the RVT algorithm. If we follow the SWOOP

results, we can implement this algorithm using less on-chip memory while at the


same time reducing the external memory accesses.

The area difference between SWOOP and the manual design can be explained

as follows:

• For the manual design, a different buffering method is used and it requires

a lot of registers and more complicated control circuitry. These registers

and control circuits need extra area.

• Because of the registers and complicated control, routing is more compli-

cated and needs more area too.

• The manual design uses one more 64-bit memory bank and the memory

interface to this extra memory bank needs more slices.

The memory allocation is different between SWOOP and the manual design.

The manual design uses two 64-bit memory bank alternatively to store the data

from the frame grabber and load the data to the FPGA chip. This saves some

control circuitry but at the cost of wasting a portion of the memory bandwidth.

4.2.4 Summary

From this experiment, we can see that a real implementation needs to take not

only the three constraints but also other factors into account. In this case,

we need to modify the input parameters to SWOOP to get a more accurate

estimate. Moreover, we can easily change the weight of different resources to get


a near optimal design with the same performance according to the designer’s

requirements.

Another interesting result from the RVT algorithm is that, in some cases,

hand-crafted designs may not necessarily be most efficient. SWOOP, in some

cases, can give the designer a guide as to how to fully utilize the available

resources, especially the on-chip memory resources.

4.3 Particle Image Velocimetry Algorithm

Particle image velocimetry is a useful tool in fluid dynamics due to its non-

intrusive and concurrent estimation of the fluid movement. In a conventional

PIV system [72, 73], small particles are added to the fluid and their movements

are measured by comparing pairs of images of the flow, taken in rapid succession.

The local fluid velocity is estimated by dividing the images into small interro-

gation areas and cross correlating the areas recorded in the two frames. Such

systems are called double frame/single exposure systems. Figure 4.11 shows a

typical PIV system [74]. There are two different SWOs in a PIV application.

One is the whole image SWO, which selects the interrogation area in a raster

scan order. But unlike the SWO applications we mentioned before, instead of

moving one pixel to the next position, the window skips a fixed number of pixels

to the next position. The number of pixels the window skips depends on user

requirements. The other SWO is the cross-correlation computation for one inter-


rogation area. The cross-correlation part of the algorithm is very computation-

ally intensive and it limits PIV usage almost exclusively to off-line processing,

analysis and modelling. For this application, we ignore the whole image SWO

because in most cases, the overlap of two neighboring interrogation areas is 50%

or less. That means the data reuse rate is much lower than the cross-correlation

SWOs. Moreover, we are more interested in the cross-correlation SWO because

it is performed for every interrogation area, much more frequently than the

whole image SWO. We will implement the cross-correlation part on a FPGA

board using the method we presented in this research to determine the speedup.

The sizes of the interrogation areas are 40× 40 and 32× 32 for two consecutive

images.

Figure 4.11: PIV System Overview (From Dantec Dynamics)

Figure 4.12 shows an example of how to get the values of the cross-correlation


plane. By moving the smaller area (Area B with n × n pixels) throughout the

larger area (Area A with m×m pixels), we get (m− n + 1)× (m− n + 1) data

which form the cross-correlation plane. The position of the peak value on the

cross-correlation plane indicates which direction and how many pixel distances

the particles moves.

x

y

0-1

-1

0

1

1

Area A

Area B

Cross-correlation plane

)1,1( =−= yxShift )0,0( == yxShift )1,1( == yxShift

)1,1( −=−= yxShift )0,1( == yxShift

Figure 4.12: Cross Correlation Plane

Figure 4.13 shows an example of what the velocity plane will look like after

the cross-correlation of different interrogation areas. The velocity information

for each interrogation area comes from the location of the best match of the two

sub-images (peak value of the cross-correlation).

Similar to the RVT algorithm, we will first estimate the performance using

our method and then compare it to the hand-written design results [69, 75].


Velocity vector for oneinterrogation area

Figure 4.13: Velocity Plane


The core of the PIV algorithm is 2-D cross-correlation. Figure 4.12 shows how

we calculate the cross-correlation plane, and then take the peak value to get the

estimate of the velocity. In this example, the size of Area A is 4× 4 and the size

of Area B is 2 × 2. Area B moves pixel by pixel throughout Area A, which is

similar to moving a window throughout an image. In our PIV application, Area

A is 40 × 40 and Area B is 32 × 32. Therefore the size of the cross correlation

plane is 9 × 9. The input images are 8-bit grey scale images. To meet the

accuracy requirement for the final velocity value, we keep all the accumulated

bits width for the correlation plane, which is 26 bits for each data value. The final

velocity value use two 9 bit values, one for the horizontal direction of velocity

and one for the vertical direction. We take Area B as a window and Area A


as an image to define the application parameters. One big difference in this

application compared to previous applications is that the size of the window is

much larger. Instead of putting the window value in registers, we use on-chip

memory to store the window.

Some modifications are needed before we define the MPE. According to our

calculation, one multiplier with two 8-bit inputs and one 16-bit output needs 47

slices. Therefore it is impossible to implement 32 by 32 multipliers in parallel.

Instead, we use 32 multipliers and several stages of adder as part of one MPE.

Figure 4.14 shows the detailed structure of one MPE [75]. The narrow rectangles

indicate the registers that divide the process into several stages of pipeline. The

digits denote the bit width for each stage.

8

8

8

8

8

8

8

8

8

8

8

8 16

16

16

16

16

16

16

16

17

16

16

17

17

17

18

17

17

18

20

20

2121

26

5-stage accumulationfor each line

accumulation for wholeinterrogation area

ComparatorPeak value

32 multipliersin parallel

Figure 4.14: Pipelined Structure


Another modification we need to make is to the output. As we have men-

tioned before, the location of the peak value of the cross-correlation gives the

velocity information of the particle movement. For an interrogation area with

40 × 40 and 32 × 32 pixels, the output is the position of the peak value. To

increase the accuracy of the velocity estimate, a technique called sub-pixel in-

terpolation is used in this application. For this application, we use parabolic

peak fit (Equation (4.2)) because it is more suitable for hardware implementa-

tion than Gaussian peak fit [73].

px = x +R(x−1,y)−R(x+1,y)

2R(x−1,y)−4R(x,y)+2R(x+1,y)

py = y +R(x,y−1)−R(x,y+1)

2R(x,y−1)−4R(x,y)+2R(x,y+1)

(4.2)

Figure 4.15 shows an example of how sub-pixel interpolation works. In this

figure, the peak value appears at location (0,0). When we apply sub-pixel inter-

polation, the adjusted location would be (-0.1, -0.4) after taking the neighboring

cross-correlation values into account.

0-1-2 1 2

X(pixel)Y(pixel)0

-1

-2

12

Cross-correlation0

-1

-2

12 X(pixel)Y(pixel)

Cross-correlation

0-1-2 1 2

Figure 4.15: An Example of Sub-pixel Interpolation

Clearly, it is not necessary to store the entire cross-correlation plane to the


external memory. Instead, for each cross-correlation plane, we only need to

store the peak value position after sub-pixel interpolation. That means for every

(40− 32 + 1) × (40− 32 + 1) window processed, only two 9-bit results need to

be stored to the off-chip memory. However, we do need extra on-chip memory

to store the values of the cross-correlation plane before we get the final estimate

of the velocity.

To summarize, the modifications we made before we apply this application

to SWOOP is as follows:

• A window needs to be loaded for each interrogation area. To load the win-

dow from external memory every time, it introduces extra input memory

bandwidth requirement. Moreover, as we mentioned before, we only take

cross-correlation SWO into account and the SWO for the whole image is

ignored.

• Extra on-chip memory is needed to store the value of windows and the

cross-correlation plane. In this application, the window size is 32 by 32

with 8 bits for each value and the cross-correlation plane size is 9 by 9

with 26 bits for each value. Therefore, we need 32× 32× 8 bits of on-chip

memory to store the window and 9 × 9 × 26 bits to store the correlation

plane.

• For each data on the correlation plane, 32 clock cycles are needed to com-

pute the correlation. Therefore, to calculate the whole cross correlation


plane, 9 × 9 × 32 clock cycles are needed. If we assume there is no re-

dundant data loading from the external memory banks, there are in total

40 × 40 + 32 × 32 pixels to load for one interrogation area. Averaging

the data loading requirements over the processing time, the input memory

bandwidth requirement is (40×40+32×32)∗8/(9×9×32), that is about

8 bits per cycle.

• The output memory bandwidth will be averaged over the entire cross-

correlation plane too. For every 32× 9× 9 clock cycles, two 9 bits results

need to be stored.

• The area used to implement sub-pixel interpolation is considered to be

part of one MPE. The cross-correlation part requires 2092 slices and sub-

pixel interpolation needs 559 slices. The total slices needed for one MPE

is 2092 + 559 = 2651.

After taking all these modifications into account, we obtain the modified

parameters listed in Table 4.11.

4.3.2 SWOOP Results

Similar to the RVT algorithm, all the possible values of AIF and BIF are enu-

merated for different possible combinations of the external memory banks. Ta-

ble 4.12 lists all the duplication factors with different usages of external memory


Table 4.11: Board Parameters and Application Parameters for PIV ApplicationsBoard Parameters

Atotal 19200k 5Wmj {64, 64, 64, 64, 32}Btotal 655360

Application ParametersM ×N 40× 40m× n 32× 32Wpi 8Wpo 9AMPE 2651

banks. For all of these combinations, the PIV algorithm is always area con-

strained, therefore, only Da and Dmu are listed.

According to Table 4.12, different usages of external memory banks won’t

change the duplication factor. The maximal number of MPEs is always 4. We

select case 4 as our final implementation because it uses the least area.


The PIV algorithm has been designed and implemented for the Annapolis Fire-

Bird by hand using the VHDL language [69]. Table 4.13 compares the results

from SWOOP and manual designs.

Interestingly, if we follow the SWOOP design, we can double the performance

of the manual design with only about half of the on-chip memory size. The fact

that the SWOOP implementation is better than the manual design relies on the


Table 4.12: Duplication Factor for Different Usage of External Memory Banks

Case No. Memory used AIF (in slices) BIF (in bits) Da Dmu D

1 2 64-bit memory 1031 32,768 4.50 8 42 3 64-bit memory 1456 49,152 4.37 16 43 4 64-bit memory 1835 65,536 4.25 24 44 1 64-bit memory +

1 32-bit memory990 32,768 4.51 8 4


1397 49,152 4.38 16 4


1829 65,532 4.25 24 4


2239 81,920 4.12 32 4

Table 4.13: Comparison between Automatic and Manual Results for PIV Algo-rithm

Automatic Manual

Dall 4 2Ball 71,738 137,332Aall 17,148 10,109Wmj {64,32} {64,64,32}


following factors:

• The input data requirement has been greatly reduced because of the se-

lection of the MPE. As we mentioned above, we use 32 multipliers and

an adder tree to do the cross-correlation. This means for every 32 clock

cycles, one window is being processed. When the window moves to the

next position, only one new pixel is needed if the full row buffering method

is used, which will keep the MPE busy for another 32 clock cycles. There-

fore, the input memory bandwidth requirement is very small (one pixel

per 32 clock cycles). The manual design did not take advantage of this

thus introducing a lot of redundant data access.

• The output data bandwidth requirement is very small too. To get the

result for one interrogation area, we need to get all the cross-correlation

result for the entire cross-correlation plane, which, in our case, is of size

9 × 9. As we mentioned before, one cross-correlation results needs 32

cycles and that means for every result of one interrogation area, it requires

one MPE to compute for 32 × 9 × 9 cycles. For each interrogation area

computation, two 9 bits results are needed to be stored. The memory

bandwidth requirement for storing the output data is very trivial for this

application.

The goal of this manual design is to meet the real-time requirements using

the available board. According to [69], the manual design can process 15 pairs


of images which is good enough for the real-time system. However, according to

our analysis, there are at least several places that can be further optimized to

improve the performance.

• PIV manual design uses two buffers to store entire area A (40 × 40) and

area B (32×32) while SWOOP uses line buffering method which stores 32

lines of area A (32 × 40) and entire area B (32 × 32). Without degraded

performance, the SWOOP buffering method can save some on-chip mem-

ory.

• To overlap the loading and processing time, the PIV manual design uses

two copies of the buffers which is not necessary because in this application,

loading data takes much less time than processing.

• Memory port allocation for the manual design is not optimal. Due to the

reasons above (data loading is no longer a bottleneck), it is not necessary

to allocate two 64-bit input memory banks and one 64-bit memory bank

is enough.

Another big difference between the design from SWOOP and the manual

design is the level of parallelism. In the manual design, two copies of the MPE

process two pairs of interrogation areas separately. In SWOOP, four copies

of one MPE process the same pair of interrogation areas. The advantage of

processing the same pair of interrogation areas is it can minimize the on-chip

memory usage by loading one interrogation area at a time.


4.3.4 Summary

The PIV application is not a strict SWO application compared to the previous

experiments. Therefore, some modifications are needed to fit the PIV algorithm

into our SWOOP model. Moreover, the window size of PIV is very large (32×32)

and we can no longer put everything in parallel for one MPE. We reduced the size

of one MPE at the cost of longer processing time. This proves that SWOOP can

be quite flexible for complicated window processing. Users can always further

optimize the MPE and use SWOOP to estimate how much extra performance

they can get by this optimization.

Again, in this application, the SWOOP result is better than the manual de-

sign due to its optimized memory organization. Our experiment shows that if we

follow the method provided by SWOOP, we can at least double the performance

of the PIV application.

4.4 Summary

The four algorithms presented in this section were selected because they not

only show the wide application area of SWOs, but also demonstrate how the

three upper bounds used by SWOOP can affect the achievable performance.

Most SWO applications fall in the range of window sizes (from 3× 3 to 32× 32)

covered here. The analysis of these four algorithms can be used as a starting

point for other SWO applications with different window sizes.


For the HPF and LPF algorithms, SWOOP gives relative accurate estimates

of resource usage and maximal performance. SWOOP results can be used as a

guideline for implementation or further optimization.

For the RVT and PIV algorithms, SWOOP results are even better than the

manual designs due to the optimized memory architecture. These two examples

show that SWOOP can give an optimized block diagram according to different

resource constraints while at the same time be applied to different applications.

Currently, some of the modifications (RVT and PIV) are manually deter-

mined. A more sophisticated tool can also handle these modifications automat-

ically by adding or adjusting parameters so that it can deal with more compli-

cated applications.

In the next chapter, we will wrap up this dissertation with the conclusions

and future work.

Chapter 5

CONCLUSIONS

In this dissertation, we presented a new tool, SWOOP, to automatically im-

plementing SWOs on COTS FPGA board. Most current available high level

synthesize tools are target converting high level languages into synthesizable

VHDL/Verilog code. The final implementations are usually not very efficient in

terms of speed and resource utilization. SWOOP only targets SWO-based ap-

plications, and thus can get a near optimal design. Currently, SWOOP cannot

generate VHDL or Verilog code automatically, but it can analyze three different

upper bounds according to different constraints: area, memory bandwidth and

on-chip memory size. SWOOP will select the tightest upper bound to get the

possible maximal performance. A block diagram of the design and a near opti-

mal memory hierarchy for each specific application is also given by SWOOP at

the same time.

Four different SWO applications in different areas are used to verify the

98

CHAPTER 5. CONCLUSIONS 99

correctness of SWOOP. The applications are selected such that the sizes of the

window vary from 3 × 3 up to 32 × 32. This range of window sizes covers

most SWO applications and can be used as a start point for more complicated

problems. Moreover, these four different SWO applications have shown designs

that are constrained by each possible bound: area, memory bandwidth and

on-chip memory.

The first two applications, 3 × 3 High-Pass Filter AND 5 × 5 Low-Pass

Filter, are very commonly used for image pre-processing. Because the size of

the windows is small, these types of SWO applications are usually memory

bandwidth constrained or on-chip memory size constrained. SWOOP results

are very close to the manual design results. The resource usage is accurately

estimated and the block diagram with near optimal memory structure is obtained

by running SWOOP.

The third application, RVT is a typical area constrained application because

its template matching is very complicated and requires a lot of area for parallel

computing. Interestingly, the results from SWOOP are better than the manual

results with respect to the performance and resource utilization.

The last application, PIV is a SWO application with very large window size

(32× 32). For applications with large windows, some modifications are needed

to fit in the SWOOP model. After the adjustments, SWOOP gives a design

with a better performance than the manual design again in this case.

For all the above applications, SWOOP can quickly estimate the near optimal


performance based on board parameters and application parameters. For the

RVT and PIV applications, the suggested implementations derived by SWOOP

are even better than the hand-crafted HDL designs because of their near opti-

mal memory architecture and intelligent buffering method. Compared to most

currently available automated tools, SWOOP gives very good results for SWO

applications.

5.1 Future Work

There are a number of directions this research can be continued:

• SWOOP is good at estimating the performance and resource usage for

simple window operations. For more complicated applications, some man-

ual interference is necessary to make the application fit into the SWOOP.

One of our future directions is to make our tool more adaptive. We plan

to have adjustable parameters that can be changed to modify SWOOP

for different applications. Some examples of these parameters are: shared

memory port for both read/write when memory bandwidth is very limited,

breaking parallelled MPE into sequential MPE to save area etc. Moreover

we could put weights on different constraints and let the user decide which

one is the most important parameter for the specific applications. For ex-

ample, if user are more concerned about the memory bandwidth, we can

put small weight on a design with high demand of the memory bandwidth.


The final decision can be made after considering all these weights and users

can select the optimal implementation with great flexibility.

• Packing Factor plays an important role on memory port allocation. For

some cases, it might waste area and memory bandwidth resources with

improper data packing. Currently, manual interference is needed to make

the optimal decision. By building a model for optimal packing factor, we

can integrated it into SWOOP and let SWOOP automatically select the

packing factor that can maximally utilized the available resources.

• SWOOP can estimate the maximal performance much faster than most

HLS tools, but one of the differences is HLS can generate HDL designs

while SWOOP can only give a block diagram. A designer still have to

write their own HDL code. Another future direction is instead of letting

the user to write the HDL code from scratch, we can build some HDL-based

library beforehand. Since we already know the block diagram, we can

build parameterized modules using a HDL so that once the block diagram

is decided by SWOOP, the user can select modules from the library that

match the block diagram.

• SWOOP cannot estimate the control logic area yet because control logic

area is very complicate to estimate. There is related research regarding

how to estimate the area of control logic. If we can integrate these results

into SWOOP, the final estimate might be more accurate.


5.2 Conclusions

In summary, we have presented SWOOP, an automated tool, which can take

the FPGA board parameters and SWO application parameters as inputs and

then generate the block diagram of the FPGA implementation with near optimal

performance and resource utilization. SWOOP can give a very accurate estimate

according to three difference constraints: area constraints, memory bandwidth

constraints and on-chip memory constraints. It can maximally use the available

resources and estimate the achievable performance according to the tightest

constraints. Four different experiments are introduced in this dissertation using

both SWOOP and manual design. The results show that SWOOP can obtain

better performance than manual design in some cases.

Bibliography

[1] P. B. Brad Hutchings, J. Hawkins, S. Hemmert, B. Nelson, and M. Rytting,

“A CAD Suite for High-Performance FPGA Design,” Seventh Annual IEEE

Symposium on Field-Programmable Custom Computing Machines, pp. 12–

24, April 1999.

[2] “JHDL: FPGA CAD Tools,” http://www.jhdl.org, Last accessed Dec 30,

2004.

[3] “Handel-C, Software-Compiled System Design,”

http://www.celoxica.com/products/c to fpga.asp, Last accessed Oct

29, 2006.

[4] B. A. Draper, J. R. Beveridge, A. P. W. Bohm, C. Ross, and M. Chawathe,

“Accelerated Image Processing On FPGAs,” IEEE Transactions on Image

Processing, vol. 12, no. 12, pp. 1543–1551, December 2003.

[5] A. Benedetti, A. Prati, and N. Scarabottolo, “Image Convolution on FP-

GAs: the Implementation of a Multi-FPGA FIFO Structure,” Proceedings

of the 24th Euromicro Conference, pp. 123–130, 1998.

[6] A. J. Elbirt and C. Paar, “An FPGA Implementation and Performance

Evaluation of the Serpent Block Cipher,” in FPGA ’00: Proceedings of the

2000 ACM/SIGDA eighth international symposium on Field programmable

gate arrays. ACM Press, 2000, pp. 33–40.

[7] M. Leeser, S. Miller, and H. Yu, “Smart Camera Based on Reconfigurable

Hardware Enables Diverse Real-time Applications,” FCCM’04, 2004.

103

BIBLIOGRAPHY 104

[8] A. M. S. Inc., WildStarTM Reference Manual, revision 3.0, 2000.

[9] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-

Submicro FPGAs. USA: Kluwer Academic Publisher, February 1999.

[10] “VirtexTM -E 1.8V Field Programmable Gate Arrays,”

http://direct.xilinx.com/bvdocs/publications/ds022.pdf, Last accessed

Jan 12, 2004.

[11] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative

Approach, 3rd ed. Morgan Kaufmann, 2002.

[12] R. C. Gonzalez and R. E. Woods, Digital Image Processing. Addison

Wesley, 1993.

[13] C. Thibeault and G. Begin, “A Scan-Based Configurable, Programmable

and Scalable Architecture for Sliding Window-Based Operations,” IEEE

Transactions on Computers, vol. 48, no. 6, pp. 615–627, June 1999.

[14] J. A. Kahle, M. N. Day, et al., “Introduction to the Cell multiprocessor,”

IBM Journal of Research and Development, vol. 49, no. 4/5, 2005.

[15] Y.-H. Yeh and C.-Y. Kee, “Cost-Effective VLSI Architectures and Buffer

Size Optimization for Full-Search Block Matching Algorithms,” IEEE

Transactions on Very Large Scale Integration Systems, vol. 7, no. 3, pp.

345–358, September 1999.

[16] M. Kim, I. Hwang, and S.-I. Chae, “A Fast VLSI Architecture for Full-

Search Variable Block Size Motion Estimation in MPEG-4 AVC/H.264,”

in Asia and South Pacific Design Automation Conference, Jan 2005, pp.

631–634.

[17] F. Vahid and T. Givargis, Embedded System Design: A Unified Hard-

ware/Software Introduction. John Wiley & Sons, Inc., 2002.

[18] H. S. Stone, High-Performance Computer Architecture, 3rd ed. Addison-

Wesley, 1993.

BIBLIOGRAPHY 105

[19] D. A. Patterson and J. L. Hennessy, Computer Organization and Design:

The Hardware/Software Interface, 2nd ed. Morgan Kaufmann, 1998.

[20] S. P. Amarasinghe, J. M. Anderson, M. S. Lam, and C.-W. Tseng, “An

Overview of the SUIF Compiler for Scalable Parallel Machines,” Proceed-

ings of the 7th SIAM Conference on Parallel Processing for Scientific Com-

puting, 1995.

[21] U. Banerjee, R. Eigenmann, and A. Nicolau, “Automatic Program Paral-

lelization,” Proceedings of the IEEE, vol. 81, no. 2, pp. 211–243, Feb 1993.

[22] M. J. Wolfe, High-Performance Compilers for Parallel Computing. Addi-

son Wesley, 1996.

[23] “Total Solutions for Embedded Development,”

http://www.ghs.com/products/compiler.html, Last accessed December

14, 2004.

[24] A. S. Huang and J. P. Shen, “A Limit Study of Local Memory Require-

ments Using Value Reuse Profiles,” Proceedings of MICRO-28, pp. 71–81,

December 1995.

[25] S. Triantafyllis, M. Vachharajani, N. Vachharajani, and D. I. August,

“Compiler Optimization-Space Exploration,” Proceeding of the Inter-

national Symposium on Code Generation and Optimization: Feedback-

Directed and Runtime Optimization, pp. 204–215, 2003.

[26] Y. Ahmed, F. Jawed, S. Zia, and M. S. Aga, “Real-Time Implementation

of Adaptive Channel Equalization Algorithms on TMS320C6x DSP Pro-

cessors,” E-Tech, pp. 101–108, July 2004.

[27] J. H. Lee, J. H. Moon, K. L. Heo, M. H. Sunwoo, S. K. Oh, and I. H.

Kim, “Implementation of Application-Specific DSP for OFDM Systems,”

Proceedings of the 2004 International Symposium on Circuits and Systems,

vol. 3, pp. 665–668, May 2004.

BIBLIOGRAPHY 106

[28] A. R. Silva and V. I. Ponomaryov, “Color Imaging by Using of DSP Imple-

mentation of Different Filters,” 14th International Conference on Electron-

ics, Communications and Computers, pp. 293–298, Feb 2004.

[29] L.-H. Chen, O. T.-C. Chen, T.-Y. Wang, and C.-L. Wang, “An Adaptive

DSP Processor for High-Efficiency Computing MPEG-4 Video Encoder,”

Proceedings of the 2004 International Symposium on Circuits and Systems,

vol. 2, pp. 157–160, May 2004.

[30] J. M. Guerrero, L. G. de Vicuna, J. Matas, J. Miret, and M. Castilla, “A

High-Performance DSP-Controller for Parallel Operation of Online UPS

Systems,” Applied Power Electronics Conference and Exposition, vol. 1,

pp. 463–469, 2004.

[31] “Texas Instruments,” http://www.ti.com, Last accessed Dec 20, 2004.

[32] “Motorola,” http://www.motorola.com, Last accessed Dec 20, 2004.

[33] “Agere Systems,” http://www.agere.com, Last accessed Dec 20, 2004.

[34] “Analog Devices,” http://www.analog.com, Last accessed Dec 20, 2004.

[35] “Programmable DSP chips and their software,”

http://www.bdti.com/faq/3.htm, Last accessed Dec 20, 2004.

[36] Z. Guo, W. Najjar, F. Vahid, and K. Vissers, “A Quantitative Analysis of

the Speedup Factors of FPGAs over Processors,” Proceeding of the 2004

ACM/SIGDA 12th International Symposium on Field Programmable Gate

Arrays, pp. 162–170, February 2004.

[37] G. D. Micheli, Synthesis and Optimization of Digital Circuits. McGraw-

Hill, Inc., 1994.

[38] “SystemC Community,” http://www.systemc.org, Last accessed Dec 14,

2004.

BIBLIOGRAPHY 107

[39] S. Gupta, “Coordinated Coarse-Grain and Fine-Grain Optimizations for

High-Level Synthesis,” Ph.D. dissertation, University of California, Irvine,

School of Information and Computer Science, June 2003.

[40] “SPARK: A Parallelizing Approach to the High-Level Synthesis of Digital

Circuits,” http://www.cecs.uci.edu/ sumitg/, Last accessed Dec 15, 2004.

[41] Altera, ASIC to FPGA Design Methodology & Guidelines, Application Note

311, Ver. 1.0, July 2003.

[42] “Cadence,” http://www.cadence.com, Last accessed Dec 24, 2004.

[43] “Synopsys,” http://www.synopsys.com, Last accessed Dec 24, 2004.

[44] P. R. Panda, F. Catthoor, N. Dutt, K. Danckaert, E. Brockmeyer, C. Kulka-

rni, A. Vandecappelle, and P. G. Kjeldsberg, “Data and Memory Optimiza-

tion Techniques for Embedded Systems,” ACM Transactions on Design

Automation for Embedded Systems (TODAES), vol. 6, no. 2, pp. 149–206,

April 2001.

[45] P. R. Panda, N. D. Dutt, and A. Nicolau, “Memory Data Organization for

Improved Cache Performance in Embedded Processor Applications,” ACM

Transactions on Design Automation for Embedded Systems (TODAES),

vol. 2, no. 4, pp. 384–409, April 1997.

[46] M. A. Miranda, F. V. M. Catthoor, M. Janssen, and H. J. D. Man, “High-

Level Address Optimizations and Synthesis Techniques for Data-Transfer-

Intensive Applications,” IEEE Transactions on Very Large Scale Integration

(VLSI) Systems, vol. 6, no. 4, pp. 677–686, December 1998.

[47] P. R. Panda, N. D. Dutt, A. Nicolau, F. Catthoor, A. Vandecappelle,

E. Brockmeyer, C. Kulkarni, and E. D. Greef, “Data Memory Organiza-

tion and Optimizations in Application-Specific Systems,” IEEE Design &

Test of Computers, vol. 18, pp. 56–68, May 2001.

BIBLIOGRAPHY 108

[48] M. Gokhale, J. Stone, and J. Arnold, “Stream-Oriented FPGA Computing

in the Streams-C High Level Language,” FCCM’00, pp. 49–56, April 2000.

[49] P. Banerjee, N. Shenoy, A. Choudhary, S. Hauck, C. Bachmann, M. Haldar,

P. Joisha, A. Jones, A. Kanhare, A. Nayak, S. Periyacheri, and M. Walk-

den, “MATCH: A Matlab Compiler for Configurable Computing Systems,”

Northwestern University, Tech. Rep. CPDC-TR-9908-013, 1999.

[50] P. Banerjee, N. Shenoy, A. Choudhary, S. Hauck, C. Bachmann, M. Haldar,

P. Joisha, A. Jones, A. Kanhare, A. Nayak, S. Periyacheri, M. Walkden,

and D. Zaretsky, “A Matlab Compiler for Distributed, Heterogeneous, Re-

configurable Computing Systems,” FCCM’00, pp. 39–48, April 2000.

[51] “AccelChip,” http://www.accelchip.com, Last accessed Jan 31, 2005.

[52] K. Bondalapati, P. C. Diniz, P. Duncan, J. Granacki, M. W. Hall, R. Jain,

and H. Ziegler, “DEFACTO: Design Environment For Adaptive Computing

TechnOlogy,” IPPS/SPDP Workshops, pp. 570–578, April 1999.

[53] B. So, M. W. Hall, and P. C. Diniz, “A Compiler Approach to Design Space

Exploration in FPGA-Based Systems,” Proceedings of the ACM Conference

on Programming Language Design and Implementation, pp. 165–176, June

2002.

[54] P. Kollig, B. M. Al-Hashimi, and K. M. Abbott, “FPGA Implementation

of High Performance FIR Filters,” Proceedings of 1997 IEEE International

Symposium on Circuits and Systems (ISCAS’97), vol. 14, pp. 2240–2243,

June 1997.

[55] R. H. Turner and R. F. Woods, “Highly Efficient, Limited Range Multipliers

for LUT-Based FPGA Architectures,” IEEE Transactions on Very Large

Scale Integration (VLSI) Systems, vol. 12, no. 10, pp. 1113–1117, October

2004.

BIBLIOGRAPHY 109

[56] M. Gokhale and J. Stone, “Automatic Allocation of Arrays to Memories

in FPGA Processors with Multiple Memory Banks,” FCCM’99, pp. 63–69,

April 1999.

[57] M. Weinhardt and W. Luk, “Memory Access Optimization and RAM

Inference for Pipeline Vectorization,” in Field-Programmable Logic

and Applications, P. Lysaght, J. Irvine, and R. W. Hartenstein,

Eds. Springer-Verlag, Berlin, / 1999, pp. 61–70. [Online]. Available:

citeseer.ist.psu.edu/weinhardt99memory.html

[58] ——, “Memory Access Optimization for Reconfigurable Systems,” IEE Pro-

ceedings Computers and Digital Techniques, vol. 148, no. 3, pp. 105–112,

May 2001.

[59] P. Andersson and K. Kuchcinski, “Automatic Local Memory Architecture

Generation for Data Reuse in Custom Data Paths,” Proceeding of Engi-

neering of Reconfigurable System and Algorithm, 2004.

[60] J. Park and P. C. Diniz, “Synthesis and Estimation of Memory Interfaces for

FPGA-based Reconfigurable Computing Engines,” Proc. of the 2003 IEEE

Symposium on FPGAs for Custom Computing Machines (FCCM’03), pp.

297–299, April 2003.

[61] T.-L. Lee and N. W. Bergmann, “An Interface Methodology for Retar-

gettable FPGA Peripherals,” Engineering of Reconfigurable Systems and

Algorithms, pp. 167–173, July 2003.

[62] X. Liang and J. S.-N. Jean, “Mapping of Generalized Template Matching

onto Reconfigurable Computers,” IEEE Transactions on VLSI Systems,

vol. 11, no. 3, pp. 485–498, 2003.

[63] ——, “Memory Access Pattern Enumeration in GTM Mapping on Recon-

figurable Computers,” The International Conference on Engineering of Re-

configurable Systems Algorithms, pp. 8–14, June 2001, Las Vegas, Nevada,

USA.

BIBLIOGRAPHY 110

[64] X. Liang and J. Jean, “Memory Access Scheduling and Loop Pipelining,”

IEEE Transactions on Very Large Scale Integration(VLSI) Systems, vol. 11,

no. 3, pp. 485–498, June 2003.

[65] X. Liang and Q. Malluhi, “Combinatorial Optimization in Mapping Gen-

eralized Template Matching onto Reconfigurable Computers,” Proceedings

of International Conference on Engineering of Reconfigurable Systems and

Algorithms, pp. 223–226, June 2006.

[66] X. Liang, J. Jean, and K. Tomko, “Data Buffering and Allocation in Map-

ping Generalized Template Matching on Reconfigurable Systems,” The

Journal of Supercomputing, Special Issue on Engineering of Reconfigurable

Hardware/Software Objects, pp. 77–91, 2001.

[67] C. Menn, O. Bringmann, and W. Rosenstiel, “Controller Estimation for

FPGA Target Architectures During High-Level Synthesis,” Proceedings of

the 15th international symposium on System Synthesis, pp. 56–61, 2002.

[68] S. Miller, “Enabling a Real-time Solution to Retinal Vascular Tracing Using

FPGAs,” Master’s thesis, Northeastern University, April 2004.

[69] H. Yu, M. Leeser, G. Tadmor, and S. Siegel, “Real-time Particle Image

Velocimetry for Feedback Loops Using FPGA Implementation,” Journal of

Aerospace Computing, Information, and Communication 2006, no. 2, pp.

52–62, 2006.

[70] Annapolis Micro Systems Inc., “FireBirdTM Hardware Reference Manual,”

in www.annapmicro.com, 2000.

[71] A. Can, H. Shen, J. N. Turner, H. L. Tanenbaum, and B. Roysam, “Rapid

Automated Tracing and Feature Extraction from Retinal Fundus Images

Using Direct Exploratory Algorithms,” IEEE Transactions on Information

Technology in Biomedicine, vol. 3, no. 1, March 1999.

[72] R. Adrian, “Particle-Imaging Technique for Experimental Fluid Mechan-

ics,” Annual Reviews in Fluid Mechanics, vol. 23, pp. 261–304, 1991.

BIBLIOGRAPHY 111

[73] M. Raffel, C. Willert, and J. Kompenehans, Particle Image Velocimetry.

Berlin, Germany: Springer-Verlag, 1998.

[74] D. D. A/S, “Dantec, Principle of Partical Image Velocimetry,”

http://www.dantecdynamics.com/piv/Princip/Index.html, Last accessed

on May 09, 2006.

[75] H. Yu and M. Leeser, “Automatic Sliding Window Operation Optimization

for FPGA-Based,” 14th Annual IEEE Symposium on Field-Programmable

Custom Computing Machines (FCCM’06), pp. 76–88, April 2006.

Appendix A

Glossary

SWO Sliding Window Operation

FPGA Field Programmable Gate Array

COTS Commercial Off-The-Shelf

SWOOP Sliding Window Operation Optimization

CCD Charge-Coupled Device

ASIC Application Specific Integrated Circuits

NRE Non-Recurring Engineering

CLB Configurable Logic Block

LUT Look Up Table

FF Flip-Flop

RISC Reduced Instruction Set Computer

HLS High Level Synthesize

112

APPENDIX A. GLOSSARY 113

GPP General Purpose Processor

ASP Application Specific Processor

SPP Special Purpose Processor

DSP Digital-signal Processor

CDFG Control/Data Flow Graph

EDA Electronic Design Automation

RTL Register Transfer Layer

HDL Hardware Description Language

VHDL Very-High-Speed Integrated Circuit Hardware Description Language

MAA Memory Allocation and Assignment

DRAM Dynamic Random Access Memory

PLD Programmable Logic Device

EDIF Electronic Design Interchange Format

FIFO First In, First Out

FIR Finite Impulse Response

FFT Fast Fourier Transform

SoPC System on Programmable Chip

GTM Generalized Template Matching

MAP Memory Access Pattern

APPENDIX A. GLOSSARY 114

PF Packing Factor

II Initiation Interval

RF Region Function

FU Functional Unit

MPE Micro Processing Element

RVT Retinal Vascular Tracing

PIV Particle Image Velocimetry

HPF High-Pass Filter

LPF Low-Pass Filter

IF Interface

BlockRAM Block Random Access Memory

Appendix B

Equation Proof

The total external memory loading is defined as Ltotal.

Ltotal = [(N − q

q − n + 1)× (

M − p

p−m + 1+ 1)× (p× q)]

︸︷︷︸when block moves left to right

+ [(M − p

p−m + 1)× (p× q)]

︸︷︷︸when block moves top to bottom

= ( (N−q)(M−p)(p−m+1)(q−n+1)

+ M−pp−m+1

)× (p× q)

= ( (N−n+1)(M−p)(p−m+1)(q−n+1)

)× (p× q)

(B.1)

Comparing to M and N , p and q are small and we assume (M − p) ∼ M and

(N − n + 1) ∼ N therefore, the problem can be further simplified to minimizep×q

(p−m+1)(q−n+1)since p > m, q > n, m,n > 1. To minimize p×q

(p−m+1)(q−n+1), we

rewrite it as follows:

p×q(p−m+1)(q−n+1)

= p×q[p×q+(m−1)×(n−1)]−[p×(n−1)+q×(m−1)]

(B.2)

The second part of the divisor satisfies:

p× (n− 1) + q × (m− 1) ≥ 2√

p× q × (m− 1)× (n− 1) (B.3)

115

APPENDIX B. EQUATION PROOF 116

where “=” exists only when p× (n− 1) = q× (m− 1). Now Equation (B.2) can

be rewritten as

p×q(p−m+1)(q−n+1)

≥ p×q

[p×q+(m−1)×(n−1)]−[2√

p×q×(m−1)×(n−1)]

=(√

pq2)√

pq2+(√

(m−1)(n−))2−2√

p×q×(m−1)×(n−1)

= (√

pq√

pq−√

(m−1)(n−1))2

(B.4)

According to the constraints listed in Equation(B.5), we can further derive

Equation(B.6).

p× q ×Wpi ≤ Bavail

2(B.5)

where Wpi and Bavail are defined in Table 3.1.√

pq√

pq−√

(m−1)(n−1)

≥√

Bavail2Wpi√

Bavail2Wpi

−√

(m−1)(n−1)

(B.6)

where “=” exists only when p×q = Bavail

2Wpi. Put Equation(B.4) and (B.6) together,

we getp×q

(p−m+1)(q−n+1)

≥ {√

Bavail2Wpi√

Bavail2Wpi

−√

(m−1)(n−1)}2

(B.7)

where “=” exists only when p× (n− 1) = q × (m− 1) and p× q = Bavail

2Wpi.

To summarize, the optimal value of p and q are those make the “=” condition

true. Let qopt = Bavail

2×Wpi×poptand put this into the other condition p × (n − 1) =

q × (m− 1), we get

popt × (n− 1) = Bavail

2×Wpi×popt× (m− 1)

⇒ p2opt = Bavail×(m−1)

2×Wpi×(n−1)

⇒ popt =√

Bavail×(m−1)2×Wpi×(n−1)

⇒ qopt =√

Bavail×(n−1)2×Wpi×(m−1)

(B.8)

APPENDIX B. EQUATION PROOF 117

By selecting the popt and qopt according to Equation(B.8), we can minimizep×q

(p−m+1)(q−n+1)and therefore minimize the Ltotal, the total external memory ac-

cess.

Documents

Optimizing Data Intensive Window-based Image … Comparison between Automatic and Manual Results for PIV Al- ... 4.11 PIV System Overview (From Dantec Dynamics) . . . . . . . . . 85