The 100-FPGA Stencil Computation Accelerator …...This thesis is organized as follows. Chapter 2 describes the parallel stencil computation by using multi-FPGAs. Chapter 3 describes

Tokyo Institute of TechnologyDepartment of Computer Science

Master Thesis

The 100-FPGA Stencil Computation Accelerator

（FPGA100個によるステンシル計算専用計算機）

Supervisor: Associate Prof. Kenji Kise

January, 2013

Submitter

Department of Computer Science

Graduate School of Information Science and Engineering

11M38217 Ryohei Kobayashi

Stencil computation is one of the typical scientific computing kernels. It is applied diverse areas as

earthquake simulation, seismic imaging for the oil and gas exploration industry, digital signal process-

ing and fluid calculation. Various hardware accelerators to solve stencil computation at high speed are

designed by using multiple high end FPGAs.

I have proposed high performance architecture for 2D stencil computation and implemented this

architecture in ScalableCore system, which employs multiple small FPGAs. ScalableCore system is

a 2D-mesh connected FPGA array that our laboratory already has developed, which is a high-speed

simulation environment for many core processors research. The ScalableCore system uses multiple

small capacity FPGAs, which are connected in 2D-mesh. In this thesis, I use hardware components of

the ScalableCore system as an infrastructure for HPC hardware accelerators.

In order to achieve high performance, the pipelines of the execution units should be kept operating

effectively. In the stencil computation, whole the data is divided into multiple blocks and each block

is assigned to each FPGA. The boundary data of each block is shared by the adjacent FPGAs. In my

system, the computation order is customized in each FPGA in order to increase the acceptable latency

of communication between the FPGAs.

I develop the system that realizes proposed architecture in stages. First, I implement software sim-

ulator in C++, which emulates stencil computation in cycle level accuracy on multiple FPGA nodes.

The execution results of the software simulator are verified by compared to the execution result of the

stencil computation program in function level accuracy coded in C. Second, I implement the circuits

based on the software simulator in Verilog HDL. I simulate the operation circuits by using iverilog and

verify the circuits by comparing the execution result of the software simulator. Finally, I implement

the circuits in FPGA array and verify FPGA array by comparing the computation result of FPGA array

with that of the program in function level accuracy coded in C.

I evaluated the performance, the scalability and the power consumption of developed FPGA array.

As a result, I established the validity on the proposed architecture since the FPGA array operated

successfully. The FPGA array with 100-FPGA achieved about 0.6GFlop/sW. This performance/W

value is about four-times better than typical CPU card.

i

Contents

1 Introduction 11.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Invention of the High Performance Architecture for 2D Stencil Computation . 1

1.1.2 Development of the System which Realizes Proposed Architecture . . . . . . . 2

1.1.3 Proof of the Validity on the Proposed Architecture by evaluation using 100-FPGA 2

2 Parallel Stencil Computation by Using Multi-FPGAs 32.1 Block Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Computation Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Clock Period Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Architecture and Implementation 113.1 Development Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Acquisition of the Position Information . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Design and Implementation of Synchronization Mechanism . . . . . . . . . . . . . . 13

3.4 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Evaluation 194.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Hardware Resource Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Performance of FPGA Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Related Work 27

6 Conclusion 29

Acknowledgement 31

Bibliography 33

Contents ii

Publication 35

iii

List of Figures

2.1 Parallel Stencil Computation by Using Multi-FPGA. . . . . . . . . . . . . . . . . . . 4

2.2 Pseudo Code of 2D Stencil Computation. . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 The Computation Order of Grid-points on FPGA. (b) is proposed method. . . . . . . 5

2.4 Computing Order Applied Proposed Method. . . . . . . . . . . . . . . . . . . . . . . 6

2.5 Clock Period Variation (Measurement Time : 20sec). . . . . . . . . . . . . . . . . . 7

2.6 Relative Clock Period Variation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Development Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Verification of FPGA Array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Acquisition of the Position Information. . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Design of Synchronization Mechanism. . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.5 Implementation of Synchronization Mechanism. . . . . . . . . . . . . . . . . . . . . 15

3.6 System Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.7 Relationship between the Grid-points and BlockRAM. . . . . . . . . . . . . . . . . . 17

3.8 MADD Pipeline Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1 Configuration Diagram of the Mesh Connected FPGA Array. . . . . . . . . . . . . . 20

4.2 FPGA Array (100 Nodes). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Peak and Effective Performance of Stencil Computation in FPGA Array with 16 Nodes. 23

4.4 Effective Performance per Watt of Stencil Computation in FPGA Array with 16 Nodes. 23

4.5 Peak and Effective Performance of Stencil Computation in FPGA Array with Opera-

tion Frequency 0.06GHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.6 Effective Performance per Watt of Stencil Computation in FPGA Array with Opera-

tion Frequency 0.06GHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

v

List of Tables

2.1 Measurement Result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.1 Hardware Resource Consumption. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Design Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1

Chapter 1

Introduction

Stencil computation is one of the typical scientific computing kernels [1]. Various accelerators to

solve stencil computation at high speed are designed by using multiple high end FPGAs [2][3].

I have proposed high performance architecture for 2D stencil computation and implemented this

architecture in ScalableCore system that employs multiple small FPGAs [4].

ScalableCore system is a 2D-mesh connected FPGA array that our laboratory already has developed,

which is a high-speed simulation environment for many-core processors research [5]. The ScalableCore

system uses multiple small-capacity FPGAs, which are connected in 2D-mesh. In this thesis, I use

hardware components of the ScalableCore system as an infrastructure for HPC hardware accelerators.

I show the contribution of this study in the next section.

1.1 Contribution

1.1.1 Invention of the High Performance Architecture for 2D Stencil Computa-

tion

In order to achieve high performance, the pipelines of the execution units should be kept operating

effectively. In the stencil computation, whole the data is divided into multiple blocks and each block

is assigned to each FPGA. The boundary data of each block is shared by the adjacent FPGAs. In my

system, the computation order is customized in each FPGA in order to increase the acceptable latency

of communication between the FPGAs.

Chapter 1 Introduction 2

1.1.2 Development of the System which Realizes Proposed Architecture

I develop the system that realizes proposed architecture in stages. First, I implement software sim-

ulator that emulates stencil computation in cycle level accuracy on multiple FPGA nodes. Second, I

implement the circuits based on the software simulator in Verilog HDL. Finally, I implement the cir-

cuits in FPGA array and verify FPGA array. However, on the process of the development, there is a

trouble that the system generates an illegal computation result when the multiple FPGA nodes are used.

The cause is clock period variation. Therefore, I evaluate the clock variations for every FPGA node

quantitatively. And then, I design and implement a mechanism to operate the synchronization of data.

1.1.3 Proof of the Validity on the Proposed Architecture by evaluation using

100-FPGA

I evaluate the performance, the scalability and the power consumption of developed FPGA array. As

a result, I establish the validity on the proposed architecture since the FPGA array operates successfully.

The FPGA array with 100-FPGA achieves 0.57GFlop/sW. This performance/W value is about four-

times better than NVidia GTX280 GPU card.

This thesis is organized as follows. Chapter 2 describes the parallel stencil computation by using

multi-FPGAs. Chapter 3 describes the architecture, design and implementation of the accelerator In

Chapter 4, I show the evaluation result of developed FPGA array. Chapter 5 gives a survey of related

work. Chapter 6 concludes this thesis.

3

Chapter 2

Parallel Stencil Computation by Using

Multi-FPGAs

Fig. 2.1 shows a typical pattern of 2D stencil computation. In the figure, each circle represents a

value of grid-point and each value of grid-point at next time-step is computed by using the values of its

four adjacent grid-points at current time-step.

Fig. 2.2 shows a pseudo code of 2D stencil computation shown in Fig. 2.1. In the figure, k represents

time-step, (i, j) represents coordinate of grid-point. Two buffers, v0 and v1, are used for the computa-

tion. The value of grid-point (i, j) is represented as vn[i][j] and n represent the buffer number (0 or 1).

As shown as the fourth line in Fig. 2.2, vn[i][j] is updated by the summation of four values. The each

value is obtained by multiplying weighting factor by one adjacent grid-points (vn[i-1][j], vn[i][j-1],

vn[i][j+1], vn[i+1][j]). As shown as the seventh and eighth line in Fig. 2.2, every grid-point is updated

for the next time-step.

2.1 Block DivisionAs shown in Fig. 2.1, the data set of stencil computation is divided into several blocks according to

the number of vertical and horizontal array of FPGAs. Each data block is assigned to each FPGA. The

computation on each FPGA uses the assigned data and the boundary data of each block shared. The

necessary boundary data of the adjacent FPGAs have to be sent to. In Fig. 2.1, the circle represents

grid-point, a group of grid-points (4×4) is assigned one FPGA and an arrow represents communication

to the neighbor FPGA. Gray regions represent the data subset communicated to other FPGAs.

The data sharing takes some overhead of data traversals. In order to eliminate this overhead, I

customized the computation order for each FPGA. I describe it in the next section.

Chapter 2 Parallel Stencil Computation by Using Multi-FPGAs 4

!"#$%&#'&(")*+%#),-.&

/..)(,0*&#,0&12!3456#77$,)6/-)#,45*/-/&.$8.0-&6#77$,)6/-0*&

9)-:&,0)(:8#"&12!3.4

5(")*+%#),-4

-)70+.-0%&;4

<4<4

=4

)4

>?@)A@=A4

><@)+?A@=AB&><@)A@=C?A4

><@)A@=+?AB&><@)C?A@=A4

>?@)A@=A&D&EF<&G&><@)+?A@=AH&

C&EF?&G&><@)A@=C?AH&4

C&EFI&G&><@)A@=+?AH&4

C&EFJ&G&><@)C?A@=AHK4

4

Fig. 2.1 Parallel Stencil Computation by Using Multi-FPGA.

��

��

��

�� !��"�� !

�"�� !

�"�� !

�"��

#��$

%��$

&��

'��

(��$�

Fig. 2.2 Pseudo Code of 2D Stencil Computation.

2.2 Computation OrderFig. 2.3 shows two cases of computation order. Fig. 2.3 (a) shows the order that FPGA (A) and

FPGA (B) compute by the same order. A dotted square shows the data subset assigned to a FPGA.

In fact, the computations use extra data of the boundary, which is not shared. However, extra data is

omitted in this figure for simplicity. I define a sequent process to compute all the grid-points at a time-

step as “Iteration. The circle represents one grid-point. The alphabet in a circle represents ID of the

FPGA. The numbers in a cycle represents computing order in the FPGA, therefore, the computations

of each FPGA proceed in order of the arrow.

2.2 Computation Order 5

��

��

��

��

��

�

��

��

�

��

��

��

��

��

��

��

��

�

��

��

�

��

��

��

� �

� �

� �

��

�

��

��

�

��

��

��

��

��

��

��

��

�

��

��

��

��

�

��

��

��

��

��

��

��

��

Fig. 2.3 The Computation Order of Grid-points on FPGA. (b) is proposed method.

In this example, each FPGA updates the assigned data of sixteen grid-points (from 0 to 15) during

Iteration. For simplicity, I assumed that a computation updating a value of one grid-point takes just a

cycle and several FIFOs are used in order to avoid illegal modification of the data. The value of A0

is computed at 0th cycle and the value of A1 is computed at 1st cycle in FPGA (A). Similarly, the

value of B0 is computed at 0th cycle and the value of B1 is computed at 1st cycle in FPGA (B). All

the computations are processed in this order. I assume that each FPGA can use the obtained data of

the FPGA in a single cycle. After the completion of the computations for each Iteration, the process

proceeds to the next time-step. In this case, Iteration takes sixteen cycles to complete the computations.

The first Iteration begins with 0th cycle and the second Iteration begins with 16th cycle. Therefore, the

third Iteration begins with 32nd cycle.

In Fig. 2.3 (a), the computation of grid-point B1 uses the values of vertical and horizontal grid-points

A13, B5, B0, B2. The value of grid-point A13 needs to be communicated between FPGA (A) and


Fig. 2.4 Computing Order Applied Proposed Method.

FPGA (B) because the value is shared with these FPGAs. The others do not need to be communicated

between FPGAs. In this computation order, the value of grid-point A13 is computed at 13th cycle and

the value of grid-point B1 is computed at 17th cycle. The computation of B1 uses the computed value

of A13. In order not to stall the computation of B1, the value of A13 must be communicated within

three cycles (14, 15, 16) after the computation. The values of grid-points A12, A14, A15 must also be

communicated within three cycles in order not to stall the computations. If the N × M grid-points are

assigned to a single FPGA, every shard value must be communicated within N − 1 cycles because of

this discussion.

Fig. 2.3 (b) models that FPGA (C) and FPGA (D) compute in reversed order. Fig. 2.3 (b) is proposed

method [6]. The computation order of FPGA (C) is the inversed order of FPGA (A) in Fig.4 (a). FPGA

(B) and FPGA (D) use the same computation order. In this case, in order not to stall the computation

of D1 of Iteration 2 (17th cycle), the margin to send value of C1 (1st cycle) is 15 cycles (2∼16). If

the N × M grid-points are assigned to a single FPGA, communication latency between FPGA (A) and

FPGA (B) must be within N × M − 1 cycles because of this discussion. In this way, by means of

changing computation order, acceptable latency of communication is increased.

Fig. 2.4 shows the computation order (proposed method) in each FPGA. The square represents

FPGA and the arrow represents computation order in Fig. 2.4. FPGAs of 1st and 3rd rows compute in

the same order as FPGA (C) in Fig. 2.3 (b). FPGAs of 2nd and 4th in Fig. 5 compute in the same order

as FPGA (D) in Fig. 2.3 (b).

As discussed in Fig. 2.3, the communication latency between FPGAs in proposed method can ensure

the cycles correspond to about Iteration. Communication to face each other in the direction of the arrow

2.3 Clock Period Variation 7

��

��

��

Fig. 2.5 Clock Period Variation (Measurement Time : 20sec).

can also. That is, compute cycles of adjacent sides are equal when to place FPGA (C) and FPGA (D)

in Fig. 2.3 (b) upside down.

I think the communication between the left and right sides of the FPGA. C3 and C0 are adjacent

when the two side-by-side to the left or right FPGA (C) in Fig. 2.3 (b). In this time, acceptable latency

of communication is 12 cycles in the FPGA of the right. This number of cycles is calculated by the

cycles according to an Iteration minus the grid-points of one side cycles.

In this way, the proposed method gives increase acceptable latency of communication by computing

the up and down in reverse order, in other words, this method ensure margin of about one Iteration.

Until now, I define that the computation of one grid-point takes one cycle. However, if the computa-

tion of one grid-point takes k cycles, the acceptable communication latency is (N × M − M) × k cycles

between left FPGA and right FPGA.

2.3 Clock Period VariationEach FPGA node in our system is equipped with the clock oscillator [CSX-750PB(B)] with operation

frequency 40MHz, and the frequency stability of the oscillator is ±50ppm. ±50ppm of the frequency

stability means the clock period variation per 1,000,000 cycles is within ±50 cycles.

As a preliminary evaluation, I evaluated the clock period variation in each FPGA node. The clock

period variation is cited as the cause of trouble that the system generates an illegal computation result


!"#$

!"%$

!"&$

!"'$

!"($

!")$

!"*$

!"+$

,'-..$$

,&-(.$$

,&-..$$

,%-(.$$

,%-..$$

,#-(.$$

,#-..$$

,.-(.$$

.-..$$

.-(.$$

#-..$$

#-(.$$

#$%$

&$'$

($)$

*$

+$

!"!#$%&'()$'*+,%-%.

%!!"!#$!

!"#$

!"%$

!"&$

!"'$

!"($

!")$

!"*$

!"+$

Fig. 2.6 Relative Clock Period Variation.

when the multiple FPGA nodes are used. I used ScalableCore system to measure the clock period

variation. ScalableCore system emulates an environment for many-core processors, and the many-core

processors are synchronized with the concept named virtual cycle. In order to examine the frequency

stability by using this mechanism, I coded the application in C, which is working in many-core proces-

sors on the ScalableCore system.

In this thesis, I measured the clock period variation of each FPGA node by running the implemented

program, and varied measurement time for each measurement. I targeted the FPGA array (8 × 8) and

the results are shown in Fig. 2.5, Fig. 2.6 and Table 2.1.

Fig. 2.5 shows the clock period variation depending on measurement time (20sec). The Z-axis in

these figures represents the clock period variation per 1,000,000 cycles. The X and Y-axes represent the

position coordinate of each FPGA node. As shown in Fig. 2.5, the clock period variation per 1,000,000

cycles keeps within ±50 cycles. This result means that it is guaranteed that the frequency stability of

the oscillator is ±50ppm.

Table 2.1 shows that the worst clock period variation and standard deviation of the clock variation

in each measurement time. The Time, Worst Value and Standard Deviation in Table 2.1 represent

measurement time, the worst clock period variation and standard deviation of the clock variation on

2.3 Clock Period Variation 9

Table. 2.1 Measurement Result.

Time[sec] Worst Value[ppm] Standard Deviation

20 20.47 (x=3, y=5) 4.73

40 20.47 (x=3, y=5) 4.68

80 20.47 (x=3, y=5) 4.73

160 20.59 (x=3, y=5) 4.77

320 20.66 (x=3, y=5) 4.79

FPGA array (8 × 8) respectively. As shown in Table 2.1, the Worst Value and the Standard Deviation

did not change nearly though I changed measurement time. This result means that the clock period

variation does not depend on measurement time.

Fig. 2.6 shows the result which is the clock period variation (320sec) divided by the variation (20sec).

The X, Y-axes in Fig. 2.6 represent the same in Fig. 2.5. The Z-axis in the figure represents the clock

period variation per 1,000,000 cycles with respect to the result (20sec). As shown in Fig. 2.6, the clock

period variation in almost all the nodes shows about 1 [ppm]. This result also shows that the clock

period variation does not depend on measurement time.

From these results, it is clear that the system must be designed considering the clock period variation

that almost does not change on time axis.

11

Chapter 3

Architecture and Implementation

There are two choices to create the large system that cannot be mounted on a single FPGA chip.

One is connecting a small number of high end FPGAs. The other is connecting a lot of small capacity

FPGAs. Absolutely, it is possible to connect a lot of high end FPGAs, however, the method is distant

because it is a waste to use the FPGA capacity than necessary.

An implementation method to use a small number of high end FPGAs has a merit that the system

runs fast. On the other hands, the larger the logic circuits that are mounted on a single FPGA chip are,

the more the implementation and verification of the system are difficult.

A method to use a lot of small capacity FPGAs has a merit that the implementation and verification

of the system are simpler because the logic circuits that can be mounted on a single FPGA chip are

smaller. Even if a FPGA node is broken, the system can operate successfully by replacing broken

FPGA node with new one. The operation speed of the implementation method is inferior to that of

the method to use a small number of high end FPGAs, however, it is suitable to employ the method in

terms of the cost per FPGA node.

From these reasons, I employ the implementation method to connect a lot of small capacity FPGAs.

3.1 Development FlowI develop the system that realizes proposed architecture in stages. Fig. 3.1 shows the development

flow.

First, I implement software simulator (stencil_sim.cc) in C++, which emulates stencil compu-

tation in cycle level accuracy on multiple FPGA nodes (Fig. 3.1 1⃝). The execution results of the

software simulator are verified by compared to the execution result of the stencil computation program

(stencil_base.c) in function level accuracy coded in C.

Second, I implement the circuits (stencil.v) based on the software simulator in Verilog HDL (Fig.

Chapter 3 Architecture and Implementation 12

stencil_base.c! sim_stencil.cc!

=

Identical ??!#include int main() {

printf() return

}!

#include int main() {

printf() return

}!"!

stencil.v!sim_stencil.cc!

=

Identical ??!module stencil (!)

input ! output !

endmodule!

#include int main() {

printf() return

}!#!

stencil_base.c!

=

Identical ??!#include int main() {

printf() return

}!$!

module stencil (!)

input ! output !

endmodule!

!"#$%&'())*+&&

,'-&

!"%.!

!"#$!

%&$'!

!"#$!

%&$'!

!"#$!

%&$'!

!"#$!

%&$'!

!"#$!

%&$'!

!"#$!

%&$'!

!"#$!

%&$'!

!"#$!

%&$'!

"&('! "&('! "&('! "&('!

"&('! "&('! "&('! "&('!

!"#$!

%&$'!

!"#$!

%&$'!"&('!

"&('!

!"#$!

%&$'!

!"#$!

%&$'!

!"#$!

%&$'!

!"#$!

%&$'!"&('! "&('! "&('! "&('!

!"#$!

%&$'!"&('!

Fig. 3.1 Development Flow.

3.1 2⃝). I simulate the operation circuits by using iverilog and verify the circuits by comparing the

execution result of the software simulator.

Finally, I implement the circuits in FPGA array and verify FPGA array by comparing the compu-

tation result of FPGA array with that of the program (stencil_base.c) in function level accuracy

coded in C (Fig. 3.1 3⃝). Fig. 3.2 shows the verification of FPGA array. The red grid-point the com-

putation result. The computation result of FPGA is outputted to PC that is connected by USB in each

Iteration.

3.2 Acquisition of the Position InformationAs describe in 2, I customized the computation order in each FPGA in order to increase the ac-

ceptable latency of communication between the FPGAs. To determine the computation order in each

FPGA, every FPGA use position information in the system. I implemented a mechanism to provide the

position information.

Fig. 3.3 shows how to provide the position information for all FPGAs. The trigona in Fig. 3.3

3.3 Design and Implementation of Synchronization Mechanism 13

Fig. 3.2 Verification of FPGA Array.

represents NOT circuit.

Every FPGA node can pick out whether adjacent FPGA nodes exist or not. The FPGA nodes located

in fourth line recognize that they are located at bottom by using the mechanism. In case of this imple-

mentation, the initial value: 0 is set to the FPGA nodes located at bottom. The FPGA node that initial

value is set to sends 0 to the FPGA node located above. The value that is sent from below inverts since

NOT circuit is connected between the upper and lower FPGA. Thus, 1 is set to the FPGA nodes located

in third line. The value of 1 is propagated upwards, 0 is set to the FPGA nodes located in second line.

The circuit that realizes this mechanism does not need register to latch the value. Therefore, the circuit

can be implemented in small scale.

3.3 Design and Implementation of Synchronization MechanismI designed and implemented the mechanism to synchronize all FPGA nodes in order to operate the

system successfully. The mechanism absorbs the clock period variation.

I describe the design of synchronization mechanism and Fig. 3.4 shows it. I define FPGA node (A)


��

��

��

��

� � � �

� � � �

� � � �

� � � �

Fig. 3.3 Acquisition of the Position Information.

��

�

�

�

��

��

� �

��

� � �

Fig. 3.4 Design of Synchronization Mechanism.

in Fig. 3.4 as Master node. Each FPGA node executes stencil computation in synchronism with the

signal transmitted from Master node. The nodes other than Master node stall stencil computation until

receiving a synchronization signal.

The synchronization signal is generated by Master node in a period of α + β and sent to the other

nodes. The α represents the number of cycles required to execute stencil computation during Iteration.

The β represents the margin to absorb the clock period variation of each FPGA node. The β represents

the margin to absorb the clock period variation of each FPGA node. This margin must be the value

3.4 System Architecture 15

D

B

C

Amaster

�

��

�

��

� ��

Fig. 3.5 Implementation of Synchronization Mechanism.

which hides the worst clock period variation in α cycles.

I describe the implementation of synchronization mechanism and Fig. 3.5 shows it. The α and β in

Fig. 3.5 are same as the α and β in Fig. 3.5. The FPGA nodes other than Master node, which receive a

synchronization signal, send the synchronization signal to right and down FPGA nodes. Therefore, all

FPGA nodes are synchronized to Master node. The FPGA node receiving the synchronization signal

restarts to execute stencil computation after waiting several cycles. The reason why the FPGA node

waits several cycle is to prevent chattering.

3.4 System ArchitectureFig. 3.6 shows the system architecture with eight multiply-adder units. The square in the figure

represents BlockRAM*1. Sync represents the module to synchronize all FPGA nodes. The DES in the

figure is a deserializer that receives data from adjacent FPGA and the SER is a serializer that sends data

to adjacent FPGAs. The input of the serializer is prepared FIFO. This FIFO is input computation results

of MADD, however, only the data that is sent to adjacent FPGAs. The implementation of Ser/Des is

used data recovery and NRZI code. MADD represents Multiply-Adder unit. The square in MADD

represents register. Both multiplier and adder is single precision floating-point unit, which conforms

to IEEE 754. I use the multiplier and adder. They have seven pipeline stages. In this case, since two

registers are included to the MADD, the pipeline of the data path in the MADD becomes sixteen stages.

*1 BlockRAM is low-latency SRAM which each FPGA has.


��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�

��

��

��

� ��

� � � � � � � � � �

� � � � � � �

Fig. 3.6 System Architecture.

Therefore, the data path is regarded as connecting the eight stages adder and eight stages multiplier.

This pipeline scheduling is valid only when width of computed grid is equal to the pipeline stages of

multiplier and adder. Thus, I decide multiplier and adder have eight stages. I explain the reason later in

this thesis.

Fig. 3.7 shows the relationship between BlockRAM in Fig. 3.6 and grid-points. The number written

in BlockRAM in Fig. 3.6 corresponds to the number in Fig. 3.7 respectively. In Fig. 3.7, the data set

that assigned to each FPGA is split in the vertical direction, and is stored in each BlockRAM (0∼7).

They are surrounded with the dashed line. If the data set of 64×128 is assigned to one FPGA, the

split data set (8×128) is stored in each BlockRAM (0∼7). Furthermore, the data of the communication

region is stored in another BlockRAM or some BlockRAMs (it is not 0∼7 BlockRAM surrounded by

the dashed line). The communication region is the set of data, which is transferred to the adjacent

nodes.

3.4 System Architecture 17

� � � � � � ��

� � � � � � � �

� � � � � � � �

Fig. 3.7 Relationship between the Grid-points and BlockRAM.

��

� � � � � � � � � � � � � � � ��

� � � � � � � � � ��

Fig. 3.8 MADD Pipeline Operation.

Fig. 3.8 shows MADD pipelined operation. The circle in the figure represents the value of grid-point

and the square is the computation result, which the value of the grid-point is multiplied by a weighting

factor. Both multiplier and adder have eight stages of the pipeline. Fig. 3.8 (a) shows the number of

grid-point.

I explain the computation of grid-points 11∼18. First of all, grid-points 1∼8 are loaded from Block-

RAM and they are input to the multiplier in cycle’s 0∼7. Next, the computation result is output from

multiplier, at the same times, grid-points 10∼17 are input to the multiplier in cycle’s 8∼15. And then,

grid-points 12∼19 are input to the multiplier, at the same time, value of grid-points 1∼8 and 10∼17

multiplied by a weighting factor are summed in cycle’s 16∼23. Finally, computation results that data

of up, down, left and right gird-points are multiplied by a weighting factor and summed are output in

cycle’s 41∼48.

The data of grid-point that will be used must not be updated by writing computation result in Block-

RAM. Therefore, general approach uses the temporary buffer in which the data is stored, such as FIFO,

before writing them in BlockRAM. But, the proposed architecture needs no additional temporary buffer


because MADD pipeline give the same functionality as temporary buffer. In the case of Fig. 3.8, the

data of grid-points 11∼18 are updated in cycle’s 40∼48. This data of grid-points 11∼18 are input to

the multiplier in cycle’s 32∼40, and are not used later. Therefore, if the computation in a single FPGA,

the order of updated data is protected without using FIFO. As previously explained, this scheduling is

valid only when width of computed grid is equal to the pipeline stages of multiplier and adder. The

width of computed grid that a MADD processes is eight because the number of the pipeline stages of

the multiplier and the adder is eight.

This architecture achieves about 100% always filled. The filing rate of the pipelines is (N−8/N)×100.

N is cycles which taken this computation. In addition to, this architecture does not use the additional

temporary buffer to update data. Therefore, this architecture can achieve high computation performance

and the small circuit area.

19

Chapter 4

Evaluation

4.1 EnvironmentFig. 4.1 shows hardware configuration of FPGA array*1. It is possible to scale array system freely

according to grid-size of stencil computation by connecting the FPGA in mesh. Therefore, the more

the number of FPGA nodes increases, the more grid-size that can be computed in same time increases.

Each node in the FPGA array is equipped with FPGA (Xilinx Spartan-6 XC6SLX16), and Block-

RAM capacity of each FPGA is 64KB. Implementing MADD in the FPGA is used IP core that is

generated by core-generator. Implementing single MADD expends four pieces of 32 DSP-blocks that

a Spartan-6 FPGA has. Therefore, the number of MADD to be able to be implemented in single FPGA

is eight.

I coded these circuits in Verilog HDL and used Xilinx ISE 14.2 to generate circuit information.

As described in 3.1, I used the program of stencil computation that is coded in C because of ver-

ification of the circuits implemented and comparison of execution speed. I coded the program for

verification by using Softfloat library whose computation precision is same as floating-point arithmetic

of FPGA. Moreover, we coded the program for comparison of execution speed by not using the library

because it is important for this version program to run faster.

4.3 shows performance evaluation of the FPGA array which eight MADD is implemented in a single

FPGA. Grid-size*2 is 2D-data set (64×128) and number of Iteration are 5,800,000. Computation result

is output to PC connected by USB in each Iteration, and I compared it to program execution result that

is coded in C, as a result, data in each Iteration are matched.

*1 SRAM in Fig. 4.1 is not used.*2 The total number of grid-points which can be computed are, 64KB(BlockRAMcapacity)÷4B(data-size of grid-point(single

precision floating-point)), 16K. However, Width of grid is 64 because of number of MADD and scheduling condition.

Chapter 4 Evaluation 20

!"#$%&'())*+&&

,'-&

!"%.!

!"#$!

%&$'!

!"#$!

%&$'!

!"#$!

%&$'!

!"#$!

%&$'!

!"#$!

%&$'!

!"#$!

%&$'!

!"#$!

%&$'!

!"#$!

%&$'!

"&('! "&('! "&('! "&('!

"&('! "&('! "&('! "&('!

/01)*2+!

!"#$!

%&$'!

!"#$!

%&$'!"&('!

"&('!

!"#$!

%&$'!

!"#$!

%&$'!

!"#$!

%&$'!

!"#$!

%&$'!"&('! "&('! "&('! "&('!

!"#$!

%&$'!"&('!

Fig. 4.1 Configuration Diagram of the Mesh Connected FPGA Array.

Table. 4.1 Hardware Resource Consumption.

Device Utilization Summary

Slice Logic Utilization Used / Available Utilization

LUTs 7,805 / 9,112 85%

Slices 2,271 / 2,278 99%

BlockRAM 28 / 32 87%

DSP48A1 32 / 32 100%

4.2 Hardware Resource ConsumptionTable 4.1 shows hardware resource consumption of a single FPGA. Utilization of DSP block is

100% because implementing eight MADD consumes all of DSP block. Slice utilization is 99%. 99%

includes communication module to output to PC, Ser/Des and the module to synchronize all FPGA

nodes in addition to multiple MADD.

4.3 Performance of FPGA Array 21

Fig. 4.2 FPGA Array (100 Nodes).

4.3 Performance of FPGA ArrayFig. 4.2 and Table 4.2 show 100-node FPGA array and design parameters to analyze performance of

FPGA array respectively.

Operation frequency is FGHz，the number of MADD implemented in each FPGA is NMADD, the

number of FPGA NFPGA. Each MADD can operate addition and multiplication on every cycle at the

same time. For this reason, hardware peak performance of single MADD is 2FGFlop/s, and hardware

peak performance of single FPGA is 2FNMADDGFlop/s. Therefore, hardware peak performance of

FPGA array which NFPGA are connected is shown below.


Table. 4.2 Design Parameters.

operation frequency F GHz

the number of FPGA NFPGA

the number of MADD NMADD

hardware peak performance Ppeak GFlop/s

number of computation OP

the total number of grid-points GRID

number of Iteration IT ER

Ppeak = 2 × F × NFPGA × NMADD (4.1)

When operation frequency is 0.06GHz, hardware peak performance Ppeak in 100-node FPGA array is

96GFlop/s because NMADD is 8, NFPGA is 100. However, as shown in Fig. 2.2, Average utilization of

MADD unit is 100 × (4 + 3)/8 = 87.5% computation of single gird-point is floating point arithmetic

of seven times*3. Therefore, peak performance with operation frequency 0.06GHz is 96×0.875 =

84GFlop/s.

Fig. 4.3 shows peak performance and effective performance of stencil computation in 16-node FPGA

array depending on operation frequency. Effective performance is measured by the total number of

floating point arithmetic divided by execution time. The total number of floating point arithmetic is

shown below by using OP, GRID, IT ER in Table 4.2*4. We measured execution time by stopwatch.

OP ×GRID × IT ER = 7 × 64 × 128 × 5800000

As shown in Fig. 4.3, since peak and effective performance of stencil computation are almost same,

that overhead of proposed computation method is small is figured out. Moreover, I compiled the

stencil computation program coded for comparison in C with ”-O3” option. Effective performance is

3.31GFlop/s when running on a single thread in Intel Core i7-2700K with operation frequency 3.5GHz.

This result is equal performance, compared to 2.8GFlop/s in [7]. The effective performance in 16-node

FPGA array is 13.42GFlop/s in 0.06GHz, that in Intel Corei7-2700K with operation frequency 3.5GHz

*3 The four multiplications and the three additions*4 OP is the total number of computation required to update data of a grid-points from time-step k to k+1. In this case, OP

is seven because of four multiplications and the three additions.


!"

#"

$"

%"

&"

'!"

'#"

'$"

'%"

!(!'" !(!#" !(!)" !(!$" !(!*" !(!%"!"#$%#&

'()"*+,-%./01!

,#"2"()3*+451!

+,-."

/0,123,"

Fig. 4.3 Peak and Effective Performance of Stencil Computation in FPGA Array with 16 Nodes.

!"

!#$"

!#%"

!#&"

!#'"

!#("

!#)"

!#!$" !#!%" !#!&" !#!'" !#!(" !#!)"!"#$%#&

'()"*+"#*,'-*./01%+2345!

0#"6"()7./895!

Fig. 4.4 Effective Performance per Watt of Stencil Computation in FPGA Array with 16 Nodes.

is 3.31GFlop/s. Therefore, 16-node FPGA array achieves about four-times better than Intel Core i7-

2700K (single thread).

Fig. 4.4 shows that effective performance per watt of stencil computation in 16-node FPGA array de-

pending on operation frequency. As shown in Fig. 4.4, I confirm improvement of effective performance

per watt as the operation frequency increases.

Fig. 4.5 shows peak performance and effective performance of stencil computation in FPGA array

with operation frequency 0.06GHz depending on the number of nodes. This system has the character-

istic that the more the number of FPGA nodes increases, the more grid-size that can be computed in

same time increases. A single FPGA node computes 2D-data set (64×128). Thus, 10×10-node FPGA


!"#$

#$

#!$

#!!$

#$ %$ &$ '$ #($ )%$ (&$ #!!$

!"#$%#&

'()"*+,-%./01!

23&4"#5%$5,!+65(%7"0!

*+,-$

./+012+$

Fig. 4.5 Peak and Effective Performance of Stencil Computation in FPGA Array with Operation

Frequency 0.06GHz.

!"

!#$"

!#%"

!#&"

!#'"

!#("

!#)"

!#*"

$" %" '" +" $)" &%" )'" $!!"

!"#$%#&

'()"*+"#*,'-*./01%+2345!

67&8"#*%$*0!/9*(%:"3!

Fig. 4.6 Effective Performance per Watt of Stencil Computation in FPGA Array with Operation

Frequency 0.06GHz.

array can compute 2D-data set (640×1280).

As shown in Fig. 4.5, I confirmed that the peak and effective performance of stencil computation are

almost same. 100-node FPGA array computed the data set as 640×1280 in 396sec, a single thread in

Intel Corei7-2700K with operation frequency 3.5GHz computed that in 10053sec. Therefore, 100-node

FPGA array achieves about 25-times better than Intel Core i7-2700K (single thread).

Fig. 4.6 shows effective performance per watt of stencil computation in FPGA array with operation


frequency 0.06GHz depending on the number of nodes. In this figure, I confirmed that improvement of

effective performance per watt as the number of nodes increases. The effective performance per watt

in 100-node FPGA array is 0.57GFlop/sW. This performance/W value is 3.8-times better than NVidia

GTX280 GPU card, compared to 0.15GFlop/sW in [1].

27

Chapter 5

Related Work

The many of works that stencil computation is optimized for multi-core processors and GPU have

been reported.

Augustin et al. [7] reports that they execute stencil computation by using Intel Xeon E5220 quad-

core processor running at 2.26GHz. Single core of the processor achieves 2.8GFlop/s, just 31% of the

peak performance. Moreover, two E5220 processors achieve 15.9GFlop/s for 8 cores, 21.8% of the

peak.

Phillips et al. [8] reports that they execute stencil computation by using NVIDIA TESLA C1060

GPU. Then, single GPU achieves 51.2GFlop/s, 65.6% of the peak performance in double-precision

arithmetic. The GPU cluster reduces this computation performance further. In the case of a 256×256×512 grids, computation performance is 42.2% of the peak performance.

Several studies of designing hardware for stencil computation by using FPGA have been reported

[2][9]. [2] proposes hardware for stencil computation that is composed of systolic array of pro-

grammable processing elements and implement prototype by using multiple FPGAs (ALTERA Staratix

family). Sano et al. achieves performance scalability with a constant memory-bandwidth by imple-

menting architecture applying pipeline-scheduling method that is proposed for Cell Automata. How-

ever, this work is different from our work in implementing architecture and type of FPGA. Sato et al.

[9] implement circuits that calculate Poisson’s equation by using FPGA array.

Moreover, works that array system connected to multiple FPGAs was developed have been reported.

[10] shows accelerator for scientific computation, CUBE, composed of 512 FPGAs connected to liner.

[11] studies String Levenshtein Distance Algorithm that is integer arithmetic-intensive application by

using CUBE.

29

Chapter 6

Conclusion

I proposed high performance architecture for 2D stencil computation and developed the system that

realizes proposed architecture in stages. First, I implemented software simulator in cycle level accuracy

on multiple FPGA nodes. Second, I implemented the circuits based on the software simulator in Verilog

HDL. Finally, I implemented the circuits in FPGA array and verified FPGA array.

Moreover, I evaluated the clock variations for every FPGA node quantitatively. And then, I designed

and implemented a mechanism to operate the synchronization of data in order to work successfully.

As a result, I established the validity on the proposed architecture since the FPGA array operated

successfully. The FPGA array with 100-FPGA achieved 0.57GFlop/sW. This performance/W value

was about four-times better than NVidia GTX280 GPU card.

In future work, I will develop the system that allows for the operation from the host PC, design and

implement towards lower power.

31

Acknowledgement

I would like to express my deep gratitude to Associate Prof. Kenji Kise. He has been my super-

visor and has supported me for 2 years from a master student. His constant support, guidance, and

encouragement have been essential for me to complete my thesis. In addition, he provided me with a

comfortable research environment in Kise Laboratory.

I also would like to thank all the members at Kise Laboratory, particularly Mr. Shinya Takamaeda-

Yamazaki and Mr. Shintaro Sano who graduated from Tokyo Tech, for active discussions and helpful

advice.

I would like to appreciate Lecturer Ryotaro Kobayashi, all the members at Kobayashi Laboratory

at Toyohashi University of Technology, Associate Prof. Tomoaki Tsumura and all the members at

Tsumura Laboratory at Nagoya Institute of Technology for active discussions and helpful advice.

I would like to sincerely thank Prof. Haruo Yokota, Associate Prof. Suguru Saito, Associate Prof.

Takuo Watanabe, and Lecturer Haruhiko Kaneko for careful review and fruitful suggestion as members

of the thesis committee.

Finally, I am deeply grateful to my parents for a wide variety of support during the years of my

studies.

This work is supported in part by Core Research for Evolutional Science and Technology (CREST),

JST.

33

Bibliography

[1] Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker,

David Patterson, John Shalf, and Katherine Yelick. Stencil computation optimization and auto-

tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE confer-

ence on Supercomputing, SC ’08, pp. 4:1–4:12, Piscataway, NJ, USA, 2008. IEEE Press.

[2] K. Sano, Y. Hatsuda, and S. Yamamoto. Scalable streaming-array of simple soft-processors for

stencil computations with constant memory-bandwidth. In Field-Programmable Custom Comput-

ing Machines (FCCM), 2011 IEEE 19th Annual International Symposium on, pp. 234 –241, may

2011.

[3] M. Shafiq, M. Pericas, R. de la Cruz, M. Araya-Polo, N. Navarro, and E. Ayguade. Exploiting

memory customization in fpga for 3d stencil computations. In Field-Programmable Technology,

2009. FPT 2009. International Conference on, pp. 38 –45, dec. 2009.

[4] R. Kobayashi et al. Towards a low-power accelerator of many fpgas for stencil computations. In

Networking and Computing (ICNC), 2012 Third International Conference on, pp. 343 –349, dec.

2012.

[5] Shinya Takamaeda-Yamazaki, Shintaro Sano, Yoshito Sakaguchi, Naoki Fujieda, and Kenji Kise.

In International Symposium on Applied Reconfigurable Computing (ARC 2012), March. 2012.

[6] Kobayashi Ryohei, Sano Shintaro, Takamaeda-Yamazaki Shinya, and Kise Kenji. High perfor-

mance stencil computation on mesh connected fpga arrays. In Transactions on Symposium on

Advanced Computing Systems and Infrastructures, Vol. 2012, pp. 142–149, may 2012.

[7] Werner Augustin, Vincent Heuveline, and Jan-Philipp Weiss. Optimized stencil computation

using in-place calculation on modern multicore systems. In Proceedings of the 15th International

Euro-Par Conference on Parallel Processing, Euro-Par ’09, pp. 772–784, Berlin, Heidelberg,

2009. Springer-Verlag.

[8] E.H. Phillips and M. Fatica. Implementing the himeno benchmark with cuda on gpu clusters.

In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pp. 1 –10,

april 2010.

[9] SATO Kazuki, JIANG Li, TAKAHASHI Kenichi, TAMUKOH Hakaru, KOBAYASHI Yuichi,

Bibliography 34

and SEKINE Masatoshi. Performance evaluation of poisson equation and cip method imple-

mented on fpga array. IEICE technical report. Circuits and systems, Vol. 109, No. 396, pp.

19–24, 2010-01-21.

[10] O. Mencer, Kuen Hung Tsoi, S. Craimer, T. Todman, Wayne Luk, Ming Yee Wong, and Philip

Leong. Cube: A 512-fpga cluster. In Programmable Logic, 2009. SPL. 5th Southern Conference

on, pp. 51 –57, april 2009.

[11] Yoshimi Masato, Nishikawa Yuri, Amano Hideharu, Miki Mitsunori, Hiroyasu Tomoyuki, and

Mencer Oskar. Performance evaluation of one-dimensional fpga-cluster cube for stream appli-

cations. IPSJ Transactions on Advanced Computing Systems, Vol. 3, No. 3, pp. 209–220, sep

2010.

35

Publication

International Symposium (Peer-reviewed)1. Ryohei Kobayashi, Shinya Takamaeda-Yamazaki, and Kenji Kise: Towards a Low-Power Ac-

celerator of Many FPGAs for Stencil Computations, Workshop on Challenges on Massively

Parallel Processors (CMPP 2012) held in conjunction with ICNC’12, pp.343-349 (December

2012).

National Symposium (Peer-reviewed)2. Ryohei Kobayashi, Shintaro Sano, Shinya Takamaeda-Yamazaki, and Kenji Kise: High Perfor-

mance Stencil Computation on Mesh Connected FPGA Arrays, Transactions on Symposium on

Advanced Computing Systems and Infrastructures (SACSIS2012), pp.142-149 (May 2012).

Oral Presentation3. Ryohei Kobayashi, Shinya Takamaeda-Yamazaki, and Kenji Kise: Design and Implementation

of High Performance Stencil Computer by using Mesh Connected FPGA Arrays, IEICE techni-

cal report (RECONF), pp. 159-164 (January 2013).

4. Ryohei Kobayashi, Shintaro Sano, Shinya Takamaeda-Yamazaki, Kenji Kise: Examination of

the Stencil Computation onto mesh arrays of FPGAs, The 74th National Convention of IPSJ,

Vol.1, No. 4J-4, pp. 107-108 (March 2012).

Documents

The 100-FPGA Stencil Computation Accelerator …...This thesis is organized as follows. Chapter 2 describes the parallel stencil computation by using multi-FPGAs. Chapter 3 describes