Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Tokyo Institute of TechnologyDepartment of Computer Science
Master Thesis
The 100-FPGA Stencil Computation Accelerator
(FPGA100個によるステンシル計算専用計算機)
Supervisor: Associate Prof. Kenji Kise
January, 2013
Submitter
Department of Computer Science
Graduate School of Information Science and Engineering
11M38217 Ryohei Kobayashi
Stencil computation is one of the typical scientific computing kernels. It is applied diverse areas as
earthquake simulation, seismic imaging for the oil and gas exploration industry, digital signal process-
ing and fluid calculation. Various hardware accelerators to solve stencil computation at high speed are
designed by using multiple high end FPGAs.
I have proposed high performance architecture for 2D stencil computation and implemented this
architecture in ScalableCore system, which employs multiple small FPGAs. ScalableCore system is
a 2D-mesh connected FPGA array that our laboratory already has developed, which is a high-speed
simulation environment for many core processors research. The ScalableCore system uses multiple
small capacity FPGAs, which are connected in 2D-mesh. In this thesis, I use hardware components of
the ScalableCore system as an infrastructure for HPC hardware accelerators.
In order to achieve high performance, the pipelines of the execution units should be kept operating
effectively. In the stencil computation, whole the data is divided into multiple blocks and each block
is assigned to each FPGA. The boundary data of each block is shared by the adjacent FPGAs. In my
system, the computation order is customized in each FPGA in order to increase the acceptable latency
of communication between the FPGAs.
I develop the system that realizes proposed architecture in stages. First, I implement software sim-
ulator in C++, which emulates stencil computation in cycle level accuracy on multiple FPGA nodes.
The execution results of the software simulator are verified by compared to the execution result of the
stencil computation program in function level accuracy coded in C. Second, I implement the circuits
based on the software simulator in Verilog HDL. I simulate the operation circuits by using iverilog and
verify the circuits by comparing the execution result of the software simulator. Finally, I implement
the circuits in FPGA array and verify FPGA array by comparing the computation result of FPGA array
with that of the program in function level accuracy coded in C.
I evaluated the performance, the scalability and the power consumption of developed FPGA array.
As a result, I established the validity on the proposed architecture since the FPGA array operated
successfully. The FPGA array with 100-FPGA achieved about 0.6GFlop/sW. This performance/W
value is about four-times better than typical CPU card.
i
Contents
1 Introduction 11.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Invention of the High Performance Architecture for 2D Stencil Computation . 1
1.1.2 Development of the System which Realizes Proposed Architecture . . . . . . . 2
1.1.3 Proof of the Validity on the Proposed Architecture by evaluation using 100-FPGA 2
2 Parallel Stencil Computation by Using Multi-FPGAs 32.1 Block Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Computation Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Clock Period Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Architecture and Implementation 113.1 Development Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Acquisition of the Position Information . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Design and Implementation of Synchronization Mechanism . . . . . . . . . . . . . . 13
3.4 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Evaluation 194.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Hardware Resource Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Performance of FPGA Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Related Work 27
6 Conclusion 29
Acknowledgement 31
Bibliography 33
Contents ii
Publication 35
iii
List of Figures
2.1 Parallel Stencil Computation by Using Multi-FPGA. . . . . . . . . . . . . . . . . . . 4
2.2 Pseudo Code of 2D Stencil Computation. . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 The Computation Order of Grid-points on FPGA. (b) is proposed method. . . . . . . 5
2.4 Computing Order Applied Proposed Method. . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Clock Period Variation (Measurement Time : 20sec). . . . . . . . . . . . . . . . . . 7
2.6 Relative Clock Period Variation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Development Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Verification of FPGA Array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Acquisition of the Position Information. . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Design of Synchronization Mechanism. . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Implementation of Synchronization Mechanism. . . . . . . . . . . . . . . . . . . . . 15
3.6 System Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.7 Relationship between the Grid-points and BlockRAM. . . . . . . . . . . . . . . . . . 17
3.8 MADD Pipeline Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1 Configuration Diagram of the Mesh Connected FPGA Array. . . . . . . . . . . . . . 20
4.2 FPGA Array (100 Nodes). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Peak and Effective Performance of Stencil Computation in FPGA Array with 16 Nodes. 23
4.4 Effective Performance per Watt of Stencil Computation in FPGA Array with 16 Nodes. 23
4.5 Peak and Effective Performance of Stencil Computation in FPGA Array with Opera-
tion Frequency 0.06GHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.6 Effective Performance per Watt of Stencil Computation in FPGA Array with Opera-
tion Frequency 0.06GHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
v
List of Tables
2.1 Measurement Result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1 Hardware Resource Consumption. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Design Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1
Chapter 1
Introduction
Stencil computation is one of the typical scientific computing kernels [1]. Various accelerators to
solve stencil computation at high speed are designed by using multiple high end FPGAs [2][3].
I have proposed high performance architecture for 2D stencil computation and implemented this
architecture in ScalableCore system that employs multiple small FPGAs [4].
ScalableCore system is a 2D-mesh connected FPGA array that our laboratory already has developed,
which is a high-speed simulation environment for many-core processors research [5]. The ScalableCore
system uses multiple small-capacity FPGAs, which are connected in 2D-mesh. In this thesis, I use
hardware components of the ScalableCore system as an infrastructure for HPC hardware accelerators.
I show the contribution of this study in the next section.
1.1 Contribution
1.1.1 Invention of the High Performance Architecture for 2D Stencil Computa-
tion
In order to achieve high performance, the pipelines of the execution units should be kept operating
effectively. In the stencil computation, whole the data is divided into multiple blocks and each block
is assigned to each FPGA. The boundary data of each block is shared by the adjacent FPGAs. In my
system, the computation order is customized in each FPGA in order to increase the acceptable latency
of communication between the FPGAs.
Chapter 1 Introduction 2
1.1.2 Development of the System which Realizes Proposed Architecture
I develop the system that realizes proposed architecture in stages. First, I implement software sim-
ulator that emulates stencil computation in cycle level accuracy on multiple FPGA nodes. Second, I
implement the circuits based on the software simulator in Verilog HDL. Finally, I implement the cir-
cuits in FPGA array and verify FPGA array. However, on the process of the development, there is a
trouble that the system generates an illegal computation result when the multiple FPGA nodes are used.
The cause is clock period variation. Therefore, I evaluate the clock variations for every FPGA node
quantitatively. And then, I design and implement a mechanism to operate the synchronization of data.
1.1.3 Proof of the Validity on the Proposed Architecture by evaluation using
100-FPGA
I evaluate the performance, the scalability and the power consumption of developed FPGA array. As
a result, I establish the validity on the proposed architecture since the FPGA array operates successfully.
The FPGA array with 100-FPGA achieves 0.57GFlop/sW. This performance/W value is about four-
times better than NVidia GTX280 GPU card.
This thesis is organized as follows. Chapter 2 describes the parallel stencil computation by using
multi-FPGAs. Chapter 3 describes the architecture, design and implementation of the accelerator In
Chapter 4, I show the evaluation result of developed FPGA array. Chapter 5 gives a survey of related
work. Chapter 6 concludes this thesis.
3
Chapter 2
Parallel Stencil Computation by Using
Multi-FPGAs
Fig. 2.1 shows a typical pattern of 2D stencil computation. In the figure, each circle represents a
value of grid-point and each value of grid-point at next time-step is computed by using the values of its
four adjacent grid-points at current time-step.
Fig. 2.2 shows a pseudo code of 2D stencil computation shown in Fig. 2.1. In the figure, k represents
time-step, (i, j) represents coordinate of grid-point. Two buffers, v0 and v1, are used for the computa-
tion. The value of grid-point (i, j) is represented as vn[i][j] and n represent the buffer number (0 or 1).
As shown as the fourth line in Fig. 2.2, vn[i][j] is updated by the summation of four values. The each
value is obtained by multiplying weighting factor by one adjacent grid-points (vn[i-1][j], vn[i][j-1],
vn[i][j+1], vn[i+1][j]). As shown as the seventh and eighth line in Fig. 2.2, every grid-point is updated
for the next time-step.
2.1 Block DivisionAs shown in Fig. 2.1, the data set of stencil computation is divided into several blocks according to
the number of vertical and horizontal array of FPGAs. Each data block is assigned to each FPGA. The
computation on each FPGA uses the assigned data and the boundary data of each block shared. The
necessary boundary data of the adjacent FPGAs have to be sent to. In Fig. 2.1, the circle represents
grid-point, a group of grid-points (4×4) is assigned one FPGA and an arrow represents communication
to the neighbor FPGA. Gray regions represent the data subset communicated to other FPGAs.
The data sharing takes some overhead of data traversals. In order to eliminate this overhead, I
customized the computation order for each FPGA. I describe it in the next section.
Chapter 2 Parallel Stencil Computation by Using Multi-FPGAs 4
!"#$%&#'&(")*+%#),-.&
/..)(,0*&#,0&12!3456#77$,)6/-)#,45*/-/&.$8.0-&6#77$,)6/-0*&
9)-:&,0)(:8#"&12!3.4
5(")*+%#),-4
-)70+.-0%&;4
<4<4
=4
)4
>?@)A@=A4
><@)+?A@=AB&><@)A@=C?A4
><@)A@=+?AB&><@)C?A@=A4
>?@)A@=A&D&EF<&G&><@)+?A@=AH&
C&EF?&G&><@)A@=C?AH&4
C&EFI&G&><@)A@=+?AH&4
C&EFJ&G&><@)C?A@=AHK4
4
Fig. 2.1 Parallel Stencil Computation by Using Multi-FPGA.
����������������� ����������������
�������������� ����� ���� �����������
������������������������������ �����������
��������������������� �� ���!��"������ �� �����!
�"���� ���� �����!
�"���� ���� �����!
�"������ �� ��
#������������$
%�������$
&������������� ����� ���� ���������
'����������������������������� ������������� �� ������ �� �
(��$�
Fig. 2.2 Pseudo Code of 2D Stencil Computation.
2.2 Computation OrderFig. 2.3 shows two cases of computation order. Fig. 2.3 (a) shows the order that FPGA (A) and
FPGA (B) compute by the same order. A dotted square shows the data subset assigned to a FPGA.
In fact, the computations use extra data of the boundary, which is not shared. However, extra data is
omitted in this figure for simplicity. I define a sequent process to compute all the grid-points at a time-
step as “Iteration. The circle represents one grid-point. The alphabet in a circle represents ID of the
FPGA. The numbers in a cycle represents computing order in the FPGA, therefore, the computations
of each FPGA proceed in order of the arrow.
2.2 Computation Order 5
�� ��
�� ��
�� ��
��� ��
��
�
���
���
�
��
���
���
�� ��
�� ��
�� ��
��� ��
��
�
���
���
�
��
���
���
� �
� �
� �
�� �
�
��
��
�
��
��
��� ��
�� ��
�� ��
�� ��
���
���
�
��
���
���
��
�
����������������� ����
����������������������
�������
�������
�������
������
����������������� ����
�������������������������
Fig. 2.3 The Computation Order of Grid-points on FPGA. (b) is proposed method.
In this example, each FPGA updates the assigned data of sixteen grid-points (from 0 to 15) during
Iteration. For simplicity, I assumed that a computation updating a value of one grid-point takes just a
cycle and several FIFOs are used in order to avoid illegal modification of the data. The value of A0
is computed at 0th cycle and the value of A1 is computed at 1st cycle in FPGA (A). Similarly, the
value of B0 is computed at 0th cycle and the value of B1 is computed at 1st cycle in FPGA (B). All
the computations are processed in this order. I assume that each FPGA can use the obtained data of
the FPGA in a single cycle. After the completion of the computations for each Iteration, the process
proceeds to the next time-step. In this case, Iteration takes sixteen cycles to complete the computations.
The first Iteration begins with 0th cycle and the second Iteration begins with 16th cycle. Therefore, the
third Iteration begins with 32nd cycle.
In Fig. 2.3 (a), the computation of grid-point B1 uses the values of vertical and horizontal grid-points
A13, B5, B0, B2. The value of grid-point A13 needs to be communicated between FPGA (A) and
Chapter 2 Parallel Stencil Computation by Using Multi-FPGAs 6
Fig. 2.4 Computing Order Applied Proposed Method.
FPGA (B) because the value is shared with these FPGAs. The others do not need to be communicated
between FPGAs. In this computation order, the value of grid-point A13 is computed at 13th cycle and
the value of grid-point B1 is computed at 17th cycle. The computation of B1 uses the computed value
of A13. In order not to stall the computation of B1, the value of A13 must be communicated within
three cycles (14, 15, 16) after the computation. The values of grid-points A12, A14, A15 must also be
communicated within three cycles in order not to stall the computations. If the N × M grid-points are
assigned to a single FPGA, every shard value must be communicated within N − 1 cycles because of
this discussion.
Fig. 2.3 (b) models that FPGA (C) and FPGA (D) compute in reversed order. Fig. 2.3 (b) is proposed
method [6]. The computation order of FPGA (C) is the inversed order of FPGA (A) in Fig.4 (a). FPGA
(B) and FPGA (D) use the same computation order. In this case, in order not to stall the computation
of D1 of Iteration 2 (17th cycle), the margin to send value of C1 (1st cycle) is 15 cycles (2∼16). If
the N × M grid-points are assigned to a single FPGA, communication latency between FPGA (A) and
FPGA (B) must be within N × M − 1 cycles because of this discussion. In this way, by means of
changing computation order, acceptable latency of communication is increased.
Fig. 2.4 shows the computation order (proposed method) in each FPGA. The square represents
FPGA and the arrow represents computation order in Fig. 2.4. FPGAs of 1st and 3rd rows compute in
the same order as FPGA (C) in Fig. 2.3 (b). FPGAs of 2nd and 4th in Fig. 5 compute in the same order
as FPGA (D) in Fig. 2.3 (b).
As discussed in Fig. 2.3, the communication latency between FPGAs in proposed method can ensure
the cycles correspond to about Iteration. Communication to face each other in the direction of the arrow
2.3 Clock Period Variation 7
��� ��� ��� ��� ��� ������������������������������������ � � � � �
�������������� ��������� ����������������������
�����
Fig. 2.5 Clock Period Variation (Measurement Time : 20sec).
can also. That is, compute cycles of adjacent sides are equal when to place FPGA (C) and FPGA (D)
in Fig. 2.3 (b) upside down.
I think the communication between the left and right sides of the FPGA. C3 and C0 are adjacent
when the two side-by-side to the left or right FPGA (C) in Fig. 2.3 (b). In this time, acceptable latency
of communication is 12 cycles in the FPGA of the right. This number of cycles is calculated by the
cycles according to an Iteration minus the grid-points of one side cycles.
In this way, the proposed method gives increase acceptable latency of communication by computing
the up and down in reverse order, in other words, this method ensure margin of about one Iteration.
Until now, I define that the computation of one grid-point takes one cycle. However, if the computa-
tion of one grid-point takes k cycles, the acceptable communication latency is (N × M − M) × k cycles
between left FPGA and right FPGA.
2.3 Clock Period VariationEach FPGA node in our system is equipped with the clock oscillator [CSX-750PB(B)] with operation
frequency 40MHz, and the frequency stability of the oscillator is ±50ppm. ±50ppm of the frequency
stability means the clock period variation per 1,000,000 cycles is within ±50 cycles.
As a preliminary evaluation, I evaluated the clock period variation in each FPGA node. The clock
period variation is cited as the cause of trouble that the system generates an illegal computation result
Chapter 2 Parallel Stencil Computation by Using Multi-FPGAs 8
!"#$
!"%$
!"&$
!"'$
!"($
!")$
!"*$
!"+$
,'-..$$
,&-(.$$
,&-..$$
,%-(.$$
,%-..$$
,#-(.$$
,#-..$$
,.-(.$$
.-..$$
.-(.$$
#-..$$
#-(.$$
#$%$
&$'$
($)$
*$
+$
!"!#$%&'()$'*+,%-%.
%!!"!#$!
!"#$
!"%$
!"&$
!"'$
!"($
!")$
!"*$
!"+$
Fig. 2.6 Relative Clock Period Variation.
when the multiple FPGA nodes are used. I used ScalableCore system to measure the clock period
variation. ScalableCore system emulates an environment for many-core processors, and the many-core
processors are synchronized with the concept named virtual cycle. In order to examine the frequency
stability by using this mechanism, I coded the application in C, which is working in many-core proces-
sors on the ScalableCore system.
In this thesis, I measured the clock period variation of each FPGA node by running the implemented
program, and varied measurement time for each measurement. I targeted the FPGA array (8 × 8) and
the results are shown in Fig. 2.5, Fig. 2.6 and Table 2.1.
Fig. 2.5 shows the clock period variation depending on measurement time (20sec). The Z-axis in
these figures represents the clock period variation per 1,000,000 cycles. The X and Y-axes represent the
position coordinate of each FPGA node. As shown in Fig. 2.5, the clock period variation per 1,000,000
cycles keeps within ±50 cycles. This result means that it is guaranteed that the frequency stability of
the oscillator is ±50ppm.
Table 2.1 shows that the worst clock period variation and standard deviation of the clock variation
in each measurement time. The Time, Worst Value and Standard Deviation in Table 2.1 represent
measurement time, the worst clock period variation and standard deviation of the clock variation on
2.3 Clock Period Variation 9
Table. 2.1 Measurement Result.
Time[sec] Worst Value[ppm] Standard Deviation
20 20.47 (x=3, y=5) 4.73
40 20.47 (x=3, y=5) 4.68
80 20.47 (x=3, y=5) 4.73
160 20.59 (x=3, y=5) 4.77
320 20.66 (x=3, y=5) 4.79
FPGA array (8 × 8) respectively. As shown in Table 2.1, the Worst Value and the Standard Deviation
did not change nearly though I changed measurement time. This result means that the clock period
variation does not depend on measurement time.
Fig. 2.6 shows the result which is the clock period variation (320sec) divided by the variation (20sec).
The X, Y-axes in Fig. 2.6 represent the same in Fig. 2.5. The Z-axis in the figure represents the clock
period variation per 1,000,000 cycles with respect to the result (20sec). As shown in Fig. 2.6, the clock
period variation in almost all the nodes shows about 1 [ppm]. This result also shows that the clock
period variation does not depend on measurement time.
From these results, it is clear that the system must be designed considering the clock period variation
that almost does not change on time axis.
11
Chapter 3
Architecture and Implementation
There are two choices to create the large system that cannot be mounted on a single FPGA chip.
One is connecting a small number of high end FPGAs. The other is connecting a lot of small capacity
FPGAs. Absolutely, it is possible to connect a lot of high end FPGAs, however, the method is distant
because it is a waste to use the FPGA capacity than necessary.
An implementation method to use a small number of high end FPGAs has a merit that the system
runs fast. On the other hands, the larger the logic circuits that are mounted on a single FPGA chip are,
the more the implementation and verification of the system are difficult.
A method to use a lot of small capacity FPGAs has a merit that the implementation and verification
of the system are simpler because the logic circuits that can be mounted on a single FPGA chip are
smaller. Even if a FPGA node is broken, the system can operate successfully by replacing broken
FPGA node with new one. The operation speed of the implementation method is inferior to that of
the method to use a small number of high end FPGAs, however, it is suitable to employ the method in
terms of the cost per FPGA node.
From these reasons, I employ the implementation method to connect a lot of small capacity FPGAs.
3.1 Development FlowI develop the system that realizes proposed architecture in stages. Fig. 3.1 shows the development
flow.
First, I implement software simulator (stencil_sim.cc) in C++, which emulates stencil compu-
tation in cycle level accuracy on multiple FPGA nodes (Fig. 3.1 1⃝). The execution results of the
software simulator are verified by compared to the execution result of the stencil computation program
(stencil_base.c) in function level accuracy coded in C.
Second, I implement the circuits (stencil.v) based on the software simulator in Verilog HDL (Fig.
Chapter 3 Architecture and Implementation 12
stencil_base.c! sim_stencil.cc!
=
Identical ??!#include int main() {
printf() return
}!
#include int main() {
printf() return
}!"!
stencil.v!sim_stencil.cc!
=
Identical ??!module stencil (!)
input ! output !
endmodule!
#include int main() {
printf() return
}!#!
stencil_base.c!
=
Identical ??!#include int main() {
printf() return
}!$!
module stencil (!)
input ! output !
endmodule!
!"#$%&'())*+&&
,'-&
!"%.!
!"#$!
%&$'!
!"#$!
%&$'!
!"#$!
%&$'!
!"#$!
%&$'!
!"#$!
%&$'!
!"#$!
%&$'!
!"#$!
%&$'!
!"#$!
%&$'!
"&('! "&('! "&('! "&('!
"&('! "&('! "&('! "&('!
!"#$!
%&$'!
!"#$!
%&$'!"&('!
"&('!
!"#$!
%&$'!
!"#$!
%&$'!
!"#$!
%&$'!
!"#$!
%&$'!"&('! "&('! "&('! "&('!
!"#$!
%&$'!"&('!
Fig. 3.1 Development Flow.
3.1 2⃝). I simulate the operation circuits by using iverilog and verify the circuits by comparing the
execution result of the software simulator.
Finally, I implement the circuits in FPGA array and verify FPGA array by comparing the compu-
tation result of FPGA array with that of the program (stencil_base.c) in function level accuracy
coded in C (Fig. 3.1 3⃝). Fig. 3.2 shows the verification of FPGA array. The red grid-point the com-
putation result. The computation result of FPGA is outputted to PC that is connected by USB in each
Iteration.
3.2 Acquisition of the Position InformationAs describe in 2, I customized the computation order in each FPGA in order to increase the ac-
ceptable latency of communication between the FPGAs. To determine the computation order in each
FPGA, every FPGA use position information in the system. I implemented a mechanism to provide the
position information.
Fig. 3.3 shows how to provide the position information for all FPGAs. The trigona in Fig. 3.3
3.3 Design and Implementation of Synchronization Mechanism 13
Fig. 3.2 Verification of FPGA Array.
represents NOT circuit.
Every FPGA node can pick out whether adjacent FPGA nodes exist or not. The FPGA nodes located
in fourth line recognize that they are located at bottom by using the mechanism. In case of this imple-
mentation, the initial value: 0 is set to the FPGA nodes located at bottom. The FPGA node that initial
value is set to sends 0 to the FPGA node located above. The value that is sent from below inverts since
NOT circuit is connected between the upper and lower FPGA. Thus, 1 is set to the FPGA nodes located
in third line. The value of 1 is propagated upwards, 0 is set to the FPGA nodes located in second line.
The circuit that realizes this mechanism does not need register to latch the value. Therefore, the circuit
can be implemented in small scale.
3.3 Design and Implementation of Synchronization MechanismI designed and implemented the mechanism to synchronize all FPGA nodes in order to operate the
system successfully. The mechanism absorbs the clock period variation.
I describe the design of synchronization mechanism and Fig. 3.4 shows it. I define FPGA node (A)
Chapter 3 Architecture and Implementation 14
���� ���� ���� ����
���� ���� ���� ����
���� ���� ���� ����
���� ���� ���� ����
� � � �
� � � �
� � � �
� � � �
Fig. 3.3 Acquisition of the Position Information.
��
�
�
�
�����
��������� ���������
� �
����� �����
� � �
Fig. 3.4 Design of Synchronization Mechanism.
in Fig. 3.4 as Master node. Each FPGA node executes stencil computation in synchronism with the
signal transmitted from Master node. The nodes other than Master node stall stencil computation until
receiving a synchronization signal.
The synchronization signal is generated by Master node in a period of α + β and sent to the other
nodes. The α represents the number of cycles required to execute stencil computation during Iteration.
The β represents the margin to absorb the clock period variation of each FPGA node. The β represents
the margin to absorb the clock period variation of each FPGA node. This margin must be the value
3.4 System Architecture 15
D
B
C
Amaster
�
�����������������
�
�����������������
� �� �
Fig. 3.5 Implementation of Synchronization Mechanism.
which hides the worst clock period variation in α cycles.
I describe the implementation of synchronization mechanism and Fig. 3.5 shows it. The α and β in
Fig. 3.5 are same as the α and β in Fig. 3.5. The FPGA nodes other than Master node, which receive a
synchronization signal, send the synchronization signal to right and down FPGA nodes. Therefore, all
FPGA nodes are synchronized to Master node. The FPGA node receiving the synchronization signal
restarts to execute stencil computation after waiting several cycles. The reason why the FPGA node
waits several cycle is to prevent chattering.
3.4 System ArchitectureFig. 3.6 shows the system architecture with eight multiply-adder units. The square in the figure
represents BlockRAM*1. Sync represents the module to synchronize all FPGA nodes. The DES in the
figure is a deserializer that receives data from adjacent FPGA and the SER is a serializer that sends data
to adjacent FPGAs. The input of the serializer is prepared FIFO. This FIFO is input computation results
of MADD, however, only the data that is sent to adjacent FPGAs. The implementation of Ser/Des is
used data recovery and NRZI code. MADD represents Multiply-Adder unit. The square in MADD
represents register. Both multiplier and adder is single precision floating-point unit, which conforms
to IEEE 754. I use the multiplier and adder. They have seven pipeline stages. In this case, since two
registers are included to the MADD, the pipeline of the data path in the MADD becomes sixteen stages.
*1 BlockRAM is low-latency SRAM which each FPGA has.
Chapter 3 Architecture and Implementation 16
������� �������
�����
����
�����
������
������������� ������� ��������
������
��� ��� ��� ��� ��� ��� ��� �����
���� ���� ���� ���� ���� ���� ���� ����
����
�������
�������
�����
�����
����
���
���
�
���
�����
�����
� ����������
� � � � � � � � � �
� � � � � � �
Fig. 3.6 System Architecture.
Therefore, the data path is regarded as connecting the eight stages adder and eight stages multiplier.
This pipeline scheduling is valid only when width of computed grid is equal to the pipeline stages of
multiplier and adder. Thus, I decide multiplier and adder have eight stages. I explain the reason later in
this thesis.
Fig. 3.7 shows the relationship between BlockRAM in Fig. 3.6 and grid-points. The number written
in BlockRAM in Fig. 3.6 corresponds to the number in Fig. 3.7 respectively. In Fig. 3.7, the data set
that assigned to each FPGA is split in the vertical direction, and is stored in each BlockRAM (0∼7).
They are surrounded with the dashed line. If the data set of 64×128 is assigned to one FPGA, the
split data set (8×128) is stored in each BlockRAM (0∼7). Furthermore, the data of the communication
region is stored in another BlockRAM or some BlockRAMs (it is not 0∼7 BlockRAM surrounded by
the dashed line). The communication region is the set of data, which is transferred to the adjacent
nodes.
3.4 System Architecture 17
� � � � � � ��
� � � � � � � �
� � � � � � � �
Fig. 3.7 Relationship between the Grid-points and BlockRAM.
������������������� ���������������������
� � � � � � � � � � � � � � � �� �� �� �� �� �� �� �� � � � � � � � � � � � � � � � �� �� �� �� �� �� �� ���� �� �� �� �� �� �� � � � � � � � � �� �� �� �� �� �� �� ���� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ��� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� � � � � � � � �� �� �� �� �� �� �� ���� �� �� �� �� �� �� ��� �� �� �� �� �� �� ���� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ��� �� �� �� �� �� �� ��� � � �� �� �� ��
� � � � � � � � � �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �������������������������
Fig. 3.8 MADD Pipeline Operation.
Fig. 3.8 shows MADD pipelined operation. The circle in the figure represents the value of grid-point
and the square is the computation result, which the value of the grid-point is multiplied by a weighting
factor. Both multiplier and adder have eight stages of the pipeline. Fig. 3.8 (a) shows the number of
grid-point.
I explain the computation of grid-points 11∼18. First of all, grid-points 1∼8 are loaded from Block-
RAM and they are input to the multiplier in cycle’s 0∼7. Next, the computation result is output from
multiplier, at the same times, grid-points 10∼17 are input to the multiplier in cycle’s 8∼15. And then,
grid-points 12∼19 are input to the multiplier, at the same time, value of grid-points 1∼8 and 10∼17
multiplied by a weighting factor are summed in cycle’s 16∼23. Finally, computation results that data
of up, down, left and right gird-points are multiplied by a weighting factor and summed are output in
cycle’s 41∼48.
The data of grid-point that will be used must not be updated by writing computation result in Block-
RAM. Therefore, general approach uses the temporary buffer in which the data is stored, such as FIFO,
before writing them in BlockRAM. But, the proposed architecture needs no additional temporary buffer
Chapter 3 Architecture and Implementation 18
because MADD pipeline give the same functionality as temporary buffer. In the case of Fig. 3.8, the
data of grid-points 11∼18 are updated in cycle’s 40∼48. This data of grid-points 11∼18 are input to
the multiplier in cycle’s 32∼40, and are not used later. Therefore, if the computation in a single FPGA,
the order of updated data is protected without using FIFO. As previously explained, this scheduling is
valid only when width of computed grid is equal to the pipeline stages of multiplier and adder. The
width of computed grid that a MADD processes is eight because the number of the pipeline stages of
the multiplier and the adder is eight.
This architecture achieves about 100% always filled. The filing rate of the pipelines is (N−8/N)×100.
N is cycles which taken this computation. In addition to, this architecture does not use the additional
temporary buffer to update data. Therefore, this architecture can achieve high computation performance
and the small circuit area.
19
Chapter 4
Evaluation
4.1 EnvironmentFig. 4.1 shows hardware configuration of FPGA array*1. It is possible to scale array system freely
according to grid-size of stencil computation by connecting the FPGA in mesh. Therefore, the more
the number of FPGA nodes increases, the more grid-size that can be computed in same time increases.
Each node in the FPGA array is equipped with FPGA (Xilinx Spartan-6 XC6SLX16), and Block-
RAM capacity of each FPGA is 64KB. Implementing MADD in the FPGA is used IP core that is
generated by core-generator. Implementing single MADD expends four pieces of 32 DSP-blocks that
a Spartan-6 FPGA has. Therefore, the number of MADD to be able to be implemented in single FPGA
is eight.
I coded these circuits in Verilog HDL and used Xilinx ISE 14.2 to generate circuit information.
As described in 3.1, I used the program of stencil computation that is coded in C because of ver-
ification of the circuits implemented and comparison of execution speed. I coded the program for
verification by using Softfloat library whose computation precision is same as floating-point arithmetic
of FPGA. Moreover, we coded the program for comparison of execution speed by not using the library
because it is important for this version program to run faster.
4.3 shows performance evaluation of the FPGA array which eight MADD is implemented in a single
FPGA. Grid-size*2 is 2D-data set (64×128) and number of Iteration are 5,800,000. Computation result
is output to PC connected by USB in each Iteration, and I compared it to program execution result that
is coded in C, as a result, data in each Iteration are matched.
*1 SRAM in Fig. 4.1 is not used.*2 The total number of grid-points which can be computed are, 64KB(BlockRAMcapacity)÷4B(data-size of grid-point(single
precision floating-point)), 16K. However, Width of grid is 64 because of number of MADD and scheduling condition.
Chapter 4 Evaluation 20
!"#$%&'())*+&&
,'-&
!"%.!
!"#$!
%&$'!
!"#$!
%&$'!
!"#$!
%&$'!
!"#$!
%&$'!
!"#$!
%&$'!
!"#$!
%&$'!
!"#$!
%&$'!
!"#$!
%&$'!
"&('! "&('! "&('! "&('!
"&('! "&('! "&('! "&('!
/01)*2+!
!"#$!
%&$'!
!"#$!
%&$'!"&('!
"&('!
!"#$!
%&$'!
!"#$!
%&$'!
!"#$!
%&$'!
!"#$!
%&$'!"&('! "&('! "&('! "&('!
!"#$!
%&$'!"&('!
Fig. 4.1 Configuration Diagram of the Mesh Connected FPGA Array.
Table. 4.1 Hardware Resource Consumption.
Device Utilization Summary
Slice Logic Utilization Used / Available Utilization
LUTs 7,805 / 9,112 85%
Slices 2,271 / 2,278 99%
BlockRAM 28 / 32 87%
DSP48A1 32 / 32 100%
4.2 Hardware Resource ConsumptionTable 4.1 shows hardware resource consumption of a single FPGA. Utilization of DSP block is
100% because implementing eight MADD consumes all of DSP block. Slice utilization is 99%. 99%
includes communication module to output to PC, Ser/Des and the module to synchronize all FPGA
nodes in addition to multiple MADD.
4.3 Performance of FPGA Array 21
Fig. 4.2 FPGA Array (100 Nodes).
4.3 Performance of FPGA ArrayFig. 4.2 and Table 4.2 show 100-node FPGA array and design parameters to analyze performance of
FPGA array respectively.
Operation frequency is FGHz,the number of MADD implemented in each FPGA is NMADD, the
number of FPGA NFPGA. Each MADD can operate addition and multiplication on every cycle at the
same time. For this reason, hardware peak performance of single MADD is 2FGFlop/s, and hardware
peak performance of single FPGA is 2FNMADDGFlop/s. Therefore, hardware peak performance of
FPGA array which NFPGA are connected is shown below.
Chapter 4 Evaluation 22
Table. 4.2 Design Parameters.
operation frequency F GHz
the number of FPGA NFPGA
the number of MADD NMADD
hardware peak performance Ppeak GFlop/s
number of computation OP
the total number of grid-points GRID
number of Iteration IT ER
Ppeak = 2 × F × NFPGA × NMADD (4.1)
When operation frequency is 0.06GHz, hardware peak performance Ppeak in 100-node FPGA array is
96GFlop/s because NMADD is 8, NFPGA is 100. However, as shown in Fig. 2.2, Average utilization of
MADD unit is 100 × (4 + 3)/8 = 87.5% computation of single gird-point is floating point arithmetic
of seven times*3. Therefore, peak performance with operation frequency 0.06GHz is 96×0.875 =
84GFlop/s.
Fig. 4.3 shows peak performance and effective performance of stencil computation in 16-node FPGA
array depending on operation frequency. Effective performance is measured by the total number of
floating point arithmetic divided by execution time. The total number of floating point arithmetic is
shown below by using OP, GRID, IT ER in Table 4.2*4. We measured execution time by stopwatch.
OP ×GRID × IT ER = 7 × 64 × 128 × 5800000
As shown in Fig. 4.3, since peak and effective performance of stencil computation are almost same,
that overhead of proposed computation method is small is figured out. Moreover, I compiled the
stencil computation program coded for comparison in C with ”-O3” option. Effective performance is
3.31GFlop/s when running on a single thread in Intel Core i7-2700K with operation frequency 3.5GHz.
This result is equal performance, compared to 2.8GFlop/s in [7]. The effective performance in 16-node
FPGA array is 13.42GFlop/s in 0.06GHz, that in Intel Corei7-2700K with operation frequency 3.5GHz
*3 The four multiplications and the three additions*4 OP is the total number of computation required to update data of a grid-points from time-step k to k+1. In this case, OP
is seven because of four multiplications and the three additions.
4.3 Performance of FPGA Array 23
!"
#"
$"
%"
&"
'!"
'#"
'$"
'%"
!(!'" !(!#" !(!)" !(!$" !(!*" !(!%"!"#$%#&
'()"*+,-%./01!
,#"2"()3*+451!
+,-."
/0,123,"
Fig. 4.3 Peak and Effective Performance of Stencil Computation in FPGA Array with 16 Nodes.
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
!#!$" !#!%" !#!&" !#!'" !#!(" !#!)"!"#$%#&
'()"*+"#*,'-*./01%+2345!
0#"6"()7./895!
Fig. 4.4 Effective Performance per Watt of Stencil Computation in FPGA Array with 16 Nodes.
is 3.31GFlop/s. Therefore, 16-node FPGA array achieves about four-times better than Intel Core i7-
2700K (single thread).
Fig. 4.4 shows that effective performance per watt of stencil computation in 16-node FPGA array de-
pending on operation frequency. As shown in Fig. 4.4, I confirm improvement of effective performance
per watt as the operation frequency increases.
Fig. 4.5 shows peak performance and effective performance of stencil computation in FPGA array
with operation frequency 0.06GHz depending on the number of nodes. This system has the character-
istic that the more the number of FPGA nodes increases, the more grid-size that can be computed in
same time increases. A single FPGA node computes 2D-data set (64×128). Thus, 10×10-node FPGA
Chapter 4 Evaluation 24
!"#$
#$
#!$
#!!$
#$ %$ &$ '$ #($ )%$ (&$ #!!$
!"#$%#&
'()"*+,-%./01!
23&4"#5%$5,!+65(%7"0!
*+,-$
./+012+$
Fig. 4.5 Peak and Effective Performance of Stencil Computation in FPGA Array with Operation
Frequency 0.06GHz.
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
!#*"
$" %" '" +" $)" &%" )'" $!!"
!"#$%#&
'()"*+"#*,'-*./01%+2345!
67&8"#*%$*0!/9*(%:"3!
Fig. 4.6 Effective Performance per Watt of Stencil Computation in FPGA Array with Operation
Frequency 0.06GHz.
array can compute 2D-data set (640×1280).
As shown in Fig. 4.5, I confirmed that the peak and effective performance of stencil computation are
almost same. 100-node FPGA array computed the data set as 640×1280 in 396sec, a single thread in
Intel Corei7-2700K with operation frequency 3.5GHz computed that in 10053sec. Therefore, 100-node
FPGA array achieves about 25-times better than Intel Core i7-2700K (single thread).
Fig. 4.6 shows effective performance per watt of stencil computation in FPGA array with operation
4.3 Performance of FPGA Array 25
frequency 0.06GHz depending on the number of nodes. In this figure, I confirmed that improvement of
effective performance per watt as the number of nodes increases. The effective performance per watt
in 100-node FPGA array is 0.57GFlop/sW. This performance/W value is 3.8-times better than NVidia
GTX280 GPU card, compared to 0.15GFlop/sW in [1].
27
Chapter 5
Related Work
The many of works that stencil computation is optimized for multi-core processors and GPU have
been reported.
Augustin et al. [7] reports that they execute stencil computation by using Intel Xeon E5220 quad-
core processor running at 2.26GHz. Single core of the processor achieves 2.8GFlop/s, just 31% of the
peak performance. Moreover, two E5220 processors achieve 15.9GFlop/s for 8 cores, 21.8% of the
peak.
Phillips et al. [8] reports that they execute stencil computation by using NVIDIA TESLA C1060
GPU. Then, single GPU achieves 51.2GFlop/s, 65.6% of the peak performance in double-precision
arithmetic. The GPU cluster reduces this computation performance further. In the case of a 256×256×512 grids, computation performance is 42.2% of the peak performance.
Several studies of designing hardware for stencil computation by using FPGA have been reported
[2][9]. [2] proposes hardware for stencil computation that is composed of systolic array of pro-
grammable processing elements and implement prototype by using multiple FPGAs (ALTERA Staratix
family). Sano et al. achieves performance scalability with a constant memory-bandwidth by imple-
menting architecture applying pipeline-scheduling method that is proposed for Cell Automata. How-
ever, this work is different from our work in implementing architecture and type of FPGA. Sato et al.
[9] implement circuits that calculate Poisson’s equation by using FPGA array.
Moreover, works that array system connected to multiple FPGAs was developed have been reported.
[10] shows accelerator for scientific computation, CUBE, composed of 512 FPGAs connected to liner.
[11] studies String Levenshtein Distance Algorithm that is integer arithmetic-intensive application by
using CUBE.
29
Chapter 6
Conclusion
I proposed high performance architecture for 2D stencil computation and developed the system that
realizes proposed architecture in stages. First, I implemented software simulator in cycle level accuracy
on multiple FPGA nodes. Second, I implemented the circuits based on the software simulator in Verilog
HDL. Finally, I implemented the circuits in FPGA array and verified FPGA array.
Moreover, I evaluated the clock variations for every FPGA node quantitatively. And then, I designed
and implemented a mechanism to operate the synchronization of data in order to work successfully.
As a result, I established the validity on the proposed architecture since the FPGA array operated
successfully. The FPGA array with 100-FPGA achieved 0.57GFlop/sW. This performance/W value
was about four-times better than NVidia GTX280 GPU card.
In future work, I will develop the system that allows for the operation from the host PC, design and
implement towards lower power.
31
Acknowledgement
I would like to express my deep gratitude to Associate Prof. Kenji Kise. He has been my super-
visor and has supported me for 2 years from a master student. His constant support, guidance, and
encouragement have been essential for me to complete my thesis. In addition, he provided me with a
comfortable research environment in Kise Laboratory.
I also would like to thank all the members at Kise Laboratory, particularly Mr. Shinya Takamaeda-
Yamazaki and Mr. Shintaro Sano who graduated from Tokyo Tech, for active discussions and helpful
advice.
I would like to appreciate Lecturer Ryotaro Kobayashi, all the members at Kobayashi Laboratory
at Toyohashi University of Technology, Associate Prof. Tomoaki Tsumura and all the members at
Tsumura Laboratory at Nagoya Institute of Technology for active discussions and helpful advice.
I would like to sincerely thank Prof. Haruo Yokota, Associate Prof. Suguru Saito, Associate Prof.
Takuo Watanabe, and Lecturer Haruhiko Kaneko for careful review and fruitful suggestion as members
of the thesis committee.
Finally, I am deeply grateful to my parents for a wide variety of support during the years of my
studies.
This work is supported in part by Core Research for Evolutional Science and Technology (CREST),
JST.
33
Bibliography
[1] Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker,
David Patterson, John Shalf, and Katherine Yelick. Stencil computation optimization and auto-
tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE confer-
ence on Supercomputing, SC ’08, pp. 4:1–4:12, Piscataway, NJ, USA, 2008. IEEE Press.
[2] K. Sano, Y. Hatsuda, and S. Yamamoto. Scalable streaming-array of simple soft-processors for
stencil computations with constant memory-bandwidth. In Field-Programmable Custom Comput-
ing Machines (FCCM), 2011 IEEE 19th Annual International Symposium on, pp. 234 –241, may
2011.
[3] M. Shafiq, M. Pericas, R. de la Cruz, M. Araya-Polo, N. Navarro, and E. Ayguade. Exploiting
memory customization in fpga for 3d stencil computations. In Field-Programmable Technology,
2009. FPT 2009. International Conference on, pp. 38 –45, dec. 2009.
[4] R. Kobayashi et al. Towards a low-power accelerator of many fpgas for stencil computations. In
Networking and Computing (ICNC), 2012 Third International Conference on, pp. 343 –349, dec.
2012.
[5] Shinya Takamaeda-Yamazaki, Shintaro Sano, Yoshito Sakaguchi, Naoki Fujieda, and Kenji Kise.
In International Symposium on Applied Reconfigurable Computing (ARC 2012), March. 2012.
[6] Kobayashi Ryohei, Sano Shintaro, Takamaeda-Yamazaki Shinya, and Kise Kenji. High perfor-
mance stencil computation on mesh connected fpga arrays. In Transactions on Symposium on
Advanced Computing Systems and Infrastructures, Vol. 2012, pp. 142–149, may 2012.
[7] Werner Augustin, Vincent Heuveline, and Jan-Philipp Weiss. Optimized stencil computation
using in-place calculation on modern multicore systems. In Proceedings of the 15th International
Euro-Par Conference on Parallel Processing, Euro-Par ’09, pp. 772–784, Berlin, Heidelberg,
2009. Springer-Verlag.
[8] E.H. Phillips and M. Fatica. Implementing the himeno benchmark with cuda on gpu clusters.
In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pp. 1 –10,
april 2010.
[9] SATO Kazuki, JIANG Li, TAKAHASHI Kenichi, TAMUKOH Hakaru, KOBAYASHI Yuichi,
Bibliography 34
and SEKINE Masatoshi. Performance evaluation of poisson equation and cip method imple-
mented on fpga array. IEICE technical report. Circuits and systems, Vol. 109, No. 396, pp.
19–24, 2010-01-21.
[10] O. Mencer, Kuen Hung Tsoi, S. Craimer, T. Todman, Wayne Luk, Ming Yee Wong, and Philip
Leong. Cube: A 512-fpga cluster. In Programmable Logic, 2009. SPL. 5th Southern Conference
on, pp. 51 –57, april 2009.
[11] Yoshimi Masato, Nishikawa Yuri, Amano Hideharu, Miki Mitsunori, Hiroyasu Tomoyuki, and
Mencer Oskar. Performance evaluation of one-dimensional fpga-cluster cube for stream appli-
cations. IPSJ Transactions on Advanced Computing Systems, Vol. 3, No. 3, pp. 209–220, sep
2010.
35
Publication
International Symposium (Peer-reviewed)1. Ryohei Kobayashi, Shinya Takamaeda-Yamazaki, and Kenji Kise: Towards a Low-Power Ac-
celerator of Many FPGAs for Stencil Computations, Workshop on Challenges on Massively
Parallel Processors (CMPP 2012) held in conjunction with ICNC’12, pp.343-349 (December
2012).
National Symposium (Peer-reviewed)2. Ryohei Kobayashi, Shintaro Sano, Shinya Takamaeda-Yamazaki, and Kenji Kise: High Perfor-
mance Stencil Computation on Mesh Connected FPGA Arrays, Transactions on Symposium on
Advanced Computing Systems and Infrastructures (SACSIS2012), pp.142-149 (May 2012).
Oral Presentation3. Ryohei Kobayashi, Shinya Takamaeda-Yamazaki, and Kenji Kise: Design and Implementation
of High Performance Stencil Computer by using Mesh Connected FPGA Arrays, IEICE techni-
cal report (RECONF), pp. 159-164 (January 2013).
4. Ryohei Kobayashi, Shintaro Sano, Shinya Takamaeda-Yamazaki, Kenji Kise: Examination of
the Stencil Computation onto mesh arrays of FPGAs, The 74th National Convention of IPSJ,
Vol.1, No. 4J-4, pp. 107-108 (March 2012).