Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Building Custom Arithmetic Operatorswith the FloPoCo Generator
Florent de Dinechine
x
√x2 +y
2 +z2
πx
sine x+
y
n∑i=
0x i
√x logx
Outline
Introduction: custom arithmetic
FloPoCo for application developers
Computing just right
FloPoCo for developers of HLS software
FloPoCo for developers of custom arithmetic
A tutorial: building an integer divider by 3
A tutorial: building a faithful FIR
A tutorial: building a√
x2 + y2 operator
Conclusion
F. de Dinechin A FloPoCo tutorial 2
Introduction :custom arithmetic
Introduction: custom arithmetic
FloPoCo for application developers
Computing just right
FloPoCo for developers of HLS software
FloPoCo for developers of custom arithmetic
A tutorial: building an integer divider by 3
A tutorial: building a faithful FIR
A tutorial: building a√
x2 + y2 operator
Conclusion
F. de Dinechin A FloPoCo tutorial 3
Two different ways of wasting silicon
Here are two universally programmable chips.
Who’s best for (insert your computation here) ?
F. de Dinechin A FloPoCo tutorial 4
Are FPGAs any good at floating-point ?
Long ago (1995), people ported the basic operations : +,−,×Versus the highly optimized FPU in the processor,
each operator 10x slower in an FPGA
This is the inavoidable overhead of programmability.
If you lose according to a metric, change the metric.
Peak figures for double-precision floating-point exponential
Pentium core : 20 cycles / DPExp @ 4GHz : 200 MDPExp/s
FPExp in FPGA : 1 DPExp/cycle @ 400MHz : 400 MDPExp/s
Chip vs chip : 6 Pentium cores vs 150 FPExp/FPGA
Power consumption also better
Single precision data better
(Intel MKL vector libm, vs FPExp in FloPoCo version 2.0.0)
F. de Dinechin A FloPoCo tutorial 5
Are FPGAs any good at floating-point ?
Long ago (1995), people ported the basic operations : +,−,×Versus the highly optimized FPU in the processor,
each operator 10x slower in an FPGA
This is the inavoidable overhead of programmability.
If you lose according to a metric, change the metric.
Peak figures for double-precision floating-point exponential
Pentium core : 20 cycles / DPExp @ 4GHz : 200 MDPExp/s
FPExp in FPGA : 1 DPExp/cycle @ 400MHz : 400 MDPExp/s
Chip vs chip : 6 Pentium cores vs 150 FPExp/FPGA
Power consumption also better
Single precision data better
(Intel MKL vector libm, vs FPExp in FloPoCo version 2.0.0)
F. de Dinechin A FloPoCo tutorial 5
Dura Amdahl lex, sed lex
SPICE Model-Evaluation, cut from Kapre and DeHon (FPL 2009)
F. de Dinechin A FloPoCo tutorial 6
Custom arithmetic (not your Pentium’s)
multiplier
genericpolynomial
truncated
precomputed
ROM
Constantmultipliers
evaluator
Shift to fixed−point
normalize / round
Fixed-point X
SX EX FX
A Z
E
E×1/ log(2)
× log(2)
eA eZ − Z − 1
Y
R
F. de Dinechin A FloPoCo tutorial 7
Custom arithmetic (not your Pentium’s)
Never compute1 bit more accuratelythan needed!
multiplier
genericpolynomial
truncated
precomputed
ROM
Constantmultipliers
evaluator
Shift to fixed−point
normalize / round
Fixed-point X
SX EX FX
A Z
E
E×1/ log(2)
× log(2)
eA eZ − Z − 1
Y
R
1 + wF + g
wF + g − k
wF + g + 2 − kMSB wF + g + 2 − k
wF + g + 1 − k
MSB wF + g + 1 − 2k
1 + wF + g
wE + wF + g + 1
wE + 1
wE + wF + g + 1
wE + wF + g + 1
k
F. de Dinechin A FloPoCo tutorial 7
Custom arithmetic (not your Pentium’s)
Never compute1 bit more accuratelythan needed!
multiplier
genericpolynomial
truncated
precomputed
ROM
Constantmultipliers
evaluator
Shift to fixed−point
normalize / round
generatorNeed a
Fixed-point X
SX EX FX
A Z
E
E×1/ log(2)
× log(2)
eA eZ − Z − 1
Y
R
1 + wF + g
wF + g − k
wF + g + 2 − kMSB wF + g + 2 − k
wF + g + 1 − k
MSB wF + g + 1 − 2k
1 + wF + g
wE + wF + g + 1
wE + 1
wE + wF + g + 1
wE + wF + g + 1
k
F. de Dinechin A FloPoCo tutorial 7
Useful operators that make sense in a processor
Should a processor include elementary functions ?Yes (Paul&Wilson, 1976), No since the transition to RISC
Should a processor include a divider and square root ?Yes (Oberman et al, Arith, 1997), No since the transition to FMA(IBM then HP then Intel)
Should a processor include decimal hardware ?Yes say IBM, No say Intel
Should a processor include a multiplier by log(2) ?No of course.
F. de Dinechin A FloPoCo tutorial 8
Useful operators that make sense in an FPGA or ASIC
Elementary functions ?Yes iff your application needs it
Divider or square root ?Yes iff your application needs it
Decimal hardware ?Yes iff your application needs it
A multiplier by log(2) ?Yes iff your application needs it
In FPGAs, useful means : useful to one application.
F. de Dinechin A FloPoCo tutorial 9
Enough work to keep me busy to retirement
Arithmetic operators useful to at least one application :
Elementary functions (sine, exponential, logarithm...)
Algebraic functions (x√
x2 + y2, polynomials, ...)
Compound functions (log2(1± 2x), e−Kt2 , ...)
Floating-point sums, dot products, sums of squares
Specialized operators : constant multipliers, squarers, ...
Complex arithmetic
LNS arithmetic
Decimal arithmetic
Interval arithmetic
...
Oh yes, basic operations, too.
F. de Dinechin A FloPoCo tutorial 10
Enough work to keep me busy to retirement
Arithmetic operators useful to at least one application :
Elementary functions (sine, exponential, logarithm...)
Algebraic functions (x√
x2 + y2, polynomials, ...)
Compound functions (log2(1± 2x), e−Kt2 , ...)
Floating-point sums, dot products, sums of squares
Specialized operators : constant multipliers, squarers, ...
Complex arithmetic
LNS arithmetic
Decimal arithmetic
Interval arithmetic
...
Oh yes, basic operations, too.
F. de Dinechin A FloPoCo tutorial 10
What do we call arithmetic operators ?
An arithmetic operation is a function (in the mathematical sense)
few well-typed inputs and outputsno memory or side effect (usually)
An operator is the implementation of such a function
IEEE-754 FP standard : operator(x) = rounding(operation(x))
→ Clean mathematical definition (even for floating-point arithmetic)
The operator as a circuit...
... is a direct acyclic graph (DAG) :
easy to build and pipeline
easy to test against its mathematical specification
F. de Dinechin A FloPoCo tutorial 11
What do we call arithmetic operators ?
An arithmetic operation is a function (in the mathematical sense)
few well-typed inputs and outputsno memory or side effect (usually)
An operator is the implementation of such a function
IEEE-754 FP standard : operator(x) = rounding(operation(x))
→ Clean mathematical definition (even for floating-point arithmetic)
The operator as a circuit...
... is a direct acyclic graph (DAG) :
easy to build and pipeline
easy to test against its mathematical specification
F. de Dinechin A FloPoCo tutorial 11
What do we call arithmetic operators ?
An arithmetic operation is a function (in the mathematical sense)
few well-typed inputs and outputsno memory or side effect (usually)
An operator is the implementation of such a function
IEEE-754 FP standard : operator(x) = rounding(operation(x))
→ Clean mathematical definition (even for floating-point arithmetic)
The operator as a circuit...
... is a direct acyclic graph (DAG) :
easy to build and pipeline
easy to test against its mathematical specification
F. de Dinechin A FloPoCo tutorial 11
The benefits of custom computing
Example : a floating-point sum of squares
x2 + y2 + z2
(not a toy example but a useful building block)
A square is simpler than a multiplicationhalf the hardware required
x2, y2, and z2 are positive :one half of your FP adder is useless
Accuracy can be improved :5 rounding errors in the floating-point version(x2 + y2) + z2 : asymmetrical
The FloPoCo Recipe
Floating-point interface for convenience
Clear accuracy specification for computing just right
Fixed-point internal architecture for efficiency
F. de Dinechin A FloPoCo tutorial 12
The benefits of custom computing
Example : a floating-point sum of squares
x2 + y2 + z2
(not a toy example but a useful building block)
A square is simpler than a multiplicationhalf the hardware required
x2, y2, and z2 are positive :one half of your FP adder is useless
Accuracy can be improved :5 rounding errors in the floating-point version(x2 + y2) + z2 : asymmetrical
The FloPoCo Recipe
Floating-point interface for convenience
Clear accuracy specification for computing just right
Fixed-point internal architecture for efficiency
F. de Dinechin A FloPoCo tutorial 12
The benefits of custom computing
Example : a floating-point sum of squares
x2 + y2 + z2
(not a toy example but a useful building block)
A square is simpler than a multiplicationhalf the hardware required
x2, y2, and z2 are positive :one half of your FP adder is useless
Accuracy can be improved :5 rounding errors in the floating-point version(x2 + y2) + z2 : asymmetrical
The FloPoCo Recipe
Floating-point interface for convenience
Clear accuracy specification for computing just right
Fixed-point internal architecture for efficiency
F. de Dinechin A FloPoCo tutorial 12
The benefits of custom computing
Example : a floating-point sum of squares
x2 + y2 + z2
(not a toy example but a useful building block)
A square is simpler than a multiplicationhalf the hardware required
x2, y2, and z2 are positive :one half of your FP adder is useless
Accuracy can be improved :5 rounding errors in the floating-point version(x2 + y2) + z2 : asymmetrical
The FloPoCo Recipe
Floating-point interface for convenience
Clear accuracy specification for computing just right
Fixed-point internal architecture for efficiency
F. de Dinechin A FloPoCo tutorial 12
The benefits of custom computing
Example : a floating-point sum of squares
x2 + y2 + z2
(not a toy example but a useful building block)
A square is simpler than a multiplicationhalf the hardware required
x2, y2, and z2 are positive :one half of your FP adder is useless
Accuracy can be improved :5 rounding errors in the floating-point version(x2 + y2) + z2 : asymmetrical
The FloPoCo Recipe
Floating-point interface for convenience
Clear accuracy specification for computing just right
Fixed-point internal architecture for efficiency
F. de Dinechin A FloPoCo tutorial 12
A floating-point adder
λ
LZC/shift
p + 1
p + 1
p + 1
p + 1
2p + 2
p p
p + 1
p
x y
z
exp. difference / swap
rounding,normalizationand exception handling
mxex +/–c/f ex − ey
close path c/f
ex
ez
my
shift
|mx −my |
my
1-bit shift
ex
ez
mx
far pathmz , r
mz , r
sticky
s
gr
prenorm (2-bit shift)
s
F. de Dinechin A FloPoCo tutorial 13
A fixed-point architecture
1 + wF 1 + wF 1 + wF
2 + wF + g2 + wF + g
2 + wF + g2 + wF + g
2 + wF + g
wE + wF + g
2 + wF + g
EC
EBMB2 MC 2
X Y Z
MXEZEYEX MY MZ
MA2
R
4 + wF + g
shifter
sort
sortsquarer squarer
shifter
squarer
add
normalize/pack
unpack
F. de Dinechin A FloPoCo tutorial 14
The benefits of custom computing
A few results for floating-point sum-of-squares on Virtex4 :
Simple Precision area performance
LogiCore classic 1282 slices, 20 DSP 43 cycles @ 353 MHz
FloPoCo classic 1188 slices, 12 DSP 29 cycles @ 289 MHz
FloPoCo custom 453 slices, 9 DSP 11 cycles @ 368 MHz
Double Precision area performance
FloPoCo classic 4480 slices, 27 DSP 46 cycles @ 276 MHz
FloPoCo custom 1845 slices, 18 DSP 16 cycles @ 362 MHz
all performance metrics improved, FLOP/s/area more than doubled
Plus : custom operator more accurate, and symmetrical
F. de Dinechin A FloPoCo tutorial 15
Custom also means : custom pipeline
1 + wF 1 + wF 1 + wF
2 + wF + g2 + wF + g
2 + wF + g2 + wF + g
2 + wF + g
wE + wF + g
2 + wF + g
EC
EBMB2 MC 2
X Y Z
MXEZEYEX MY MZ
MA2
R
4 + wF + g
shifter
sort
sortsquarer squarer
shifter
squarer
add
normalize/pack
unpack
One operator does not fit all
Low frequency, low resource consumption
Faster but larger (more registers)
Combinatorial
F. de Dinechin A FloPoCo tutorial 16
Custom also means : custom pipeline
1 + wF 1 + wF 1 + wF
2 + wF + g2 + wF + g
2 + wF + g2 + wF + g
2 + wF + g
wE + wF + g
2 + wF + g
EC
EBMB2 MC 2
X Y Z
MXEZEYEX MY MZ
MA2
R
4 + wF + g
shifter
sort
sortsquarer squarer
shifter
squarer
add
normalize/pack
unpack
One operator does not fit all
Low frequency, low resource consumption
Faster but larger (more registers)
Combinatorial
F. de Dinechin A FloPoCo tutorial 16
Custom also means : custom pipeline
1 + wF 1 + wF 1 + wF
2 + wF + g2 + wF + g
2 + wF + g2 + wF + g
2 + wF + g
wE + wF + g
2 + wF + g
EC
EBMB2 MC 2
X Y Z
MXEZEYEX MY MZ
MA2
R
4 + wF + g
shifter
sort
sortsquarer squarer
shifter
squarer
add
normalize/pack
unpack
One operator does not fit all
Low frequency, low resource consumption
Faster but larger (more registers)
Combinatorial
F. de Dinechin A FloPoCo tutorial 16
History of the FloPoCo project
Initial goal : FPGA arithmetic the way it should be
that is : open-ended, unlike processor arithmetic
an open-ended list of custom operators
open-ended data formats : all operators fully parameterized
open-ended performance trade-off : flexible pipeline
General philosophy : computing just right
A generator framework (written in C++, outputting VHDL)
Objectives :
portable, target-agnostic (more on this later)
VHDL should be simulable and synthesizable using free tool
Design philosophy :
VHDL generation by printing arbitrary VHDL from C++
automate repetitive and error-prone tasks (such as pipelining)
F. de Dinechin A FloPoCo tutorial 17
History of the FloPoCo project
Initial goal : FPGA arithmetic the way it should be
that is : open-ended, unlike processor arithmetic
an open-ended list of custom operators
open-ended data formats : all operators fully parameterized
open-ended performance trade-off : flexible pipeline
General philosophy : computing just right
A generator framework (written in C++, outputting VHDL)
Objectives :
portable, target-agnostic (more on this later)
VHDL should be simulable and synthesizable using free tool
Design philosophy :
VHDL generation by printing arbitrary VHDL from C++
automate repetitive and error-prone tasks (such as pipelining)
F. de Dinechin A FloPoCo tutorial 17
Beyond the plan
Custom arithmetic for ASIC and HLS
the FloPoCo framework was successfully used to design the FPU ofthe Kalray processor
FloPoCo provides the floating-point back-end to the PandA project(politecnico de Milano)
Not enough “custom” yet to my taste
F. de Dinechin A FloPoCo tutorial 18
Beyond the plan
Custom arithmetic for ASIC and HLS
the FloPoCo framework was successfully used to design the FPU ofthe Kalray processor
FloPoCo provides the floating-point back-end to the PandA project(politecnico de Milano)
Not enough “custom” yet to my taste
F. de Dinechin A FloPoCo tutorial 18
FloPoCo for application developers
Introduction: custom arithmetic
FloPoCo for application developers
Computing just right
FloPoCo for developers of HLS software
FloPoCo for developers of custom arithmetic
A tutorial: building an integer divider by 3
A tutorial: building a faithful FIR
A tutorial: building a√
x2 + y2 operator
Conclusion
F. de Dinechin A FloPoCo tutorial 19
Here should come a demo
FloPoCo is freely available from
http://flopoco.gforge.inria.fr/
Command line syntax : a sequence of operator specifications
Options : target frequency, target hardware, ...
Output : synthesizable VHDL.
F. de Dinechin A FloPoCo tutorial 20
First something classical
A single precision floating-point adder(8-bit exponent and 23-bit mantissa)
flopoco -pipeline=no FPAdder 8 23
Final report:
|---Entity FPAdder_8_23_uid2_RightShifter
|---Entity IntAdder_27_f400_uid7
|---Entity LZCShifter_28_to_28_counting_32_uid14
|---Entity IntAdder_34_f400_uid17
Entity FPAdder_8_23_uid2
Output file: flopoco.vhdl
To probe further :
flopoco -pipeline=no FPAdder 11 52 double precision
flopoco -pipeline=no FPAdder 9 34 just right for you
F. de Dinechin A FloPoCo tutorial 21
Classical floating-point, continued
A complete single-precision FPU in a single VHDL file :flopoco -pipeline=no FPAdder 8 23 FPMultiplier 8 23 23
FPDiv 8 23 FPSqrt 8 23
Final report:
|---Entity FPAdder_8_23_uid2_RightShifter
|---Entity IntAdder_27_f400_uid7
|---Entity LZCShifter_28_to_28_counting_32_uid14
|---Entity IntAdder_34_f400_uid17
Entity FPAdder_8_23_uid2
Entity Compressor_2_2
Entity Compressor_3_2
| |---Entity IntAdder_49_f400_uid39
|---Entity IntMultiplier_UsingDSP_24_24_48_unsigned_uid26
|---Entity IntAdder_33_f400_uid47
Entity FPMultiplier_8_23_8_23_8_23_uid24
Entity FPDiv_8_23
Entity FPSqrt_8_23
Output file: flopoco.vhdl
F. de Dinechin A FloPoCo tutorial 22
Frequency-directed pipelining
The same FPAdder, pipelined for 300MHz :flopoco -pipeline=yes -frequency=300 FPAdder 8 23
FloPoCo interface to pipeline construction
“Please pipeline this operator to work at 200MHz”
Not the choice made by other core generators...
... but better because compositional
When you assemble components working at frequency f ,you obtain a component working at frequency f .
F. de Dinechin A FloPoCo tutorial 23
Frequency-directed pipelining
The same FPAdder, pipelined for 300MHz :flopoco -pipeline=yes -frequency=300 FPAdder 8 23
FloPoCo interface to pipeline construction
“Please pipeline this operator to work at 200MHz”
Not the choice made by other core generators...
... but better because compositional
When you assemble components working at frequency f ,you obtain a component working at frequency f .
F. de Dinechin A FloPoCo tutorial 23
Frequency-directed pipelining
The same FPAdder, pipelined for 300MHz :flopoco -pipeline=yes -frequency=300 FPAdder 8 23
FloPoCo interface to pipeline construction
“Please pipeline this operator to work at 200MHz”
Not the choice made by other core generators...
... but better because compositional
When you assemble components working at frequency f ,you obtain a component working at frequency f .
F. de Dinechin A FloPoCo tutorial 23
Frequency-directed pipelining
The same FPAdder, pipelined for 300MHz :flopoco -pipeline=yes -frequency=300 FPAdder 8 23
FloPoCo interface to pipeline construction
“Please pipeline this operator to work at 200MHz”
Not the choice made by other core generators...
... but better because compositional
When you assemble components working at frequency f ,you obtain a component working at frequency f .
F. de Dinechin A FloPoCo tutorial 23
Examples of pipeline
flopoco -pipeline=yes -frequency=400 FPAdder 8 23
Final report:
|---Entity FPAdder_8_23_uid2_RightShifter
| Pipeline depth = 1
|---Entity IntAdder_27_f400_uid7
| Pipeline depth = 1
|---Entity LZCShifter_28_to_28_counting_32_uid14
| Pipeline depth = 4
|---Entity IntAdder_34_f400_uid17
| Pipeline depth = 1
Entity FPAdder_8_23_uid2
Pipeline depth = 10
flopoco -pipeline=yes -frequency=200 FPAdder 8 23
Final report:
(...)
Pipeline depth = 4
F. de Dinechin A FloPoCo tutorial 24
Of course the frequency depends on the target FPGA
flopoco -pipeline=yes -frequency=200 -target=Spartan3
FPAdder 8 23
Final report:
(...)
Pipeline depth = 14
flopoco -pipeline=yes -frequency=200 -target=Virtex6
FPAdder 8 23
Final report:
(...)
Pipeline depth = 3
Altera and Xilinx target currently supported (at various levels ofaccuracy) : Spartan3, Virtex4, Virtex5, Virtex6, StratixII, StratixIII,StratixIV, StratixV, CycloneII, CycloneIII, CycloneIV, CycloneV.
F. de Dinechin A FloPoCo tutorial 25
Also match the architecture to the target FPGA
Compareflopoco -pipeline=no -target=Virtex4 IntConstDiv 16 3 -1
flopoco -pipeline=no -target=Virtex6 IntConstDiv 16 3 -1
LUT LUTLUTLUT
q0q1q2q3
2 2 2 2
4444
4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r
4
r4 = 0
Architecture specificities
LUTs
DSP blocks
memory blocks
And FloPoCo attempts to model the delays of these blocksfor efficient pipelining
F. de Dinechin A FloPoCo tutorial 26
Also match the architecture to the target FPGA
Compareflopoco -pipeline=no -target=Virtex4 IntConstDiv 16 3 -1
flopoco -pipeline=no -target=Virtex6 IntConstDiv 16 3 -1
LUT LUTLUTLUT
q0q1q2q3
2 2 2 2
4444
4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r
4
r4 = 0
Architecture specificities
LUTs
DSP blocks
memory blocks
And FloPoCo attempts to model the delays of these blocksfor efficient pipelining
F. de Dinechin A FloPoCo tutorial 26
Also match the architecture to the target FPGA
Compareflopoco -pipeline=no -target=Virtex4 IntConstDiv 16 3 -1
flopoco -pipeline=no -target=Virtex6 IntConstDiv 16 3 -1
LUT LUTLUTLUT
q0q1q2q3
2 2 2 2
4444
4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r
4
r4 = 0
Architecture specificities
LUTs
DSP blocks
memory blocks
And FloPoCo attempts to model the delays of these blocksfor efficient pipelining
F. de Dinechin A FloPoCo tutorial 26
Also match the architecture to the target FPGA
Compareflopoco -pipeline=no -target=Virtex4 IntConstDiv 16 3 -1
flopoco -pipeline=no -target=Virtex6 IntConstDiv 16 3 -1
LUT LUTLUTLUT
q0q1q2q3
2 2 2 2
4444
4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r
4
r4 = 0
Architecture specificities
LUTs
DSP blocks
memory blocks
And FloPoCo attempts to model the delays of these blocksfor efficient pipelining
F. de Dinechin A FloPoCo tutorial 26
Frequency-directed pipelining in practice
We do our best but we know it’s hopeless
The actual frequency obtained will depend on the whole application(placement, routing pressure etc)...
best-effort philosophy,
aiming to be accurate to 10% for an operator synthesized alone
asking a higher frequency provides a deeper pipeline
And a big TODO : VLSI targets.
F. de Dinechin A FloPoCo tutorial 27
Frequency-directed pipelining in practice
We do our best but we know it’s hopeless
The actual frequency obtained will depend on the whole application(placement, routing pressure etc)...
best-effort philosophy,
aiming to be accurate to 10% for an operator synthesized alone
asking a higher frequency provides a deeper pipeline
And a big TODO : VLSI targets.
F. de Dinechin A FloPoCo tutorial 27
Non-standard operators
Correctly rounded divider by 3 : flopoco FPConstDiv 8 23 3
Floating-point exponential : flopoco FPExp 8 23
Multiplication of a 32-bit signed integer by the constant 1234567(two algorithms, your mileage may vary) :flopoco IntIntKCM 32 1234567 0
flopoco IntConstMult 32 1234567
Full list in the documentation, or by typing just flopoco.Sorry for the sometimes incomplete or inconsistent interface.
F. de Dinechin A FloPoCo tutorial 28
Don’t trust us
Two operators, TestBench and TestBenchFile, generate test benchs forthe operator preceding them on the command line
flopoco FPExp 8 23 TestBenchFile 10000 generates 10000random tests
flopoco IntConstDiv 16 3 -1 TestBenchFile -2 generatesan exhaustive test
Specification-based test bench generation
Not by simulation of the generated architecture !
Helper functions for encoding/decoding FP format, if you want to checkthe testbench...
fp2bin 9 36 3.1415926
bin2fp 9 36
010100000000100100100001111110110100110100010011
F. de Dinechin A FloPoCo tutorial 29
Don’t trust us
Two operators, TestBench and TestBenchFile, generate test benchs forthe operator preceding them on the command line
flopoco FPExp 8 23 TestBenchFile 10000 generates 10000random tests
flopoco IntConstDiv 16 3 -1 TestBenchFile -2 generatesan exhaustive test
Specification-based test bench generation
Not by simulation of the generated architecture !
Helper functions for encoding/decoding FP format, if you want to checkthe testbench...
fp2bin 9 36 3.1415926
bin2fp 9 36
010100000000100100100001111110110100110100010011
F. de Dinechin A FloPoCo tutorial 29
Don’t trust us
Two operators, TestBench and TestBenchFile, generate test benchs forthe operator preceding them on the command line
flopoco FPExp 8 23 TestBenchFile 10000 generates 10000random tests
flopoco IntConstDiv 16 3 -1 TestBenchFile -2 generatesan exhaustive test
Specification-based test bench generation
Not by simulation of the generated architecture !
Helper functions for encoding/decoding FP format, if you want to checkthe testbench...
fp2bin 9 36 3.1415926
bin2fp 9 36
010100000000100100100001111110110100110100010011
F. de Dinechin A FloPoCo tutorial 29
Open-ended operators
A polynomial evaluator for arbitrary functions
Example :flopoco FunctionEvaluator "(sin(x*Pi/2))^ 2" 32 32 4
4 is the degree of the polynomial, allows to express amemory/multiplier trade-off
Works for the set of functions for which it works
Another one is HOTBM, all this is still work in progress
F. de Dinechin A FloPoCo tutorial 30
FPPipeline
Example :flopoco FPPipeline pipeline.txt 8 23
where pipeline.txt contains something liker = sqrt(sqr(x0-x1)+sqr(y0-y1)); output r;
⊕ Properly synchronizes the various sub components
⊕ Makes it easy to explore various precisions
Using this is against the philosophy of FloPoCo !
which is : let’s build a custom operator instead
Quick and dirty implementation
not all operators supported, constants not always specializedno TestBench support (not a trivial thing to add)
Anybody in the audience doing serious HLS ?
F. de Dinechin A FloPoCo tutorial 31
FPPipeline
Example :flopoco FPPipeline pipeline.txt 8 23
where pipeline.txt contains something liker = sqrt(sqr(x0-x1)+sqr(y0-y1)); output r;
⊕ Properly synchronizes the various sub components
⊕ Makes it easy to explore various precisions
Using this is against the philosophy of FloPoCo !
which is : let’s build a custom operator instead
Quick and dirty implementation
not all operators supported, constants not always specializedno TestBench support (not a trivial thing to add)
Anybody in the audience doing serious HLS ?
F. de Dinechin A FloPoCo tutorial 31
FPPipeline
Example :flopoco FPPipeline pipeline.txt 8 23
where pipeline.txt contains something liker = sqrt(sqr(x0-x1)+sqr(y0-y1)); output r;
⊕ Properly synchronizes the various sub components
⊕ Makes it easy to explore various precisions
Using this is against the philosophy of FloPoCo !
which is : let’s build a custom operator instead
Quick and dirty implementation
not all operators supported, constants not always specializedno TestBench support (not a trivial thing to add)
Anybody in the audience doing serious HLS ?
F. de Dinechin A FloPoCo tutorial 31
FPPipeline
Example :flopoco FPPipeline pipeline.txt 8 23
where pipeline.txt contains something liker = sqrt(sqr(x0-x1)+sqr(y0-y1)); output r;
⊕ Properly synchronizes the various sub components
⊕ Makes it easy to explore various precisions
Using this is against the philosophy of FloPoCo !
which is : let’s build a custom operator instead
Quick and dirty implementation
not all operators supported, constants not always specializedno TestBench support (not a trivial thing to add)
Anybody in the audience doing serious HLS ?
F. de Dinechin A FloPoCo tutorial 31
Computing just right
Introduction: custom arithmetic
FloPoCo for application developers
Computing just right
FloPoCo for developers of HLS software
FloPoCo for developers of custom arithmetic
A tutorial: building an integer divider by 3
A tutorial: building a faithful FIR
A tutorial: building a√
x2 + y2 operator
Conclusion
F. de Dinechin A FloPoCo tutorial 32
A bit of plain common sense
Common wisdom
The more accurate you compute, the more expensive it gets
In practice
We (hopefully) notice it when our computation isnot accurate enough.
But do we notice it when it is too accurate for our needs ?
Reconciling performance and accuracy ?
Or, regain performance by computing just right ?
F. de Dinechin A FloPoCo tutorial 33
A bit of plain common sense
Common wisdom
The more accurate you compute, the more expensive it gets
In practice
We (hopefully) notice it when our computation isnot accurate enough.
But do we notice it when it is too accurate for our needs ?
Reconciling performance and accuracy ?
Or, regain performance by computing just right ?
F. de Dinechin A FloPoCo tutorial 33
A bit of plain common sense
Common wisdom
The more accurate you compute, the more expensive it gets
In practice
We (hopefully) notice it when our computation isnot accurate enough.
But do we notice it when it is too accurate for our needs ?
Reconciling performance and accuracy ?
Or, regain performance by computing just right ?
F. de Dinechin A FloPoCo tutorial 33
Double precision spoils us
The standard binary64 format (formerly known as double-precision)provides roughly 16 decimal digits.
Why should anybody need such accuracy ?
Count the digits in the following
Definition of the second : the duration of 9,192,631,770 periods ofthe radiation corresponding to the transition between the twohyperfine levels of the ground state of the cesium 133 atom.
Definition of the metre : the distance travelled by light in vacuumin 1/299,792,458 of a second.
Most accurate measurement ever (another atomic frequency)to 14 decimal places
Most accurate measurement of the Planck constant to date :to 7 decimal places
The gravitation constant G is known to 3 decimal places only
F. de Dinechin A FloPoCo tutorial 34
Parenthesis : then why binary64 ?
This PC computes 109 operations per second (1 gigaflops)
An allegory due to Kulisch
print the numbers in 100 lines of 5 columns double-sided :1000 numbers/sheet
1000 sheets ≈ a heap of 10 cm
109 flops ≈ heap height speed of 100m/s, or 360km/h
A teraflops (1012 op/s) prints to the moon in one second
Current top 500 computers reach the petaflop (1015 op/s)
each operation may involve a relative error of 10−16,and they accumulate.
Doesn’t this sound wrong ?
We would use these 16 digits just to accumulate garbage in them ?
F. de Dinechin A FloPoCo tutorial 35
Parenthesis : then why binary64 ?
This PC computes 109 operations per second (1 gigaflops)
An allegory due to Kulisch
print the numbers in 100 lines of 5 columns double-sided :1000 numbers/sheet
1000 sheets ≈ a heap of 10 cm
109 flops ≈ heap height speed of 100m/s, or 360km/h
A teraflops (1012 op/s) prints to the moon in one second
Current top 500 computers reach the petaflop (1015 op/s)
each operation may involve a relative error of 10−16,and they accumulate.
Doesn’t this sound wrong ?
We would use these 16 digits just to accumulate garbage in them ?
F. de Dinechin A FloPoCo tutorial 35
Parenthesis : then why binary64 ?
This PC computes 109 operations per second (1 gigaflops)
An allegory due to Kulisch
print the numbers in 100 lines of 5 columns double-sided :1000 numbers/sheet
1000 sheets ≈ a heap of 10 cm
109 flops ≈ heap height speed of 100m/s, or 360km/h
A teraflops (1012 op/s) prints to the moon in one second
Current top 500 computers reach the petaflop (1015 op/s)
each operation may involve a relative error of 10−16,and they accumulate.
Doesn’t this sound wrong ?
We would use these 16 digits just to accumulate garbage in them ?
F. de Dinechin A FloPoCo tutorial 35
Parenthesis : then why binary64 ?
This PC computes 109 operations per second (1 gigaflops)
An allegory due to Kulisch
print the numbers in 100 lines of 5 columns double-sided :1000 numbers/sheet
1000 sheets ≈ a heap of 10 cm
109 flops ≈ heap height speed of 100m/s, or 360km/h
A teraflops (1012 op/s) prints to the moon in one second
Current top 500 computers reach the petaflop (1015 op/s)
each operation may involve a relative error of 10−16,and they accumulate.
Doesn’t this sound wrong ?
We would use these 16 digits just to accumulate garbage in them ?
F. de Dinechin A FloPoCo tutorial 35
Parenthesis : then why binary64 ?
This PC computes 109 operations per second (1 gigaflops)
An allegory due to Kulisch
print the numbers in 100 lines of 5 columns double-sided :1000 numbers/sheet
1000 sheets ≈ a heap of 10 cm
109 flops ≈ heap height speed of 100m/s, or 360km/h
A teraflops (1012 op/s) prints to the moon in one second
Current top 500 computers reach the petaflop (1015 op/s)
each operation may involve a relative error of 10−16,and they accumulate.
Doesn’t this sound wrong ?
We would use these 16 digits just to accumulate garbage in them ?
F. de Dinechin A FloPoCo tutorial 35
Back to the point
Mastering accuracy for performance
When implementing a “computing core”
A goal : never compute more accurately than needed
A tool : flexible arithmetic
Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem
Sub-goals
Know what accuracy you needKnow how accurate you compute
“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...
Hopelessly difficult in the general case ...but at least ask this question systematically
F. de Dinechin A FloPoCo tutorial 36
Back to the point
Mastering accuracy for performance
When implementing a “computing core”
A goal : never compute more accurately than needed
A tool : flexible arithmetic
Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem
Sub-goals
Know what accuracy you needKnow how accurate you compute
“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...
Hopelessly difficult in the general case ...but at least ask this question systematically
F. de Dinechin A FloPoCo tutorial 36
Back to the point
Mastering accuracy for performance
When implementing a “computing core”
A goal : never compute more accurately than needed
A tool : flexible arithmetic
Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem
Sub-goals
Know what accuracy you need
Know how accurate you compute
“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...
Hopelessly difficult in the general case ...but at least ask this question systematically
F. de Dinechin A FloPoCo tutorial 36
Back to the point
Mastering accuracy for performance
When implementing a “computing core”
A goal : never compute more accurately than needed
A tool : flexible arithmetic
Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem
Sub-goals
Know what accuracy you needKnow how accurate you compute
“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...
Hopelessly difficult in the general case ...but at least ask this question systematically
F. de Dinechin A FloPoCo tutorial 36
Back to the point
Mastering accuracy for performance
When implementing a “computing core”
A goal : never compute more accurately than needed
A tool : flexible arithmetic
Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem
Sub-goals
Know what accuracy you needKnow how accurate you compute
“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...
Hopelessly difficult in the general case ...but at least ask this question systematically
F. de Dinechin A FloPoCo tutorial 36
Back to the point
Mastering accuracy for performance
When implementing a “computing core”
A goal : never compute more accurately than needed
A tool : flexible arithmetic
Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem
Sub-goals
Know what accuracy you needKnow how accurate you compute
“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...
Hopelessly difficult in the general case ...but at least ask this question systematically
F. de Dinechin A FloPoCo tutorial 36
My current crusade
The evil (at least on FPGAs)
A fast implementation of single-precision floating-point exponential(but accurate to 2−8 only)
Do you see why it is wrong ?
A line I shall have in each of my talks until the world is saved
Save routing ! Save power ! Don’t move useless bits around !
Or maybe this one
Do you really need to compute this bit ?
F. de Dinechin A FloPoCo tutorial 37
My current crusade
The evil (at least on FPGAs)
A fast implementation of single-precision floating-point exponential(but accurate to 2−8 only)
Do you see why it is wrong ?
A line I shall have in each of my talks until the world is saved
Save routing ! Save power ! Don’t move useless bits around !
Or maybe this one
Do you really need to compute this bit ?
F. de Dinechin A FloPoCo tutorial 37
My current crusade
The evil (at least on FPGAs)
A fast implementation of single-precision floating-point exponential(but accurate to 2−8 only)
Do you see why it is wrong ?
A line I shall have in each of my talks until the world is saved
Save routing ! Save power ! Don’t move useless bits around !
Or maybe this one
Do you really need to compute this bit ?
F. de Dinechin A FloPoCo tutorial 37
FloPoCo for developers of HLSsoftware
Introduction: custom arithmetic
FloPoCo for application developers
Computing just right
FloPoCo for developers of HLS software
FloPoCo for developers of custom arithmetic
A tutorial: building an integer divider by 3
A tutorial: building a faithful FIR
A tutorial: building a√
x2 + y2 operator
Conclusion
F. de Dinechin A FloPoCo tutorial 38
Using FloPoCoLib
From file src/main minimal.cpp :
1 #include "FloPoCo.hpp"2 using namespace std;3 using namespace flopoco;4
5 int main(int argc , char* argv[] ) {6 Target* target = new Virtex4 ();7
8 int wE = 9;9 int wF = 33;
10 Operator *op = new FPAdderSinglePath(target , wE,wF, wE,wF, wE,wF);
11
12 ofstream file;13 file.open("FPAdder.vhdl", ios::out);14 op->outputVHDLToFile(file);15 file.close ();16
17 cerr << endl <<"Final report:"<<endl;18 op->outputFinalReport (0);19 }
F. de Dinechin A FloPoCo tutorial 39
FloPoCo class hierarchy
Signal
+width
+cycle
+lifeSpan
Opera tor
+signalList
+vhdl
+outputVHDL()
+emulate()
+buildStandardTestCases()
FPAdder
+wE
+wF
In tAddder
+size
Shif ters Collision
+wE
+wF
Targets
V i r tex4Strat ix I I
TestBench
F. de Dinechin A FloPoCo tutorial 40
Custom arithmetic for HLS (an outsider point of view)
All the optimizations you do when compiling for a processor, plus
a+a should be replaced with 2*a, not the opposite !
fewer optimizations prevented by bit-for-bit or cornercase issues
a C compiler is not allowed to replace x-0.0 with x
or 2*x/2 with x
both issues can be solved by (very small) cornercase-fix logic
more opportunities of operator specialization :
multiplication and division by a constanta squarer for a*a,a sine on one period only...
endless opportunities of operator fusion
did I mention√
x2 + y2 ?bit-for-bit compatibility lost, replaced with guarantee accuracy
Eventually we will need new languages.
F. de Dinechin A FloPoCo tutorial 41
Custom arithmetic for HLS (an outsider point of view)
All the optimizations you do when compiling for a processor, plus
a+a should be replaced with 2*a, not the opposite !
fewer optimizations prevented by bit-for-bit or cornercase issues
a C compiler is not allowed to replace x-0.0 with x
or 2*x/2 with x
both issues can be solved by (very small) cornercase-fix logic
more opportunities of operator specialization :
multiplication and division by a constanta squarer for a*a,a sine on one period only...
endless opportunities of operator fusion
did I mention√
x2 + y2 ?bit-for-bit compatibility lost, replaced with guarantee accuracy
Eventually we will need new languages.
F. de Dinechin A FloPoCo tutorial 41
Custom arithmetic for HLS (an outsider point of view)
All the optimizations you do when compiling for a processor, plus
a+a should be replaced with 2*a, not the opposite !
fewer optimizations prevented by bit-for-bit or cornercase issues
a C compiler is not allowed to replace x-0.0 with x
or 2*x/2 with x
both issues can be solved by (very small) cornercase-fix logic
more opportunities of operator specialization :
multiplication and division by a constanta squarer for a*a,a sine on one period only...
endless opportunities of operator fusion
did I mention√
x2 + y2 ?bit-for-bit compatibility lost, replaced with guarantee accuracy
Eventually we will need new languages.
F. de Dinechin A FloPoCo tutorial 41
Custom arithmetic for HLS (an outsider point of view)
All the optimizations you do when compiling for a processor, plus
a+a should be replaced with 2*a, not the opposite !
fewer optimizations prevented by bit-for-bit or cornercase issues
a C compiler is not allowed to replace x-0.0 with x
or 2*x/2 with x
both issues can be solved by (very small) cornercase-fix logic
more opportunities of operator specialization :
multiplication and division by a constanta squarer for a*a,a sine on one period only...
endless opportunities of operator fusion
did I mention√
x2 + y2 ?bit-for-bit compatibility lost, replaced with guarantee accuracy
Eventually we will need new languages.
F. de Dinechin A FloPoCo tutorial 41
Custom arithmetic for HLS (an outsider point of view)
All the optimizations you do when compiling for a processor, plus
a+a should be replaced with 2*a, not the opposite !
fewer optimizations prevented by bit-for-bit or cornercase issues
a C compiler is not allowed to replace x-0.0 with x
or 2*x/2 with x
both issues can be solved by (very small) cornercase-fix logic
more opportunities of operator specialization :
multiplication and division by a constanta squarer for a*a,a sine on one period only...
endless opportunities of operator fusion
did I mention√
x2 + y2 ?bit-for-bit compatibility lost, replaced with guarantee accuracy
Eventually we will need new languages.
F. de Dinechin A FloPoCo tutorial 41
FloPoCo for developers of customarithmetic
Introduction: custom arithmetic
FloPoCo for application developers
Computing just right
FloPoCo for developers of HLS software
FloPoCo for developers of custom arithmetic
A tutorial: building an integer divider by 3
A tutorial: building a faithful FIR
A tutorial: building a√
x2 + y2 operator
Conclusion
F. de Dinechin A FloPoCo tutorial 42
A modestly object-oriented approach
FloPoCo is not a C++-based HDL
VHDL generation is “print-based”
1 vhdl << "SoS <= EA(wE -1 downto 0) & Fraction;" ;
easy to port existing work (FPLibrary)easy learning curve for the VHDL-litterateat least the expressive power of VHDL !
Many helper functions help doing the printsExample : VHDL signal declaration
1 vhdl << declare("SoS", wE+wF+g)2 << " <= EA(wE -1 downto 0) & Fraction;" ;
F. de Dinechin A FloPoCo tutorial 43
A modestly object-oriented approach
FloPoCo is not a C++-based HDL
VHDL generation is “print-based”
1 vhdl << "SoS <= EA(wE -1 downto 0) & Fraction;" ;
easy to port existing work (FPLibrary)easy learning curve for the VHDL-litterateat least the expressive power of VHDL !
Many helper functions help doing the printsExample : VHDL signal declaration
1 vhdl << declare("SoS", wE+wF+g)2 << " <= EA(wE -1 downto 0) & Fraction;" ;
F. de Dinechin A FloPoCo tutorial 43
A modestly object-oriented approach
FloPoCo is not a C++-based HDL
VHDL generation is “print-based”
1 vhdl << "SoS <= EA(wE -1 downto 0) & Fraction;" ;
easy to port existing work (FPLibrary)easy learning curve for the VHDL-litterateat least the expressive power of VHDL !
Many helper functions help doing the printsExample : VHDL signal declaration
1 vhdl << declare("SoS", wE+wF+g)2 << " <= EA(wE -1 downto 0) & Fraction;" ;
F. de Dinechin A FloPoCo tutorial 43
Workflow of designing a new operator in FloPoCo
1. Create an operator skeleton
copy src/UserDefinedOperator.cpp (heavily documented)add your file in CMakeList.txt and src/FloPoCo.hpp
add command-line interface + doc in src/main.cpp
2. Define the specification : write the emulate() method
syntax is quite OOgly but imitation is your frienda few lines to change from a type-matching operator
... then TestBench and TestBenchFile will work
3. Write the code generating combinatorial VHDL
Basically embedd your VHDL in C++ prints
4. Add code for self-pipelining
Roughly the same effort as would take to get one pipeline instance... but you get frequency-directed pipeline for all the FloPoCo targets
F. de Dinechin A FloPoCo tutorial 44
Workflow of designing a new operator in FloPoCo
1. Create an operator skeleton
copy src/UserDefinedOperator.cpp (heavily documented)add your file in CMakeList.txt and src/FloPoCo.hpp
add command-line interface + doc in src/main.cpp
2. Define the specification : write the emulate() method
syntax is quite OOgly but imitation is your frienda few lines to change from a type-matching operator
... then TestBench and TestBenchFile will work
3. Write the code generating combinatorial VHDL
Basically embedd your VHDL in C++ prints
4. Add code for self-pipelining
Roughly the same effort as would take to get one pipeline instance... but you get frequency-directed pipeline for all the FloPoCo targets
F. de Dinechin A FloPoCo tutorial 44
Workflow of designing a new operator in FloPoCo
1. Create an operator skeleton
copy src/UserDefinedOperator.cpp (heavily documented)add your file in CMakeList.txt and src/FloPoCo.hpp
add command-line interface + doc in src/main.cpp
2. Define the specification : write the emulate() method
syntax is quite OOgly but imitation is your frienda few lines to change from a type-matching operator
... then TestBench and TestBenchFile will work
3. Write the code generating combinatorial VHDL
Basically embedd your VHDL in C++ prints
4. Add code for self-pipelining
Roughly the same effort as would take to get one pipeline instance... but you get frequency-directed pipeline for all the FloPoCo targets
F. de Dinechin A FloPoCo tutorial 44
Workflow of designing a new operator in FloPoCo
1. Create an operator skeleton
copy src/UserDefinedOperator.cpp (heavily documented)add your file in CMakeList.txt and src/FloPoCo.hpp
add command-line interface + doc in src/main.cpp
2. Define the specification : write the emulate() method
syntax is quite OOgly but imitation is your frienda few lines to change from a type-matching operator
... then TestBench and TestBenchFile will work
3. Write the code generating combinatorial VHDL
Basically embedd your VHDL in C++ prints
4. Add code for self-pipelining
Roughly the same effort as would take to get one pipeline instance... but you get frequency-directed pipeline for all the FloPoCo targets
F. de Dinechin A FloPoCo tutorial 44
My personal record
Two weeks from the first intuition of the algorithmto complete pipelined FloPoCo implementation + paper submission(Division by small integer constants, ARC 2012)
Implementation time
10 minutes to obtain a testbench generator
1/2 day for the integer Euclidean division
20 mn for its flexible pipeline
1/2 day for the FP divider by 3
and again 20 mn
Time savers :
Test bench generator for correct combinatorial operation
Pipeline framework (for filling result tables in the paper)
(Time wasters : maintaining all this instead of writing new operators)
F. de Dinechin A FloPoCo tutorial 45
My personal record
Two weeks from the first intuition of the algorithmto complete pipelined FloPoCo implementation + paper submission(Division by small integer constants, ARC 2012)
Implementation time
10 minutes to obtain a testbench generator
1/2 day for the integer Euclidean division
20 mn for its flexible pipeline
1/2 day for the FP divider by 3
and again 20 mn
Time savers :
Test bench generator for correct combinatorial operation
Pipeline framework (for filling result tables in the paper)
(Time wasters : maintaining all this instead of writing new operators)
F. de Dinechin A FloPoCo tutorial 45
My personal record
Two weeks from the first intuition of the algorithmto complete pipelined FloPoCo implementation + paper submission(Division by small integer constants, ARC 2012)
Implementation time
10 minutes to obtain a testbench generator
1/2 day for the integer Euclidean division
20 mn for its flexible pipeline
1/2 day for the FP divider by 3
and again 20 mn
Time savers :
Test bench generator for correct combinatorial operation
Pipeline framework (for filling result tables in the paper)
(Time wasters : maintaining all this instead of writing new operators)
F. de Dinechin A FloPoCo tutorial 45
Pipeline made easy
1 + wF 1 + wF 1 + wF
2 + wF + g2 + wF + g
2 + wF + g2 + wF + g
2 + wF + g
wE + wF + g
2 + wF + g
EC
EBMB2 MC 2
X Y Z
MXEZEYEX MY MZ
MA2
R
4 + wF + g
shifter
sort
sortsquarer squarer
shifter
squarer
add
normalize/pack
unpack
F. de Dinechin A FloPoCo tutorial 46
Pipeline made easy
5
4
3
2
1
0
0
1
6
1 + wF 1 + wF 1 + wF
2 + wF + g2 + wF + g
2 + wF + g2 + wF + g
2 + wF + g
wE + wF + g
2 + wF + g
EC
EBMB2 MC 2
X Y Z
MXEZEYEX MY MZ
MA2
R
4 + wF + g
shifter
sort
sortsquarer squarer
shifter
squarer
add
normalize/pack
unpack
Notion of current cycle during VHDL output
Each signal has a cycle attribute (at which cycle it is defined)
F. de Dinechin A FloPoCo tutorial 46
Pipeline made easy
5
4
3
2
1
0
0
1
6
EA
EB
1 + wF 1 + wF 1 + wF
2 + wF + g2 + wF + g
2 + wF + g2 + wF + g
2 + wF + g
wE + wF + g
2 + wF + g
EC
EBMB2 MC 2
X Y Z
MXEZEYEX MY MZ
MA2
R
4 + wF + g
shifter
sort
sortsquarer squarer
shifter
squarer
add
normalize/pack
unpack
1 vhdl << declare("EA", wE) << " <= ... ;" ; // c y c l e (EA)=02 nextCycle ();3 vhdl << declare("EB", wE) << " <= ... ;" ; // c y c l e (EB)=14 setCycle (0);5 vhdl << ... ;
F. de Dinechin A FloPoCo tutorial 46
Pipeline made easy
5
4
3
2
1
0
0
1
6
EA
EB
(insert 6 registers)
RHS("EA")SoS
1 + wF 1 + wF 1 + wF
2 + wF + g2 + wF + g
2 + wF + g2 + wF + g
2 + wF + g
wE + wF + g
2 + wF + g
EC
EBMB2 MC 2
X Y Z
MXEZEYEX MY MZ
MA2
R
4 + wF + g
shifter
sort
sortsquarer squarer
shifter
squarer
add
normalize/pack
unpack
FloPoCo looks for signal names on the right-hand sideand delays them by (currentCycle - defCycle) cycles :
1 vhdl << declare("SoS", wE+wF+g)2 << " <= EA(wE -1 downto 0) & Fraction;" ;
F. de Dinechin A FloPoCo tutorial 46
Pipeline made easy
5
4
3
2
1
0
0
1
6
EA
EB
(insert 6 registers)
RHS("EA")SoS
1 + wF 1 + wF 1 + wF
2 + wF + g2 + wF + g
2 + wF + g2 + wF + g
2 + wF + g
wE + wF + g
2 + wF + g
EC
EBMB2 MC 2
X Y Z
MXEZEYEX MY MZ
MA2
R
4 + wF + g
shifter
sort
sortsquarer squarer
shifter
squarer
add
normalize/pack
unpack
FloPoCo looks for signal names on the right-hand sideand delays them by (currentCycle - defCycle) cycles :Generated VHDL :
1 SoS <= EA_d6(wE -1 downto 0) & Fraction_d1;
(and transparently declare and build the needed shift registers)
F. de Dinechin A FloPoCo tutorial 46
Pipeline made easy
5
4
3
2
1
0
0
1
6
1 + wF 1 + wF 1 + wF
2 + wF + g2 + wF + g
2 + wF + g2 + wF + g
2 + wF + g
wE + wF + g
2 + wF + g
EC
EBMB2 MC 2
X Y Z
MXEZEYEX MY MZ
MA2
R
4 + wF + g
shifter
sort
sortsquarer squarer
shifter
squarer
add
normalize/pack
unpack
Managing the current cycle :n=getCycle();
setCycle(n);
nextCycle();
syncCycleWithSignal("EA");
Frequency-directed pipelining :
manageCriticalPath(
target->adderDelay(n) );
F. de Dinechin A FloPoCo tutorial 46
manageCriticalPath(delay)
A currentCriticalPath variable is maintained during VHDL generation
manageCriticalPath(delay)
adds delay to currentCriticalPathif the sum is larger than 1/targetFrequency,
I currentCycle ++I and reset currentCriticalPath to delay
delay is an estimation of the delay of the following block of VHDL
you (the developer) have to perform this estimation
... typically computed using timing methods of the Target class
adderDelay(int n)
localRoutingDelay(int fanout)
This is how the pipeline adapts to the actual target.
F. de Dinechin A FloPoCo tutorial 47
1 + wF 1 + wF 1 + wF
2 + wF + g2 + wF + g
2 + wF + g2 + wF + g
2 + wF + g
wE + wF + g
2 + wF + g
EC
EBMB2 MC 2
X Y Z
MXEZEYEX MY MZ
MA2
R
4 + wF + g
shifter
sort
sortsquarer squarer
shifter
squarer
add
normalize/pack
unpack
The magic explained
Pipeline adapts to random insertions of pipeline levels anywhere
during development, tinkering, fine-tuning
or during frequency-directed, target-optimized pipelining
Graceful degradation to a combinatorial operator
in which case the vhdl stream is just output untouched
Suits the “print VHDL” philosophy
FloPoCo code is much shorter than the VHDL it generates.
F. de Dinechin A FloPoCo tutorial 48
1 + wF 1 + wF 1 + wF
2 + wF + g2 + wF + g
2 + wF + g2 + wF + g
2 + wF + g
wE + wF + g
2 + wF + g
EC
EBMB2 MC 2
X Y Z
MXEZEYEX MY MZ
MA2
R
4 + wF + g
shifter
sort
sortsquarer squarer
shifter
squarer
add
normalize/pack
unpack
The magic explained
Pipeline adapts to random insertions of pipeline levels anywhere
during development, tinkering, fine-tuningor during frequency-directed, target-optimized pipelining
Graceful degradation to a combinatorial operator
in which case the vhdl stream is just output untouched
Suits the “print VHDL” philosophy
FloPoCo code is much shorter than the VHDL it generates.
F. de Dinechin A FloPoCo tutorial 48
1 + wF 1 + wF 1 + wF
2 + wF + g2 + wF + g
2 + wF + g2 + wF + g
2 + wF + g
wE + wF + g
2 + wF + g
EC
EBMB2 MC 2
X Y Z
MXEZEYEX MY MZ
MA2
R
4 + wF + g
shifter
sort
sortsquarer squarer
shifter
squarer
add
normalize/pack
unpack
The magic explained
Pipeline adapts to random insertions of pipeline levels anywhere
during development, tinkering, fine-tuningor during frequency-directed, target-optimized pipelining
Graceful degradation to a combinatorial operator
in which case the vhdl stream is just output untouched
Suits the “print VHDL” philosophy
FloPoCo code is much shorter than the VHDL it generates.
F. de Dinechin A FloPoCo tutorial 48
1 + wF 1 + wF 1 + wF
2 + wF + g2 + wF + g
2 + wF + g2 + wF + g
2 + wF + g
wE + wF + g
2 + wF + g
EC
EBMB2 MC 2
X Y Z
MXEZEYEX MY MZ
MA2
R
4 + wF + g
shifter
sort
sortsquarer squarer
shifter
squarer
add
normalize/pack
unpack
The magic explained
Pipeline adapts to random insertions of pipeline levels anywhere
during development, tinkering, fine-tuningor during frequency-directed, target-optimized pipelining
Graceful degradation to a combinatorial operator
in which case the vhdl stream is just output untouched
Suits the “print VHDL” philosophy
FloPoCo code is much shorter than the VHDL it generates.
F. de Dinechin A FloPoCo tutorial 48
From the developer point of view
Clean separation of functionality and performance issues
You write to the vhdl stream only combinatorial VHDL
FloPoCo parses it to insert registers (in two passes)
Timing is managed out of the VHDL code
Subcomponents also delay their outputs WRT their inputs, etc
Correct-by-construction pipelining
You cannot obtain a non-functional operator,starting from a working combinatorial one
(at least not without very loud warnings)
This part of the framework can be trusted
nothing more complicated than integer subtractions and comparisonsand of course, test bench generators adapt to the pipeline
You can obtain a bad pipeline, though
unbalanced, over-pipelined, missing the target frequency, ...
F. de Dinechin A FloPoCo tutorial 49
From the developer point of view
Clean separation of functionality and performance issues
You write to the vhdl stream only combinatorial VHDL
FloPoCo parses it to insert registers (in two passes)
Timing is managed out of the VHDL code
Subcomponents also delay their outputs WRT their inputs, etc
Correct-by-construction pipelining
You cannot obtain a non-functional operator,starting from a working combinatorial one
(at least not without very loud warnings)
This part of the framework can be trusted
nothing more complicated than integer subtractions and comparisonsand of course, test bench generators adapt to the pipeline
You can obtain a bad pipeline, though
unbalanced, over-pipelined, missing the target frequency, ...
F. de Dinechin A FloPoCo tutorial 49
subcomponent list
signal list
(bitwidth, etc)
in structural VHDL syntax)
(combinatorial description
streamvhdl
generation of VHDL declarations
generation of VHDL register code
for each signalvaluelifeSpan
for each signalcycle value
generation of VHDL architecture code
(second pass on vhdl stream,
delaying right−hand side signals)
C++to know their pipeline depth)
(recursively calling constructors of all sub−components
Constructor
Operator.outputVHDL()
pipeline information
architecture ....
entity ...
port (...)
component ....
signal ....
end architecture
output file
DEXY <= ...
DEYZ <= ...
...
end process
begin
process(clk)
....
VHDLC++
functional informationproduces
is used
F. de Dinechin A FloPoCo tutorial 50
We could have done it better
History of development shows a bit
Do we really need an interface with explicit cycle management ?
In FloPoCo, time=(cycle, critical path within the cycle)
but the user interface should consist only of delay information
A delay should be associated systematically to signal declaration
Too much code to change it now...
F. de Dinechin A FloPoCo tutorial 51
Matching the architecture to the target
Perfectly valid code :
1 if(target =="StratixII"){2 // outpu t some VHDL3 }4 else if(target =="Virtex4")
{5 // outpu t o t h e r VHDL6 }7 else ...
Try to avoid it :
Model architecture details
LUTsDSP blocksmemory blocks
Model delays
Definitely an endless effort.
F. de Dinechin A FloPoCo tutorial 52
One slide on testing
emulate() performs bit-accurate emulation
Not by architecture simulation !By expressing the mathematical specification..in typically a few lines of MPFR (www.mpfr.org)
buildStandardTestCases() for corner cases and regression tests
buildRandomTestCases() to bias the random generator in anoperation-specific way
default is uniform on the bits.Example 1 : exponential useful range is very small,bias the random generator towards thatExample2 : cancellation cases in floating-point additions
may (should) be overloaded by each Operator
The special TestBench and TestBenchFile operators invokethese methods
F. de Dinechin A FloPoCo tutorial 53
One slide on testing
emulate() performs bit-accurate emulation
Not by architecture simulation !By expressing the mathematical specification..in typically a few lines of MPFR (www.mpfr.org)
buildStandardTestCases() for corner cases and regression tests
buildRandomTestCases() to bias the random generator in anoperation-specific way
default is uniform on the bits.Example 1 : exponential useful range is very small,bias the random generator towards thatExample2 : cancellation cases in floating-point additions
may (should) be overloaded by each Operator
The special TestBench and TestBenchFile operators invokethese methods
F. de Dinechin A FloPoCo tutorial 53
One slide on testing
emulate() performs bit-accurate emulation
Not by architecture simulation !By expressing the mathematical specification..in typically a few lines of MPFR (www.mpfr.org)
buildStandardTestCases() for corner cases and regression tests
buildRandomTestCases() to bias the random generator in anoperation-specific way
default is uniform on the bits.Example 1 : exponential useful range is very small,bias the random generator towards thatExample2 : cancellation cases in floating-point additions
may (should) be overloaded by each Operator
The special TestBench and TestBenchFile operators invokethese methods
F. de Dinechin A FloPoCo tutorial 53
One slide on testing
emulate() performs bit-accurate emulation
Not by architecture simulation !By expressing the mathematical specification..in typically a few lines of MPFR (www.mpfr.org)
buildStandardTestCases() for corner cases and regression tests
buildRandomTestCases() to bias the random generator in anoperation-specific way
default is uniform on the bits.Example 1 : exponential useful range is very small,bias the random generator towards thatExample2 : cancellation cases in floating-point additions
may (should) be overloaded by each Operator
The special TestBench and TestBenchFile operators invokethese methods
F. de Dinechin A FloPoCo tutorial 53
A tutorial : building an integerdivider by 3
Introduction: custom arithmetic
FloPoCo for application developers
Computing just right
FloPoCo for developers of HLS software
FloPoCo for developers of custom arithmetic
A tutorial: building an integer divider by 3
A tutorial: building a faithful FIR
A tutorial: building a√
x2 + y2 operator
Conclusion
F. de Dinechin A FloPoCo tutorial 54
Anybody here remembers how we compute divisions ?
7 7 6
1 7
2 6
2
2 5 8
3
iteration body : Euclidean division of a 2-digit decimal number by 3
The first digit is a remainder from previous iteration :its value is 0, 1 or 2
Possible implementation as a look-up table that, for each valuefrom 00 to 29, gives the quotient and the remainder of its divisionby 3.
F. de Dinechin A FloPoCo tutorial 55
Anybody here remembers how we compute divisions ?
7 7 6
1 7
2 6
2
2 5 8
3
iteration body : Euclidean division of a 2-digit decimal number by 3
The first digit is a remainder from previous iteration :its value is 0, 1 or 2
Possible implementation as a look-up table that, for each valuefrom 00 to 29, gives the quotient and the remainder of its divisionby 3.
F. de Dinechin A FloPoCo tutorial 55
Mathematical obfuscation, in binary-friendly radix
Representation of x in radix β = 2α
(splitting the binary decomposition of x into k chunks of α bits)
procedure ConstantDiv(x , d)rk ← 0for i = k − 1 down to 0 do
yi ← xi + 2αri+1 (this + is a concatenation)(qi , ri )← (byi/dc, yi mod d) (read from a table)
end forreturn q =
∑ki=0 qi .2
−αi , r0end procedure
Each iteration
consumes α bits of x , and a remainder of size γ = dlog2 deproduces α bits of q, and a remainder of size γ
Implemented as a table with α + γ bits in, α + γ bits out
F. de Dinechin A FloPoCo tutorial 56
Mathematical obfuscation, in binary-friendly radix
Representation of x in radix β = 2α
(splitting the binary decomposition of x into k chunks of α bits)
procedure ConstantDiv(x , d)rk ← 0for i = k − 1 down to 0 do
yi ← xi + 2αri+1 (this + is a concatenation)(qi , ri )← (byi/dc, yi mod d) (read from a table)
end forreturn q =
∑ki=0 qi .2
−αi , r0end procedure
Each iteration
consumes α bits of x , and a remainder of size γ = dlog2 deproduces α bits of q, and a remainder of size γ
Implemented as a table with α + γ bits in, α + γ bits out
F. de Dinechin A FloPoCo tutorial 56
Sequential implementation
LUT
clk
reset
α
α
xi
γγ
qi
ri+1 ri
F. de Dinechin A FloPoCo tutorial 57
Unrolled implementation
LUT LUTLUTLUT
q0q1q2q3
2 2 2 2
4444
4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r
4
r4 = 0
F. de Dinechin A FloPoCo tutorial 58
Setting the parameters for LUTs
LUT LUTLUTLUT
q0q1q2q3
2 2 2 2
4444
4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r
4
r4 = 0
For instance, assuming a 6-input LUTs (e.g. LUT6)
A 6-bit in, 6-bit out consumes 6 LUT6
Size of remainder is γ = log2 d
If d < 25, very efficient architecture : α = 6− γThe smaller d , the better
Easy to pipeline (one register behind each LUT for free)
F. de Dinechin A FloPoCo tutorial 59
Step 0 : get a working FloPoCo distribution
Google FloPoCo, then
first use the 1-line install to get all dependencies
optionally do an svn checkout to live on the edge
Tutorial files in a tgz on the tutorial webpage.
F. de Dinechin A FloPoCo tutorial 60
Step 1 : build a skeleton
We want to implement flopoco IntConstDiv w dcd src
cp UserDefinedOperator.cpp IntConstDiv.cpp
cp UserDefinedOperator.hpp IntConstDiv.hpp
Inside the newly created files,
global replace UserDefinedOperator with IntConstDiv
empty the cpp of all flesh except initial declarations
Edit CMakeLists.txt (the file that defines the building process),
add src/IntConstDiv somewhere
Edit src/FloPoCo.hpp and do similar imitation
Edit src/main.cpp and do similar imitation
Compile and fix
Check that ./flopoco IntConstDiv 32 3 does nothing
F. de Dinechin A FloPoCo tutorial 61
Step 2 : overload emulate()
Your d should be stored in the IntConstDiv somewhere
What emulate() does
a TestCase is a pair (input, allowed outputs)
emulate() inputs a TestCase with only the input defined
and defines the allowed outputs – here two
Inputs and outputs are semantically bit vectors, held in a GMPinteger.
OOgly methods to convert floating-point from / to arbitraryprecision floating-point (MPFR)
no need here
You may define several acceptable values for an input
typically for faithful rounding (two values)
After this step, wonder at :./flopoco IntConstDiv 10 3 TestBench 1000
F. de Dinechin A FloPoCo tutorial 62
Step 2 : overload emulate()
Your d should be stored in the IntConstDiv somewhere
What emulate() does
a TestCase is a pair (input, allowed outputs)
emulate() inputs a TestCase with only the input defined
and defines the allowed outputs – here two
Inputs and outputs are semantically bit vectors, held in a GMPinteger.
OOgly methods to convert floating-point from / to arbitraryprecision floating-point (MPFR)
no need here
You may define several acceptable values for an input
typically for faithful rounding (two values)
After this step, wonder at :./flopoco IntConstDiv 10 3 TestBench 1000
F. de Dinechin A FloPoCo tutorial 62
Step 3 : Write the combinatorial VHDL
Use the abstract Table operator
All you need is overload the function method
Other useful generic operators :
Shifters and leading zero counters for floating-pointPipelined adders, multipliers, ...
F. de Dinechin A FloPoCo tutorial 63
Step 3.5 : overload BuildStandardTestCases
First debug until VHDL works for all xi = 0
... then for all xi = 1
F. de Dinechin A FloPoCo tutorial 64
Step 4 : pipeline
using nextCycle() and sync*() only in this case
F. de Dinechin A FloPoCo tutorial 65
A tutorial : building a faithful FIR
Introduction: custom arithmetic
FloPoCo for application developers
Computing just right
FloPoCo for developers of HLS software
FloPoCo for developers of custom arithmetic
A tutorial: building an integer divider by 3
A tutorial: building a faithful FIR
A tutorial: building a√
x2 + y2 operator
Conclusion
F. de Dinechin A FloPoCo tutorial 66
Specification
Function
r =n∑
i=1
cixi
The ci are constants provided with arbitrary accuracy on thecommand line
the xi are signed fixed-point inputs in (1,p) format
Accuracy : faithful rounding of the exact result
xi and ci considered exact
return one of the two numbers surrounding the exact result :
next-best after correct rounding (which is too expensive)last-bit accurate (all returned bits are useful),more accurate than naive assembly of multipliers and adders
F. de Dinechin A FloPoCo tutorial 67
Specification
Function
r =n∑
i=1
cixi
The ci are constants provided with arbitrary accuracy on thecommand line
the xi are signed fixed-point inputs in (1,p) format
Accuracy : faithful rounding of the exact result
xi and ci considered exact
return one of the two numbers surrounding the exact result :
next-best after correct rounding (which is too expensive)last-bit accurate (all returned bits are useful),more accurate than naive assembly of multipliers and adders
F. de Dinechin A FloPoCo tutorial 67
Algorithm and error budget
sum the products truncated to precision 2−p−g , i.e. g guard bits,
add one bit at weight 2−p−1,
and truncate to precision 2−p, i.e. to the result format
... where g = 1 + log2(n − 1)
Design space exploration
The sum can be implemented as
a rake of adders (simplest, let’s start with that)
a tree of adders (slightly more complex, good we have C++)
a single BitHeap object (on the edge)
F. de Dinechin A FloPoCo tutorial 68
Step 0 : get a working FloPoCo distribution
Google FloPoCo, then
first use the 1-line install to get all dependencies
optionally do an svn checkout to live on the edge
F. de Dinechin A FloPoCo tutorial 69
Step 1 : build a skeleton
We want to implement flopoco FixedPointFIR n w a1 ... ancd src
cp UserDefinedOperator.cpp FixedPointFIR.cpp
cp UserDefinedOperator.hpp FixedPointFIR.hpp
Inside the newly created files,
global replace UserDefinedOperator with FixedPointFIR
empty the cpp of all flesh except initial declarations
Edit CMakeLists.txt (the file that defines the building process),
add src/FixedPointFIR somewhere
Edit src/FloPoCo.hpp and do similar imitation
Edit src/main.cpp and do similar imitation
NewCompressorTree has an arbitrary size parameter list
Compile and fix
Check that ./flopoco FixedPointFIR 4 8 1 1 1 1 doesnothing
F. de Dinechin A FloPoCo tutorial 70
Step 2 : overload emulate()
Your ai should be stored in the FixedPointFIR somewhere
What emulate() does
a TestCase is a pair (input, allowed outputs)
emulate() inputs a TestCase with only the input defined
and defines the allowed outputs – here two
Inputs and outputs are semantically bit vectors, held in a GMPinteger.
OOgly methods to convert floating-point from / to arbitraryprecision floating-point (MPFR)
no need here
After this step, wonder at :./flopoco FP2DNorm 5 10 TestBench 1000
F. de Dinechin A FloPoCo tutorial 71
Step 2 : overload emulate()
Your ai should be stored in the FixedPointFIR somewhere
What emulate() does
a TestCase is a pair (input, allowed outputs)
emulate() inputs a TestCase with only the input defined
and defines the allowed outputs – here two
Inputs and outputs are semantically bit vectors, held in a GMPinteger.
OOgly methods to convert floating-point from / to arbitraryprecision floating-point (MPFR)
no need here
After this step, wonder at :./flopoco FP2DNorm 5 10 TestBench 1000
F. de Dinechin A FloPoCo tutorial 71
Step 3 : Write the combinatorial VHDL
use one of the existing constant multipliers in FloPoCo
Half the room should use FixRealKCM
The other half should use IntConstMult
First write the additions as “+” in standard VHDL
not very efficient, but tutorially good.(several better alternatives to discuss later)
All datapath on 1 + w + g bits
here also, a small optimization is possible
F. de Dinechin A FloPoCo tutorial 72
Step 3.5 : overload BuildStandardTestCases
First debug until VHDL works for all xi = 0
... then for all xi = 1
F. de Dinechin A FloPoCo tutorial 73
Step 4 : pipeline
using nextCycle() and sync*() only in this case
F. de Dinechin A FloPoCo tutorial 74
A tutorial : building a√x2 + y 2
operatorIntroduction: custom arithmetic
FloPoCo for application developers
Computing just right
FloPoCo for developers of HLS software
FloPoCo for developers of custom arithmetic
A tutorial: building an integer divider by 3
A tutorial: building a faithful FIR
A tutorial: building a√
x2 + y2 operator
Conclusion
F. de Dinechin A FloPoCo tutorial 75
Specification
We are going to build a floating-point operator for√
x2 + y2
Black box : two inputs, one output, just like an adder or amultiplier.
Mathematical specification : faithful rounding of the exact result
x and y considered exactreturn one of the FP numbers surrounding the exact resultless accurate than correct rounding, butlast-bit accurate (all returned bits are useful), andmore accurate than a combination of two multipliers, one adder anda square root (4 rounding errors)
Algorithm : sum of square, followed by a square root of themantissa
with some guard bits for faithful roundingexpectig savings mostly in the sum of squares
F. de Dinechin A FloPoCo tutorial 76
Specification
We are going to build a floating-point operator for√
x2 + y2
Black box : two inputs, one output, just like an adder or amultiplier.
Mathematical specification : faithful rounding of the exact result
x and y considered exactreturn one of the FP numbers surrounding the exact resultless accurate than correct rounding, butlast-bit accurate (all returned bits are useful), andmore accurate than a combination of two multipliers, one adder anda square root (4 rounding errors)
Algorithm : sum of square, followed by a square root of themantissa
with some guard bits for faithful roundingexpectig savings mostly in the sum of squares
F. de Dinechin A FloPoCo tutorial 76
Disclaimer (for hard-core arithmeticians)
It should be possible to fuse the whole computation in a single digitrecurrence
more efficient for ASIC and DSP-less FPGAs
see works by Ercegovac, Muller, Lang, Bruguera, Takagi ...
... but beyond the scope of this work
and less interesting from a tutorial point of view :no component reuse
F. de Dinechin A FloPoCo tutorial 77
Step 0 : get a working FloPoCo distribution
Google FloPoCo, then
first use the 1-line install to get all dependencies
optionally do an svn checkout to live on the edge
F. de Dinechin A FloPoCo tutorial 78
Step 1 : build a skeleton
We want to implement flopoco FP2DNorm wE wFcd src
cp UserDefinedOperator.cpp FP2DNorm.cpp
cp UserDefinedOperator.hpp FP2DNorm.hpp
Inside the newly created files,
global replace UserDefinedOperator with FP2DNorm
empty the cpp of all flesh except initial declarations
Edit CMakeLists.txt (the file that defines the building process),
look for src/FPSqrt (because the interface is similar)and add src/FP2DNorm close to it
Edit src/FloPoCo.hpp and do similar imitation
Edit src/main.cpp and do similar imitation
Compile and fix
Check that ./flopoco FP2DNorm 8 23 does nothing
F. de Dinechin A FloPoCo tutorial 79
Step 2 : overload emulate()
Here, get inspiration from src/FPSumOfSquares orsrc/FPAdderSinglePath.
a TestCase is a pair (input, allowed outputs)
emulate() inputs a TestCase with only the input defined
and defines the allowed outputs
Inputs and outputs are semantically bit vectors, held in a GMPinteger.
OOgly methods to convert from / to arbitrary precisionfloating-point (MPFR)
After this step, test :./flopoco FP2DNorm 5 10 TestBench 1000
F. de Dinechin A FloPoCo tutorial 80
Step 2 : overload emulate()
Here, get inspiration from src/FPSumOfSquares orsrc/FPAdderSinglePath.
a TestCase is a pair (input, allowed outputs)
emulate() inputs a TestCase with only the input defined
and defines the allowed outputs
Inputs and outputs are semantically bit vectors, held in a GMPinteger.
OOgly methods to convert from / to arbitrary precisionfloating-point (MPFR)
After this step, test :./flopoco FP2DNorm 5 10 TestBench 1000
F. de Dinechin A FloPoCo tutorial 80
Step 3 : Write the combinatorial VHDL
use an exponent difference, a shifter, two squarers, and a squareroot
exponent difference is a subtraction, but a small onefollowed by swap
Shift before or after the square ?
before enables single bit-heap implementation... but requires a slightly larger squarerafter requires a slightly larger shifter
Hack the floating-point square root
no fixed-point version available
F. de Dinechin A FloPoCo tutorial 81
Conclusion
Introduction: custom arithmetic
FloPoCo for application developers
Computing just right
FloPoCo for developers of HLS software
FloPoCo for developers of custom arithmetic
A tutorial: building an integer divider by 3
A tutorial: building a faithful FIR
A tutorial: building a√
x2 + y2 operator
Conclusion
F. de Dinechin A FloPoCo tutorial 82
The “no killer app” theorem
For 20 years, the FPGA community has been waiting for the “killerapplication”.(The widely useful application on which the FPGA is so much better)
Theorem : we’ll wait forever.
Proof : When such an application pops up,
either it is indeed widely useful, and next year’s Pentium will do itin hardware 10x faster than the FPGA, so it won’t be an FPGAkiller app next year,
or the FPGA remains competitive next year, but it means that itwas not a killer app.
The killer feature of FPGAs is flexibility
To exploit it, we do need infinitely many arithmetic operators.
F. de Dinechin A FloPoCo tutorial 83
The “no killer app” theorem
For 20 years, the FPGA community has been waiting for the “killerapplication”.(The widely useful application on which the FPGA is so much better)
Theorem : we’ll wait forever.
Proof : When such an application pops up,
either it is indeed widely useful, and next year’s Pentium will do itin hardware 10x faster than the FPGA, so it won’t be an FPGAkiller app next year,
or the FPGA remains competitive next year, but it means that itwas not a killer app.
The killer feature of FPGAs is flexibility
To exploit it, we do need infinitely many arithmetic operators.
F. de Dinechin A FloPoCo tutorial 83
The “no killer app” theorem
For 20 years, the FPGA community has been waiting for the “killerapplication”.(The widely useful application on which the FPGA is so much better)
Theorem : we’ll wait forever.
Proof : When such an application pops up,
either it is indeed widely useful, and next year’s Pentium will do itin hardware 10x faster than the FPGA, so it won’t be an FPGAkiller app next year,
or the FPGA remains competitive next year, but it means that itwas not a killer app.
The killer feature of FPGAs is flexibility
To exploit it, we do need infinitely many arithmetic operators.
F. de Dinechin A FloPoCo tutorial 83
The “no killer app” theorem
For 20 years, the FPGA community has been waiting for the “killerapplication”.(The widely useful application on which the FPGA is so much better)
Theorem : we’ll wait forever.
Proof : When such an application pops up,
either it is indeed widely useful, and next year’s Pentium will do itin hardware 10x faster than the FPGA, so it won’t be an FPGAkiller app next year,
or the FPGA remains competitive next year, but it means that itwas not a killer app.
The killer feature of FPGAs is flexibility
To exploit it, we do need infinitely many arithmetic operators.
F. de Dinechin A FloPoCo tutorial 83
Computing just right
In a Pentium
the choice is between
an integer SUV, or
a floating-point SUV.
In an FPGA
If all I need is a bicycle, I have the possibility to build a bicycle
(and I’m usually faster to destination)
Save routing ! Save power ! Don’t move useless bits around !
F. de Dinechin A FloPoCo tutorial 84
Computing just right
In a Pentium
the choice is between
an integer SUV, or
a floating-point SUV.
In an FPGA
If all I need is a bicycle, I have the possibility to build a bicycle
(and I’m usually faster to destination)
Save routing ! Save power ! Don’t move useless bits around !
F. de Dinechin A FloPoCo tutorial 84
Computing just right
In a Pentium
the choice is between
an integer SUV, or
a floating-point SUV.
In an FPGA
If all I need is a bicycle, I have the possibility to build a bicycle
(and I’m usually faster to destination)
Save routing ! Save power ! Don’t move useless bits around !
F. de Dinechin A FloPoCo tutorial 84
An almost virgin land
Most of the arithmetic literature addresses the construction of SUVs.
F. de Dinechin A FloPoCo tutorial 85
My personal record
Division by 3 (ARC 2012) :Two weeks from the first intuition of the algorithmto complete pipelined FloPoCo implementation + paper submission.
Implementation time
10 minutes to obtain a testbench generator
2 days for the fully parametric combinatorial operator
(less than VHDL)
20 mn for its fully flexible pipeline
F. de Dinechin A FloPoCo tutorial 86
So when do we have an FPGA in every PC ?
When they become as easy to program as processors ?
(now that’s a challenge)
(or do we quietly wait for processors to become as messy to program asFPGAs ?)
F. de Dinechin A FloPoCo tutorial 87
Thanks for your attention
The following people have contributed to FloPoCo :S. Banescu, N. Brunie, S. Collange, J. Detrey,P. Echeverrıa, F. Ferrandi, M. Grad, K. Illyes,M. Istoan, M. Joldes, C. Klein, D. Mastrandrea,B. Pasca, B. Popa, X. Pujol, D. Thomas,R. Tudoran, A. Vasquez.
e
x
√x2 +y
2 +z2
πx
sine x+
y
n∑i=
0x i
√x logx
http://flopoco.gforge.inria.fr/
Introduction: custom arithmetic
FloPoCo for application developers
Computing just right
FloPoCo for developers of HLS software
FloPoCo for developers of custom arithmetic
A tutorial: building an integer divider by 3
A tutorial: building a faithful FIR
A tutorial: building a√
x2 + y2 operator
Conclusion
F. de Dinechin A FloPoCo tutorial 88