x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit

Building Custom Arithmetic Operatorswith the FloPoCo Generator

Florent de Dinechine

x

√x2 +y

2 +z2

πx

sine x+

y

n∑i=

0x i

√x logx

Outline

Introduction: custom arithmetic

FloPoCo for application developers

Computing just right

FloPoCo for developers of HLS software

FloPoCo for developers of custom arithmetic

A tutorial: building an integer divider by 3

A tutorial: building a faithful FIR

A tutorial: building a√

x2 + y2 operator

Conclusion

F. de Dinechin A FloPoCo tutorial 2

Introduction :custom arithmetic









x2 + y2 operator

Conclusion


Two different ways of wasting silicon

Here are two universally programmable chips.

Who’s best for (insert your computation here) ?


Are FPGAs any good at floating-point ?

Long ago (1995), people ported the basic operations : +,−,×Versus the highly optimized FPU in the processor,

each operator 10x slower in an FPGA

This is the inavoidable overhead of programmability.

If you lose according to a metric, change the metric.

Peak figures for double-precision floating-point exponential

Pentium core : 20 cycles / DPExp @ 4GHz : 200 MDPExp/s

FPExp in FPGA : 1 DPExp/cycle @ 400MHz : 400 MDPExp/s

Chip vs chip : 6 Pentium cores vs 150 FPExp/FPGA

Power consumption also better

Single precision data better

(Intel MKL vector libm, vs FPExp in FloPoCo version 2.0.0)


Are FPGAs any good at floating-point ?

Long ago (1995), people ported the basic operations : +,−,×Versus the highly optimized FPU in the processor,

each operator 10x slower in an FPGA

This is the inavoidable overhead of programmability.

If you lose according to a metric, change the metric.

Peak figures for double-precision floating-point exponential

Pentium core : 20 cycles / DPExp @ 4GHz : 200 MDPExp/s

FPExp in FPGA : 1 DPExp/cycle @ 400MHz : 400 MDPExp/s

Chip vs chip : 6 Pentium cores vs 150 FPExp/FPGA

Power consumption also better

Single precision data better

(Intel MKL vector libm, vs FPExp in FloPoCo version 2.0.0)


Dura Amdahl lex, sed lex

SPICE Model-Evaluation, cut from Kapre and DeHon (FPL 2009)


Custom arithmetic (not your Pentium’s)

multiplier

genericpolynomial

truncated

precomputed

ROM

Constantmultipliers

evaluator

Shift to fixed−point

normalize / round

Fixed-point X

SX EX FX

A Z

E

E×1/ log(2)

× log(2)

eA eZ − Z − 1

Y

R



Never compute1 bit more accuratelythan needed!

multiplier

genericpolynomial

truncated

precomputed

ROM

Constantmultipliers

evaluator


normalize / round

Fixed-point X

SX EX FX

A Z

E

E×1/ log(2)

× log(2)

eA eZ − Z − 1

Y

R

1 + wF + g

wF + g − k

wF + g + 2 − kMSB wF + g + 2 − k

wF + g + 1 − k

MSB wF + g + 1 − 2k

1 + wF + g

wE + wF + g + 1

wE + 1

wE + wF + g + 1

wE + wF + g + 1

k



Never compute1 bit more accuratelythan needed!

multiplier

genericpolynomial

truncated

precomputed

ROM

Constantmultipliers

evaluator


normalize / round

generatorNeed a

Fixed-point X

SX EX FX

A Z

E

E×1/ log(2)

× log(2)

eA eZ − Z − 1

Y

R

1 + wF + g

wF + g − k

wF + g + 2 − kMSB wF + g + 2 − k

wF + g + 1 − k

MSB wF + g + 1 − 2k

1 + wF + g

wE + wF + g + 1

wE + 1

wE + wF + g + 1

wE + wF + g + 1

k


Useful operators that make sense in a processor

Should a processor include elementary functions ?Yes (Paul&Wilson, 1976), No since the transition to RISC

Should a processor include a divider and square root ?Yes (Oberman et al, Arith, 1997), No since the transition to FMA(IBM then HP then Intel)

Should a processor include decimal hardware ?Yes say IBM, No say Intel

Should a processor include a multiplier by log(2) ?No of course.


Useful operators that make sense in an FPGA or ASIC

Elementary functions ?Yes iff your application needs it

Divider or square root ?Yes iff your application needs it

Decimal hardware ?Yes iff your application needs it

A multiplier by log(2) ?Yes iff your application needs it

In FPGAs, useful means : useful to one application.


Enough work to keep me busy to retirement

Arithmetic operators useful to at least one application :

Elementary functions (sine, exponential, logarithm...)

Algebraic functions (x√

x2 + y2, polynomials, ...)

Compound functions (log2(1± 2x), e−Kt2 , ...)

Floating-point sums, dot products, sums of squares

Specialized operators : constant multipliers, squarers, ...

Complex arithmetic

LNS arithmetic

Decimal arithmetic

Interval arithmetic

...

Oh yes, basic operations, too.


Enough work to keep me busy to retirement

Arithmetic operators useful to at least one application :

Elementary functions (sine, exponential, logarithm...)

Algebraic functions (x√

x2 + y2, polynomials, ...)

Compound functions (log2(1± 2x), e−Kt2 , ...)

Floating-point sums, dot products, sums of squares

Specialized operators : constant multipliers, squarers, ...

Complex arithmetic

LNS arithmetic

Decimal arithmetic

Interval arithmetic

...

Oh yes, basic operations, too.


What do we call arithmetic operators ?

An arithmetic operation is a function (in the mathematical sense)

few well-typed inputs and outputsno memory or side effect (usually)

An operator is the implementation of such a function

IEEE-754 FP standard : operator(x) = rounding(operation(x))

→ Clean mathematical definition (even for floating-point arithmetic)

The operator as a circuit...

... is a direct acyclic graph (DAG) :

easy to build and pipeline

easy to test against its mathematical specification
























The benefits of custom computing

Example : a floating-point sum of squares

x2 + y2 + z2

(not a toy example but a useful building block)

A square is simpler than a multiplicationhalf the hardware required

x2, y2, and z2 are positive :one half of your FP adder is useless

Accuracy can be improved :5 rounding errors in the floating-point version(x2 + y2) + z2 : asymmetrical

The FloPoCo Recipe

Floating-point interface for convenience

Clear accuracy specification for computing just right

Fixed-point internal architecture for efficiency




x2 + y2 + z2





The FloPoCo Recipe







x2 + y2 + z2





The FloPoCo Recipe







x2 + y2 + z2





The FloPoCo Recipe







x2 + y2 + z2





The FloPoCo Recipe





A floating-point adder

λ

LZC/shift

p + 1

p + 1

p + 1

p + 1

2p + 2

p p

p + 1

p

x y

z

exp. difference / swap

rounding,normalizationand exception handling

mxex +/–c/f ex − ey

close path c/f

ex

ez

my

shift

|mx −my |

my

1-bit shift

ex

ez

mx

far pathmz , r

mz , r

sticky

s

gr

prenorm (2-bit shift)

s


A fixed-point architecture

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack



A few results for floating-point sum-of-squares on Virtex4 :

Simple Precision area performance

LogiCore classic 1282 slices, 20 DSP 43 cycles @ 353 MHz

FloPoCo classic 1188 slices, 12 DSP 29 cycles @ 289 MHz

FloPoCo custom 453 slices, 9 DSP 11 cycles @ 368 MHz

Double Precision area performance

FloPoCo classic 4480 slices, 27 DSP 46 cycles @ 276 MHz

FloPoCo custom 1845 slices, 18 DSP 16 cycles @ 362 MHz

all performance metrics improved, FLOP/s/area more than doubled

Plus : custom operator more accurate, and symmetrical


Custom also means : custom pipeline

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

One operator does not fit all

Low frequency, low resource consumption

Faster but larger (more registers)

Combinatorial



1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack




Combinatorial



1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack




Combinatorial


History of the FloPoCo project

Initial goal : FPGA arithmetic the way it should be

that is : open-ended, unlike processor arithmetic

an open-ended list of custom operators

open-ended data formats : all operators fully parameterized

open-ended performance trade-off : flexible pipeline

General philosophy : computing just right

A generator framework (written in C++, outputting VHDL)

Objectives :

portable, target-agnostic (more on this later)

VHDL should be simulable and synthesizable using free tool

Design philosophy :

VHDL generation by printing arbitrary VHDL from C++

automate repetitive and error-prone tasks (such as pipelining)


History of the FloPoCo project

Initial goal : FPGA arithmetic the way it should be

that is : open-ended, unlike processor arithmetic

an open-ended list of custom operators

open-ended data formats : all operators fully parameterized

open-ended performance trade-off : flexible pipeline

General philosophy : computing just right

A generator framework (written in C++, outputting VHDL)

Objectives :

portable, target-agnostic (more on this later)

VHDL should be simulable and synthesizable using free tool

Design philosophy :

VHDL generation by printing arbitrary VHDL from C++

automate repetitive and error-prone tasks (such as pipelining)


Beyond the plan

Custom arithmetic for ASIC and HLS

the FloPoCo framework was successfully used to design the FPU ofthe Kalray processor

FloPoCo provides the floating-point back-end to the PandA project(politecnico de Milano)

Not enough “custom” yet to my taste


Beyond the plan

Custom arithmetic for ASIC and HLS

the FloPoCo framework was successfully used to design the FPU ofthe Kalray processor

FloPoCo provides the floating-point back-end to the PandA project(politecnico de Milano)

Not enough “custom” yet to my taste











x2 + y2 operator

Conclusion


Here should come a demo

FloPoCo is freely available from

http://flopoco.gforge.inria.fr/

Command line syntax : a sequence of operator specifications

Options : target frequency, target hardware, ...

Output : synthesizable VHDL.



First something classical

A single precision floating-point adder(8-bit exponent and 23-bit mantissa)

flopoco -pipeline=no FPAdder 8 23

Final report:

|---Entity FPAdder_8_23_uid2_RightShifter

|---Entity IntAdder_27_f400_uid7

|---Entity LZCShifter_28_to_28_counting_32_uid14


Entity FPAdder_8_23_uid2

Output file: flopoco.vhdl

To probe further :

flopoco -pipeline=no FPAdder 11 52 double precision

flopoco -pipeline=no FPAdder 9 34 just right for you


Classical floating-point, continued

A complete single-precision FPU in a single VHDL file :flopoco -pipeline=no FPAdder 8 23 FPMultiplier 8 23 23

FPDiv 8 23 FPSqrt 8 23

Final report:






Entity Compressor_2_2

Entity Compressor_3_2

| |---Entity IntAdder_49_f400_uid39

|---Entity IntMultiplier_UsingDSP_24_24_48_unsigned_uid26


Entity FPMultiplier_8_23_8_23_8_23_uid24

Entity FPDiv_8_23

Entity FPSqrt_8_23

Output file: flopoco.vhdl


Frequency-directed pipelining

The same FPAdder, pipelined for 300MHz :flopoco -pipeline=yes -frequency=300 FPAdder 8 23

FloPoCo interface to pipeline construction

“Please pipeline this operator to work at 200MHz”

Not the choice made by other core generators...

... but better because compositional

When you assemble components working at frequency f ,you obtain a component working at frequency f .


























Examples of pipeline

flopoco -pipeline=yes -frequency=400 FPAdder 8 23

Final report:


| Pipeline depth = 1








Pipeline depth = 10

flopoco -pipeline=yes -frequency=200 FPAdder 8 23

Final report:

(...)

Pipeline depth = 4


Of course the frequency depends on the target FPGA

flopoco -pipeline=yes -frequency=200 -target=Spartan3

FPAdder 8 23

Final report:

(...)

Pipeline depth = 14

flopoco -pipeline=yes -frequency=200 -target=Virtex6

FPAdder 8 23

Final report:

(...)

Pipeline depth = 3

Altera and Xilinx target currently supported (at various levels ofaccuracy) : Spartan3, Virtex4, Virtex5, Virtex6, StratixII, StratixIII,StratixIV, StratixV, CycloneII, CycloneIII, CycloneIV, CycloneV.


Also match the architecture to the target FPGA

Compareflopoco -pipeline=no -target=Virtex4 IntConstDiv 16 3 -1

flopoco -pipeline=no -target=Virtex6 IntConstDiv 16 3 -1

LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0

Architecture specificities

LUTs

DSP blocks

memory blocks

And FloPoCo attempts to model the delays of these blocksfor efficient pipelining





LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0


LUTs

DSP blocks

memory blocks






LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0


LUTs

DSP blocks

memory blocks






LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0


LUTs

DSP blocks

memory blocks



Frequency-directed pipelining in practice

We do our best but we know it’s hopeless

The actual frequency obtained will depend on the whole application(placement, routing pressure etc)...

best-effort philosophy,

aiming to be accurate to 10% for an operator synthesized alone

asking a higher frequency provides a deeper pipeline

And a big TODO : VLSI targets.


Frequency-directed pipelining in practice

We do our best but we know it’s hopeless

The actual frequency obtained will depend on the whole application(placement, routing pressure etc)...

best-effort philosophy,

aiming to be accurate to 10% for an operator synthesized alone

asking a higher frequency provides a deeper pipeline

And a big TODO : VLSI targets.


Non-standard operators

Correctly rounded divider by 3 : flopoco FPConstDiv 8 23 3

Floating-point exponential : flopoco FPExp 8 23

Multiplication of a 32-bit signed integer by the constant 1234567(two algorithms, your mileage may vary) :flopoco IntIntKCM 32 1234567 0

flopoco IntConstMult 32 1234567

Full list in the documentation, or by typing just flopoco.Sorry for the sometimes incomplete or inconsistent interface.


Don’t trust us

Two operators, TestBench and TestBenchFile, generate test benchs forthe operator preceding them on the command line

flopoco FPExp 8 23 TestBenchFile 10000 generates 10000random tests

flopoco IntConstDiv 16 3 -1 TestBenchFile -2 generatesan exhaustive test

Specification-based test bench generation

Not by simulation of the generated architecture !

Helper functions for encoding/decoding FP format, if you want to checkthe testbench...

fp2bin 9 36 3.1415926

bin2fp 9 36

010100000000100100100001111110110100110100010011


Don’t trust us







fp2bin 9 36 3.1415926

bin2fp 9 36

010100000000100100100001111110110100110100010011


Don’t trust us







fp2bin 9 36 3.1415926

bin2fp 9 36

010100000000100100100001111110110100110100010011


Open-ended operators

A polynomial evaluator for arbitrary functions

Example :flopoco FunctionEvaluator "(sin(x*Pi/2))^ 2" 32 32 4

4 is the degree of the polynomial, allows to express amemory/multiplier trade-off

Works for the set of functions for which it works

Another one is HOTBM, all this is still work in progress


FPPipeline

Example :flopoco FPPipeline pipeline.txt 8 23

where pipeline.txt contains something liker = sqrt(sqr(x0-x1)+sqr(y0-y1)); output r;

⊕ Properly synchronizes the various sub components

⊕ Makes it easy to explore various precisions

Using this is against the philosophy of FloPoCo !

which is : let’s build a custom operator instead

Quick and dirty implementation

not all operators supported, constants not always specializedno TestBench support (not a trivial thing to add)

Anybody in the audience doing serious HLS ?


FPPipeline











FPPipeline











FPPipeline




















x2 + y2 operator

Conclusion


A bit of plain common sense

Common wisdom

The more accurate you compute, the more expensive it gets

In practice

We (hopefully) notice it when our computation isnot accurate enough.

But do we notice it when it is too accurate for our needs ?

Reconciling performance and accuracy ?

Or, regain performance by computing just right ?



Common wisdom


In practice







Common wisdom


In practice






Double precision spoils us

The standard binary64 format (formerly known as double-precision)provides roughly 16 decimal digits.

Why should anybody need such accuracy ?

Count the digits in the following

Definition of the second : the duration of 9,192,631,770 periods ofthe radiation corresponding to the transition between the twohyperfine levels of the ground state of the cesium 133 atom.

Definition of the metre : the distance travelled by light in vacuumin 1/299,792,458 of a second.

Most accurate measurement ever (another atomic frequency)to 14 decimal places

Most accurate measurement of the Planck constant to date :to 7 decimal places

The gravitation constant G is known to 3 decimal places only


Parenthesis : then why binary64 ?

This PC computes 109 operations per second (1 gigaflops)

An allegory due to Kulisch

print the numbers in 100 lines of 5 columns double-sided :1000 numbers/sheet

1000 sheets ≈ a heap of 10 cm

109 flops ≈ heap height speed of 100m/s, or 360km/h

A teraflops (1012 op/s) prints to the moon in one second

Current top 500 computers reach the petaflop (1015 op/s)

each operation may involve a relative error of 10−16,and they accumulate.

Doesn’t this sound wrong ?

We would use these 16 digits just to accumulate garbage in them ?


















































Back to the point

Mastering accuracy for performance

When implementing a “computing core”

A goal : never compute more accurately than needed

A tool : flexible arithmetic

Qualitatively : use FPConstDiv instead of a dividerQuantitatively : tailor your data formats to your problem

Sub-goals

Know what accuracy you needKnow how accurate you compute

“Computing cores” considered so far : elementary functions, sums ofproducts, linear algebra, Euclidean lattices algorithms, ...

Hopelessly difficult in the general case ...but at least ask this question systematically


Back to the point






Sub-goals





Back to the point






Sub-goals

Know what accuracy you need

Know how accurate you compute




Back to the point






Sub-goals





Back to the point






Sub-goals





Back to the point






Sub-goals





My current crusade

The evil (at least on FPGAs)

A fast implementation of single-precision floating-point exponential(but accurate to 2−8 only)

Do you see why it is wrong ?

A line I shall have in each of my talks until the world is saved

Save routing ! Save power ! Don’t move useless bits around !

Or maybe this one

Do you really need to compute this bit ?


My current crusade






Or maybe this one



My current crusade






Or maybe this one



FloPoCo for developers of HLSsoftware









x2 + y2 operator

Conclusion


Using FloPoCoLib

From file src/main minimal.cpp :

1 #include "FloPoCo.hpp"2 using namespace std;3 using namespace flopoco;4

5 int main(int argc , char* argv[] ) {6 Target* target = new Virtex4 ();7

8 int wE = 9;9 int wF = 33;

10 Operator *op = new FPAdderSinglePath(target , wE,wF, wE,wF, wE,wF);

11

12 ofstream file;13 file.open("FPAdder.vhdl", ios::out);14 op->outputVHDLToFile(file);15 file.close ();16

17 cerr << endl <<"Final report:"<<endl;18 op->outputFinalReport (0);19 }


FloPoCo class hierarchy

Signal

+width

+cycle

+lifeSpan

Opera tor

+signalList

+vhdl

+outputVHDL()

+emulate()

+buildStandardTestCases()

FPAdder

+wE

+wF

In tAddder

+size

Shif ters Collision

+wE

+wF

Targets

V i r tex4Strat ix I I

TestBench


Custom arithmetic for HLS (an outsider point of view)

All the optimizations you do when compiling for a processor, plus

a+a should be replaced with 2*a, not the opposite !

fewer optimizations prevented by bit-for-bit or cornercase issues

a C compiler is not allowed to replace x-0.0 with x

or 2*x/2 with x

both issues can be solved by (very small) cornercase-fix logic

more opportunities of operator specialization :

multiplication and division by a constanta squarer for a*a,a sine on one period only...

endless opportunities of operator fusion

did I mention√

x2 + y2 ?bit-for-bit compatibility lost, replaced with guarantee accuracy

Eventually we will need new languages.







or 2*x/2 with x





did I mention√









or 2*x/2 with x





did I mention√









or 2*x/2 with x





did I mention√









or 2*x/2 with x





did I mention√




FloPoCo for developers of customarithmetic









x2 + y2 operator

Conclusion


A modestly object-oriented approach

FloPoCo is not a C++-based HDL

VHDL generation is “print-based”

1 vhdl << "SoS <= EA(wE -1 downto 0) & Fraction;" ;

easy to port existing work (FPLibrary)easy learning curve for the VHDL-litterateat least the expressive power of VHDL !

Many helper functions help doing the printsExample : VHDL signal declaration

1 vhdl << declare("SoS", wE+wF+g)2 << " <= EA(wE -1 downto 0) & Fraction;" ;


















Workflow of designing a new operator in FloPoCo

1. Create an operator skeleton

copy src/UserDefinedOperator.cpp (heavily documented)add your file in CMakeList.txt and src/FloPoCo.hpp

add command-line interface + doc in src/main.cpp

2. Define the specification : write the emulate() method

syntax is quite OOgly but imitation is your frienda few lines to change from a type-matching operator

... then TestBench and TestBenchFile will work

3. Write the code generating combinatorial VHDL

Basically embedd your VHDL in C++ prints

4. Add code for self-pipelining

Roughly the same effort as would take to get one pipeline instance... but you get frequency-directed pipeline for all the FloPoCo targets






































My personal record

Two weeks from the first intuition of the algorithmto complete pipelined FloPoCo implementation + paper submission(Division by small integer constants, ARC 2012)

Implementation time

10 minutes to obtain a testbench generator

1/2 day for the integer Euclidean division

20 mn for its flexible pipeline

1/2 day for the FP divider by 3

and again 20 mn

Time savers :

Test bench generator for correct combinatorial operation

Pipeline framework (for filling result tables in the paper)

(Time wasters : maintaining all this instead of writing new operators)


My personal record


Implementation time





and again 20 mn

Time savers :





My personal record


Implementation time





and again 20 mn

Time savers :





Pipeline made easy

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack


Pipeline made easy

5

4

3

2

1

0

0

1

6

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

Notion of current cycle during VHDL output

Each signal has a cycle attribute (at which cycle it is defined)


Pipeline made easy

5

4

3

2

1

0

0

1

6

EA

EB

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

1 vhdl << declare("EA", wE) << " <= ... ;" ; // c y c l e (EA)=02 nextCycle ();3 vhdl << declare("EB", wE) << " <= ... ;" ; // c y c l e (EB)=14 setCycle (0);5 vhdl << ... ;


Pipeline made easy

5

4

3

2

1

0

0

1

6

EA

EB

(insert 6 registers)

RHS("EA")SoS

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

FloPoCo looks for signal names on the right-hand sideand delays them by (currentCycle - defCycle) cycles :



Pipeline made easy

5

4

3

2

1

0

0

1

6

EA

EB

(insert 6 registers)

RHS("EA")SoS

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

FloPoCo looks for signal names on the right-hand sideand delays them by (currentCycle - defCycle) cycles :Generated VHDL :

1 SoS <= EA_d6(wE -1 downto 0) & Fraction_d1;

(and transparently declare and build the needed shift registers)


Pipeline made easy

5

4

3

2

1

0

0

1

6

1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

Managing the current cycle :n=getCycle();

setCycle(n);

nextCycle();

syncCycleWithSignal("EA");

Frequency-directed pipelining :

manageCriticalPath(

target->adderDelay(n) );


manageCriticalPath(delay)

A currentCriticalPath variable is maintained during VHDL generation

manageCriticalPath(delay)

adds delay to currentCriticalPathif the sum is larger than 1/targetFrequency,

I currentCycle ++I and reset currentCriticalPath to delay

delay is an estimation of the delay of the following block of VHDL

you (the developer) have to perform this estimation

... typically computed using timing methods of the Target class

adderDelay(int n)

localRoutingDelay(int fanout)

This is how the pipeline adapts to the actual target.


1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

The magic explained

Pipeline adapts to random insertions of pipeline levels anywhere

during development, tinkering, fine-tuning

or during frequency-directed, target-optimized pipelining

Graceful degradation to a combinatorial operator

in which case the vhdl stream is just output untouched

Suits the “print VHDL” philosophy

FloPoCo code is much shorter than the VHDL it generates.


1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

The magic explained


during development, tinkering, fine-tuningor during frequency-directed, target-optimized pipelining






1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

The magic explained








1 + wF 1 + wF 1 + wF

2 + wF + g2 + wF + g

2 + wF + g2 + wF + g

2 + wF + g

wE + wF + g

2 + wF + g

EC

EBMB2 MC 2

X Y Z

MXEZEYEX MY MZ

MA2

R

4 + wF + g

shifter

sort

sortsquarer squarer

shifter

squarer

add

normalize/pack

unpack

The magic explained








From the developer point of view

Clean separation of functionality and performance issues

You write to the vhdl stream only combinatorial VHDL

FloPoCo parses it to insert registers (in two passes)

Timing is managed out of the VHDL code

Subcomponents also delay their outputs WRT their inputs, etc

Correct-by-construction pipelining

You cannot obtain a non-functional operator,starting from a working combinatorial one

(at least not without very loud warnings)

This part of the framework can be trusted

nothing more complicated than integer subtractions and comparisonsand of course, test bench generators adapt to the pipeline

You can obtain a bad pipeline, though

unbalanced, over-pipelined, missing the target frequency, ...


From the developer point of view

Clean separation of functionality and performance issues

You write to the vhdl stream only combinatorial VHDL

FloPoCo parses it to insert registers (in two passes)

Timing is managed out of the VHDL code

Subcomponents also delay their outputs WRT their inputs, etc

Correct-by-construction pipelining

You cannot obtain a non-functional operator,starting from a working combinatorial one

(at least not without very loud warnings)

This part of the framework can be trusted

nothing more complicated than integer subtractions and comparisonsand of course, test bench generators adapt to the pipeline

You can obtain a bad pipeline, though

unbalanced, over-pipelined, missing the target frequency, ...


subcomponent list

signal list

(bitwidth, etc)

in structural VHDL syntax)

(combinatorial description

streamvhdl

generation of VHDL declarations

generation of VHDL register code

for each signalvaluelifeSpan

for each signalcycle value

generation of VHDL architecture code

(second pass on vhdl stream,

delaying right−hand side signals)

C++to know their pipeline depth)

(recursively calling constructors of all sub−components

Constructor

Operator.outputVHDL()

pipeline information

architecture ....

entity ...

port (...)

component ....

signal ....

end architecture

output file

DEXY <= ...

DEYZ <= ...

...

end process

begin

process(clk)

....

VHDLC++

functional informationproduces

is used


We could have done it better

History of development shows a bit

Do we really need an interface with explicit cycle management ?

In FloPoCo, time=(cycle, critical path within the cycle)

but the user interface should consist only of delay information

A delay should be associated systematically to signal declaration

Too much code to change it now...


Matching the architecture to the target

Perfectly valid code :

1 if(target =="StratixII"){2 // outpu t some VHDL3 }4 else if(target =="Virtex4")

{5 // outpu t o t h e r VHDL6 }7 else ...

Try to avoid it :

Model architecture details

LUTsDSP blocksmemory blocks

Model delays

Definitely an endless effort.


One slide on testing

emulate() performs bit-accurate emulation

Not by architecture simulation !By expressing the mathematical specification..in typically a few lines of MPFR (www.mpfr.org)

buildStandardTestCases() for corner cases and regression tests

buildRandomTestCases() to bias the random generator in anoperation-specific way

default is uniform on the bits.Example 1 : exponential useful range is very small,bias the random generator towards thatExample2 : cancellation cases in floating-point additions

may (should) be overloaded by each Operator

The special TestBench and TestBenchFile operators invokethese methods


www.mpfr.org










www.mpfr.org










www.mpfr.org










www.mpfr.org

A tutorial : building an integerdivider by 3









x2 + y2 operator

Conclusion


Anybody here remembers how we compute divisions ?

7 7 6

1 7

2 6

2

2 5 8

3

iteration body : Euclidean division of a 2-digit decimal number by 3

The first digit is a remainder from previous iteration :its value is 0, 1 or 2

Possible implementation as a look-up table that, for each valuefrom 00 to 29, gives the quotient and the remainder of its divisionby 3.


Anybody here remembers how we compute divisions ?

7 7 6

1 7

2 6

2

2 5 8

3

iteration body : Euclidean division of a 2-digit decimal number by 3

The first digit is a remainder from previous iteration :its value is 0, 1 or 2

Possible implementation as a look-up table that, for each valuefrom 00 to 29, gives the quotient and the remainder of its divisionby 3.


Mathematical obfuscation, in binary-friendly radix

Representation of x in radix β = 2α

(splitting the binary decomposition of x into k chunks of α bits)

procedure ConstantDiv(x , d)rk ← 0for i = k − 1 down to 0 do

yi ← xi + 2αri+1 (this + is a concatenation)(qi , ri )← (byi/dc, yi mod d) (read from a table)

end forreturn q =

∑ki=0 qi .2

−αi , r0end procedure

Each iteration

consumes α bits of x , and a remainder of size γ = dlog2 deproduces α bits of q, and a remainder of size γ

Implemented as a table with α + γ bits in, α + γ bits out


Mathematical obfuscation, in binary-friendly radix

Representation of x in radix β = 2α

(splitting the binary decomposition of x into k chunks of α bits)

procedure ConstantDiv(x , d)rk ← 0for i = k − 1 down to 0 do

yi ← xi + 2αri+1 (this + is a concatenation)(qi , ri )← (byi/dc, yi mod d) (read from a table)

end forreturn q =

∑ki=0 qi .2

−αi , r0end procedure

Each iteration

consumes α bits of x , and a remainder of size γ = dlog2 deproduces α bits of q, and a remainder of size γ

Implemented as a table with α + γ bits in, α + γ bits out


Sequential implementation

LUT

clk

reset

α

α

xi

γγ

qi

ri+1 ri


Unrolled implementation

LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0


Setting the parameters for LUTs

LUT LUTLUTLUT

q0q1q2q3

2 2 2 2

4444

4 4 4x3 x2 x1 x0r3 r2 r1 r0 = r

4

r4 = 0

For instance, assuming a 6-input LUTs (e.g. LUT6)

A 6-bit in, 6-bit out consumes 6 LUT6

Size of remainder is γ = log2 d

If d < 25, very efficient architecture : α = 6− γThe smaller d , the better

Easy to pipeline (one register behind each LUT for free)


Step 0 : get a working FloPoCo distribution

Google FloPoCo, then

first use the 1-line install to get all dependencies

optionally do an svn checkout to live on the edge

Tutorial files in a tgz on the tutorial webpage.


Step 1 : build a skeleton

We want to implement flopoco IntConstDiv w dcd src

cp UserDefinedOperator.cpp IntConstDiv.cpp

cp UserDefinedOperator.hpp IntConstDiv.hpp

Inside the newly created files,

global replace UserDefinedOperator with IntConstDiv

empty the cpp of all flesh except initial declarations

Edit CMakeLists.txt (the file that defines the building process),

add src/IntConstDiv somewhere

Edit src/FloPoCo.hpp and do similar imitation

Edit src/main.cpp and do similar imitation

Compile and fix

Check that ./flopoco IntConstDiv 32 3 does nothing


Step 2 : overload emulate()

Your d should be stored in the IntConstDiv somewhere

What emulate() does

a TestCase is a pair (input, allowed outputs)

emulate() inputs a TestCase with only the input defined

and defines the allowed outputs – here two

Inputs and outputs are semantically bit vectors, held in a GMPinteger.

OOgly methods to convert floating-point from / to arbitraryprecision floating-point (MPFR)

no need here

You may define several acceptable values for an input

typically for faithful rounding (two values)

After this step, wonder at :./flopoco IntConstDiv 10 3 TestBench 1000



Your d should be stored in the IntConstDiv somewhere

What emulate() does






no need here

You may define several acceptable values for an input

typically for faithful rounding (two values)

After this step, wonder at :./flopoco IntConstDiv 10 3 TestBench 1000


Step 3 : Write the combinatorial VHDL

Use the abstract Table operator

All you need is overload the function method

Other useful generic operators :

Shifters and leading zero counters for floating-pointPipelined adders, multipliers, ...


Step 3.5 : overload BuildStandardTestCases

First debug until VHDL works for all xi = 0

... then for all xi = 1


Step 4 : pipeline

using nextCycle() and sync*() only in this case


A tutorial : building a faithful FIR









x2 + y2 operator

Conclusion


Specification

Function

r =n∑

i=1

cixi

The ci are constants provided with arbitrary accuracy on thecommand line

the xi are signed fixed-point inputs in (1,p) format

Accuracy : faithful rounding of the exact result

xi and ci considered exact

return one of the two numbers surrounding the exact result :

next-best after correct rounding (which is too expensive)last-bit accurate (all returned bits are useful),more accurate than naive assembly of multipliers and adders


Specification

Function

r =n∑

i=1

cixi

The ci are constants provided with arbitrary accuracy on thecommand line

the xi are signed fixed-point inputs in (1,p) format

Accuracy : faithful rounding of the exact result

xi and ci considered exact

return one of the two numbers surrounding the exact result :

next-best after correct rounding (which is too expensive)last-bit accurate (all returned bits are useful),more accurate than naive assembly of multipliers and adders


Algorithm and error budget

sum the products truncated to precision 2−p−g , i.e. g guard bits,

add one bit at weight 2−p−1,

and truncate to precision 2−p, i.e. to the result format

... where g = 1 + log2(n − 1)

Design space exploration

The sum can be implemented as

a rake of adders (simplest, let’s start with that)

a tree of adders (slightly more complex, good we have C++)

a single BitHeap object (on the edge)








We want to implement flopoco FixedPointFIR n w a1 ... ancd src

cp UserDefinedOperator.cpp FixedPointFIR.cpp

cp UserDefinedOperator.hpp FixedPointFIR.hpp


global replace UserDefinedOperator with FixedPointFIR



add src/FixedPointFIR somewhere



NewCompressorTree has an arbitrary size parameter list

Compile and fix

Check that ./flopoco FixedPointFIR 4 8 1 1 1 1 doesnothing



Your ai should be stored in the FixedPointFIR somewhere

What emulate() does






no need here

After this step, wonder at :./flopoco FP2DNorm 5 10 TestBench 1000



Your ai should be stored in the FixedPointFIR somewhere

What emulate() does






no need here

After this step, wonder at :./flopoco FP2DNorm 5 10 TestBench 1000



use one of the existing constant multipliers in FloPoCo

Half the room should use FixRealKCM

The other half should use IntConstMult

First write the additions as “+” in standard VHDL

not very efficient, but tutorially good.(several better alternatives to discuss later)

All datapath on 1 + w + g bits

here also, a small optimization is possible


Step 3.5 : overload BuildStandardTestCases

First debug until VHDL works for all xi = 0

... then for all xi = 1


Step 4 : pipeline

using nextCycle() and sync*() only in this case


A tutorial : building a√x2 + y 2

operatorIntroduction: custom arithmetic








x2 + y2 operator

Conclusion


Specification

We are going to build a floating-point operator for√

x2 + y2

Black box : two inputs, one output, just like an adder or amultiplier.

Mathematical specification : faithful rounding of the exact result

x and y considered exactreturn one of the FP numbers surrounding the exact resultless accurate than correct rounding, butlast-bit accurate (all returned bits are useful), andmore accurate than a combination of two multipliers, one adder anda square root (4 rounding errors)

Algorithm : sum of square, followed by a square root of themantissa

with some guard bits for faithful roundingexpectig savings mostly in the sum of squares


Specification

We are going to build a floating-point operator for√

x2 + y2

Black box : two inputs, one output, just like an adder or amultiplier.

Mathematical specification : faithful rounding of the exact result

x and y considered exactreturn one of the FP numbers surrounding the exact resultless accurate than correct rounding, butlast-bit accurate (all returned bits are useful), andmore accurate than a combination of two multipliers, one adder anda square root (4 rounding errors)

Algorithm : sum of square, followed by a square root of themantissa

with some guard bits for faithful roundingexpectig savings mostly in the sum of squares


Disclaimer (for hard-core arithmeticians)

It should be possible to fuse the whole computation in a single digitrecurrence

more efficient for ASIC and DSP-less FPGAs

see works by Ercegovac, Muller, Lang, Bruguera, Takagi ...

... but beyond the scope of this work

and less interesting from a tutorial point of view :no component reuse








We want to implement flopoco FP2DNorm wE wFcd src

cp UserDefinedOperator.cpp FP2DNorm.cpp

cp UserDefinedOperator.hpp FP2DNorm.hpp


global replace UserDefinedOperator with FP2DNorm



look for src/FPSqrt (because the interface is similar)and add src/FP2DNorm close to it



Compile and fix

Check that ./flopoco FP2DNorm 8 23 does nothing



Here, get inspiration from src/FPSumOfSquares orsrc/FPAdderSinglePath.



and defines the allowed outputs


OOgly methods to convert from / to arbitrary precisionfloating-point (MPFR)

After this step, test :./flopoco FP2DNorm 5 10 TestBench 1000



Here, get inspiration from src/FPSumOfSquares orsrc/FPAdderSinglePath.



and defines the allowed outputs


OOgly methods to convert from / to arbitrary precisionfloating-point (MPFR)

After this step, test :./flopoco FP2DNorm 5 10 TestBench 1000



use an exponent difference, a shifter, two squarers, and a squareroot

exponent difference is a subtraction, but a small onefollowed by swap

Shift before or after the square ?

before enables single bit-heap implementation... but requires a slightly larger squarerafter requires a slightly larger shifter

Hack the floating-point square root

no fixed-point version available


Conclusion









x2 + y2 operator

Conclusion


The “no killer app” theorem

For 20 years, the FPGA community has been waiting for the “killerapplication”.(The widely useful application on which the FPGA is so much better)

Theorem : we’ll wait forever.

Proof : When such an application pops up,

either it is indeed widely useful, and next year’s Pentium will do itin hardware 10x faster than the FPGA, so it won’t be an FPGAkiller app next year,

or the FPGA remains competitive next year, but it means that itwas not a killer app.

The killer feature of FPGAs is flexibility

To exploit it, we do need infinitely many arithmetic operators.






























In a Pentium

the choice is between

an integer SUV, or

a floating-point SUV.

In an FPGA

If all I need is a bicycle, I have the possibility to build a bicycle

(and I’m usually faster to destination)




In a Pentium


an integer SUV, or


In an FPGA






In a Pentium


an integer SUV, or


In an FPGA





An almost virgin land

Most of the arithmetic literature addresses the construction of SUVs.


My personal record

Division by 3 (ARC 2012) :Two weeks from the first intuition of the algorithmto complete pipelined FloPoCo implementation + paper submission.

Implementation time


2 days for the fully parametric combinatorial operator

(less than VHDL)

20 mn for its fully flexible pipeline


So when do we have an FPGA in every PC ?

When they become as easy to program as processors ?

(now that’s a challenge)

(or do we quietly wait for processors to become as messy to program asFPGAs ?)


Thanks for your attention

The following people have contributed to FloPoCo :S. Banescu, N. Brunie, S. Collange, J. Detrey,P. Echeverrıa, F. Ferrandi, M. Grad, K. Illyes,M. Istoan, M. Joldes, C. Klein, D. Mastrandrea,B. Pasca, B. Popa, X. Pujol, D. Thomas,R. Tudoran, A. Vasquez.

e

x

√x2 +y

2 +z2

πx

sine x+

y

n∑i=

0x i

√x logx










x2 + y2 operator

Conclusion



Documents

x =0i Building Custom Arithmetic Operators with the ...perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/... · oating-point arithmetic) The operator as a circuit