Resource Saving in Micro-Computer Software & FPGA Firmware Designs Wu, Jinyuan Fermilab Nov. 2006

Resource Saving in Micro-Computer Software &

FPGA Firmware Designs

Wu, Jinyuan

Fermilab

Nov. 2006

Resource Saving in FPGAFrom: “CompactFPGAdesign.pdf”

• Glue Logic

• Digitization– TDC, (ADC), etc.

• Communication– C5, Digital Phase Follower, etc.

• Data Organization– Zero-Suppression, Parasitic Event Building, etc.

• Reconfigurable Computing– Hash Sorter, TTF, ELMS, etc.

Software -- FirmwareSoftware -- Firmware

Computer Is Fast

• This is the first impression of many beginners.

• “FPGA is big.”

• Program Creation Time > Execution Time

How to Slow Down Computers?

• Single Layer Loop: – 256 x 3 x 4 x 0.25 us = 0.75 ms

• Nested Loops:– 256 x0.75 ms = .19 s

5 56 | 2 - | 1 16 | 2 - | .SquareWave

Generator

CPUZ80

4MHz

“LD A,A” = “NOOP”1 NOOP spends 1s1,000,000 NOOP spends 1s

LD A,#255BACKA: NOOP

DEC AJP NZ, BACKA

LD B,#255BACKB: LD A,#255BACKA: NOOP

DEC AJP NZ, BACKALD A,BDEC BDEC AJP NZ, BACKB

T

Knowing Slow, Knowing FastWhere Resources Can Be Saved

• For micro-computer software:– Pay attention to loops and frequently called

subroutines,– Especially inner-most nested loops.

• For FPGA firmware:– Algorithms rooted in micro-computer software.– Reusable blocks.– Occasionally used functions.

Example: Inner-Product

• Avoid using conditional branch for loop control: ELMS– Saves 25% execution time in this case.

LD R1, #nLD R2, #addr_aLD R3, #addr_XLD R7, #0

BckA1 LD R4, (R2)INC R2LD R5, (R3)INC R3MUL R6, R4, R5

EndA1 ADD R7, R7, R6DEC R1BRNZ BckA1

n

iii XaY

0

R1--

R3++X

R2++a

R4

R6

R5

R7

x

+

• Multiplier-less algorithms.

• Reuse computations: Using fast algorithms like FFT.

• Avoid entering the loop: Using early constraints.

Computing Module in Micro-processor & FPGA

• Micro-processors use full sequencing approach. One operation is performed in each clock cycle.

• In FPGA, flatten logics are allowed and are fast but take large silicon area.

(100+3-4)*5+7 =?

100

34

57Control:

Data: 100,3,4,5,7

LD (-) (+)(*)(+)

Sequencing in FPGA for Resource Control

• Sequencing is a very efficient means of resource control in FPGA.

• Reuse processing resource for similar function and/or different channels.

• Pay attention to occasionally-used functions like initialization.

Initialization

Sum1 Sum2 Sum3 Sum4

Sum1 Sum2 Sum3 Sum4

Sum1 Sum2 Sum3 Sum4

Sum1 Sum2 Sum3 Sum4

CH0

CH1

CH2

CH3

Initialization1

Sum1Sum2Sum3Sum4

Sum1Sum2Sum3Sum4

Sum1Sum2Sum3Sum4

Sum1Sum2Sum3Sum4

CH0

CH1

CH2

CH3

Initialization2

Suggestion (1)

Use partially flatten and partially sequential logic to reach balance of speed and size.

ELMS– Enclosed Loop Micro-Sequencer

• A PC+ROM structure can be a very good sequencer in FPGA.

• The Conditional Branch Logic is added to support regular conditional branch as in micro-processors.

• The Loop & Return Logic + Stack are added to support FOR loops with pre-defined iterations at machine code level.

• The resource usage of ELMS in FPGA is very small.

ProgramCounter

ROM128x

36bits

Reset A

Con

trol

Sig

nals

CLK

ProgramCounter

ROM128x

36bits

A

Loop & Return Logic + Stack

Conditional Branch Logic

Reset

CLK Con

trol

Sig

nals

FOR BckA1 EndA1 #nLD R2, #addr_aLD R3, #addr_XLD R7, #0


EndA1 ADD R7, R7, R6

ELMS– Detailed Block Diagram

ROM128x

36bits

+1

CondJMP

PC

Reset

Loop & Return Registers

+ Stack (128 words)

Compare

RTNJMPIF

CNT

endA

bckA

PushPop

LoopBack

DEC

RTN

LastPass

LoopBack = DEC =(PC==endA) && (CNT!=0)

LastPass =(PC==endA) && (CNT==1)

UserControlSignals

desA

JMP

0x04

RUNat04 cnt EndA BckA

FOR Loops at Machine Code Level

• Looping sequence is known in this example before entering the loop.

• Regular micro-processor treat the sequence as unknown.

• ELMS supports FOR loops with pre-defined iterations at machine code level.

LD R1, #nLD R2, #addr_aLD R3, #addr_XLD R7, #0


EndA1 ADD R7, R7, R6DEC R1BRNZ BckA1

FOR BckA1 EndA1 #nLD R2, #addr_aLD R3, #addr_XLD R7, #0


EndA1 ADD R7, R7, R6

n

iii XaY

0

Suggestion (2)

Eliminate unnecessary instructions, functions, time slots, etc. whenever it is possible.

6 2

4

1

Do You SUDOKU?• Fill in 1-9 so that:

– Each column contains 1-9 without repeating.

– Each row contains 1-9 without repeating.

– Each 3x3 box contains 1-9 without repeating.

• It is fun to solve by hand.

• It is also fun to write a solver program, or read a good one.

8

9

1 7

4 7 2

1

9

6

5

3

9 2

4 7

3

8

1

9

5

3 1 2

6 2

4

1

A Possible SUDOKU Solver?• For all empty boxes,

assign 1-9 to each.• Check correct or not.• If not, repeat.

8

9

1 7

4 7 2

1

9

6

5

3

9 2

4 7

3

8

1

9

5

3 1 2

81-28=53 empty boxes9 possibilities for each box.Total possibilities 953.Assume a computer checks 1010 possibilities/sec.A year = 3x107 sec.Total time to solve:

953 /(1010 x 3x107) >> 1000 years

6 2

4

1

A Real SUDOKU Solver• Eliminate impossible

values for each empty box.

• Assign a possible value to the box.

• Repeat.

8

9

1 7

4 7 2

1

9

6

5

3

9 2

4 7

3

8

1

9

5

3 1 2Total time to solve:

< 1 sec

sudoku.c#include <stdio.h>#include <strings.h>

void show_board(int b[9][9]){ int i, j;

printf("+-------+-------+-------+\n"); for (i = 0; i < 9; i++) { printf("|"); for (j = 0; j < 9; j++) { if (b[i][j] == 0) printf(" "); else printf(" %d", b[i][j]); if (j % 3 == 2) printf(" |"); } printf("\n"); if (i % 3 == 2) printf("+-------+-------+-------+\n"); }}

/* init_board() -- initialize the board with all 0 */

void init_board(int b[9][9]){ int i, j;

for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) b[i][j] = 0;}

/* read_board() -- read the board from input file */

void read_board(FILE *fp, int b[9][9]){ char s[10]; int i, j, c;

i = 0; j = 0; while ((c = fgetc(fp)) != EOF) { if (c == '\n') { i++; j = 0; } else { if (c != ' ') b[i][j] = c - '0'; j++; } }}

/* check_row() -- check the row */

int check_row(int b[9][9], int x, int y, int v){ int i;

for (i = 0; i < 9; i++) if (i != y) if (b[x][i] == v) return 0; return v;}

/* check_row() -- check the row */

int check_row(int b[9][9], int x, int y, int v){ int i;

for (i = 0; i < 9; i++) if (i != y) if (b[x][i] == v) return 0; return v;}/* check_column() -- check the column */

int check_column(int b[9][9], int x, int y, int v){ int i;

for (i = 0; i < 9; i++) if (i != x) if (b[i][y] == v) return 0; return v;}/* check_square() -- check the square */

int check_square(int b[9][9], int x, int y, int v){ int i, j, x0, y0;

x0 = x / 3; y0 = y / 3;

for (i = x0 * 3; i < x0 * 3 + 3; i++) for (j = y0 * 3; j < y0 *3 + 3; j++) if (!((x == i) && (y == j))) if (b[i][j] == v) return 0; return v;}/* unique_solution() -- find the unique solution for [i, j] */

int unique_solution(int b[9][9], int x, int y){ int s = 0, n = 0, i, j, v;

for (v = 1; v < 10; v++) { if (check_row(b, x, y, v) && check_column(b, x, y, v) && check_square(b, x, y, v)) { s = v; n++; } } if (n == 1) return s; else return 0;}/* possible solutions() -- find the possible solutions for [i, j] */

int possible_solutions(int b[9][9], int x, int y, int s[]){ int n = 0, i, j, v;

for (v = 1; v < 10; v++) { if (check_row(b, x, y, v) && check_column(b, x, y, v) && check_square(b, x, y, v)) { s[n++] = v; } } return n;}

main(int argc, char **argv){ int board[9][9]; FILE *fp; int i, j, k, n; int s[9];

if (argc > 1) { fp = fopen(argv[1], "r"); } else { fp = stdin; }

init_board(board); read_board(fp, board); show_board(board);

solve(board);}

/* solve1() -- one pass to solve the puzzle */

int solve1(int b[9][9]){ int i, j; int solved = 0;

for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) if (b[i][j] == 0) { b[i][j] = unique_solution(b, i, j); if (b[i][j]) solved++; } return (solved);}

int solve(int b[9][9]){ int b2[9][9], i, j, k, n; int ps[9], s[9], pn, x, y;

/* copy the board for recurrsion */

for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) b2[i][j] = b[i][j];

while (solve1(b2)) { show_board(b2); }

/* figure out possible solution for unknown */ pn = 10; for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) { if (b2[i][j] == 0) { for (k = 0; k < 9; k++) s[k] = 0; n = possible_solutions(b2, i, j, s); if (n < pn) { pn = n; for (k = 0; k < n; k++) ps[k] = s[k]; x = i; y = j; } } }

if (pn == 10) /* that's it */ { for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) if (b2[i][j] == 0) return 0; return 1; }

for (i = 0; i < pn; i++) { b2[x][y] = ps[i]; show_board(b2); if (solve(b2)) { return 1; } }

return 0;}

A Possible Track Finder?• Choose a hit for each

layer.• Fit and calculate 2.• Cut on 2.

10 layers O(n10)100 hits/layer.Total possibilities 1020.Assume a computer checks 1010 possibilities/sec.A year = 3x107 sec.Total time to check all possibilities:

1020 /(1010 x 3x107) > 300 years

A Better Track Finder• Choose a hit for each of

layer 1 and 2.• Choose only compactable

hits on layers 3 to 10.• Calculate 2.• Cut on 2.

First constrain at layer 3 O(n3)100 hits/layer.Total possibilities 106+.Assume a computer checks 1010 possibilities/sec.Total time to check all possibilities:

106 /(1010) > 0.1 ms

Suggestion (2)

Use early constraints to reduce number of iterations.

•Evaluate the first constraint as simply as possible. •Apply the first constraint as early as possible.

(e.g. At layer 3, not until 10)

(e.g. Offset, rather than 2)

Triplets

• Triplet:– Data item with 2 free parameters.– # of measurements - # of constraints = 2.– A triplet is not necessarily a straight track

segment.– A triplet may have more than 3

measurements.• Circular track with known interaction point is a

triplet since it has 2 free parameters. (Otherwise it has 3 parameters.)

Triplet Finding • Triplet finding can be done in software or in firmware.

• Tiny Triplet Finder (TTF) is a firmware implementation developed in Fermilab BTeV.

• Tiny = small silicon usage.

• For more info on TTF, see handout.

Triplet Finding

O(n3)Software

Processes

O(n)FPGA Firmware

Functions

O(N2)Implementations

CAM, Hough Trans., etc.

O(N*log(N))Implementation

Tiny Triplet Finder

DFT and FFT

• Why log(N)?– Information propagation

– Multiplication reuse of rotational factors

DFT: O(N2) FFT: O(N*log(N))

1

0

1

0

/2 )()()(N

n

knN

N

n

Nknj WnxenxkX

FFT for Arbitrary Precision Multiplications

• Multiplication of two very long integers consumes O(N2) computation.

• It can be viewed as a convolution.• Convolutions can be computed using FFT with

O(N*log(N)) computation.

))F()(F(F*

)F()F()*F(

)()(*

1

1

0

hxhx

hxhx

jhjnxhxN

j

Suggestion (3)

Take advantages of fast (like FFT) or tiny (like Tiny Triplet Finder) algorithms.

Multiplier-less (ML) Approaches

• Canonic signed digit (CSD) and sum of powers of two (SOPOT) representations: – 5xA = 4xA + A, 248xA = 256xA - 8xA

• Recursive implementation of finite impulse respond (FIR) filter:– Sliding sum, sinc2, etc.

• CORDIC or similar algorithms:– ML FFT, rotators, etc.

• Distributed Arithmetic (DA) designs:– Look-up tables.

• Single-bit sinc3 FIR decimation filter– In delta-sigma ADC

Least-Square (LS) Track Fitter

2000 )()( zzzzhyy

• Standard least square fitting uses large amount of multiplications and possibly divisions.

z=z0(z-z0)=-2 (z-z0)=+2 (z-z0)=+4(z-z0)=-4

4h

y0-4

z=z0(z-z0)=-2 (z-z0)=+2 (z-z0)=+4(z-z0)=-4

4h

y0-4

z=z0(z-z0)=-1 (z-z0)=+1 (z-z0)=+3(z-z0)=-3

2h

y0

-4

-

z=z0(z-z0)=-1 (z-z0)=+1 (z-z0)=+3(z-z0)=-3

2h

y0

-4

-

2

0

0

0

)(/

)(/

/

i iiiii

iii

iii

ii

iii

zzeye

zzdydh

cycy

Multiplier-less (ML) Track Fitter

• The coefficients are scaled to avoid using dividers.• The coefficients for ML approximate fitting algorithm are

“two-bit” integers. The full multiplications are replaced by two integer shift-additions

4096][][4096

512][][512

32][][32 0

i

i

i

iyieeta

hiyidhh

yiyicyy

2

0

0

0

)(/

)(/

/

i iiiii

iii

iii

ii

iii

zzeye

zzdydh

cycy

4096])}[2][(])[1][{(4096

512])}[2][(])[1][{(512

32])}[2][(])[1][{(32 0

i

i

i

ieiyieiyeta

hidiyidiyhh

yiciyiciyyy

Errors of LS and ML Track Fitters

• The errors of ML approximate fitting algorithm are only slightly larger than LS fitting errors..

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

0 2 4 6 8 10 12 14 16 18

Half-length of the Track

Rel

ativ

e E

rro

r

eta4096 Least Square

eta4096 FPGA Fitter

hh512 Least Square

hh512 FPGA fitter

yy32 Least Square

yy32 FPGA fitter

Errors Several Track Fitters

• Generally speaking, more computations yield better quality of the results.

• However, after certain point, the quality of the results does not improve as rapidly as before.

• It is common that large amount of computation brings only small improvement in the mathematically perfect algorithms.

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

20.00

0 2 4 6 8 10 12 14 16 18

Track Half Length

Rel

ativ

e E

rro

rs

3-point, next planes

3-point, full length

FPGA fitter

Least Square

Suggestion (4)

Consider resource/power friendly algorithms such as multiplier-less, divider-less algorithms.

Why Saving Resource

• ?

• ?

• ?

• ?

• ?

• ?

• ?

Moore’s Law

• Number of transistors in a package:

x2 /18months

Taken from www.intel.com

The Fever of Moore’s Law vs. Maxwell Equations

• During the fever of Moore’s law, saving computing resource became non-critical, if not impossible.

• From basic principle like Maxwell Equations, it was know the fever would not last.

t

DJH

t

BE

B

D

0

1998 2000 2002 2004 2006 2008 2010

Op/sec

MIT, 2002

Moore’s Law Today

• # of transistors– Yes, via multi-core.

• Clock Speed– ?

Taken from www.intel.com

Total Useful Works = (Clock Frequency) x (Silicon Size) x (Efficiency)

• There is big room for improvement on computation efficiency in both micro-computer software and FPGA firmware.

• Resource saving helps today when technology stales.• Resource saving helps future with technology progresses.

E

F

S

E

F

S

Resource Saving Helps FutureWhere Resources Can Be Saved

• Today’s subroutines or FPGA blocks are to be reused thousands of times in the future:– If today’s design is slightly too slow, too big…

• Today’s students as well as old people gain experience from today’s work and become bosses, reviewers, etc. in the future:– The “experience” (?)– E. g.: Is a wedding with $20K budget possible? (Given

the “experience” of $1000/pizza?).

The End

Thanks

• Three layers of nested loops are needed if the process is implemented in software.

• A total of n3 combinations must be checked (e.g. 5x5x5=125).

• In FPGA, to “unroll” 2 layers of loops, large silicon resource may be needed without careful planning:

O(N2)

Triplet Finding

Plane A Plane B Plane C

for (i=0; i<N_A; i++){for (j=0; j<N_B; j++){

for (k=0; k<N_C; k++){}

}}

Circular Tracks from Collision Pointon Cylindrical Detectors

• For a given hit on layer 3, the coincident between a layer 2 and a layer 1 hit satisfying coincident map signifies a valid circular track.

• A track segment has 2 free parameters, i.e., a triplet.• The coincident map is invariant of rotation.

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

0

16

32

48

64

80

96

112

128

0 16 32 48 64 80 96 112 128

1-3)+64

2-

3)+

64

Tiny Triplet FinderReuse Coincident Logic via Shifting Hit Patterns

C1

C2

C3

One set of coincident logic is implemented.

For an arbitrary hit on C3, rotate, i.e., shift the hit patterns for C1 and C2 to search for coincidence.

Tiny Triplet Finder for Circular Tracks

*R1/R3

*R2/R3

Triplet Map Output To Decoder

Bit

Arr

ay

Shifter

Bit

Arr

ay

ShifterBit-wise Coincident Logic

0

16

32

48

64

80

96

112

128

0 16 32 48 64 80 96 112 128

1. Fill the C1 and C2 bit arrays. (n1 clock cycles)

2. Loop over C3 hits, shift bit arrays and check for coincidence. (n3 clock cycles)

Also works with more than 3 layers

Documents

Resource Saving in Micro-Computer Software & FPGA Firmware Designs Wu, Jinyuan Fermilab Nov. 2006