Upload
alisha-anthony
View
218
Download
2
Embed Size (px)
Citation preview
Resource Saving in Micro-Computer Software &
FPGA Firmware Designs
Wu, Jinyuan
Fermilab
Nov. 2006
Resource Saving in FPGAFrom: “CompactFPGAdesign.pdf”
• Glue Logic
• Digitization– TDC, (ADC), etc.
• Communication– C5, Digital Phase Follower, etc.
• Data Organization– Zero-Suppression, Parasitic Event Building, etc.
• Reconfigurable Computing– Hash Sorter, TTF, ELMS, etc.
Software -- FirmwareSoftware -- Firmware
Computer Is Fast
• This is the first impression of many beginners.
• “FPGA is big.”
• Program Creation Time > Execution Time
How to Slow Down Computers?
• Single Layer Loop: – 256 x 3 x 4 x 0.25 us = 0.75 ms
• Nested Loops:– 256 x0.75 ms = .19 s
5 56 | 2 - | 1 16 | 2 - | .SquareWave
Generator
CPUZ80
4MHz
“LD A,A” = “NOOP”1 NOOP spends 1s1,000,000 NOOP spends 1s
LD A,#255BACKA: NOOP
DEC AJP NZ, BACKA
LD B,#255BACKB: LD A,#255BACKA: NOOP
DEC AJP NZ, BACKALD A,BDEC BDEC AJP NZ, BACKB
T
Knowing Slow, Knowing FastWhere Resources Can Be Saved
• For micro-computer software:– Pay attention to loops and frequently called
subroutines,– Especially inner-most nested loops.
• For FPGA firmware:– Algorithms rooted in micro-computer software.– Reusable blocks.– Occasionally used functions.
Example: Inner-Product
• Avoid using conditional branch for loop control: ELMS– Saves 25% execution time in this case.
LD R1, #nLD R2, #addr_aLD R3, #addr_XLD R7, #0
BckA1 LD R4, (R2)INC R2LD R5, (R3)INC R3MUL R6, R4, R5
EndA1 ADD R7, R7, R6DEC R1BRNZ BckA1
n
iii XaY
0
R1--
R3++X
R2++a
R4
R6
R5
R7
x
+
• Multiplier-less algorithms.
• Reuse computations: Using fast algorithms like FFT.
• Avoid entering the loop: Using early constraints.
Computing Module in Micro-processor & FPGA
• Micro-processors use full sequencing approach. One operation is performed in each clock cycle.
• In FPGA, flatten logics are allowed and are fast but take large silicon area.
(100+3-4)*5+7 =?
100
34
57Control:
Data: 100,3,4,5,7
LD (-) (+)(*)(+)
Sequencing in FPGA for Resource Control
• Sequencing is a very efficient means of resource control in FPGA.
• Reuse processing resource for similar function and/or different channels.
• Pay attention to occasionally-used functions like initialization.
Initialization
Sum1 Sum2 Sum3 Sum4
Sum1 Sum2 Sum3 Sum4
Sum1 Sum2 Sum3 Sum4
Sum1 Sum2 Sum3 Sum4
CH0
CH1
CH2
CH3
Initialization1
Sum1Sum2Sum3Sum4
Sum1Sum2Sum3Sum4
Sum1Sum2Sum3Sum4
Sum1Sum2Sum3Sum4
CH0
CH1
CH2
CH3
Initialization2
Suggestion (1)
Use partially flatten and partially sequential logic to reach balance of speed and size.
ELMS– Enclosed Loop Micro-Sequencer
• A PC+ROM structure can be a very good sequencer in FPGA.
• The Conditional Branch Logic is added to support regular conditional branch as in micro-processors.
• The Loop & Return Logic + Stack are added to support FOR loops with pre-defined iterations at machine code level.
• The resource usage of ELMS in FPGA is very small.
ProgramCounter
ROM128x
36bits
Reset A
Con
trol
Sig
nals
CLK
ProgramCounter
ROM128x
36bits
A
Loop & Return Logic + Stack
Conditional Branch Logic
Reset
CLK Con
trol
Sig
nals
FOR BckA1 EndA1 #nLD R2, #addr_aLD R3, #addr_XLD R7, #0
BckA1 LD R4, (R2)INC R2LD R5, (R3)INC R3MUL R6, R4, R5
EndA1 ADD R7, R7, R6
ELMS– Detailed Block Diagram
ROM128x
36bits
+1
CondJMP
PC
Reset
Loop & Return Registers
+ Stack (128 words)
Compare
RTNJMPIF
CNT
endA
bckA
PushPop
LoopBack
DEC
RTN
LastPass
LoopBack = DEC =(PC==endA) && (CNT!=0)
LastPass =(PC==endA) && (CNT==1)
UserControlSignals
desA
JMP
0x04
RUNat04 cnt EndA BckA
FOR Loops at Machine Code Level
• Looping sequence is known in this example before entering the loop.
• Regular micro-processor treat the sequence as unknown.
• ELMS supports FOR loops with pre-defined iterations at machine code level.
LD R1, #nLD R2, #addr_aLD R3, #addr_XLD R7, #0
BckA1 LD R4, (R2)INC R2LD R5, (R3)INC R3MUL R6, R4, R5
EndA1 ADD R7, R7, R6DEC R1BRNZ BckA1
FOR BckA1 EndA1 #nLD R2, #addr_aLD R3, #addr_XLD R7, #0
BckA1 LD R4, (R2)INC R2LD R5, (R3)INC R3MUL R6, R4, R5
EndA1 ADD R7, R7, R6
n
iii XaY
0
Suggestion (2)
Eliminate unnecessary instructions, functions, time slots, etc. whenever it is possible.
6 2
4
1
Do You SUDOKU?• Fill in 1-9 so that:
– Each column contains 1-9 without repeating.
– Each row contains 1-9 without repeating.
– Each 3x3 box contains 1-9 without repeating.
• It is fun to solve by hand.
• It is also fun to write a solver program, or read a good one.
8
9
1 7
4 7 2
1
9
6
5
3
9 2
4 7
3
8
1
9
5
3 1 2
6 2
4
1
A Possible SUDOKU Solver?• For all empty boxes,
assign 1-9 to each.• Check correct or not.• If not, repeat.
8
9
1 7
4 7 2
1
9
6
5
3
9 2
4 7
3
8
1
9
5
3 1 2
81-28=53 empty boxes9 possibilities for each box.Total possibilities 953.Assume a computer checks 1010 possibilities/sec.A year = 3x107 sec.Total time to solve:
953 /(1010 x 3x107) >> 1000 years
6 2
4
1
A Real SUDOKU Solver• Eliminate impossible
values for each empty box.
• Assign a possible value to the box.
• Repeat.
8
9
1 7
4 7 2
1
9
6
5
3
9 2
4 7
3
8
1
9
5
3 1 2Total time to solve:
< 1 sec
sudoku.c#include <stdio.h>#include <strings.h>
void show_board(int b[9][9]){ int i, j;
printf("+-------+-------+-------+\n"); for (i = 0; i < 9; i++) { printf("|"); for (j = 0; j < 9; j++) { if (b[i][j] == 0) printf(" "); else printf(" %d", b[i][j]); if (j % 3 == 2) printf(" |"); } printf("\n"); if (i % 3 == 2) printf("+-------+-------+-------+\n"); }}
/* init_board() -- initialize the board with all 0 */
void init_board(int b[9][9]){ int i, j;
for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) b[i][j] = 0;}
/* read_board() -- read the board from input file */
void read_board(FILE *fp, int b[9][9]){ char s[10]; int i, j, c;
i = 0; j = 0; while ((c = fgetc(fp)) != EOF) { if (c == '\n') { i++; j = 0; } else { if (c != ' ') b[i][j] = c - '0'; j++; } }}
/* check_row() -- check the row */
int check_row(int b[9][9], int x, int y, int v){ int i;
for (i = 0; i < 9; i++) if (i != y) if (b[x][i] == v) return 0; return v;}
/* check_row() -- check the row */
int check_row(int b[9][9], int x, int y, int v){ int i;
for (i = 0; i < 9; i++) if (i != y) if (b[x][i] == v) return 0; return v;}/* check_column() -- check the column */
int check_column(int b[9][9], int x, int y, int v){ int i;
for (i = 0; i < 9; i++) if (i != x) if (b[i][y] == v) return 0; return v;}/* check_square() -- check the square */
int check_square(int b[9][9], int x, int y, int v){ int i, j, x0, y0;
x0 = x / 3; y0 = y / 3;
for (i = x0 * 3; i < x0 * 3 + 3; i++) for (j = y0 * 3; j < y0 *3 + 3; j++) if (!((x == i) && (y == j))) if (b[i][j] == v) return 0; return v;}/* unique_solution() -- find the unique solution for [i, j] */
int unique_solution(int b[9][9], int x, int y){ int s = 0, n = 0, i, j, v;
for (v = 1; v < 10; v++) { if (check_row(b, x, y, v) && check_column(b, x, y, v) && check_square(b, x, y, v)) { s = v; n++; } } if (n == 1) return s; else return 0;}/* possible solutions() -- find the possible solutions for [i, j] */
int possible_solutions(int b[9][9], int x, int y, int s[]){ int n = 0, i, j, v;
for (v = 1; v < 10; v++) { if (check_row(b, x, y, v) && check_column(b, x, y, v) && check_square(b, x, y, v)) { s[n++] = v; } } return n;}
main(int argc, char **argv){ int board[9][9]; FILE *fp; int i, j, k, n; int s[9];
if (argc > 1) { fp = fopen(argv[1], "r"); } else { fp = stdin; }
init_board(board); read_board(fp, board); show_board(board);
solve(board);}
/* solve1() -- one pass to solve the puzzle */
int solve1(int b[9][9]){ int i, j; int solved = 0;
for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) if (b[i][j] == 0) { b[i][j] = unique_solution(b, i, j); if (b[i][j]) solved++; } return (solved);}
int solve(int b[9][9]){ int b2[9][9], i, j, k, n; int ps[9], s[9], pn, x, y;
/* copy the board for recurrsion */
for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) b2[i][j] = b[i][j];
while (solve1(b2)) { show_board(b2); }
/* figure out possible solution for unknown */ pn = 10; for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) { if (b2[i][j] == 0) { for (k = 0; k < 9; k++) s[k] = 0; n = possible_solutions(b2, i, j, s); if (n < pn) { pn = n; for (k = 0; k < n; k++) ps[k] = s[k]; x = i; y = j; } } }
if (pn == 10) /* that's it */ { for (i = 0; i < 9; i++) for (j = 0; j < 9; j++) if (b2[i][j] == 0) return 0; return 1; }
for (i = 0; i < pn; i++) { b2[x][y] = ps[i]; show_board(b2); if (solve(b2)) { return 1; } }
return 0;}
A Possible Track Finder?• Choose a hit for each
layer.• Fit and calculate 2.• Cut on 2.
10 layers O(n10)100 hits/layer.Total possibilities 1020.Assume a computer checks 1010 possibilities/sec.A year = 3x107 sec.Total time to check all possibilities:
1020 /(1010 x 3x107) > 300 years
A Better Track Finder• Choose a hit for each of
layer 1 and 2.• Choose only compactable
hits on layers 3 to 10.• Calculate 2.• Cut on 2.
First constrain at layer 3 O(n3)100 hits/layer.Total possibilities 106+.Assume a computer checks 1010 possibilities/sec.Total time to check all possibilities:
106 /(1010) > 0.1 ms
Suggestion (2)
Use early constraints to reduce number of iterations.
•Evaluate the first constraint as simply as possible. •Apply the first constraint as early as possible.
(e.g. At layer 3, not until 10)
(e.g. Offset, rather than 2)
Triplets
• Triplet:– Data item with 2 free parameters.– # of measurements - # of constraints = 2.– A triplet is not necessarily a straight track
segment.– A triplet may have more than 3
measurements.• Circular track with known interaction point is a
triplet since it has 2 free parameters. (Otherwise it has 3 parameters.)
Triplet Finding • Triplet finding can be done in software or in firmware.
• Tiny Triplet Finder (TTF) is a firmware implementation developed in Fermilab BTeV.
• Tiny = small silicon usage.
• For more info on TTF, see handout.
Triplet Finding
O(n3)Software
Processes
O(n)FPGA Firmware
Functions
O(N2)Implementations
CAM, Hough Trans., etc.
O(N*log(N))Implementation
Tiny Triplet Finder
DFT and FFT
• Why log(N)?– Information propagation
– Multiplication reuse of rotational factors
DFT: O(N2) FFT: O(N*log(N))
1
0
1
0
/2 )()()(N
n
knN
N
n
Nknj WnxenxkX
FFT for Arbitrary Precision Multiplications
• Multiplication of two very long integers consumes O(N2) computation.
• It can be viewed as a convolution.• Convolutions can be computed using FFT with
O(N*log(N)) computation.
))F()(F(F*
)F()F()*F(
)()(*
1
1
0
hxhx
hxhx
jhjnxhxN
j
Suggestion (3)
Take advantages of fast (like FFT) or tiny (like Tiny Triplet Finder) algorithms.
Multiplier-less (ML) Approaches
• Canonic signed digit (CSD) and sum of powers of two (SOPOT) representations: – 5xA = 4xA + A, 248xA = 256xA - 8xA
• Recursive implementation of finite impulse respond (FIR) filter:– Sliding sum, sinc2, etc.
• CORDIC or similar algorithms:– ML FFT, rotators, etc.
• Distributed Arithmetic (DA) designs:– Look-up tables.
• Single-bit sinc3 FIR decimation filter– In delta-sigma ADC
Least-Square (LS) Track Fitter
2000 )()( zzzzhyy
• Standard least square fitting uses large amount of multiplications and possibly divisions.
z=z0(z-z0)=-2 (z-z0)=+2 (z-z0)=+4(z-z0)=-4
4h
y0-4
z=z0(z-z0)=-2 (z-z0)=+2 (z-z0)=+4(z-z0)=-4
4h
y0-4
z=z0(z-z0)=-1 (z-z0)=+1 (z-z0)=+3(z-z0)=-3
2h
y0
-4
-
z=z0(z-z0)=-1 (z-z0)=+1 (z-z0)=+3(z-z0)=-3
2h
y0
-4
-
2
0
0
0
)(/
)(/
/
i iiiii
iii
iii
ii
iii
zzeye
zzdydh
cycy
Multiplier-less (ML) Track Fitter
• The coefficients are scaled to avoid using dividers.• The coefficients for ML approximate fitting algorithm are
“two-bit” integers. The full multiplications are replaced by two integer shift-additions
4096][][4096
512][][512
32][][32 0
i
i
i
iyieeta
hiyidhh
yiyicyy
2
0
0
0
)(/
)(/
/
i iiiii
iii
iii
ii
iii
zzeye
zzdydh
cycy
4096])}[2][(])[1][{(4096
512])}[2][(])[1][{(512
32])}[2][(])[1][{(32 0
i
i
i
ieiyieiyeta
hidiyidiyhh
yiciyiciyyy
Errors of LS and ML Track Fitters
• The errors of ML approximate fitting algorithm are only slightly larger than LS fitting errors..
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
0 2 4 6 8 10 12 14 16 18
Half-length of the Track
Rel
ativ
e E
rro
r
eta4096 Least Square
eta4096 FPGA Fitter
hh512 Least Square
hh512 FPGA fitter
yy32 Least Square
yy32 FPGA fitter
Errors Several Track Fitters
• Generally speaking, more computations yield better quality of the results.
• However, after certain point, the quality of the results does not improve as rapidly as before.
• It is common that large amount of computation brings only small improvement in the mathematically perfect algorithms.
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
0 2 4 6 8 10 12 14 16 18
Track Half Length
Rel
ativ
e E
rro
rs
3-point, next planes
3-point, full length
FPGA fitter
Least Square
Suggestion (4)
Consider resource/power friendly algorithms such as multiplier-less, divider-less algorithms.
Why Saving Resource
• ?
• ?
• ?
• ?
• ?
• ?
• ?
Moore’s Law
• Number of transistors in a package:
x2 /18months
Taken from www.intel.com
The Fever of Moore’s Law vs. Maxwell Equations
• During the fever of Moore’s law, saving computing resource became non-critical, if not impossible.
• From basic principle like Maxwell Equations, it was know the fever would not last.
t
DJH
t
BE
B
D
0
1998 2000 2002 2004 2006 2008 2010
Op/sec
MIT, 2002
Moore’s Law Today
• # of transistors– Yes, via multi-core.
• Clock Speed– ?
Taken from www.intel.com
Total Useful Works = (Clock Frequency) x (Silicon Size) x (Efficiency)
• There is big room for improvement on computation efficiency in both micro-computer software and FPGA firmware.
• Resource saving helps today when technology stales.• Resource saving helps future with technology progresses.
E
F
S
E
F
S
Resource Saving Helps FutureWhere Resources Can Be Saved
• Today’s subroutines or FPGA blocks are to be reused thousands of times in the future:– If today’s design is slightly too slow, too big…
• Today’s students as well as old people gain experience from today’s work and become bosses, reviewers, etc. in the future:– The “experience” (?)– E. g.: Is a wedding with $20K budget possible? (Given
the “experience” of $1000/pizza?).
The End
Thanks
• Three layers of nested loops are needed if the process is implemented in software.
• A total of n3 combinations must be checked (e.g. 5x5x5=125).
• In FPGA, to “unroll” 2 layers of loops, large silicon resource may be needed without careful planning:
O(N2)
Triplet Finding
Plane A Plane B Plane C
for (i=0; i<N_A; i++){for (j=0; j<N_B; j++){
for (k=0; k<N_C; k++){}
}}
Circular Tracks from Collision Pointon Cylindrical Detectors
• For a given hit on layer 3, the coincident between a layer 2 and a layer 1 hit satisfying coincident map signifies a valid circular track.
• A track segment has 2 free parameters, i.e., a triplet.• The coincident map is invariant of rotation.
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
0
16
32
48
64
80
96
112
128
0 16 32 48 64 80 96 112 128
1-3)+64
2-
3)+
64
Tiny Triplet FinderReuse Coincident Logic via Shifting Hit Patterns
C1
C2
C3
One set of coincident logic is implemented.
For an arbitrary hit on C3, rotate, i.e., shift the hit patterns for C1 and C2 to search for coincidence.
Tiny Triplet Finder for Circular Tracks
*R1/R3
*R2/R3
Triplet Map Output To Decoder
Bit
Arr
ay
Shifter
Bit
Arr
ay
ShifterBit-wise Coincident Logic
0
16
32
48
64
80
96
112
128
0 16 32 48 64 80 96 112 128
1. Fill the C1 and C2 bit arrays. (n1 clock cycles)
2. Loop over C3 hits, shift bit arrays and check for coincidence. (n3 clock cycles)
Also works with more than 3 layers