A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion

Preview:

DESCRIPTION

A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion. Professor Shiyan Hu, Ph.D. Department of Electrical and Computer Engineering Michigan Technological University. Moore’s law. Twice the number of transistors, approximately every two years. 2. - PowerPoint PPT Presentation

Citation preview

A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion

Professor Shiyan Hu, Ph.D.Department of Electrical and Computer Engineering

Michigan Technological University

Moore’s law

2

Twice the number of transistors, approximately every two years

Interconnect Delay Dominates Gate Delay

3

Technology Scaling

4

130nm 65nm

Global interconnect lengths does not shrink Local interconnect lengths shrink Delay ∝ RC Resistance R = rL/S, where S is reduced Capacitance C slightly changes

Interconnect Delay Scaling

5

Scaling factor s=0.7 per generation Emore Delay of a wire of length l

tint = (rl)(cl)/2= rcl2/2 (first order)

Local interconnects tint : (r/s2)(c)(ls)2/2 = rcl2/2

– Local interconnect delay is roughly unchanged

Global interconnects tint : (r/s2)(c)(l)2/2= rcl2

– Global interconnect delay doubles which is unsustainable

Interconnect delay increasingly more dominant

Timing Driven Buffer Insertion

6

Buffers Reduce RC Wire Delay

7

R

x/2

cx/4 cx/4rx/2

∆t = t_buf – t_unbuf = RC + tb – rcx2/4

x/2

cx/4 cx/4rx/2

C

C R

x

∆t

x/2

x

Intuitive Analysis

8

Interconnect Elmore delay = rcL2/2

l=2 lll

L

/22

1

1Interconnect Delay 2 2

2 2Since there are L/2 buffers

L Lrc rc rcL

(Of course, we need to consider buffer delay)

The delay of a wire of length L is T=rcL2/2

Detailed Analysis

9

gddg

ggd

CRl

cRrCrclL

clCrlclCRNT

12/

2/

0dldT

02 2

opt

gd

l

CRrcL

rc

CRl gdopt

2

L

r,c – Resistance, cap. per unit lengthRd – On resistance of inverterCg – Gate input capacitance

l Assume N identical buffers with equal inter-buffer length l (L = Nl). To minimize

delay

Quadratic Delay -> Linear Delay

10

Substituting lopt back into the interconnect delay expression:

rc

CR

CRcRrC

rc

CRrcL

CRl

cRrCrclLT

gd

gddg

gd

gdopt

dgoptopt

2

2

1

cRrCrcCRLT dggdopt 2

Delay grows linearly with L instead of quadratically.This is why buffer insertion is highly effective and thus widely used for reducing circuit delay.

25% Gates are Buffers

11

Saxena, et al. [TCAD 2004]

ITRS Projections

12

Problem Formulation

13

Minimal cost (area/power) solution

1. Steiner Tree

2. n candidate buffer locations

T

Solution Characterization

14

To model effect to downstream, a candidate solution is associated with

• v: a node• C: downstream

capacitance• Q: required arrival

time• W: cumulative

buffer cost

Candidate Buffering Solutions

15

Dynamic Programming (DP)

16

Candidate solutions are propagated toward the source

Start from sinks Candidate solutions

are generated Three operations

– Add Wire

– Insert Buffer

– Merge Solution Pruning

Solution Propagation: Add Wire

17

c2 = c1 + cx q2 = q1 - (rcx2/2 + rxc1) r: wire resistance per unit length c: wire capacitance per unit length

(v1, c1, w1, q1)(v2, c2, w2, q2)x

Solution Propagation: Insert Buffer

18

(v1, c1, w1, q1)(v1, c1b, w1b, q1b)

q1b = q1 - d(b) c1b = C(b) w1b = w1 + w(b) d(b): buffer delay

Solution Propagation: Merge

19

cmerge = cl + cr

wmerge = wl + wr

qmerge = min(ql , qr)

(v, cl , wl , ql) (v, cr, wr, qr)

Example of Solution Propagation

20

(v1, 1, 20, 0)22

v1 v1

(v2, 3, 16, 0)

• r = 1, c = 1• Rb = 1, Cb = 1, tb = 1• Rd = 1

(v2, 1, 12, 1)

v1

(v3, 5, 8, 0)

v1

(v3, 3, 8, 1)

slack = 5slack = 3

Add wire

Add wire

Insert bufferAdd wire

Add driver Add driver

(v, C, Q, W)

Solution Propagation

21

(1)

(2)

(3)

Exponential Runtime

22

2 solutions

4 solutions

8 solutions

16 solutions

n candidate buffer locations lead to 2n solutions

Too Many Solutions

23

Needs solution pruning for acceleration Two candidate solutions

– (v, c1, q1,w1)

– (v, c2, q2,w2)

Solution 1 is inferior to Solution 2 if – c1 c2 : larger load

– and q1 q2 : tighter timing

– and w1 w2: larger cost

Car Race - Speed

24

END

Car Speed <=> RAT

Car Race - Load

25

Load <=> Load Capacitance

Faster & Smaller Load

26

ENDFaster & smaller load(larger RAT, smaller

capacitance):Good

Slower & larger load(smaller RAT, larger

capacitance):Inferior

Faster & Larger Load: Result 1

27

END

Faster & Larger Load: Result 2

28

END

Who will be the winner?Cannot tell at this moment,

so keep both of them.

Pruning

29

(Q1,C1,W1)

(Q2,C2,W2)

inferior/dominatedif C1 C2,W1 W2 and Q1 Q2 Non-dominated solutions are

maintained: for the same Q and W, pick min C # of solutions depends on # of distinct W and Q, but not their values

Generating Candidates

30

(1)

(2)

(3)

Pruning Candidates

31

(3)

(a) (b)

Both (a) and (b) look the same to the source.Remove the one with the worse slack and cost

(4)

Candidate Example Continued

32

(4)

(5)

Candidate Example Continued

33

After pruning

(5)

At driver, compute the candidate solution satisfying the timing target with minimum cost. The result is optimal.

Branch Merge

34

Right Candidates

Left Candidates

Pruning During Branch Merge

35

With pruning(n1n2) solutions after each branch merge. Worst-case ((n/m)m) solutions.

Selected Milestone Works on Timing Buffering

36

1990 1991 ……. 1996 ……. 2003 2004 ……. 2008 2009

van

Ginne

ken’s

algo

rithm

Lillis’

algo

rithm

Shi a

nd Li’s

alg

orith

m

NP-har

dnes

s pro

of

Is it possible to design a provably good algorithm running in polynomial time with theoretical guarantee on the error to the optimal solution?

This is a major open problem for a decade!

Bridging The Gap

37

We are bridging the gap!

A Fully Polynomial Time Approximation Scheme (FPTAS) Provably good Computes a solution

with cost at most (1+ɛ) of the optimal cost for any ɛ>0

Runs in time polynomial in n (nodes), b (buffer types) and 1/ɛ

Best solution for an NP-hard problem in theory

Highly practical

The Rough Picture

38

W*: the cost of optimal solution

Make guess on W*

Good (close to W*)

Not Good

Key 2: Smart guessKey 1: Efficient checking

Check it

Return the solution

Key 1: Efficient Checking

39

Benefit of guess Only maintain

the solutions with cost no greater than the guessed cost

This is the first reason for acceleratation

The Oracle

40

Oracle (x): the checker, able to decide whether x>W* or not

– Without knowing W*– Answer efficiently

Construction of Oracle(x)

41

Scale and round each buffer cost

Only interested in whether there is a solution with

cost up to x satisfying timing

constraint

Dynamic Programming

Perform DP to scaled problem with cost upper bound n/ɛ. Time

polynomial in n/ɛ

Scaling and Rounding

42

xɛ/n 2xɛ/n 3xɛ/n 4xɛ/n

Buffer cost

0

Scaling and Rounding

43

Buffer cost1 2 3 40

# distinct buffer costs is at most O(n/ε) since only solutions with W bounded by n/ɛ are propagated.

Rounding error at each buffer xɛ/n, total rounding error xɛ. • Larger xɛ/n: larger error, fewer distinct costs and faster • Smaller xɛ/n: smaller error, more distinct costs and slower • Rounding is the second reason for acceleration

Oracle Construction

44

Yes, there is a solution satisfying timing

constraint

No, no such solution

With cost rounded and scaled back, the solution has cost at most n/ɛ • xɛ/n + xɛ=

(1+ɛ)x > W*

With cost rounded and scaled back, the solution has cost at least n/ɛ •

xɛ/n = x W*

Run dynamic programming with cost n/ɛ

Rounding on Q

45

# solutions bounded by # distinct W and Q # W = O(n/ɛ1), ɛ1 is used for W

– Rounding before DP # Q

– Round up Q to nearest value in {0, ɛ2T/m , 2ɛ2T/m, 3ɛ2T/m,…,T }, in branch merge (m is # sinks)

– Rounding during DP– # Q = O(m/ɛ2), ɛ2 is used for Q – Rounding error bounded by ɛ2T/m per branch merge, by

ɛ2T for the whole tree # non-dominated solutions is O(mn/ɛ1ɛ2)

3ɛ2T/m2ɛ2T/mɛ2T/m 4ɛ2T/m0

Q-W Rounding Before Branch Merge

46

W

Q

n/ɛ1

T

ɛ2T/m

0 1 2 3 4

2ɛ2T/m

3ɛ2T/m

4ɛ2T/m

Buffer Insertion Runtime

47

branch single ain solutions dominated-non )(most At 1

2

21 bnmn

O

pruning.bin - Wcross No node.each for time)( 1

22

21 bnmnb

O

mergebranch aafter solutions )(21

mnO

esbuffer typ b with solutions dominated-non )( introducesinsertion buffer A 1nb

O

bins- W)(1n

O

Branch Merge Runtime - 1

48

Target Q=0

When merging Wl=2 with Wr=1, previously we need to try quadratic # of combinations, now only linear # of combinations.

Branch Merge Runtime - 2

49

Target Q= ɛ2T/m

Branch Merge Runtime - 3

50

Target Q= 2ɛ2T/m

Branch Merge Runtime - 4

51

time)( each takes wherea,W Wall try a, WmergedFor 2

rl am

O

)( is runtime total,0,1,...,aFor 2

21

2

1 mn

On

)( isit bins, into solutions puttingfor timeIncluding2

21

2

1

2

21 mnbnmn

O

mergebranch aafter solutions )(21

mnO

Timing-Cost Approximate DP

52

Lemma: a buffering solution with cost at most (1+ɛ1)W* and with timing at most (1+ɛ2)T can be computed in time

)(1

23

21

2

22

1

22

1

2

21

2

bnbmnnmbmnnm

O

U (L): upper (lower) bound on W* Naive binary search style approach

Runtime (# iterations) depends on the initial bounds U and L

Key 2: Geometric Sequence Based Guess

53

Oracle (x)

x=(U+L)/2

Set U and L on W*

U= (1+ɛ)x L= x

W*<(1+ɛ)x W* x

Adapt ɛ1

54

Rounding factor xɛ1/n for W Larger ɛ1: faster with rough estimation Smaller ɛ1: slower with accurate estimation Adapt ɛ1 according to U and L

U/L Related Scale and Round

55

Buffer cost

0U/L

xɛ/n

xɛ/n

Conceptually

56

Begin with large ɛ1 and progressively reduce it (towards ɛ) according to U/L as x approaches W*

Fix ɛ2=ɛ in rounding T for limiting timing violation

• Set ɛ1 as a geometric sequence of …, 8, 4, 2, 1, 1/2, …, ɛ• Suppose that one run of DP takes O(n/ɛ1) time. Total runtime is bounded by the last run as O(… + n/8 + n/4 + n/2 + … + n/ɛ) = O(n/ɛ).

Oracle Query Till U/L<2

57

'

*,

*,

*,

*,'

1 ,1

i

iliu

il

iui

WWx

W

W

)()()1

(

)3/4(2/1

1*,

*,

2

2

1*,

*,

2

2

1'

2

2it

ti iu

il

ti iu

il

ti i W

WnmO

W

WnmO

nmO

)() 59.0()(2

2

0

)3/4(2/1

2

2)3/4(2/1

0*,

*,

2

2

nm

Onm

OW

WnmO

tjtj iu

il j

j

it

tu

tl

iu

il

iu

il

iu

il

il

iu

il

iu

W

W

W

W

W

W

W

W

W

W

W

W

)3/4(

*,

*,

*,

*,

3/4

*,

*,

*,

*,

4/3

*,

*,

*1,

*1,

)(1

23

21

2

22

1

22

1

2

21

2

bnbmnnmbmnnm

O

Mathematically

58

When U/L<2

59

At least one feasible solution, otherwise no solution with cost 2n/ɛ • Lɛ/n = 2L U

Lɛ/n rounding error per buffer and Lɛ in a solution

A single DP runtime

Pick min cost solution satisfying timing at driver

W=2n/ɛ

Scale and round each cost by Lɛ/n

Run DP

U/L<2

The Algorithmic Flow

60

Oracle (x)

Adapting ɛ1 =[U/L-1]1/2

Set U and L of W*

Set x=[UL/(1+ ɛ1)]1/2

Update U or L

Compute final solution

Main Theorem

61

Theorem: a (1+ ɛ) approximation to the timing constrained minimum cost buffering problem can be computed in O(m2n2b/ɛ3+ n3b2/ɛ) time for 0<ɛ<1 and in O(m2n2b/ɛ+mn2b+n3b) time for ɛ 1

Experiments

62

Experimental Setup– 1000 industrial nets

– 48 industrial buffer types including non-inverting buffers and inverting buffers

Compared to Dynamic Programming which is the state of the art technique and is widely used in industry

Cost Ratio Compared to DP

63

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

FPTASFPTAS

Buf

fer

Cos

t R

atio

Approximation

Speedup Compared to DP

64

Spe

edup

Approximation

0.01 0.05 0.1 0.2 0.3 0.4 0.50

1

2

3

4

5

6

FPTASFPTAS

Observations

65

FPTAS always achieves the theoretical guarantee Larger ɛ leads to more speedup On average about 5x faster than dynamic programming Can run 4.6x faster with 0.57% solution degradation <5% nets with timing violations which can be fixed by a simple

timing recovery procedure

Our Bridge

66

NP-Hardness Complexity

Exponential Time Algorithm

Conclusion

67

Propose a (1+ ɛ) approximation for timing constrained minimum cost buffering for any ɛ > 0 (DAC’09)

– Runs in O(m2n2b/ɛ3+ n3b2/ɛ) time– Timing-cost approximate dynamic programming – Double-ɛ geometric sequence based oracle search– 5x speedup in experiments– Few percent additional buffers as guaranteed

theoretically The first provably good approximation algorithm on this

problem which is a major open problem in the field

Thanks

Recommended