A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion

Professor Shiyan Hu, Ph.D.Department of Electrical and Computer Engineering

Michigan Technological University

Moore’s law

Twice the number of transistors, approximately every two years

Interconnect Delay Dominates Gate Delay

Technology Scaling

130nm 65nm

Global interconnect lengths does not shrink Local interconnect lengths shrink Delay ∝ RC Resistance R = rL/S, where S is reduced Capacitance C slightly changes

Interconnect Delay Scaling

Scaling factor s=0.7 per generation Emore Delay of a wire of length l

tint = (rl)(cl)/2= rcl2/2 (first order)

Local interconnects tint : (r/s2)(c)(ls)2/2 = rcl2/2

– Local interconnect delay is roughly unchanged

Global interconnects tint : (r/s2)(c)(l)2/2= rcl2

– Global interconnect delay doubles which is unsustainable

Interconnect delay increasingly more dominant

Timing Driven Buffer Insertion

Buffers Reduce RC Wire Delay

cx/4 cx/4rx/2

∆t = t_buf – t_unbuf = RC + tb – rcx2/4

cx/4 cx/4rx/2

Intuitive Analysis

Interconnect Elmore delay = rcL2/2

l=2 lll

1Interconnect Delay 2 2

2 2Since there are L/2 buffers

L Lrc rc rcL

(Of course, we need to consider buffer delay)

The delay of a wire of length L is T=rcL2/2

Detailed Analysis

cRrCrclL

clCrlclCRNT

CRl gdopt

r,c – Resistance, cap. per unit lengthRd – On resistance of inverterCg – Gate input capacitance

l Assume N identical buffers with equal inter-buffer length l (L = Nl). To minimize

Quadratic Delay -> Linear Delay

Substituting lopt back into the interconnect delay expression:

CRcRrC

cRrCrclLT

dgoptopt

cRrCrcCRLT dggdopt 2

Delay grows linearly with L instead of quadratically.This is why buffer insertion is highly effective and thus widely used for reducing circuit delay.

25% Gates are Buffers

Saxena, et al. [TCAD 2004]

ITRS Projections

Problem Formulation

Minimal cost (area/power) solution

1. Steiner Tree

2. n candidate buffer locations

Solution Characterization

To model effect to downstream, a candidate solution is associated with

• v: a node• C: downstream

capacitance• Q: required arrival

time• W: cumulative

buffer cost

Candidate Buffering Solutions

Dynamic Programming (DP)

Candidate solutions are propagated toward the source

Start from sinks Candidate solutions

are generated Three operations

– Add Wire

– Insert Buffer

– Merge Solution Pruning

Solution Propagation: Add Wire

c2 = c1 + cx q2 = q1 - (rcx2/2 + rxc1) r: wire resistance per unit length c: wire capacitance per unit length

(v1, c1, w1, q1)(v2, c2, w2, q2)x

Solution Propagation: Insert Buffer

(v1, c1, w1, q1)(v1, c1b, w1b, q1b)

q1b = q1 - d(b) c1b = C(b) w1b = w1 + w(b) d(b): buffer delay

Solution Propagation: Merge

cmerge = cl + cr

wmerge = wl + wr

qmerge = min(ql , qr)

(v, cl , wl , ql) (v, cr, wr, qr)

Example of Solution Propagation

(v1, 1, 20, 0)22

(v2, 3, 16, 0)

• r = 1, c = 1• Rb = 1, Cb = 1, tb = 1• Rd = 1

(v2, 1, 12, 1)

(v3, 5, 8, 0)

(v3, 3, 8, 1)

slack = 5slack = 3

Add wire

Insert bufferAdd wire

Add driver Add driver

(v, C, Q, W)

Solution Propagation

Exponential Runtime

2 solutions

4 solutions

8 solutions

16 solutions

n candidate buffer locations lead to 2n solutions

Too Many Solutions

Needs solution pruning for acceleration Two candidate solutions

– (v, c1, q1,w1)

– (v, c2, q2,w2)

Solution 1 is inferior to Solution 2 if – c1 c2 : larger load

– and q1 q2 : tighter timing

– and w1 w2: larger cost

Car Race - Speed

Car Speed <=> RAT

Car Race - Load

Load <=> Load Capacitance

Faster & Smaller Load

ENDFaster & smaller load(larger RAT, smaller

capacitance):Good

Slower & larger load(smaller RAT, larger

capacitance):Inferior

Faster & Larger Load: Result 1

Faster & Larger Load: Result 2

Who will be the winner?Cannot tell at this moment,

so keep both of them.

Pruning

(Q1,C1,W1)

(Q2,C2,W2)

inferior/dominatedif C1 C2,W1 W2 and Q1 Q2 Non-dominated solutions are

maintained: for the same Q and W, pick min C # of solutions depends on # of distinct W and Q, but not their values

Generating Candidates

Pruning Candidates

(a) (b)

Both (a) and (b) look the same to the source.Remove the one with the worse slack and cost

Candidate Example Continued

After pruning

At driver, compute the candidate solution satisfying the timing target with minimum cost. The result is optimal.

Branch Merge

Right Candidates

Left Candidates

Pruning During Branch Merge

With pruning(n1n2) solutions after each branch merge. Worst-case ((n/m)m) solutions.

Selected Milestone Works on Timing Buffering

1990 1991 ……. 1996 ……. 2003 2004 ……. 2008 2009

ken’s

Lillis’

nd Li’s

NP-har

Is it possible to design a provably good algorithm running in polynomial time with theoretical guarantee on the error to the optimal solution?

This is a major open problem for a decade!

Bridging The Gap

We are bridging the gap!

A Fully Polynomial Time Approximation Scheme (FPTAS) Provably good Computes a solution

with cost at most (1+ɛ) of the optimal cost for any ɛ>0

Runs in time polynomial in n (nodes), b (buffer types) and 1/ɛ

Best solution for an NP-hard problem in theory

Highly practical

The Rough Picture

W*: the cost of optimal solution

Make guess on W*

Good (close to W*)

Not Good

Key 2: Smart guessKey 1: Efficient checking

Check it

Return the solution

Key 1: Efficient Checking

Benefit of guess Only maintain

the solutions with cost no greater than the guessed cost

This is the first reason for acceleratation

The Oracle

Oracle (x): the checker, able to decide whether x>W* or not

– Without knowing W*– Answer efficiently

Construction of Oracle(x)

Scale and round each buffer cost

Only interested in whether there is a solution with

cost up to x satisfying timing

constraint

Dynamic Programming

Perform DP to scaled problem with cost upper bound n/ɛ. Time

polynomial in n/ɛ

Scaling and Rounding

xɛ/n 2xɛ/n 3xɛ/n 4xɛ/n

Buffer cost

Scaling and Rounding

Buffer cost1 2 3 40

# distinct buffer costs is at most O(n/ε) since only solutions with W bounded by n/ɛ are propagated.

Rounding error at each buffer xɛ/n, total rounding error xɛ. • Larger xɛ/n: larger error, fewer distinct costs and faster • Smaller xɛ/n: smaller error, more distinct costs and slower • Rounding is the second reason for acceleration

Oracle Construction

Yes, there is a solution satisfying timing

constraint

No, no such solution

With cost rounded and scaled back, the solution has cost at most n/ɛ • xɛ/n + xɛ=

(1+ɛ)x > W*

With cost rounded and scaled back, the solution has cost at least n/ɛ •

xɛ/n = x W*

Run dynamic programming with cost n/ɛ

Rounding on Q

# solutions bounded by # distinct W and Q # W = O(n/ɛ1), ɛ1 is used for W

– Rounding before DP # Q

– Round up Q to nearest value in {0, ɛ2T/m , 2ɛ2T/m, 3ɛ2T/m,…,T }, in branch merge (m is # sinks)

– Rounding during DP– # Q = O(m/ɛ2), ɛ2 is used for Q – Rounding error bounded by ɛ2T/m per branch merge, by

ɛ2T for the whole tree # non-dominated solutions is O(mn/ɛ1ɛ2)

3ɛ2T/m2ɛ2T/mɛ2T/m 4ɛ2T/m0

Q-W Rounding Before Branch Merge

ɛ2T/m

0 1 2 3 4

2ɛ2T/m

3ɛ2T/m

4ɛ2T/m

Buffer Insertion Runtime

branch single ain solutions dominated-non )(most At 1

21 bnmn

pruning.bin - Wcross No node.each for time)( 1

21 bnmnb

mergebranch aafter solutions )(21

esbuffer typ b with solutions dominated-non )( introducesinsertion buffer A 1nb

bins- W)(1n

Branch Merge Runtime - 1

Target Q=0

When merging Wl=2 with Wr=1, previously we need to try quadratic # of combinations, now only linear # of combinations.

Target Q= ɛ2T/m

Target Q= 2ɛ2T/m

time)( each takes wherea,W Wall try a, WmergedFor 2

)( is runtime total,0,1,...,aFor 2

)( isit bins, into solutions puttingfor timeIncluding2

21 mnbnmn

mergebranch aafter solutions )(21

Timing-Cost Approximate DP

Lemma: a buffering solution with cost at most (1+ɛ1)W* and with timing at most (1+ɛ2)T can be computed in time

bnbmnnmbmnnm

U (L): upper (lower) bound on W* Naive binary search style approach

Runtime (# iterations) depends on the initial bounds U and L

Key 2: Geometric Sequence Based Guess

Oracle (x)

x=(U+L)/2

Set U and L on W*

U= (1+ɛ)x L= x

W*<(1+ɛ)x W* x

Adapt ɛ1

Rounding factor xɛ1/n for W Larger ɛ1: faster with rough estimation Smaller ɛ1: slower with accurate estimation Adapt ɛ1 according to U and L

U/L Related Scale and Round

Buffer cost

Conceptually

Begin with large ɛ1 and progressively reduce it (towards ɛ) according to U/L as x approaches W*

Fix ɛ2=ɛ in rounding T for limiting timing violation

• Set ɛ1 as a geometric sequence of …, 8, 4, 2, 1, 1/2, …, ɛ• Suppose that one run of DP takes O(n/ɛ1) time. Total runtime is bounded by the last run as O(… + n/8 + n/4 + n/2 + … + n/ɛ) = O(n/ɛ).

Oracle Query Till U/L<2

)()()1

)3/4(2/1

ti i W

)() 59.0()(2

)3/4(2/1

2)3/4(2/1

tjtj iu

bnbmnnmbmnnm

Mathematically

When U/L<2

At least one feasible solution, otherwise no solution with cost 2n/ɛ • Lɛ/n = 2L U

Lɛ/n rounding error per buffer and Lɛ in a solution

A single DP runtime

Pick min cost solution satisfying timing at driver

W=2n/ɛ

Scale and round each cost by Lɛ/n

Run DP

The Algorithmic Flow

Oracle (x)

Adapting ɛ1 =[U/L-1]1/2

Set U and L of W*

Set x=[UL/(1+ ɛ1)]1/2

Update U or L

Compute final solution

Main Theorem

Theorem: a (1+ ɛ) approximation to the timing constrained minimum cost buffering problem can be computed in O(m2n2b/ɛ3+ n3b2/ɛ) time for 0<ɛ<1 and in O(m2n2b/ɛ+mn2b+n3b) time for ɛ 1

Experiments

Experimental Setup– 1000 industrial nets

– 48 industrial buffer types including non-inverting buffers and inverting buffers

Compared to Dynamic Programming which is the state of the art technique and is widely used in industry

Cost Ratio Compared to DP

FPTASFPTAS

Approximation

Speedup Compared to DP

Approximation

0.01 0.05 0.1 0.2 0.3 0.4 0.50

FPTASFPTAS

Observations

FPTAS always achieves the theoretical guarantee Larger ɛ leads to more speedup On average about 5x faster than dynamic programming Can run 4.6x faster with 0.57% solution degradation <5% nets with timing violations which can be fixed by a simple

timing recovery procedure

Our Bridge

NP-Hardness Complexity

Exponential Time Algorithm

Conclusion

Propose a (1+ ɛ) approximation for timing constrained minimum cost buffering for any ɛ > 0 (DAC’09)

– Runs in O(m2n2b/ɛ3+ n3b2/ɛ) time– Timing-cost approximate dynamic programming – Double-ɛ geometric sequence based oracle search– 5x speedup in experiments– Few percent additional buffers as guaranteed

theoretically The first provably good approximation algorithm on this

problem which is a major open problem in the field

Thanks

A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion

Documents

THE COMBINATORICS OF POLYNOMIAL FUNCTORS

Insertion Devices

Approximation Algorithms for Polynomial-Expansion and Low

Orthogonal Polynomials and Polynomial Approximations

Lesson 06 Polynomial

Title Polynomial Time Approximation Schemes for …repository.kulib.kyoto-u.ac.jp/dspace/bitstream/2433/...Title Polynomial Time Approximation Schemes for Metric TSP on the Polyhedron

Insertion sort

Interpolation and Polynomial Approximationwp.kntu.ac.ir/mojra/Interpolation and Polynomial Approximation.pdf · يا ﻪﻠﻤﺟ ﺪﻨﭼ ﺐﻳﺮﻘﺗ و ﻲﺑﺎﻳ نورد

Chapter 3 Interpolation and Polynomial Approximation

On the Bernstein Constants of Polynomial Approximation

Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures … · Sun. Eﬃcient Density Estimation via Piecewise Polynomial Approximation. •[DL01] Luc Devroye and Gabor

Chapter 8 Polynomial Approach

13 Insertion sociale, insertion professionnelle

Shubhanshu math project work , polynomial

Polynomial dpf

Polynomial Derivation from Data

Approximation II

數值方法 2008 Applied Mathematics, NDHU1 Lagrange polynomial Polynomial interpolation Lecture 4II

An Efficient Polynomial Space and Polynomial Delay Algorithm for Enumeration of Maximal Motifs

Approximation schemes Scheduling problems. Polynomial Time Approximation Scheme (PTAS) Let Π be a minimization problem. An approximation scheme for problem