28
Low-power Clock Trees for CPUs Dong-Jin Lee, Myung-Chul Kim and Igor L. Markov Dept. of EECS, University of Michigan 1 ICCAD 2010, Dong-Jin Lee, University of Michigan

Low-power Clock Trees for CPUs Dong-Jin Lee, Myung-Chul Kim and Igor L. Markov Dept. of EECS, University of Michigan 1 ICCAD 2010, Dong-Jin Lee, University

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Low-power Clock Trees for CPUs

Dong-Jin Lee, Myung-Chul Kimand Igor L. MarkovDept. of EECS, University of Michigan

1ICCAD 2010, Dong-Jin Lee, University of Michigan

Outline

■Motivation and challenges■Modeling and objectives

− Local skew with variation− Local-skew slack− Modeling process variation■Proposed methodology and techniques

− Initial tree construction and buffer insertion− Robustness improvements− Wire snaking and delay buffer insertion■Empirical validation■Summary

2ICCAD 2010, Dong-Jin Lee, University of Michigan

Motivation

■Clock networks− Contribute a significant fraction of dynamic power− A limiting factor in high-performance CPUs and SoCs

■Challenges − Interconnect is lagging in performance

while transistors continue scaling− Multi-objective optimization

– Traditional clock network synthesis constraints– The increasing impact of process variation– Power-performance-cost trade-offs

3ICCAD 2010, Dong-Jin Lee, University of Michigan

Tree vs Mesh

■Objectives− Minimize skew of a high-performance clock tree− Minimize the impact of PVT variations− Clock trees vs meshes, subject to skew < 7.5ps

4

Ro

bu

stn

es

s

Power efficiency

Trees

Ideal clock networks

Meshes

ICCAD 2010, Dong-Jin Lee, University of Michigan

Our Contributions

■The notion of local-skew slack for clock trees

■A tabular technique to estimate the impact of variations

■A path-based technique to enhance the robustness

■A time-budgeting algorithm for clock-tree tuning with minimal power resources

■Fine tuning of clock trees : accurate, fast, power efficient

■Implementation : Contango2.0

■Strong empirical results : low skew, robustness, low power

5ICCAD 2010, Dong-Jin Lee, University of Michigan

Modeling and Objectives

6ICCAD 2010, Dong-Jin Lee, University of Michigan

Local Skew

■Main objective (concept)− Minimize local skew in the presence of variation

■Definition: Skew− Ψ : Clock tree

− λ(si) : the clock latency (insertion delay) at sink si Ψ∈−

■Definition: Global Skew (ωΨ)−

7ICCAD 2010, Dong-Jin Lee, University of Michigan

■Definition: The worst nominal local skew (ωΨΔ)

− Δ : local skew distance bound

− dist(si,sj) : Manhattan distance between si and sj Ψ∈−

■Definition: The worst local skew with variation (ωΨΔ,ν,y )

− ν : variation model − y : yield (0 <y ≤ 1)

− f(t) : the cumulative distribution function of ωΨΔ,ν

Local Skew

8ICCAD 2010, Dong-Jin Lee, University of Michigan

Worst local skew with variation (ωΨΔ,ν,y )

− Probability density function of ωΨΔ,ν

− ΩΔ = 7.5ps, y = 95%, ωΨΔ,ν,y< ΩΔ

− ωΨΔ,ν,y = 6.05ps

Modeling and Objectives - Example

9

0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3ΩΔωΨ

Δ,ν,y

ps

ICCAD 2010, Dong-Jin Lee, University of Michigan

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3PDFCDFInverse CDFPDF

y = 0.95

ωΨΔ,ν,y = 6.05ps

■Building variation-tolerant clock trees

− such that ωΔ,ν,y < ΩΔ (ΩΔ – local skew limit)− subject to slew constraints■Minimizing clock-tree power

Optimization Objectives

10ICCAD 2010, Dong-Jin Lee, University of Michigan

0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3ΩΔωΨ

Δ,ν,y

ps0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3

Local-skew Slack σ(s) for sink s Ψ∈

■Definition− σ(s) is the minimum amount of additional delay for s,

so that the tree satisfies ωΨ Δ < ΩΔ

■Example (Ωδ = 5ps)

11ICCAD 2010, Dong-Jin Lee, University of Michigan

Modeling Process Variation

■Impact of variation on skew(si,sj) depends on tree path length(si,sj), num. buffers(si,sj) and type buffers(si,sj)

■Notation− T : technology node− B : buffer and wire library− v : variation model

■Variation-estimation table ΞT,B,ν,y[w,b,t] − worst-case increase in skew (with probability y) between

two sinks connected by a tree path of length w with b buffers and the buffer type t

12ICCAD 2010, Dong-Jin Lee, University of Michigan

w : tree path length b : num. of buffers (2)t : buffer type

A B C D

Modeling Process Variation

■varEst(si,sj)

− the worst case variational skew(si,sj)−

■Key constraint−

13ICCAD 2010, Dong-Jin Lee, University of Michigan

Initial Tree Construction

■ZST-DME algorithm* based on Elmore delay■A simple and robust technique for obstacle avoidance** ■Initial buffer insertion

− t0 : the initial buffer type for initial buffer insertion− Use variation-estimation table with path lengths from

initial tree

− Once t0 is determined, we adapt the fast variant of van Ginneken’s algorithm*** for initial buffer insertion

− Minimize insertion delay, reliable slew rate

14

* : J.-H. Huang et al, “On Bounded-Skew Routing Tree Problem,” DAC‘95

** : D.-J. Lee et al, “Contango: Integrated Optimization of SoC Clock Networks,” DATE‘10

*** : W. Shi et al, “A Fast Algorithm for Optimal Buffer Insertion,” Trans. on CAD 24(6),2005

ICCAD 2010, Dong-Jin Lee, University of Michigan

Robustness Improvement

■Improve robustness after initial buffer insertion so that ωΨ

Δ,ν,y < ΩΔ holds after skew optimization

■The target buffer type for a tree-path between sink si and sj, t(si,sj) is defined as the smallest t such that

− choosing smaller buffers reduces capacitance

15ICCAD 2010, Dong-Jin Lee, University of Michigan

Local Skew Optimization : Wire Snaking

16

Ttarget(e) : 11ps Tactual(e) : 7ps

T2actual(e) : 3ps

T3actual(e) : 1ps

ICCAD 2010, Dong-Jin Lee, University of Michigan

■Local-skew optimization techniques− based on the optimal tuning amount

from the slack computation algorithms with varEst(si,sj) ■Improved wire snaking algorithm

− speed, accuracy and routing resources

e

T1target(e) : 11ps T1

actual(e) : 7ps

T2target(e) : 4ps

T3target(e) : 1ps

Tactual(e) : 7psTtarget(e) : 11ps Tactual(e) : 10psTactual(e) : 11ps

Titarget(e) ≥ Ti

actual(e)

■α : to keep Tiactual(e) ≤ Titarget(e) efficiently

■Delay model for wire snaking aims for Tiactual(e) to satisfy the above inequality with the highest α possible

■Look-up tables for length estimation− to enhance the quality of estimation by wire snaking − a set of SPICE simulations for each technology

environment which includes technology model, types of buffers and wires, variation specification

■We achieved α values between 60% and 70% for the ISPD 2010 CNS contest benchmarks

Delay Model for Wire Snaking

17ICCAD 2010, Dong-Jin Lee, University of Michigan

■Wire snaking at buffer outputs is more accurate than at other nodes

■Limiting wire snaking to buffer outputs reduces # of SPICE calls

■Example

Optimal Node Selection for Wire Snaking

18ICCAD 2010, Dong-Jin Lee, University of Michigan

■Highly unbalanced sink capacitances or layout obstacles may result in significant local skew

■Delay buffer insertion− Skew can be reduced by the delay of the inserted buffer− Further precise wire snaking is possible because

the inserted buffer isolates the target node■Example

Delay Buffer Insertion

19ICCAD 2010, Dong-Jin Lee, University of Michigan

ISPD’10 Clock Network Synthesis Contest

■45nm 2GHz CPU benchmarks from IBM and Intel

■Evaluation− Monte-Carlo SPICE simulations with PVT variations− Skew and slew constraints (7.5ps, 100ps)− Objective : total capacitance — proxy for dynamic power

■A rare opportunity to compare multiple strategies for clock-network synthesis

20ICCAD 2010, Dong-Jin Lee, University of Michigan

■ispd10cns07

Example of Our Clock Tree

21ICCAD 2010, Dong-Jin Lee, University of Michigan

■ISPD 2010 benchmarks

− 2.6ps nominal local skew− Smaller capacitance than CNSrouter and NTUclock

by 4.22× and 4.13× resp.− Our clock trees yield > 95%, while CNSrouter violates

yield constraints on 3 benchmarks and NTUclock on 7

Empirical Validation

22ICCAD 2010, Dong-Jin Lee, University of Michigan

■Local skew constraints are

all cleared

■Smaller capacitance than NTU

and CUHK by 2.09× and

4.24× resp.

■More robust withsmaller

capacitance

ICCAD 2010 Proceedings

23ICCAD 2010, Dong-Jin Lee, University of Michigan

NTU CUHK Contango2

Bench ωΨΔ,ν,y Cap. ωΨ

Δ,ν,y Cap. ωΨΔ,ν,y Cap.

cns01 7.16 445 7.23 1168 7.01 198

cns02 7.33 934 7.35 2100 7.34 376

cns03 4.88 184 3.95 94 4.18 56

cns04 4.09 196 7.25 125 4.46 72

cns05 3.81 89 7.27 74 4.41 38

cns06 7.49 16 6.79 87 6.05 48

cns07 6.24 23 5.97 128 4.58 73

cns08 5.47 23 5.37 97 5.15 52

Avg. 5.81 2.09 6.40 4.24 5.40 1.0

■Probability density functions (PDF) for skew on ISPD’10 benchmarks

Skew Profiles for Contango2 & CNSrouter

24ICCAD 2010, Dong-Jin Lee, University of Michigan

■When tight local skew constraints, large buffers ensure robustness, increasing capacitance

− Much capacitance can be saved when local skew constraints are loose

■Experiments on ispd10cns08

Trade-off - Power vs Robustness to Variations

25ICCAD 2010, Dong-Jin Lee, University of Michigan

■A tree solution for CPU clock routing− Improves power consumption under tight skew

constraints in the presence of variation− Clock trees can be tuned to have nominal skew below

5 ps and low total skew in the presence of variation− 4x capacitance improvement on average over

mesh structures

■Our clock trees have a higher yield than meshes− meshes are not as easy to tune for nominal skew

Summary

26ICCAD 2010, Dong-Jin Lee, University of Michigan

Thank you!!

Questions?

Questions and Answers

27ICCAD 2010, Dong-Jin Lee, University of Michigan

Questions and Answers

28ICCAD 2010, Dong-Jin Lee, University of Michigan