L17 :Logic Level Design 성균관대학교 조 준 동 교수

L17 :Logic Level Design

성균관대학교 조 준 동 교수http://vlsicad.skku.ac.kr

Peak Power Reduction• Peak Power has relation to EMI

• Reducing concurrent switching makes peak power reduction

– Adjust delay within the speed of system clock in Bus/Port driver

– Consider the power consumption of delay element

– Maintaining total power consumption, we improve EMI in peak power reduction

• Before Peak Power Reduction

• After Peak Power Reduction

n bits wide

Itotal

t

E 1

n bits wide

Itotal

t

(n- 1)/

E 2t totoldd dtIVE

Factoring Example Function : f = ad + bc + cd The function f is not on the critical path. The signal a,b,c and d are all the same bit width. Signal b is a high activity net. The two factorings below are equivalent from both a timing and area criteria. Net Result : network toggling and power is reduced.

a

f

d

c

b

c

f = b(a+ c ) + c d

b

b

a

c

d

f = b(a+ c ) + c d

f

Block diagram of low-voltage, high-speed of LSI

• Power Management Processor controls the low-Vt circuit using the sleep signal.• Extend the sleep period as much as possible, because leakage power is reduced during this time

Operations of low-V t LSI

Request signal from an I/O device, output the results, waits for the next request signal. During the waitingperiod, the low-Vt circuit can sleep.

Waking/Sleeping operation

Waking operation Sleeping operation

Creating sleep period: Operation during calculation

•Heavy operations such as voice CODEC, and light operations such as datacollection can be distributed to both the low-Vt circuit and the PMP, and the low Vt circuit can sleep when the PMP is executing lightoperations.• reduce the power by 10%

Interconnection power optimization

• Coding for reduced switching activity - Introducing sample to sample correlation such that the total # of transitions is reduced.

• Coding scheme ( if n bit data is transmitted with m wires)

– Non-redundant : m = n, with the knowledge of the statistics of data, there can be other non-redundant scheme.

– Redundant : m > n, one-to-one mapping or one-to-many mapping can be used.

One-Hot Coding

• Interconnection of two chip are made by m=2n wires.

• Place ‘H’ on i-th bit where 0 <= i =<2n-1, other wires are ‘L’.

• Both the encoder and decoder are memoryless.

• Power reduction

– assume n-bit data words are independent

– uniformly randomly distributed

2for )2-4(1

1for 1n-

coding no2

coding no

2

coding non

n

n

fCV

fCV

P

P hotonehotonehotone

Gray-Coding• Adjacent numbers only have one bit difference.

• Useful when the transmitted data is sequential and highly correlated.

• Conversion

– B=<bn-1, bn-2, …, b1, b0> ( binary number ),

– G=<gn-1, gn-2, … , g1, g0> ( gray-coded number )

– binary to gray-code conversion

• gn-1 = bn-1, gi = bi+1 bi ( i=n-2, … 0 )

– gray-code to binary conversion

• bn-1 = gn-1, bi = bi+1 gi ( i=n-2, … 0 )

– example

• B=<1,1,0,1> <b3, b3 b2, b2 b1, b1 b0>=<1,0,1,1>

• G=<1,0,1,1> <g3, g3 g2, g3 g2 g1, g3 g2 g1 g0>=<1,1,0,1>

Gray-Coding

• Application– code the address line for instruction access - sequential and switching

activity is reduced.– if used for data address line, the transition is equal for binary

representation.

BPI(bit transitions per

instruction)

qsort

reducer

circuit

semigroup

BPI1.0 2.0 3.0 1.0 2.0 3.0

binary gray-code

2.641,33

2.571.71

2.331.47

2.681.99

1.321,25

1,471,40

1,331,18

1,381,34

Bus-Invert Coding for Low Power I/O

An eight-bit bus on which all eight lines toggle at the sametime and which has a high peak (worst-case) power dissipation.•There are 16 transitions over 16 clock cycles (average 1 transition per clock cycle).

Peak Power Dissipation

An eight-bit bus on which the eight lines toggle at differentmoments and which has a low peak power dissipation. There are the same 16 transitions over 16 clock cycles and thus the same average power dissipation

Bus-Invert - Coding for low power• The Bus-Invert method proposed here uses one extra control bit called i

nvert. By convention then invert = 0 the bus value will equal the data value. When invert = 1 the bus value will be the inverted data value. The peak power dissipation can then be decreased by half by coding the I/O as follow

• 1. Compute the Hamming distance (the number of bits in which they differ) between the present bus value (also counting the present invert line) and the next data value.

• 2. If the Hamming distance is larger than n=2, set invert = 1 (and thus make the next bus value equal to the inverted next data value).

• 3. Otherwise, let invert = 0 (and let the next bus value equal to the next data value).

• 4. At the receiver side the contents of the bus must be conditionally inverted according to the invert line, unless the data is not stored encoded as it is (e.g. in a RAM). In any case the value of invert must be transmitted over the bus (the method increases the number of bus lines from n to n + 1).

Bus-Inversion Coding• Redundant coding with m = n+1

• If data word S is to be transmitted, either S or S’ which is bit-wise inversion of S can be transmitted.

• Extra wire P is used to indicate the polarity.

• Decoder is memoryless, and encoder only used the current state of the wires.

• Power saving

n

ncodingcodingcoding nC

nfCV

fCV

P

P

21

11 2/

coding no2

coding no

2

coding no

Example

A typical eight-bit synchronous data bus. The transitions between two consecutive time-slots are \clean". There are 64 transitions for a period of 16 time slots. This represents an average of 4 transitions per time slot, or 0.5 transitions per bus line per time

slot.

Bus encoding

The same sequence of data coded using the BusInvert method. There are now only 53 transitions over a period of 16 time slots. This represents an average of 3.3 transitions per time slot, or 0.41 transitions per bus line per time slot.The maximum number of transitions for any time slot is now 4.

Bus-Inversion Coding

• For large values of n, the efficacy of coding technique disappears as the ratio converges to 1.

• Dividing large bit groups to smaller groups is better.

(Bit transitions with encoding) / (Bit transition without encoding)

1.0

0.9

0.8

0.7

4 8 12 16 20 24 28 32Data word width

Comparisons

Comparison of unencoded I/O and coded I/O with one or more invert lines. The comparison looks at the average and maximum number of transitions per time-slot, per bus-line per time-slot, and I/O power dissipation for different bus-widths.

Remarks• The increase in the delay of the data-path: By looking at the power-delay produc

t which removes the effect of frequency (delay) on power dissipation, a clear improvement is obtained in the form of an absolute lower number of transitions. It is also relatively easy to pipeline the bus activity. The extra pipeline stage and the extra latency must then be considered.

• The increased number of I/O pins. As was mentioned before ground-bounce is a big problem for simultaneous switching in high speed designs. That is why modern microprocessors use a large number of Vdd and GND pins. The Bus-Invert method has the side-effect of decreasing the maximum ground-bounce by approximately 50%. Thus circuits using the Bus Invert method can use a lower number of Vdd and GND pins and by using the method the total number of pins might even decrease.

• Bus-Invert method decreases the total power dissipation although both the total number of transitions increases (by counting the extra internal transitions) and the total capacitance increases (because of the extra circuitry). This is

• possible because the transitions get redistributed very nonuniformly, more on the low-capacitance side and less on the high-capacitance side.

Lower Power Data Encoding

• S.S.Chun and J.D.Cho’97• 허프만 부호화 알고리즘에 의하여 발생된 압축률을

유지하면서 허프만코드를 재구성하여 스위칭 동작 횟수를 줄이는 방법

• 공통된 서브 시퀀스를 많이 갖는 서브 스트림에 그레이 코드와 같은 스위칭 횟수가 적은 부호화 방식을 채택하는 것이다 .

• RISC 인스트럭션 어드레싱 방식중 바이너리코드 어드레싱 방식에 비해서 그레이코드 어드레싱 방식을 사용할 경우 50% 까지의 전력감축 효과를 나타낸다

Gray Code

• 두 개의 n 차원 (n bit) 벡터 U = u_1, u_2, … , u_n 과 V = v_1, v_2, … , v_n 의 해밍 거리를 h(U,V) = SUM from i=1 to n (u_i, v_i ) 로 정의하자 . 여기서 (u_i v_i ) 는 u 와 v 의 bit 값이 다르면 1 이 되고 그렇지 않으면 0 이 된다 . 이것은 n 차원 hypercube G 의 변을 따라갈 때의 거리로 표현 할 수도 있다 . Gray code = shortest path in G

• 허프만 코드는 문자의 코드 길이가 다를 수 있으며 prefix-free 코드를 유지하여야 하기 때문에 정확한 그레이 코드로 변환하는 것은 불가능하며 비트 변화량을 최소화하기 위한 압축 부호화가 필요하게 된다 .

2-D Traveling Salesman Problem• 제안된 문제는 문자의 인접 빈도수가

많은 문자쌍에 해밍 거리가 작은 코드쌍을 할당하는 문제이기 때문에 두 개 이상의 TSP 를 동시에 처리하는 새로운 문제로 표현된다 .

• Using heuristic: 10% reduction in switching activity for random un-correlated data

Data Representation

• 2’s complement– most signal processing uses 2’s complement. – significant switching activity when the signal change fr

om negative to positive(MSB’s toggle)

• Sign-magnitude – only one-bit toggles when the signal switches sign if th

e dynamic range of a signal does not span the entire bitwidth.

Data Representation is the correlation coefficient of data

– large implies that the signal changes slowly and switches sign very infrequently

– negative implies that the signal changes frequently from positive to negative

Tra

nsi

tio

n p

rob

abil

ity

Tra

nsi

tio

n p

rob

abil

ity

Bit Number Bit Number

1.0

0.5

0.0

=-0.99

=-0.50

=0

=0.50

=0.99

=-0.99

=-0.50

=0

=0.50

=0.99

0 7 14 0 7 14

DesignPower: inputs & outputs

DesignPowerDesignPowerGate-Level Netlist

Power Report Total Design Modules Individual Nets Individual Cells

VHDL or VerilogRTL

Simulation

VHDL or VerilogRTL

Simulation

Library

Switching ActivityInformation

VHDL or VerilogGate-LevelSimulation

VHDL or VerilogGate-LevelSimulation

Switching Activity Generation - RTL

• Activity of the synthesis invariant nodes is captured during RTL simulation– sequential outputs, hierarchical boundaries, black-box pins

• Utilizes a zero-delay cycle-based propagation engine

• Same activity is used for both analysis and optimization

• New switching activity is required when the synthesis invariant behavior is changed

Switching Activity: RTL vs. Gate-Level

• RTL Switching Activity:Available early in the design processFastAccurateDoes not account for glitchesDoes not fully support state- and path-dependency

• Gate-Level Switching Activity:Very accurateAccounts for glitchesState- and path-dependency supportRequires lengthy gate-level simulationUsually done at the later stages of the design process

Mega Mega CellsCells

MemoryMemoryµpµp

µcµc

A/DA/D

DMADMA

D/AD/A

S/PS/P

P/SP/SControlControlLogicLogic

PowerGate for Detailed Power

• Power verification at the later stages of the design cycle– Ensure that power budget and constraints

are satisfied– Time based , peak power and time-

average power at user-defined intervals– Identify power hungry vectors /

instructions– Isolate power problems in-time

RTL Design

Design Compiler

Power Compiler

Power Compiler (RTL Clock Gating)

PowerGate

Place & Route

Power optimizeddesign

DesignPower

• The average power consumption looks O.K yet is there a problem with the memory?

• Is the memory cycle valid? (address collision)

• Is there data contention? (are both ports in the read mode?)

Address 1 Address 2ControlLogic

1

Dual-portRAM

Common Data Bus

ControlLogic

2

Identify Excessive Power In Time

Power

Time

Average

Power Compiler @ RTLPush-button reduction in power at the RT-LevelPush-button reduction in power at the RT-Level

RTLSource

Power CompilerClock-Gating

(elaborate -gate_clock)

Design Compiler

Un-mappedNet-List +

Constraints

RTL Clock-Gating No changes required to the RTL code Can deliver significant reduction in power Power reduction is design dependent

We have seen 30% - 60% power reduction in some designs

Downstream Dependencies Logic Synthesis Testability Clock Tree Synthesis

Automatic Clock-Gating @ RTLSynchronous-load-enable implementation

Gated clock implementation

EN

CLK

FSM

D_inD_out

RegisterBank

D_out

CLK

FSM Latch

D_in

EN G_CLK

RegisterBank

Always @ (posedge CLK) if (EN) D_out = D_in

Always @ (posedge CLK) if (EN) D_out = D_in

elaborate -gate_clock

elaborate

Clock-Gating @ RTL - Power SavingsPower Savings by clock-gating

Reduced internal power consumption at the clock-gated flip-flops

No need for Muxes to re-circulate the data for these flip-flops (saves Power & Area)

Reduced power consumption by the clock network

Power Saving dependency

# of load-enable registers

% of disabled cycles

D_out

CLK

FSM Latch

D_in

EN G_CLK

RegisterBank

12

3

Clock-Gating Styles

Extensive user control

Latch-based or latch-free gating style

Which register banks to gate or exclude from gating

Positive (AND) or negative (OR) gating logic

Minimal bit-width of gated registers

EN

CLK

GCLK

Latch-free {OR}

EN

CLKGCLK

Latch-free {INV NAND BUF}

EN

CLK

GCLK

Latch-based {NAND INV}

Clock-Gating @ RTL - Dependencies• Logic Synthesis

Power Compiler automatically generates set-up and hold constraints on the gating element

Combinatorial set-up and hold checks are performed by DC

• Testability Medium and high testability options for controllability & obse

rvability of the enable signal Test Compiler and DC XP can handle the gating circuitry dur

ing rule-checking and ATPG

• Clock-Tree-Synthesis Supported by many ASIC vendors and tools providers

Contact your vendor for details

Clock-Gating - Medium Testability

TEST_MODE enables override of clock-gating during scan-in andscan-out

Asserting TEST_MODE during the parallel mode will make FSM faultsun-testable

CLKFSM Latch

D_in

EN G_CLK

RegisterBank

TEST_MODE D_out

Clock-Gating - High Testability

CLKFSM Latch

D_in

EN G_CLK

RegisterBank

D_out

CLK

TEST_MODE

ObservabilityRegister

OtherObservabilityNodes

All FSM faults are testable Testability logic does not consume

power Higher area cost

TechLibrary

Power Compiler @ Gate-Level

Power CompilerPower CompilerPower Compiler

dc_shell> compile -incremental

Gate-LevelNetlist

Gate-LevelNetlist Switching ActivitySwitching Activity Constraints

(timing, power, area)

Constraints(timing, power, area)

Parasitic(Capacitance)

Parasitic(Capacitance)

Power OptimizedGate-Level NetlistPower OptimizedGate-Level Netlist

Design Compiler

Power Compiler @ Gate-Level• Optimizes power simultaneously with area and timing

• New optimization technologies added for power

– Activity-based optimizations minimize power subject to

power constraints

– Power added to the synthesis optimization cost function

– 10% - 20% push-button reduction in power

• Works within timing constraints

– no increase in negative slack

• Requires synthesis libraries updated for power

• Completely integrated with Links-to-Layout methodology

Optimization Priorities

Power Compiler works within the specified timing constraintsPower Compiler works within the specified timing constraints

Cost Type

Design RuleDelayDynamic PowerLeakage PowerArea

Constraints

Max Trans, Max FanoutClock Period, Max_delay, Min_delayMax Dynamic PowerMax Leakage PowerMax Area

Priority

• The optimization priorities are hard coded• Try tightening/loosening the constraints to

get the required speed/power/area trade-offs

Cell Sizing Example

Delay (a,f) : reqd = 4, actual = 3.5

Cload: f = 3; n1 = 2.5, n2 = 1.5

TR: a, b = .25, c, d = .5

=> n1 = .125, n2 = .25, f = .56

Power = 3.69

Delay (a,f) : reqd = 4, actual = 3.3

Cload: f = 4; n1, n2 = 2

TR: a, b = .25, c, d = .5

=> n1 = .125, n2 = .25, f = .56

Power = 4.125

Note: Internal power effects (i.e. edge rate) also considered

a

n2

b n1

cd

f

an2c

an2a

an2a

a

n2

b n1

cd

f

an2a

an2a

an2c

Critical path

Low activity net

Sized up

Sized down

Factoring ExampleFunction:

f = ab + bc + cd

The function f is not on the critical path

The signals a, b, c and d are all the same bit width

Signal b is a high activity net

The two implementations below are equivalent from both

timing and area criteria

Net Result: network toggling and power is reducedf = b(a + c) + cd f = ab + c (b + d)

f

ac

cd

b f

db

ba

c

Pin Swapping Example

f

Cpin = 1.5C1

Cpin = C1

toggle rate = .4

toggle rate = .8

b

a

c

d

f

Cpin = 1.5C1

Cpin = C1

toggle rate = .8

toggle rate =.4

d

b

c

a

Move high toggle nets to lower capacitance pins

Phase Assignment Example

Implementation tradeoff criteria:

toggle rates of inputs and outputs

pin capacitance of library cell

Solution requires:

dynamic power cost function

actual toggle rates

accurate cell libraries

?1

2 : 1Mux

6

area = 7

A

B

TR = .7

TR = .3

area = 6

1

2 : 1Mux

5B

ATR = .7

TR = .3

Push-Button Power ReductionIntel Success (Presented by Intel at SNUG 1998)

A graphics chip for which both power and area are critical, synthesized to 0.35 library at 3.3 Volts.

Achieved 12%, 21% and 24% reduction in power on 3 blocks with 2% or less area increase.

Lucent SuccessAn ISDN Transceiver ASIC, 40K gates block, synthesized to 0.35 library

Achieved 12% push-button power reduction with 3.3% area increase

ASIC MethodologyRTL Design

Design Compiler

Power Compiler

RTL Simulation

Power Compiler (RTL Clock Gating)

DesignPowerPowerGate

Place & Route

Gate Simulation

RTL SA

SA

Cap.

SNPS.db

Power optimizeddesign

DesignPower RTL SADesign Exploration

Design Implementation

PhysicalDesign

SpeedSpeed

Accuracy

Diagnosis

Links-to-Layout for Power

Before: timing constraints not met

PowerCompilerPower

Compiler

PhysicalDesign

MetConstraints?

PDEF

set_load

Yes

No FloorplanManager

FloorplanManager

After: timing constraints met

Lowest power implementation

The lowest power silicon within your timing constraints The lowest power silicon within your timing constraints

SDF

Documents

L17 :Logic Level Design 성균관대학교 조 준 동 교수