Upload
corey-parrish
View
220
Download
3
Embed Size (px)
Citation preview
L17 :Logic Level Design
성균관대학교 조 준 동 교수http://vlsicad.skku.ac.kr
Peak Power Reduction• Peak Power has relation to EMI
• Reducing concurrent switching makes peak power reduction
– Adjust delay within the speed of system clock in Bus/Port driver
– Consider the power consumption of delay element
– Maintaining total power consumption, we improve EMI in peak power reduction
• Before Peak Power Reduction
• After Peak Power Reduction
n bits wide
Itotal
t
E 1
n bits wide
Itotal
t
(n- 1)/
E 2t totoldd dtIVE
Factoring Example Function : f = ad + bc + cd The function f is not on the critical path. The signal a,b,c and d are all the same bit width. Signal b is a high activity net. The two factorings below are equivalent from both a timing and area criteria. Net Result : network toggling and power is reduced.
a
f
d
c
b
c
f = b(a+ c ) + c d
b
b
a
c
d
f = b(a+ c ) + c d
f
Block diagram of low-voltage, high-speed of LSI
• Power Management Processor controls the low-Vt circuit using the sleep signal.• Extend the sleep period as much as possible, because leakage power is reduced during this time
Operations of low-V t LSI
Request signal from an I/O device, output the results, waits for the next request signal. During the waitingperiod, the low-Vt circuit can sleep.
Waking/Sleeping operation
Waking operation Sleeping operation
Creating sleep period: Operation during calculation
•Heavy operations such as voice CODEC, and light operations such as datacollection can be distributed to both the low-Vt circuit and the PMP, and the low Vt circuit can sleep when the PMP is executing lightoperations.• reduce the power by 10%
Interconnection power optimization
• Coding for reduced switching activity - Introducing sample to sample correlation such that the total # of transitions is reduced.
• Coding scheme ( if n bit data is transmitted with m wires)
– Non-redundant : m = n, with the knowledge of the statistics of data, there can be other non-redundant scheme.
– Redundant : m > n, one-to-one mapping or one-to-many mapping can be used.
One-Hot Coding
• Interconnection of two chip are made by m=2n wires.
• Place ‘H’ on i-th bit where 0 <= i =<2n-1, other wires are ‘L’.
• Both the encoder and decoder are memoryless.
• Power reduction
– assume n-bit data words are independent
– uniformly randomly distributed
2for )2-4(1
1for 1n-
coding no2
coding no
2
coding non
n
n
fCV
fCV
P
P hotonehotonehotone
Gray-Coding• Adjacent numbers only have one bit difference.
• Useful when the transmitted data is sequential and highly correlated.
• Conversion
– B=<bn-1, bn-2, …, b1, b0> ( binary number ),
– G=<gn-1, gn-2, … , g1, g0> ( gray-coded number )
– binary to gray-code conversion
• gn-1 = bn-1, gi = bi+1 bi ( i=n-2, … 0 )
– gray-code to binary conversion
• bn-1 = gn-1, bi = bi+1 gi ( i=n-2, … 0 )
– example
• B=<1,1,0,1> <b3, b3 b2, b2 b1, b1 b0>=<1,0,1,1>
• G=<1,0,1,1> <g3, g3 g2, g3 g2 g1, g3 g2 g1 g0>=<1,1,0,1>
Gray-Coding
• Application– code the address line for instruction access - sequential and switching
activity is reduced.– if used for data address line, the transition is equal for binary
representation.
BPI(bit transitions per
instruction)
qsort
reducer
circuit
semigroup
BPI1.0 2.0 3.0 1.0 2.0 3.0
binary gray-code
2.641,33
2.571.71
2.331.47
2.681.99
1.321,25
1,471,40
1,331,18
1,381,34
Bus-Invert Coding for Low Power I/O
An eight-bit bus on which all eight lines toggle at the sametime and which has a high peak (worst-case) power dissipation.•There are 16 transitions over 16 clock cycles (average 1 transition per clock cycle).
Peak Power Dissipation
An eight-bit bus on which the eight lines toggle at differentmoments and which has a low peak power dissipation. There are the same 16 transitions over 16 clock cycles and thus the same average power dissipation
Bus-Invert - Coding for low power• The Bus-Invert method proposed here uses one extra control bit called i
nvert. By convention then invert = 0 the bus value will equal the data value. When invert = 1 the bus value will be the inverted data value. The peak power dissipation can then be decreased by half by coding the I/O as follow
• 1. Compute the Hamming distance (the number of bits in which they differ) between the present bus value (also counting the present invert line) and the next data value.
• 2. If the Hamming distance is larger than n=2, set invert = 1 (and thus make the next bus value equal to the inverted next data value).
• 3. Otherwise, let invert = 0 (and let the next bus value equal to the next data value).
• 4. At the receiver side the contents of the bus must be conditionally inverted according to the invert line, unless the data is not stored encoded as it is (e.g. in a RAM). In any case the value of invert must be transmitted over the bus (the method increases the number of bus lines from n to n + 1).
Bus-Inversion Coding• Redundant coding with m = n+1
• If data word S is to be transmitted, either S or S’ which is bit-wise inversion of S can be transmitted.
• Extra wire P is used to indicate the polarity.
• Decoder is memoryless, and encoder only used the current state of the wires.
• Power saving
n
ncodingcodingcoding nC
nfCV
fCV
P
P
21
11 2/
coding no2
coding no
2
coding no
Example
A typical eight-bit synchronous data bus. The transitions between two consecutive time-slots are \clean". There are 64 transitions for a period of 16 time slots. This represents an average of 4 transitions per time slot, or 0.5 transitions per bus line per time
slot.
Bus encoding
The same sequence of data coded using the BusInvert method. There are now only 53 transitions over a period of 16 time slots. This represents an average of 3.3 transitions per time slot, or 0.41 transitions per bus line per time slot.The maximum number of transitions for any time slot is now 4.
Bus-Inversion Coding
• For large values of n, the efficacy of coding technique disappears as the ratio converges to 1.
• Dividing large bit groups to smaller groups is better.
(Bit transitions with encoding) / (Bit transition without encoding)
1.0
0.9
0.8
0.7
4 8 12 16 20 24 28 32Data word width
Comparisons
Comparison of unencoded I/O and coded I/O with one or more invert lines. The comparison looks at the average and maximum number of transitions per time-slot, per bus-line per time-slot, and I/O power dissipation for different bus-widths.
Remarks• The increase in the delay of the data-path: By looking at the power-delay produc
t which removes the effect of frequency (delay) on power dissipation, a clear improvement is obtained in the form of an absolute lower number of transitions. It is also relatively easy to pipeline the bus activity. The extra pipeline stage and the extra latency must then be considered.
• The increased number of I/O pins. As was mentioned before ground-bounce is a big problem for simultaneous switching in high speed designs. That is why modern microprocessors use a large number of Vdd and GND pins. The Bus-Invert method has the side-effect of decreasing the maximum ground-bounce by approximately 50%. Thus circuits using the Bus Invert method can use a lower number of Vdd and GND pins and by using the method the total number of pins might even decrease.
• Bus-Invert method decreases the total power dissipation although both the total number of transitions increases (by counting the extra internal transitions) and the total capacitance increases (because of the extra circuitry). This is
• possible because the transitions get redistributed very nonuniformly, more on the low-capacitance side and less on the high-capacitance side.
Lower Power Data Encoding
• S.S.Chun and J.D.Cho’97• 허프만 부호화 알고리즘에 의하여 발생된 압축률을
유지하면서 허프만코드를 재구성하여 스위칭 동작 횟수를 줄이는 방법
• 공통된 서브 시퀀스를 많이 갖는 서브 스트림에 그레이 코드와 같은 스위칭 횟수가 적은 부호화 방식을 채택하는 것이다 .
• RISC 인스트럭션 어드레싱 방식중 바이너리코드 어드레싱 방식에 비해서 그레이코드 어드레싱 방식을 사용할 경우 50% 까지의 전력감축 효과를 나타낸다
Gray Code
• 두 개의 n 차원 (n bit) 벡터 U = u_1, u_2, … , u_n 과 V = v_1, v_2, … , v_n 의 해밍 거리를 h(U,V) = SUM from i=1 to n (u_i, v_i ) 로 정의하자 . 여기서 (u_i v_i ) 는 u 와 v 의 bit 값이 다르면 1 이 되고 그렇지 않으면 0 이 된다 . 이것은 n 차원 hypercube G 의 변을 따라갈 때의 거리로 표현 할 수도 있다 . Gray code = shortest path in G
• 허프만 코드는 문자의 코드 길이가 다를 수 있으며 prefix-free 코드를 유지하여야 하기 때문에 정확한 그레이 코드로 변환하는 것은 불가능하며 비트 변화량을 최소화하기 위한 압축 부호화가 필요하게 된다 .
2-D Traveling Salesman Problem• 제안된 문제는 문자의 인접 빈도수가
많은 문자쌍에 해밍 거리가 작은 코드쌍을 할당하는 문제이기 때문에 두 개 이상의 TSP 를 동시에 처리하는 새로운 문제로 표현된다 .
• Using heuristic: 10% reduction in switching activity for random un-correlated data
Data Representation
• 2’s complement– most signal processing uses 2’s complement. – significant switching activity when the signal change fr
om negative to positive(MSB’s toggle)
• Sign-magnitude – only one-bit toggles when the signal switches sign if th
e dynamic range of a signal does not span the entire bitwidth.
Data Representation is the correlation coefficient of data
– large implies that the signal changes slowly and switches sign very infrequently
– negative implies that the signal changes frequently from positive to negative
Tra
nsi
tio
n p
rob
abil
ity
Tra
nsi
tio
n p
rob
abil
ity
Bit Number Bit Number
1.0
0.5
0.0
=-0.99
=-0.50
=0
=0.50
=0.99
=-0.99
=-0.50
=0
=0.50
=0.99
0 7 14 0 7 14
DesignPower: inputs & outputs
DesignPowerDesignPowerGate-Level Netlist
Power Report Total Design Modules Individual Nets Individual Cells
VHDL or VerilogRTL
Simulation
VHDL or VerilogRTL
Simulation
Library
Switching ActivityInformation
VHDL or VerilogGate-LevelSimulation
VHDL or VerilogGate-LevelSimulation
Switching Activity Generation - RTL
• Activity of the synthesis invariant nodes is captured during RTL simulation– sequential outputs, hierarchical boundaries, black-box pins
• Utilizes a zero-delay cycle-based propagation engine
• Same activity is used for both analysis and optimization
• New switching activity is required when the synthesis invariant behavior is changed
Switching Activity: RTL vs. Gate-Level
• RTL Switching Activity:Available early in the design processFastAccurateDoes not account for glitchesDoes not fully support state- and path-dependency
• Gate-Level Switching Activity:Very accurateAccounts for glitchesState- and path-dependency supportRequires lengthy gate-level simulationUsually done at the later stages of the design process
Mega Mega CellsCells
MemoryMemoryµpµp
µcµc
A/DA/D
DMADMA
D/AD/A
S/PS/P
P/SP/SControlControlLogicLogic
PowerGate for Detailed Power
• Power verification at the later stages of the design cycle– Ensure that power budget and constraints
are satisfied– Time based , peak power and time-
average power at user-defined intervals– Identify power hungry vectors /
instructions– Isolate power problems in-time
RTL Design
Design Compiler
Power Compiler
Power Compiler (RTL Clock Gating)
PowerGate
Place & Route
Power optimizeddesign
DesignPower
• The average power consumption looks O.K yet is there a problem with the memory?
• Is the memory cycle valid? (address collision)
• Is there data contention? (are both ports in the read mode?)
Address 1 Address 2ControlLogic
1
Dual-portRAM
Common Data Bus
ControlLogic
2
Identify Excessive Power In Time
Power
Time
Average
Power Compiler @ RTLPush-button reduction in power at the RT-LevelPush-button reduction in power at the RT-Level
RTLSource
Power CompilerClock-Gating
(elaborate -gate_clock)
Design Compiler
Un-mappedNet-List +
Constraints
RTL Clock-Gating No changes required to the RTL code Can deliver significant reduction in power Power reduction is design dependent
We have seen 30% - 60% power reduction in some designs
Downstream Dependencies Logic Synthesis Testability Clock Tree Synthesis
Automatic Clock-Gating @ RTLSynchronous-load-enable implementation
Gated clock implementation
EN
CLK
FSM
D_inD_out
RegisterBank
D_out
CLK
FSM Latch
D_in
EN G_CLK
RegisterBank
Always @ (posedge CLK) if (EN) D_out = D_in
Always @ (posedge CLK) if (EN) D_out = D_in
elaborate -gate_clock
elaborate
Clock-Gating @ RTL - Power SavingsPower Savings by clock-gating
Reduced internal power consumption at the clock-gated flip-flops
No need for Muxes to re-circulate the data for these flip-flops (saves Power & Area)
Reduced power consumption by the clock network
Power Saving dependency
# of load-enable registers
% of disabled cycles
D_out
CLK
FSM Latch
D_in
EN G_CLK
RegisterBank
12
3
Clock-Gating Styles
Extensive user control
Latch-based or latch-free gating style
Which register banks to gate or exclude from gating
Positive (AND) or negative (OR) gating logic
Minimal bit-width of gated registers
EN
CLK
GCLK
Latch-free {OR}
EN
CLKGCLK
Latch-free {INV NAND BUF}
EN
CLK
GCLK
Latch-based {NAND INV}
Clock-Gating @ RTL - Dependencies• Logic Synthesis
Power Compiler automatically generates set-up and hold constraints on the gating element
Combinatorial set-up and hold checks are performed by DC
• Testability Medium and high testability options for controllability & obse
rvability of the enable signal Test Compiler and DC XP can handle the gating circuitry dur
ing rule-checking and ATPG
• Clock-Tree-Synthesis Supported by many ASIC vendors and tools providers
Contact your vendor for details
Clock-Gating - Medium Testability
TEST_MODE enables override of clock-gating during scan-in andscan-out
Asserting TEST_MODE during the parallel mode will make FSM faultsun-testable
CLKFSM Latch
D_in
EN G_CLK
RegisterBank
TEST_MODE D_out
Clock-Gating - High Testability
CLKFSM Latch
D_in
EN G_CLK
RegisterBank
D_out
CLK
TEST_MODE
ObservabilityRegister
OtherObservabilityNodes
All FSM faults are testable Testability logic does not consume
power Higher area cost
TechLibrary
Power Compiler @ Gate-Level
Power CompilerPower CompilerPower Compiler
dc_shell> compile -incremental
Gate-LevelNetlist
Gate-LevelNetlist Switching ActivitySwitching Activity Constraints
(timing, power, area)
Constraints(timing, power, area)
Parasitic(Capacitance)
Parasitic(Capacitance)
Power OptimizedGate-Level NetlistPower OptimizedGate-Level Netlist
Design Compiler
Power Compiler @ Gate-Level• Optimizes power simultaneously with area and timing
• New optimization technologies added for power
– Activity-based optimizations minimize power subject to
power constraints
– Power added to the synthesis optimization cost function
– 10% - 20% push-button reduction in power
• Works within timing constraints
– no increase in negative slack
• Requires synthesis libraries updated for power
• Completely integrated with Links-to-Layout methodology
Optimization Priorities
Power Compiler works within the specified timing constraintsPower Compiler works within the specified timing constraints
Cost Type
Design RuleDelayDynamic PowerLeakage PowerArea
Constraints
Max Trans, Max FanoutClock Period, Max_delay, Min_delayMax Dynamic PowerMax Leakage PowerMax Area
Priority
• The optimization priorities are hard coded• Try tightening/loosening the constraints to
get the required speed/power/area trade-offs
Cell Sizing Example
Delay (a,f) : reqd = 4, actual = 3.5
Cload: f = 3; n1 = 2.5, n2 = 1.5
TR: a, b = .25, c, d = .5
=> n1 = .125, n2 = .25, f = .56
Power = 3.69
Delay (a,f) : reqd = 4, actual = 3.3
Cload: f = 4; n1, n2 = 2
TR: a, b = .25, c, d = .5
=> n1 = .125, n2 = .25, f = .56
Power = 4.125
Note: Internal power effects (i.e. edge rate) also considered
a
n2
b n1
cd
f
an2c
an2a
an2a
a
n2
b n1
cd
f
an2a
an2a
an2c
Critical path
Low activity net
Sized up
Sized down
Factoring ExampleFunction:
f = ab + bc + cd
The function f is not on the critical path
The signals a, b, c and d are all the same bit width
Signal b is a high activity net
The two implementations below are equivalent from both
timing and area criteria
Net Result: network toggling and power is reducedf = b(a + c) + cd f = ab + c (b + d)
f
ac
cd
b f
db
ba
c
Pin Swapping Example
f
Cpin = 1.5C1
Cpin = C1
toggle rate = .4
toggle rate = .8
b
a
c
d
f
Cpin = 1.5C1
Cpin = C1
toggle rate = .8
toggle rate =.4
d
b
c
a
Move high toggle nets to lower capacitance pins
Phase Assignment Example
Implementation tradeoff criteria:
toggle rates of inputs and outputs
pin capacitance of library cell
Solution requires:
dynamic power cost function
actual toggle rates
accurate cell libraries
?1
2 : 1Mux
6
area = 7
A
B
TR = .7
TR = .3
area = 6
1
2 : 1Mux
5B
ATR = .7
TR = .3
Push-Button Power ReductionIntel Success (Presented by Intel at SNUG 1998)
A graphics chip for which both power and area are critical, synthesized to 0.35 library at 3.3 Volts.
Achieved 12%, 21% and 24% reduction in power on 3 blocks with 2% or less area increase.
Lucent SuccessAn ISDN Transceiver ASIC, 40K gates block, synthesized to 0.35 library
Achieved 12% push-button power reduction with 3.3% area increase
ASIC MethodologyRTL Design
Design Compiler
Power Compiler
RTL Simulation
Power Compiler (RTL Clock Gating)
DesignPowerPowerGate
Place & Route
Gate Simulation
RTL SA
SA
Cap.
SNPS.db
Power optimizeddesign
DesignPower RTL SADesign Exploration
Design Implementation
PhysicalDesign
SpeedSpeed
Accuracy
Diagnosis
Links-to-Layout for Power
Before: timing constraints not met
PowerCompilerPower
Compiler
PhysicalDesign
MetConstraints?
PDEF
set_load
Yes
No FloorplanManager
FloorplanManager
After: timing constraints met
Lowest power implementation
The lowest power silicon within your timing constraints The lowest power silicon within your timing constraints
SDF