151
Lower Power Design Guide 1998. 6.7 성성성성성성 성 성 성 성성 http://vlsicad.skku.ac.kr

Lower Power Design Guide

  • Upload
    melina

  • View
    34

  • Download
    2

Embed Size (px)

DESCRIPTION

Lower Power Design Guide. 1998. 6.7 성균관대학교 조 준 동 교수 http://vlsicad.skku.ac.kr. Contents. 1. Intoduction Trends for High-Level Lower Power Design 2. Power Management Clock/Cache/Memory Management 3. Architecture Level Design Architecture Trade offs, Transformation - PowerPoint PPT Presentation

Citation preview

Page 1: Lower Power Design Guide

Lower Power Design Guide

1998. 6.7

성균관대학교 조 준 동 교수http://vlsicad.skku.ac.kr

Page 2: Lower Power Design Guide

Contents 1. Intoduction Trends for High-Level Lower Power Design

2. Power ManagementClock/Cache/Memory Management

3. Architecture Level DesignArchitecture Trade offs, Transformation

4. RTL Level DesignRetiming, Loop-Unrolling, Clock Selection, Scheduling, Resource Sharing, Register Allocation

5. partitioning 6. Logic Level Design 7. Circuit Level Design 8. Quarter Sub Micron Layout Design

Lower Power Clock Designs 9. CAD tools 10. References

Page 3: Lower Power Design Guide

1. Introduction

Page 4: Lower Power Design Guide

Motivation

• Portable Mobile (=ubiquitous =nomadic)

• Systems with limited for heat sinks

• Lowering power with fixed performance: DSPs in modems and cellular phones

• Reliability: Increasing power ! increasing electromigration, 40-year reliability guarantee (product life cycle of telecommunication industries)

• Adding fans to reduce power cause reliability to plummet.

• Higher power leads to higher packaging costs: 2-watt package can be four times greater than a 1-watt package

• Myriad Constraints: timing, power, testability, area, packaging, time-to-market.

• Ad-Hoc Design: Lack a systematic process leading to universal applicability.

Page 5: Lower Power Design Guide

Power!Power!Power!

Page 6: Lower Power Design Guide

Power Dissipation in VLSI’s

MPU1 clockmemory

I/O

clockclock clockI/O

I/O

I/Ologic logic

logicmemory

memory memory

MPU1 ASSP1 ASSP2

MPU1: low-end microprocessor for embedded use

MPU2: high-end CPU with large amount of cache

ASSP1: MPEG2 decoder

ASSP2: ATM switch

Page 7: Lower Power Design Guide

Current Design Issues in Lower Power Problem

Energy-hungry Function by Network Server:

• Infopad (univ. of California, Berkeley), weight < 1 pound,

• 0.5W (reflective color display) + 0.5W (computation,communication, I/O support) = 1W (Alpha chip: 25W StrongARM: 215 MHz at 2.0V:0.3W)

• runtime 50 hours, target: 100MIPS/mW.

• Deep-sub micron (0.35 - 0.18) with low voltage for portable full motion video terminal; 0:5m : 40 AA NiMH; 1m : 1 AA NiMH

• System-On-A-Chip to reduce external Interconnection Capacitances

• Power Management: shut down idle units

• Power Optimization Techniques in Software, Architecture,Logic/Circuit,

• Layout Phases to reduce operations, frequency, capacitance, switching activity with maintaining the same throughput.

Page 8: Lower Power Design Guide

Battery Trends

Page 9: Lower Power Design Guide

Road-Map in Semiconductor Device Integration

Page 10: Lower Power Design Guide

Road-Map in Semiconductor Device Complexity

Page 11: Lower Power Design Guide

Power Component

• Static: Leakage current(<< 1%)• Dynamic:

– Short Circuit power(10-30%): Short circuit ow during transitions,

– Switching (or capacitive) power(70-90%): Charging/discharging of capacitive loads during transitions

Page 12: Lower Power Design Guide

Vdd vs Delay

•use architecture optimization to compensate for slower operation, e.g., Parallel Processing and Pipelining for concurrent increasing and critical path reducing. •Scale down device sizes to compensate for delay (Interconnects do not scale proportionately and can become dominant)

Page 13: Lower Power Design Guide

Good Design Methodologies

Page 14: Lower Power Design Guide

Synthesis and Optimization

Pareto point

Page 15: Lower Power Design Guide

2. Power Management

Page 16: Lower Power Design Guide

Power Consumption in Multimedia Systems

• LCD: 54.1%, HDD 16.8%, CPU 10.7%, VGA/VRAM 9.6%, SysLogic 4.5%, DRAM 1.1%, Others: 3.2%

• 5-55 Mode: – Display mode: CPU is in sleep-

mode (55 minutes), LCD (VRAM + LCDC)

– CPU mode: Display is idle ( 5 minutes), Looking up - data retrival

• Handwrite recognition - biggest power (memory, system bus active)

Page 17: Lower Power Design Guide

Power Management

• DPM

(Dynamic Power Management): stops the clock switching of a specific unit generated by clock generators. The clock regenerators produce two clocks, C1 and C2 . The logic: 0.3%, 10-20% of power savings.

• SPM

(Static Power Management): saving of the power dissipation in the steady mode. When the system (or subsystem) remains idle for a significant period time, then the entire chip

(or subsystem) is shut-down.

• Identify power hungry modules and look for opportunities to reduce power

• If f is increased, one has to increase the transistor size or Vdd.

Page 18: Lower Power Design Guide

Power Management([email protected])

• use right supply and right frequency to each part of the system If one has to wait on the occurence of some input, only a small circuit could wait and wake-up the main circuit when the input occurs.

• Another technique is to reduce the basic frequency for tasks that can be executed slowly.

• PowerPC 603 is a 2-issue (2 instructions read at a time) with 5 parallel

• execution units. 4 modes:– Full on mode for full speed– Doze mode in which the execution units are not running– Nap mode which also stops the bus clocking and the Sleep mode which

stops the clock generator– Sleep mode which stops the clock generator with or without the PLL (20-

100mW).

• Superpipelined MIPS R4200 : 5-stage pipleline, MIPS R4400: 8 stage, 2 execution units, f/2 in reduce mode.

Page 19: Lower Power Design Guide

TI• Two DSPs: TMS320C541, TMS320C542 reduce power and chip count and

system cost for wireless communication applications • C54X DSPs, 2.7V, 5V, Low-Power Enhanced Architecture DSP (LEAD) family:

Three different power down modes, these devices are well-suited for wireless communications products such as digital cellular phones, personal digital assistants, and wireless modem,low power on voice coding and decoding

• The TMS320LC548 features:– 15-ns (66 MIPS) or 20-ns (50 MIPS) instruction cycle times– 3.0- and 3.3-V operation

• 32K 16-bit words of RAM and 2K 16-bit words of boot ROM on-chip• Integrated Viterbi accelerator that reduces Viterbi butterfly update in four

instruction cycles for GSM channel decoding• Powerful single-cycle instructions (dual operand, parallel instructions, conditional

instructions)• Low-power standby modes

Page 20: Lower Power Design Guide

Power Estimation Techniques

• Circuit Simulation (SPICE): a set of input vectors, accurate, memory and time constraints

• Monte Carlo: randomly generated input patterns, normal distributed power per time interval T using a simulator switch level simulation (IRSIM): defined as no. of rising and falling transitions over total number of inputs

• Powermill (transistor level): steady-state transitions, hazards and glitches, transient short circuit current and leakage current; measures current density and voltage drop in the power net and identifies reliability problem caused by EM failures, ground bounce and excessive voltage drops.

• DesignPower (Synopsys): simulation-based analysis is within 8-15% of SPICE in terms of percentage difference (Probability-based analysis is within 15-20% of SPICE).

Page 21: Lower Power Design Guide

Cache/Memory Management• Clock and memory consumes between 15% to 45% of the total power in digital

computers• As block size increases, the energy required to service miss increases due to

increased memory access external-memory access (530 mA) vs. on-chip access(300mA): Replacing excessive accesses to background memory by foreground memory

• Cache vertical partitioning (buffering): multi-level variable-size caches

Caches are powerdown when idle.• Cache horizontal partitioning (subarray access): several segments can be

powered individually. Only the cache sub-bank where the requested data is located consumes power in each cache access.

• Using distributed memory instead of a single centralized memory• Locality of reference to eliminate expensive data transfer across high

capacitance busses• Cache misses consume more energy (directed-mapping or k-associated

mapping?), page faults consume more energy

Page 22: Lower Power Design Guide

Power Management• Block Power Management (Sleep,

standby mode) Scheme by Enabling Clock

• Clock Power Management Scheme by adding Clock Generation block

block 1

block 1

block 1

enable 1

enable 3

enable 2

c lk

block 1

block 1

block 1

c lk

enable 1

enable 3

enable 2

c lock management

Page 23: Lower Power Design Guide

3. Architectural Level Design

Page 24: Lower Power Design Guide

Architectural-level Synthesis• Translate HDL models into sequencing graphs. • Behavioral-level optimization:

– Optimize abstract models independently from the implementation parameters.

• Architectural synthesis and optimization:– Create macroscopic structure:

• data-path and control-unit.

– Consider area and delay information • Hardware compilation:

– Compile HDL model into sequencing graph.

– Optimize sequencing graph.

– Generate gate-level interconnection for a cell library. of the implementation.

Page 25: Lower Power Design Guide

Power Measure of P

Page 26: Lower Power Design Guide

System-Level Solutions

• Spatial locality: an algorithm can be partitioned into natural clusters based on connectivity

• Temporal locality: average lifetimes of variables (less temporal storage, probability of future accesses referenced in the recent past).

• Precompute physical capacitance of Interconnect and switching activity (number of bus accesses)

• Architecture-Driven Voltage Scaling: Choose more parallel architecture

• Supply Voltage Scaling : Lowering V dd reduces energy, but increase delays

Page 27: Lower Power Design Guide

Software Power Issues

Upto 40% of the on-chip power is dissipated on the buses !

• System Software : OS, BIOS, Compilers

• Software can affect energy consumption at various levels Inter-Instruction Effects

• Energy cost of instruction varies depending on previous instruction

• For example, XORBX 1; ADDAX DX;

• Iest = (319:2+313:6)=2 = 316:4mA Iobs =323:2mA

• The difference defined as circuit state overhead

• Need to specify overhead as a function of pairs of instructions

• Due to pipeline stalls, cache misses

• Instruction reordering to improve cache hit ratio

Page 28: Lower Power Design Guide

Avoiding Wastful Computation

• Preservation of data correlation

• Distributed computing / locality of reference

• Application-specific processing

• Demand-driven operation

• Bus-Inverted Coding

• Transformation for memory size reduction– Consider arrays A and C are already available in memory– When A is consumed another array B is generated; when C is consumed a

scalar value D is produced. – Memory Size can be reduced by executing the j loop before the i loop so

that C is consumed before B is generated and the same memory space can be used for both arrays.

Page 29: Lower Power Design Guide

Avoiding Wastful Computation

Page 30: Lower Power Design Guide

Architecture Lower Power Design

• Optimum Supply Voltage Architecture through Hardware Duplication (Trading Area for Lower Power) and/or Pipelining– complex and fewer instruction requires less encoding, but larger

decode logic!

• Use small complex instruction with smaller instruction length (e.g., Hitachi SH: 16-bit fixed-length, arithmetic instruction uses only two operands, NEC V800: variable-length instruction decoding overhead )

• Superscalar: CPI < 1: parallel instruction execution. VLIW architecture.

Page 31: Lower Power Design Guide

Variable Supply Voltage Block Diagram

• Computational work varies with time. An approach to reduce the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.

• The basic idea is to lower power supply when the a fixed supply for some fraction of time.

• The supply voltage and clock rate are increased during high workload period.

Page 32: Lower Power Design Guide

Power Reduction using Variable Supply

•Circuits with a fixed supply voltage work at a fixed speed and idle if the data sample requires less than themaximum amount of computation. Power is reduced in a linear fashion since the energy per operation is fixed. • If the work load for a given sample period is less than peak, then the delay of the processing element can be increased by a factor of 1/workload without loss in throughput, allowing the processor to operate at a lower supply voltage. Thus, energy per operation varies.

Page 33: Lower Power Design Guide

Data Driven Signal Processing

The basic idea of averaging two samples are buffered and their work loads are averaged.

The averaged workload is then used as the effective workload to drive the power supply.

Using a pingpong buffering scheme, data samples In +2, In +3

are being buffered while In, In +1

are being processed.

Page 34: Lower Power Design Guide

Architecture of Microcoded Instruction Set Processor

Page 35: Lower Power Design Guide

Power and Area

1.5V and 10MHz clock rate: instruction and data memory accesses account for 47% of the total power consumption.

Page 36: Lower Power Design Guide

Datapath Parallelization

Page 37: Lower Power Design Guide

Memory Parallelization

At first order P= C * f/2 * Vdd2

Page 38: Lower Power Design Guide

Pipelined Micro-P

Page 39: Lower Power Design Guide

Architecture Trade-Off

PIPLELINED Implementation

Ppipeline = (1.15C)( 0.58V)2 (f) = 0.39P

Pparallel =

(2.15C)(0.58V)2 (0.5f) = 0.36P

NON-PIPLELINED Implementation

Page 40: Lower Power Design Guide

Through WAVE PIPELINING

Page 41: Lower Power Design Guide

Different Classes of RISC Micro-P

Page 42: Lower Power Design Guide

Application Specific Coprocessor

• DSP's are increasingly called upon to perform tasks for which they are not ideally suited, for example, Viterbi decoding.

• They may also take considerably more energy than a custom solution.

• Use the DSP for portions of algorithms for which it is well suited, and craft an application-specic coprocessor (i.e., custom hardware) for other tasks.

• This is an example of the difference between power and energy

• The application-specific coprocessor may actually consume a more power than the DSP, but it may be able to accomplish the same task in far less time, resulting in a net energy savings.

• Power consumption varies dramatically with the instruction being executed.

Page 43: Lower Power Design Guide

Clock per Instruction (CPI)

Page 44: Lower Power Design Guide

SUPERPIPELINE micro-P

Page 45: Lower Power Design Guide

VLIW Architecture Compiler takes the responsibility for finding the operations that can be issued in parallel and creating a single very long instruction containing these operations. VLIW instruction decoding is easier than superscalar instruction due to the fixed format and to no instruction dependency. The fixed format could present more limitations to the combination of operations. Intel P6: CISC instructions are combined on chip to provide a set of micro-operations (i.e., long instruction word) that can be executed in parallel. As power becomes a major issue in the design of fast -Pro, the simple is the better architecture. VLIW architecture, as they are simpler than N-issue machines, could be considered as promising architectures to achieve simultaneouslyhigh-speed and low-power.

Page 46: Lower Power Design Guide

Synchronous VS. Asynchronous SYSTEMS

• Synchronous system: A signal path starts from a clocked flip- flop through combinational gates and ends at another clocked flip- flop. The clock signals do not participate in computation but are required for synchronizing purposes. With advancement in technology, the systems tend to get bigger and bigger, and as a result the delay on the clock wires can no longer be ignored. The problem of clock skew is thus becoming a bottleneck for many system designers. Many gates switch unnecessarily just because they are connected to the clock, and not because they have to process new inputs. The biggest gate is the clock driver itself which must switch.

• Asynchronous system (self-timed): an input signal (request) starts the computation on a module and an output signal (acknowledge) signifies the completion of the computation and the availability of the requested data. Asynchronous systems are potentially response to transitions on any of their inputs at anytime, since they have no clock with which to sample their inputs.

Page 47: Lower Power Design Guide

Asynchronous SYSTEMS

• More difficult to implement, requiring explicit synchronization between communication blocks without clocks

• If the signal feeds directly to conventional gate-level circuitry, invalid logic levels could propagate throughout the system.

• Glitches, which are filtered out by the clock in synchronous designs, may cause an asynchronous design to malfunction.

• Asynchronous designs are not widely used, designers can't find the supporting design tools and methodologies they need.

• DCC Error Corrector of Compact cassette player saves power of 80% as compared to the synchronous counterpart.

• Offers more architectural options/freedom encourages distributed, localized control offers more freedom to adapt the supply voltage

Page 48: Lower Power Design Guide

Asynchronous Modules

Page 49: Lower Power Design Guide

Example: ABCS protocol

6% more logics

Page 50: Lower Power Design Guide

Control Synthesis Flow

Page 51: Lower Power Design Guide

PIPELINED SELF-TIMED micro P

Page 52: Lower Power Design Guide

Programming Style

Page 53: Lower Power Design Guide

Speed vs. Power Optimization

Page 54: Lower Power Design Guide

VON NEUMANN VERSUS HARVARD

Page 55: Lower Power Design Guide

Low Vdd Main Memories

Page 56: Lower Power Design Guide

CACHE MEMORIES

Page 57: Lower Power Design Guide

Low Power Memory• Hierarchical Word Line: Divide the memory in different blocks and access the bit cells

of the desired block

• Selective precharge: Many bit lines are discharged even when these locations are not accessed. Only bit lines which will be accesses are precharged.

• Minimization of Non-zero Terms in the ROM table: Zero terms do not switch bit lines and reduce capacitance in both bit lines and row lines.

– Inverted ROM: If the number of ones is very high, the whole ROM core can be inverted.

– Inverted Row: A given row is inverted if more than half of the bits are non-zero terms. An extra bit is required to perfoem encoding.

– Sign magnitude representation: ROM is used to store the coefficients of a digital filter. As a result, a significant amount of the non-zero terms are due to the sign extension of the negative coefficients. The main drawback of this type is that a conversion to two’s complement is required at the end of a cycle, which slows down the ROM.

– Sign magnitude and inverted block:

• Difference Encoding: reduce the size of the ROM core. If the value between adjacent data do not change significantly, the ROM core stores the difference between the data.

Page 58: Lower Power Design Guide

Low Power Memory• Smaller ROMS: in 102 tap filter, more than 70% of the coefficients are

below 18 bits. Still the largest coefficients are below 18 bits. Still the largest coefficient goes up 24 bits. A better implementation can be achieved if the large coefficients are stored in a wide ROM with fewer address; the small coefficients are stored in narrow ROM with many addresses. A similar approach is applied for locations in ROM which are often accessed. Loations that are accesses frequently are stored in a small, fast ROM, while the other locations are stored in a larger ROM.

• NMOS precharge: bit lines are precharged to Vdd - Vt. A drawback of this technique is degradation of noise margins and the body bias effect.

• Buffer Sizing: a large set of buffers is required in the control logic to drive the address lines through the decoder, generate the contol signals for the column multiplexers, drive the row lines and drive the precharge signals.

• Voltage scaling: 2))(/(

2

tddox

ddLddLdelay VVLWC

VC

I

VCT

Page 59: Lower Power Design Guide

Memory Architecture

Page 60: Lower Power Design Guide

Exploiting Locality for Low-Power Design

•Power consumption (mW) in the maximally time-shared and fully-parallel versions of the QMF sub-band coder filter• Improvement of a factor of 10.5 at the expense of a 20% increase in area• The interconnect elements (buses, multiplexers, and buffers) consumes 43% and 28% of the total power inthe time-shared and parallel versions.

•A spatially local cluster: group of algorithm operations that are tightlyconnected to each other in the flow graph representation.• Two nodes are tightly connected to each other on the flow graph representation if the shortest distance between them, in terms of number of edges traversed, is low.

Page 61: Lower Power Design Guide

Cascade filter layouts

(a)Non-local implementation from Hyper (b)Local implementation from Hyper-LP

Page 62: Lower Power Design Guide

Stage-Skip Pipeline

•The power savings is achieved by stopping the instruction fetch and decode stages of the processor duringthe loop execution except its first iteration.•DIB = Decoded Instruction Buffer• 40 % power savings using DSP or RISC processor.

Page 63: Lower Power Design Guide

Stage-Skip Pipeline•Selector: selects the output from either the instruction decoder or DIB• The decoded instruction signals for a loop are temporarily stored in the DIB and are reused in each iteration of the loop. •The power wasted in the conventional pipeline is saved in our pipeline by stopping the instruction fetching and decoding for each loop execution.

Page 64: Lower Power Design Guide

Stage-Skip Pipeline

Majority of execution cycles in signal processing programs are used for loop execution : 40% reduction in power with area increase 2%.

Page 65: Lower Power Design Guide

Parallel LIFO Scenario

Page 66: Lower Power Design Guide

Parallel-serial Converter

Page 67: Lower Power Design Guide

D- flip- flop Parallelization

Page 68: Lower Power Design Guide

State Machine

Page 69: Lower Power Design Guide

Frequency Multipliers and Dividers

Page 70: Lower Power Design Guide

Data Reuse Exploration

• MH(memory hierarchy) introduces copies of data from larger to smaller memories in DFG.

• Power consumption is decreased because data is now read mostly from smaller memories, while it is increased because extra memory transfers are introduced.

• Moreover, adding another layer of hierarchy has a negative effect on the area and interconnect cost.

Page 71: Lower Power Design Guide

State/Instruction Encoding• Architecture of Control Logic in

Microprocessor– State Transition Diagram

S 0

S n

S 4

S 3

S 2

S 1

e

C om binationalLogic

state register

present state next state

primary input primary output

– Binary Code Mapping– Hardware Implementation

If e has higher switching prob. (e.g., S0 =branch, S1=compare), then encode S0 and S1 with gray code style.

Page 72: Lower Power Design Guide

Optimizing Power using Transformation

LOCAL TRANSFORMATIONPRIMITIVESAssociativity,Distributivity,

Retiming,Common Sub-expression

GLOBALTRANSFORMATION

PRIMITIVESRetiming,

Pipelining,Look-Ahead,Associativity

SEARCH MECHANISMsimulated Rejectionless,

Steepest Decent,Heuristics

POWERESTIMATION

INPUT FLOWGRAPH OUTPUT FLOWGRAPH

Page 73: Lower Power Design Guide

Summary of ResultsEXAMPLE

POWERREDUCTION

AREAINCREASE

OPTIMUMVOLTAGE

FIR11 11 1.1 1.5V

DCT 8 5 1.5V

IIR7 7.5 6.4 1.4V

VOLTERRA2 8.6 1 1.7V

Optimum voltage for low-power is around 1.5V

Page 74: Lower Power Design Guide

Data- flow based transformations

• Tree Height reduction.• Constant and variable propagation.• Common subexpression elimination.• Code motion• Dead-code elimination• The application of algebraic laws such as commutability,

distributivity and associativity.• Most of the parallelism in an algorithm is embodied in the loops.• Loop jamming, partial and complete loop unrolling, strength

reduction and loop retiming and software pipelining.• Retiming: maximize the resource utilization.

Page 75: Lower Power Design Guide

Tree-height reduction•Example of tree-height reduction using commutativity and associativity

• Example of tree-height reduction with distributivity

Page 76: Lower Power Design Guide

Sub-expression elimination

• Logic expressions:– Performed by logic optimization.– Kernel-based methods.

• Arithmetic expressions:– Search isomorphic patterns in the parse trees.– Example:– a= x+ y; b = a+ 1; c = x+ y;– a= x+ y; b = a+ 1; c = a;

Page 77: Lower Power Design Guide

Examples of other transformations

• Dead-code elimination:– a= x; b = x+ 1; c = 2 * x;– a= x; can be removed if not referenced.

• Operator-strength reduction:– a= x2 ; b = 3 * x;– a= x * x; t = x<<1; b = x+ t;

• Code motion:– for ( i = 1; i < a * b) { } – t = a * b; for ( i = 1; i < t) { }

Page 78: Lower Power Design Guide

Strength reduction

++

*

**

B

X

XX

A

+*+

+* +++

+

X

A

X B

X 2 + AX + B X(X + A) + B

X

A

+* +

+*

*X

X

X

C

*++* +++ +

X B

+*

BX

X

A

Page 79: Lower Power Design Guide

Strength Reduction

Page 80: Lower Power Design Guide

Control- flow based transformations

• Model expansion.– Expand subroutine flatten

hierarchy.– Useful to expand scope of other

optimization techniques.– Problematic when routine is

called more than once.– Example:– x= a+ b; y= a * b; z = foo( x, y) ;– foo( p, q) {t =q-p; return(t);} – By expanding foo:– x= a+ b; y= a * b; z = y-x;

• Conditional expansion • Transform conditional into parallel execution with test at the end.• Useful when test depends on late signals.• May preclude hardware sharing.• Always useful for logic expressions.• Example:•y= ab; if ( a) x= b+d; else x= bd; can be expanded to: x= a( b+ d) + a’bd;•y= ab; x= y+ d( a+ b);

Page 81: Lower Power Design Guide

Pipelining

Page 82: Lower Power Design Guide

Associativity Transformation

Page 83: Lower Power Design Guide

FIR Parallelization

Page 84: Lower Power Design Guide

FIR PARALLELIZATION

Page 85: Lower Power Design Guide

FIR Filter Parallelization

Page 86: Lower Power Design Guide

FIR parallelization: two working phases

Page 87: Lower Power Design Guide

IIR filter recursive function

Page 88: Lower Power Design Guide

Recursive Function

Page 89: Lower Power Design Guide

Interlaced Accumulation Programming for LowPower

Page 90: Lower Power Design Guide

4. Register Transfer Level Design

Page 91: Lower Power Design Guide

FIR3 Block Diagram and Flow Graph

Page 92: Lower Power Design Guide

High-Level Power Estimation

• Pcore = PDP + PMEM + PCNTR + PPROC

• PDP = PREG +PMUX +PFU + +PFU, where PREG is the power of the registers

• PMUX is the power of multiplexers• PFU is the power of functional units• PINT is the power of physical interconnet capacitance

(HYPER). tsinterconne physical ofnumber

theof estimatean is N and chip theof ecapacitanc estimated

total theis y),probabilitn transitiosignal averagean

by multiplied accessesct interconne ofnumber total(the

activity average theis where,/int

total

total

C

NCC

Page 93: Lower Power Design Guide

High-Level Power Estimation: PREG

• Compute the lifetimes of all the variables in the given VHDL code.• Represent the lifetime of each variable as a vertical line from statement i through

statement i + n in the column j reserved for the corresponding varibale v j .• Determine the maximum number N of overlapping lifetimes computing the

maximum number of vertical lines intersecting with any horizontal cut-line.• Estimate the minimal number of N of set of registers necessary to implement the

code by using register sharing. Register sharing has to be applied whenever a group of variables, with the same bit-width b i .

• Select a possible mapping of variables into registers by using register sharing• Compute the number w i of write to the variables mapped to the same set of

registers. Estimate n i of each set of register dividing w i by the number of statements S: i =wi/S; hence TR imax = n i f clk .

• Power of latches and flip flops is consumed not only during output transitions, but also during all clock edges by the internal clock buffers

• The non-switching power PNSK dissipated by internal clock buffers accounts for 30% of the average power for the 0.38-micron and 3.3 V operating system.

• In total,

,)(,),(1

kclkNkkNSKktkkkNSK

N

kkREG TRfPnPTRPnPPPP

Page 94: Lower Power Design Guide

PCNTR• After scheduling, the control is defined and optimized by the hardware mapper and

further by the logic synthesis process before mapping to layout.

• Like interconnect, therefore, the control needs to be estimated statistically.

• Global control model:

states. ofnumber

on thedependent strongly is ns transitioofnumber totalThe

22.1fF. is and 4.9fF is gy, technolo1.2 aFor

,

21

21

statesFSM NC

Local control model: the local controller account for a larger percentage of the total capacitance than the global controller.

.55.0,3.8,15.0,72 tech.,1.2 aFor

,

3,2,1,0,

3210

fstatestranslc BNNC

Where Ntrans is the number of tansitions, nstates is the number of states, Bf is the bus factor, and Clc is the capacitance switched in any local controller in one sample period. Bf is the ratio of the number of bus accesses to the number of busses.

Page 95: Lower Power Design Guide

Ntrans

• The number of transitions depends on assignment, scheduling, optimizations, logic

• optimization, the standard cell library used, the amount of glitchings and the statistics of the inputs.

.0.2,2.7,7.178 tech.1.2 aFor units.execution

ofnumber totalfor the estimatean is andCDFG the

in nodes and edges ofnumber theare and period,

sampleper cycles control ofnumber theis S s,controller loal the

of outputs on the ns transitioofnumber theis where

)()(

321

321

Exu

nodesedges

trans

Exuedgesnodestrans

N

NN

N

NSNNN

Page 96: Lower Power Design Guide

Behavioral Synthesis• loop unrolling : localize the data to reduce the activity of the inputs of the

functional units or two output samples are computed in parallel based on two input samples.

Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation,

The transformation yields critical path of 3, thus voltage can be dropped.• Clock Selection : Choose optimal system clock period Eliminate slacks/improve resource

utilization and Enable greater voltage scaling• Module selection : For each operation, choose library template• Flow graph restructuring : pull out operations on the critical cycle.

)( 211

211

nnnnnn

nnn

YAXAXYAXY

YAXY

22

1

211

nnnn

nnn

YAYAXY

YAXY

Page 97: Lower Power Design Guide

High-Level Power Estimation: PMUX and PFU

Page 98: Lower Power Design Guide

Critical Path• Longest delayed path from input to

output in combinational logic

• Determine operating clock frequency

• Resizing non-critical path transistor (In-Place Optimization)

• Critical path in Synchronous Sequential logic

skewclock of max.value :

flipflop of timesetup of max.value :

delaypath critical of max.value :

delay flipflop of max.value :

periodclock of min.value :,

,

skew,max

setup,max

logic,max

ff,max

mincycle

skew,maxsetup,maxlogic,maxff,maxmincycle

t

t

t

t

t

ttttt

D Q

D Q

D Q

D Q

D Q

D Q

c lk c lk

C ombinationalLogic

path A

path B

Page 99: Lower Power Design Guide

Loop Unrolling for Low Power

Page 100: Lower Power Design Guide

Retiming

Flip- flop insertion to minimize hazard activity moving a flip- flop in a circuit

Page 101: Lower Power Design Guide

Exploiting spatial locality for interconnect power reduction

Global

Local

Adder1

Adder2

Page 102: Lower Power Design Guide

Balancing maximal time-sharing and fully-parallel implementation

A fourth-order parallel-form

IIR filter

(a) Local assignment

(2 global transfers), (b) Non-local assignment

(20 global transfers)

Page 103: Lower Power Design Guide

Retiming/pipelining for Critical path

+ +

+

D

D

+

Biquad Biquad Biquad1st

OrderBiquad Biquad

in out

Minimal Area Time-multiplexed Solution

(meeting the throughput constraint)

Retiming and pipelining

Biquad Biquad Biquad1st

OrderBiquad Biquad

in outD DDDD

+

++

+

DD

D

D

"Fastest" Solution

Supply voltage can be reduced keeping throughput fixed.

Page 104: Lower Power Design Guide

Effective Resource Utilization

+

+

+

+

D

D

S

5 1 2

3 4

6

7

Retiming

D

D

D

D

D+

+

+

+S

51 2 6

7

43

Before AFTER

CYCLE Multipliers1 1, 3

2, 4

-

-5

6, 8

7

2

13

4

Adder8

6

7

5

Adder Multipliers

2

1

1

1

-

Can reducd interconnect capacitance.

Page 105: Lower Power Design Guide

Hazard propagation elimination by clocked sampling

By sampling a steady state signal at a register input, no more glitches are propagated through the nextcombinational logics.

Page 106: Lower Power Design Guide

Latched Retiming

Page 107: Lower Power Design Guide

Latched retiming

Page 108: Lower Power Design Guide

Regularity

• Common patterns enable the design of less complex architecture and therefore simpler interconnect structure (muxes, buffers, and buses). Regular designs often have less control hardware.

+ *

+ *

+ *

+ <

<+ *<<

A 1

A 1

A 1

A 2

A 2

M 1

M 1

S1

S1

M 1

+ +

* <<

A 1 A 2

M 1 S1

+ *

+ *

+ *

+ <

<+ *<<

A 1

A 2

A 2

A 1

A 2

M 1

M 1

S1

S1

M 1

* <<M 1 S1

MUX

+A 1 + A 2

MUX

(a)±ÔÄ¢Àû ¸ðµâÇÒ´ç

(b)ºñ±ÔÄ¢Àû ¸ðµâÇÒ´ç

Page 109: Lower Power Design Guide

Module Selection

• Select the clock period, choose proper hardware modules for all operations(e.g., Wallace or Booth Multiplier), determine where to pipeline (or where to put registers), such that a minimal hardware cost is obtained under given timing and throughput constraints.

• Full pipelining: ineffective clock period mismatches between the execution times of the operators. performing operations in sequence without immediate buffering can result in a reduction of the critical path.

• Clustering operations into non-pipelining hardware modules, the reusability of these modules over the complete computational graph be maximized.

• During clustering, more expensive but faster hardware may be swapped in for operations on the critical path if the clustering violates timing constraints

Page 110: Lower Power Design Guide

Estimation• Estimate min and max bounds on the required resources to

– delimit the design space min bounds to serve as an initial solution

– serve as entries in a resource utilization table which guides the transformation, assignment and scheduling operations

• Max bound on execution time is tmax: topological ordering of DFG using ASAP and ALAP

• Minimum bounds on the number of resources for each resource class

Where NRi: the number of resources of class Ri

dRi : the duration of a single operation

ORi : the number of operations

Page 111: Lower Power Design Guide

Exploring the Design Space

• Find the minimal area solution constrained to the timing constraints

• By checking the critical paths, it determine if the proposed graph violates the timing constraints. If so, retiming, pipelining and tree height reduction can be applied.

• After acceptable graph is obtained, the resource allocation process is

• initiated.

– change the available hardware (FU's, registers, busses)

– redistribute the time allocation over the sub-graphs

– transform the graph to reduce the hardware requirements.

• Use a rejectionless probabilistic iterative search technique (a variant of Simulated Annealing), where moves are always accepted. This approach reduces computational complexity and gives faster convergence.

Page 112: Lower Power Design Guide

Data path Synthesis

Page 113: Lower Power Design Guide

Scheduling and Binding• The scheduling task selects the control step, in which a given operation

will happen, i.e., assign each operation to an execution cycle

• Sharing: Bind a resource to more than one operation.

– Operations must not execute concurrently.

• Graph scheduled hierachically in a bottom-up fashion

• Power tradeoffs– Shorter schedules enable supply voltage (Vdd) scaling– Schedule directly impacts resource sharing– Energy consumption depends what the previous instruction was– Reordering to minimize the switching on the control path

• Clock selection – Eliminate slacks– Choose optimal system clock period

Page 114: Lower Power Design Guide

ASAP Scheduling

• Algorithm • HAL Example

Page 115: Lower Power Design Guide

• Algorithm

ALAP Scheduling

• HAL Example

Page 116: Lower Power Design Guide

Force Directed Scheduling

Used as priority function. Force is related to concurrency. Sort operations for least force. Mechanical analogy:

Force = constant displacement. constant = operation-type distribution. displacement = change in probability.

Page 117: Lower Power Design Guide

Force Directed Scheduling

Page 118: Lower Power Design Guide

Example : Operation V6

Page 119: Lower Power Design Guide

Force-Directed Scheduling• Algorithm (Paulin)

Page 120: Lower Power Design Guide

Force-Directed Scheduling Example• Probability of scheduling operations

into control steps

• Probability of scheduling operations into control steps after operation o3 is scheduled to step s2

• Operator cost for multiplications in a

• Operator cost for multiplications in c

Page 121: Lower Power Design Guide

List Scheduling• The scheduled DFG• DFG with mobility labeling (inside <>)

• ready operation list/resource constraint

Page 122: Lower Power Design Guide

Static-List Scheduling• DFG

• Partial schedule of five nodes

• Priority list

The final schedule

Page 123: Lower Power Design Guide

Loop folding

• Reduce execution delay of a loop.• Pipeline operations inside a loop.

• Overlap execution of operations.• Need a prologue and epilogue.

• Use pipeline scheduling for loop graph model.

Page 124: Lower Power Design Guide

DFG Restructuring• DFG2 • DFG2 after redundant operation

insertion

Page 125: Lower Power Design Guide

Minimizing the bit transitions for constants during Scheduling

Page 126: Lower Power Design Guide

Control Synthesis

•Synthesize circuit that:•Executes scheduled operations.•Provides synchronization.•Supports:

• Iteration.• Branching.• Hierarchy.• Interfaces.

Page 127: Lower Power Design Guide

Allocation ◆Bind a resource to more than one operation.

Page 128: Lower Power Design Guide

Optimum binding

Page 129: Lower Power Design Guide

Example

Page 130: Lower Power Design Guide

RESOURCE SHARING• Parallel vs. time-sharing buses (or execution units)• Resource sharing can destroy signal correlations and increase switching

activity, should be done between operations that are strongly connected.• Map operations with correlated input signals to the same units• Regularity: repeated patterns of computation (e.g., (+, * ), ( * ,*), (+,>))

simplifying interconnect (busses, multiplexers, buffers)

Page 131: Lower Power Design Guide

Datapath interconnections

• Multiplexer-oriented datapath

• Bus-oriented datapath

Page 132: Lower Power Design Guide

Sequential Execution

• Example of three micro-operations in the same clock period

Page 133: Lower Power Design Guide

Insertion of Latch (out)• Insertion of latches at the output ports of the functional units

Page 134: Lower Power Design Guide

Insertion of Latch (in/out)• Insertion of latches at both the input and output ports of

the functional units

Page 135: Lower Power Design Guide

Overlapping Data Transfer(in)

• Overlapping read and write data transfers

Page 136: Lower Power Design Guide

Overlapping of Data Transfer (in/out)• Overlapping data transfer with functional-unit execution

Page 137: Lower Power Design Guide

Register Allocation Using Clique Partitioning• Scheduled DFG

• Graph model

• Lifetime intervals of variable

• Clique-partitioning solution

Page 138: Lower Power Design Guide

Left-Edge Algorithm

• Register allocation using Left-Edge Algorithm

Page 139: Lower Power Design Guide

Register Allocation: Left-Edge Algorithm

• Sorted variable lifetime intervals • Five-register allocation result

Page 140: Lower Power Design Guide

Register Allocation

• Allocation : bind registers and functional modules to variables and operations in the CDFG and specify the interconnection among modules and registers in terms of MUX or BUS.

• Reduce capacitance during allocation by minimizing the number of functional modules, registers, and multiplexers.

• Composite weight w.r.t transition activity and capacitance loads is incorporated into CDFG.

• Find the highest composite weight and merge the two nodes it joins, i.e., maps the corresponding variable to the same register.

• Allocation continues till no edges are left in the CDFG while updating the composite weight values.

• Set the maximum # of operations alive in any control step to be one.

• Sequence operations/variables to enhance signal correlations

Page 141: Lower Power Design Guide

Exploiting spatial locality for interconnect power reduction

• A spatially local cluster: group of algorithm operations that are tightly connected to each other in the flowgraph representation.

• Two nodes are tightly connected to each other on the flowgraph representaion if the shortest distance between them, in terms of number of edges traversed, is low.

• A spatially local assignment is a mapping of the algorithm operations to specific hardware units such that no operations in different clusters share the same hardware.

• Partitioning the algorithm into spatially local clusters ensures that the majority of the data transfers take place within clusters (with local bus) and relatively few occur between clusters (with global bus).

• The partitioning information is passed to the architecture netlist and floorplanning tools.

• Local: A given adder outputs data to its own inputs Global: A given adder outputs data to the aother adder's inputs

Page 142: Lower Power Design Guide

Hardware Mapping

• The last step in the synthesis process maps the allocated, assigned and scheduled flow graph (called the decorated flow graph) onto the available hardware blocks.

• The result of this process is a structural description of the processor architecture, (e.g., sdl input to the Lager IV silicon assembly environment).

• The mapping process transforms the flow graph into three structural sub-graphs:

the data path structure graph

the controller state machine graph

the interface graph (between data path control inputs and the

controller output signals)

Page 143: Lower Power Design Guide

Spectral Partitioning in High-Level Synthesis

• The eigenvector placement obtained forms an ordering in which nodes tightly connected to each other are placed close together.

• The relative distances is a measure of the tightness of connections.• Use the eigenvector ordering to generate several partitioning solutions• The area estimates are based on distribution graphs.• A distribution graph displays the expected number of operations executed in

each time slot.• Local bus power: the number of global data transfers times the area of the

cluster• Global bus power: the number of global data transfer times the total area:

Page 144: Lower Power Design Guide

Finding a good Partition

Page 145: Lower Power Design Guide

Interconnection Estimation

• For connection within a datapath (over-the-cell routing), routing between units increases the actual height of the datapath by approximately 20-30% and that most wire lengths are about 30-40% of the datapath height.

• Average global bus length : square root of the estimated chip area.• The three terms represent white space, active area of the components, and

wiring area. The coefficients are derived statistically.

Page 146: Lower Power Design Guide

Experiments

Page 147: Lower Power Design Guide

Datapath Generation

• Register file recognition and the multiplexer reduction:– Individual registers are merged as much as possible into register files– reduces the number of bus multiplexers, the overall number of busses

(since all registers in a file share the input and output busses) and the number of control signals (since a register file uses a local decoder).

• Minimize the multiplexer and I/O bus, simultaneously (clique partitioning is Np-complete, thus Simulated Annealing is used)

• Data path partitioning is to optimize the processor floorplan

• The core idea is to grow pairs of as large as possible isomorphic regions from corresponding of seed nodes.

Page 148: Lower Power Design Guide

Hardware Mapper

Page 149: Lower Power Design Guide

Test Example

Page 150: Lower Power Design Guide

Control Signal Assignment

Page 151: Lower Power Design Guide

Incorporating into HYPER-LP