VADA Lab.SungKyunKwan Univ. 1 Lower Power Architecture Design 1999. 8.2 성균관대학교 조 준 동 교수

SungKyunKwan Univ.

1VADA Lab.

Lower Power Architecture Design

1999. 8.2

성균관대학교 조 준 동 교수http://vada.skku.ac.kr

SungKyunKwan Univ.

2VADA Lab.

Architectural-level Synthesis• Translate HDL models into sequencing graphs. • Behavioral-level optimization:

– Optimize abstract models independently from the implementation parameters.

• Architectural synthesis and optimization:– Create macroscopic structure:

• data-path and control-unit.

– Consider area and delay information • Hardware compilation:

– Compile HDL model into sequencing graph.

– Optimize sequencing graph.

– Generate gate-level interconnection for a cell library. of the implementation.

SungKyunKwan Univ.

3VADA Lab.

Architecture-Level Solutions• Architecture-Driven Voltage Scaling: Choose more parallel architectu

re, Lowering V dd reduces energy, but increase delays

• Regularity: to minimize the power in the control hardware and the interconnection network.

• Modularity: to exploit data locality through distributed processing units, mem-ories and control.

– Spatial locality: an algorithm can be partitioned into natural clusters based on connectivity

– Temporal locality:average lifetimes of variables (less temporal storage, probability of future accesses referenced in the recent past).

• Few memory references: since references to memories are expensive in terms of power.

• Precompute physical capacitance of Interconnect and switching activity (number of bus accesses

SungKyunKwan Univ.

4VADA Lab.

Power Measure of P

SungKyunKwan Univ.

5VADA Lab.

Architecture Trade-offReference Data Path

SungKyunKwan Univ.

6VADA Lab.

Parallel Data Path

SungKyunKwan Univ.

7VADA Lab.

Pipelined Data Path

SungKyunKwan Univ.

8VADA Lab.

A Simple Data Path, Result

SungKyunKwan Univ.

9VADA Lab.

Uni-processor Implementation

SungKyunKwan Univ.

10VADA Lab.

Multi-Processor Implementation

SungKyunKwan Univ.

11VADA Lab.

Datapath Parallelization

SungKyunKwan Univ.

12VADA Lab.

FIR Parallelization

Mahesh Mejendale, Sunil D. Sherlekar, G. Venkatesh “Low-Power Realization of FIR Filters on Programmable DSP’s” IEEE Transations on very large scale integration (VLSI) system, Vol. 6, No. 4, December 1998

SungKyunKwan Univ.

13VADA Lab.

Memory Parallelization

At first order P= C * f/2 * Vdd2

SungKyunKwan Univ.

14VADA Lab.

VLIW Architecture

SungKyunKwan Univ.

15VADA Lab.

VLIW - cont.• Compiler takes the responsibility for finding the operations that can be i

ssued in parallel and creating a single very long instruction containing these operations. VLIW instruction decoding is easier than superscalar instruction due to the fixed format and to no instruction dependency.

• The fixed format could present more limitations to the combination of operations.

• Intel P6: CISC instructions are combined on chip to provide a set of micro-operations (i.e., long instruction word) that can be executed in parallel.

• As power becomes a major issue in the design of fast -Pro, the simple is the better architecture.

• VLIW architecture, as they are simpler than N-issue machines, could be considered as promising architectures to achieve simultaneously

• high-speed and low-power.

SungKyunKwan Univ.

16VADA Lab.

Synchronous VS. Asynchronous

• Synchronous system: A signal path starts from a clocked flip- flop through combinational gates and ends at another clocked flip- flop. The clock signals do not participate in computation but are required for synchronizing purposes. With advancement in technology, the systems tend to get bigger and bigger, and as a result the delay on the clock wires can no longer be ignored. The problem of clock skew is thus becoming a bottleneck for many system designers. Many gates switch unnecessarily just because they are connected to the clock, and not because they have to process new inputs. The biggest gate is the clock driver itself which must switch.

• Asynchronous system (self-timed): an input signal (request) starts the computation on a module and an output signal (acknowledge) signifies the completion of the computation and the availability of the requested data. Asynchronous systems are potentially response to transitions on any of their inputs at anytime, since they have no clock with which to sample their inputs.

SungKyunKwan Univ.

17VADA Lab.

Asynchronous - Cont.

• More difficult to implement, requiring explicit synchronization between communication blocks without clocks

• If the signal feeds directly to conventional gate-level circuitry, invalid

logic levels could propagate throughout the system.

• Glitches, which are filtered out by the clock in synchronous designs, may cause an asynchronous design to malfunction.

• Asynchronous designs are not widely used, designers can't find the supporting design tools and methodologies they need.

• DCC Error Corrector of Compact cassette player saves power of 80% as compared to the synchronous counterpart.

• Offers more architectural options/freedom encourages distributed, localized control offers more freedom to adapt the supply voltage

S. Furber, M. Edwards. “Asynchronous Design Methodologies”. 1993

SungKyunKwan Univ.

18VADA Lab.

Asynchronous design with adaptive scaling of the supply voltage

(a) Synchronous system

(b) Asynchronous system with adaptive scaling of the supply voltage

SungKyunKwan Univ.

19VADA Lab.

Asynchronous Pipeline

SungKyunKwan Univ.

20VADA Lab.

PIPELINED SELF-TIMED micro P

SungKyunKwan Univ.

21VADA Lab.

Hazard-free Circuits

6% more logics

SungKyunKwan Univ.

22VADA Lab.

Through WAVE PIPELINING

SungKyunKwan Univ.

23VADA Lab.

Wave-pipelining on FPGA• Pipeline 의 문제점

– Balanced partitioning

– Delay element overhead

– Tclk > Tmax - Tmin + clock skew + setup/hold time– Area, Power, 전체 지연시간의 증가– Clock distribution problem

• Wavepipelining = high throughput w/o

such overhead =Ideal pipelining

SungKyunKwan Univ.

24VADA Lab.

FPGA on WavePipeline• LUT 의 delay 는 다양한 logic function

에서도 비슷하다 .• 동일 delay 를 구성할 수 있다 .• FPGA element delay (wire, LUT,

interconnection)• Powerful layout editor• Fast design cycle

SungKyunKwan Univ.

25VADA Lab.

WP advantages

• Area efficient - register, clock distribution network & clock buffer 필요 없음 .

• Low power dissipation• Higher throughput• Low latency

SungKyunKwan Univ.

26VADA Lab.

Disadvantage

• Degraded performance in certain case • Difficult to achieve sharp rise and fall

time in synchronous design• Layout is critical for balancing the delay• Parameter variation - power supply and

temperature dependence

SungKyunKwan Univ.

27VADA Lab.

Experimental ResultsConventional Pipeline wavepipeline

Register 0 286 28

Max pathdelay

74.188ns 12.730 ns 68.969 ns

Min. pathdelay

9.0ns 52.356 ns

Max Freq. 13.5 MHz 78.6 MHz 50 MHz

CLB # 49 143 148

Latency 75ns 169 ns (13clk)

80 ns

Power 19.6mW/Mhz 76.8mW/MHz +clock driver

64.8mW/MHz

By 이재형 , SKKU

SungKyunKwan Univ.

28VADA Lab.

Observation• WP multiplier 는 delay 를 조절하기 위한 LUTs 의

추가가 많아서 전력소모 면에서 큰 이득은 보지 못했다 .

• FPGA 에서 delay 를 조절하기 위해 LUTs 나 net delay를 사용하지 않고 별도의 delay 소자를 사용하면 보다 효과적

• 또한 , 동일한 level 을 가지는 multiplier 를 설계하면 WP 구현이 용이하고 pipeline 구조보다 전력소모나 면적에서 큰 이득을 얻을 수 있을 것이다 .

SungKyunKwan Univ.

29VADA Lab.

VON NEUMANN VERSUS HARVARD

SungKyunKwan Univ.

30VADA Lab.

Power vs Area of Micro-coded Microprocessor

1.5V and 10MHz clock rate: instruction and data memory accesses account for 47% of the total power consumption.

SungKyunKwan Univ.

31VADA Lab.

Memory Architecture

SungKyunKwan Univ.

32VADA Lab.

Exploiting Locality for Low-Power Design

•Power consumption (mW) in the maximally time-shared and fully-parallel versions of the QMF sub-band coder filter• Improvement of a factor of 10.5 at the expense of a 20% increase in area• The interconnect elements (buses, multiplexers, and buffers) consumes 43% and 28% of the total power inthe time-shared and parallel versions.

•A spatially local cluster: group of algorithm operations that are tightlyconnected to each other in the flow graph representation.• Two nodes are tightly connected to each other on the flow graph representation if the shortest distance between them, in terms of number of edges traversed, is low.

SungKyunKwan Univ.

33VADA Lab.

Cascade filter layouts

(a)Non-local implementation from Hyper (b)Local implementation from Hyper-LP

SungKyunKwan Univ.

34VADA Lab.

Frequency Multipliers and Dividers

SungKyunKwan Univ.

35VADA Lab.

Low Power DSP

• 수행시간의 대부분이 DO-LOOP 에서 이루어짐

VSELP Vocoder : 83.4 %2D 8x8 DCT : 98.3 %LPC computation : 98.0 %

DO-LOOP 의 Power Minimization ==> DSP 의 Power Minimization

VSELP : Vector Sum Excited Linear PredictionLPC : Linear Prediction Coding

SungKyunKwan Univ.

36VADA Lab.

Low Power DSP

• Instruction Buffer ( 또는 Cache)locality 이용Program memory 의 access 를 줄인다 .

• Decoded Instruction Buffer– LOOP 의 첫번째 iteration 의 decoding 결과를

RAM 에 저장한 후 재사용– Fetch/Decoding 과정을 제거– 30~40% Power Saving

SungKyunKwan Univ.

37VADA Lab.

Stage-Skip Pipeline

•The power savings is achieved by stopping the instruction fetch and decode stages of the processor duringthe loop execution except its first iteration.•DIB = Decoded Instruction Buffer• 40 % power savings using DSP or RISC processor.

SungKyunKwan Univ.

38VADA Lab.

Stage-Skip Pipeline

•Selector: selects the output from either the instruction decoder or DIB• The decoded instruction signals for a loop are temporarily stored in the DIB and are reused in each iterationof the loop. •The power wasted in the conventional pipeline is saved in our pipeline by stopping the instruction fetching and decoding for each loop execution.

SungKyunKwan Univ.

39VADA Lab.

Stage-Skip Pipeline

Majority of execution cycles in signal processing programs are used for loop execution : 40% reduction in power with area increase 2%.

SungKyunKwan Univ.

40VADA Lab.

Optimizing Power using Transformation

LOCAL TRANSFORMATIONPRIMITIVESAssociativity,Distributivity,

Retiming,Common Sub-expression

GLOBALTRANSFORMATION

PRIMITIVESRetiming,

Pipelining,Look-Ahead,Associativity

SEARCH MECHANISMsimulated Rejectionless,

Steepest Decent,Heuristics

POWERESTIMATION

INPUT FLOWGRAPH OUTPUT FLOWGRAPH

SungKyunKwan Univ.

41VADA Lab.

Data- flow based transformations

• Tree Height reduction.• Constant and variable propagation.• Common subexpression elimination.• Code motion• Dead-code elimination

• The application of algebraic laws such as commutability, distributivity and associativity.

• Most of the parallelism in an algorithm is embodied in the loops.

• Loop jamming, partial and complete loop unrolling, strength reduction and loop retiming and software pipelining.

• Retiming: maximize the resource utilization.

SungKyunKwan Univ.

42VADA Lab.

Tree-height reduction•Example of tree-height reduction using commutativity and associativity

• Example of tree-height reduction using distributivity

SungKyunKwan Univ.

43VADA Lab.

Sub-expression elimination

• Logic expressions:– Performed by logic optimization.– Kernel-based methods.

• Arithmetic expressions:– Search isomorphic patterns in the parse trees.– Example:– a= x+ y; b = a+ 1; c = x+ y;– a= x+ y; b = a+ 1; c = a;

SungKyunKwan Univ.

44VADA Lab.

Examples of other transformations

• Dead-code elimination:– a= x; b = x+ 1; c = 2 * x;– a= x; can be removed if not referenced.

• Operator-strength reduction:– a= x2 ; b = 3 * x;– a= x * x; t = x<<1; b = x+ t;

• Code motion:– for ( i = 1; i < a * b) { } – t = a * b; for ( i = 1; i < t) { }

SungKyunKwan Univ.

45VADA Lab.

Control- flow based transformations

• Model expansion.– Expand subroutine flatten hierarc

hy.– Useful to expand scope of other

optimization techniques.– Problematic when routine is call

ed more than once.– Example:– x= a+ b; y= a * b; z = foo( x, y) ;– foo( p, q) {t =q-p; return(t);} – By expanding foo:– x= a+ b; y= a * b; z = y-x;

• Conditional expansion • Transform conditional into parallel execution with test at the end.• Useful when test depends on late signals.• May preclude hardware sharing.• Always useful for logic expressions.• Example:•y= ab; if ( a) x= b+d; else x= bd; can be expanded to: x= a( b+ d) + a’bd;•y= ab; x= y+ d( a+ b);

SungKyunKwan Univ.

46VADA Lab.

Strength reduction

++

*

**

B

X

XX

A

+*+

+* +++

+

X

A

X B

X 2 + AX + B X(X + A) + B

X

A

+* +

+*

*X

X

X

C

*++* +++ +

X B

+*

BX

X

A

SungKyunKwan Univ.

47VADA Lab.

Strength Reduction

SungKyunKwan Univ.

48VADA Lab.

DIGLOG multiplierC n n C n n

A A B B

A B A B B A A B

mult add

jR

kR

jR

kR

jR

kR R R

( ) , ( ) ,

,

( )( )

253 214

2 2

2 2 2 2

2 where n world length in bits

1st Iter 2nd Iter 3rd Iter

Worst-case error -25% -6% -1.6%

Prob. of Error<1% 10% 70% 99.8%

With an 8 by 8 multiplier, the exact result can be obtained at a maximum of seven iteration steps (worst case)

SungKyunKwan Univ.

49VADA Lab.

Logarithmic Number System

L x

L L L L L L

L L L L

x

AB A B A B A B

A A A A

log | |,

, ,

, ,/

2

2 1 1

--> Significant Strength Reduction

SungKyunKwan Univ.

50VADA Lab.

Switching Activity Reduction

(a) Average activity in a multiplier as a function of the constant value

(b) A parallel and serial implementations of an adder tree.

SungKyunKwan Univ.

51VADA Lab.

Pipelining

SungKyunKwan Univ.

52VADA Lab.

Associativity Transformation

SungKyunKwan Univ.

53VADA Lab.

Interlaced Accumulation Programming for LowPower

SungKyunKwan Univ.

54VADA Lab.

Two’s complement implementation of an accumulator

SungKyunKwan Univ.

55VADA Lab.

Sign magnitude implementation of

an accumulator.

SungKyunKwan Univ.

56VADA Lab.

Number representation trade-off for arithmetic

SungKyunKwan Univ.

57VADA Lab.

Signal statistics for Sign Magnitude implementation of the accumulator datapath assuming random inputs.

Documents

VADA Lab.SungKyunKwan Univ. 1 Lower Power Architecture Design 1999. 8.2 성균관대학교 조 준 동 교수