1 Power/Temperature analysis of register file architecture for superscalar processor...

Preview:

Citation preview

1

Power/Temperature analysis of register file architecture for superscalar processo

r

Hardware/Software co-design term-end project

R92922128 水沼 仁志 2004/06/08

2

1. Introduction

2. Motivation & background

3. Register File architecture study

4. Simulation methodology

5. HotSpot introduction

6. Experimental result

7. Conclusion

ContentsContentsContentsContents

3

1. Introduction“Temperature” becomes a serious headache for modern microprocessor designers.

Quoted from “Temperature-aware microarchitecture presentation slide”

4

1990 2000

10

He

at d

ens

ity (

W/c

m2)

100

1000

2010

Hotplate (2003)

1

Nuclear plant

Rocket nozzle

Solar surface

Quoted from Intel developer’s forum

If we don’t take action right now, the heat density would reach unbearable level within a decade.

1. Introduction

5

2. Motivation & backgroundRegister file has become a hot spot on the modern processor .

Hot Spots

Quoted from “Termal Modeling and Measurement of Large High Power Silicon Devices with Asymmetric Power Distribution”, Jeffrey Deeney

6

Register file has become a hot spot on the modern processor .

Hot Spots

Quoted from “Temperature-aware micro-architecture”, Kevin skadron, etc. 2003

2. Motivation & background

7

8-regMU

X

IF ID

MU

XM

UX

MU

XM

UX

MU

XM

UX

MU

X

IF ID

IF ID

IF ID

Way0

Way1

Way2

Way3

4-reg MU

X

IF ID

MU

XM

UX

MU

XIF ID

Way0

Way1

Register File also becomes a critical path determining cycle time, as issue width increases.

2. Motivation & background

8

Two major schemes were proposed to reduce RF delay.

1. RF duplicating

2. RF banking

2. Motivation & background

9

“RF duplicating” is to reduce port density by duplicating RF(from 8 ports to 4 ports).

8-reg RFMU

XM

UX

MU

XM

UX

MU

XM

UX

MU

XM

UX

8-reg RF

Cluster0

Cluster1

2. Motivation & background

8-regMU

X

IF ID

MU

XM

UX

MU

XM

UX

MU

XM

UX

MU

X

IF ID

IF ID

IF ID

Way0

Way1

Way2

Way3

IF ID

IF ID

IF ID

IF ID

Way0

Way1

Way2

Way3

(Original) (RF duplicating)

10

“RF banking” is a technique to reduce a port density by splitting RF into multiple-bank structure (From 8 ports to 2 ports).

8-regMU

X

IF ID

MU

XM

UX

MU

XM

UX

MU

XM

UX

MU

X

IF ID

IF ID

IF ID

Way0

Way1

Way2

Way3

2-reg RF

2-reg RF

MUX

MUXIF IDWay0

MUX

MUX

2-reg RFMUX

MUX

2-reg RFMUX

MUX

IF IDWay1

IF IDWay2

IF IDWay3

(Original) (RF banking)

2. Motivation & background

11

These two schemes are also beneficial to power saving because the total power necessary to drive each line/port is reduced.

Scheme PROs

RF duplicating

- Clock up thanks to

port density decrease

- Power reduction thanks to port density decreaseRF banking

2. Motivation & background

12

Way0

Way1

Way2

Way3

RF duplicating scheme’s drawback 1: Additional power is required for synchronization between two RF contents

Way0Cluster0

Cluster1

Way1

Way2

Way3

(Original)

(RF duplicating)

RF read

RF write

Time

Time

2. Motivation & background

13

RF duplicating scheme’s drawback 2: Inter-cluster bypass path becomes a performance bottleneck.

8-reg RFMU

XM

UX

MU

XM

UX

MU

XM

UX

MU

XM

UX

8-reg RF

Cluster0Cluster1 R

enam

ing

Win

dow

Inst

ruct

ion

Dec

ode

Inst

ruct

ion

Fet

ch

Dat

a ca

che

2. Motivation & background

14

RF banking scheme’s drawback: Performance loss due to bank conflicts when too many global ports try to access the same local port/bank.

Arbiter

Decoder

Decoder2-reg RF

2-reg RF

MUX

MUXR

enam

ing

Win

dow MUX

MUX

2-reg RFMUX

MUX

2-reg RFMUX

MUX

Arbiter

Decoder

Decoder

Decoder

Decoder

Arbiter

Arbiter

Arbiter

Arbiter

Arbiter

Arbiter

Inst

ruct

ion

Dec

ode

Inst

ruct

ion

Fet

ch

Dat

a ca

che

Decoder

Global port Local port

Decoder

2. Motivation & background

15

Pros & Cons between two schemes are as follows.

Scheme PROs CONs

RF duplicating

- Clock up thanks to

port density decrease

- Power reduction thanks to port density decrease

- Power increase due to RF synchronization overhead- CPI down due to longer bypass overhead in case of inter-cluster communication

RF banking - CPI down due to Instruction stall in case of port/bank conflict

3. Research goal

16

From the power / temperature view point, we need to analyze and quantify power overhead caused by these schemes.

3. Research goal

Scheme PROs CONs

RF duplicating

- Clock up thanks to

port density decrease

- Power reduction thanks to port density decrease

- Power increase due to RF synchronization overhead- CPI down due to longer bypass overhead in case of inter-cluster communication

RF banking - Clock up thanks to

port density decrease

- Power reduction thanks to port density decrease

- CPI down due to Instruction stall in case of port/bank conflict

17

My research goal is to use the metrics of power and temperature to evaluate two clock-up schemes for register file, “RF duplicating” and “RF banking”.

3. Research goal

PerformancePerformance

PowerPower TemperatureTemperature

Architectural Architectural simulationsimulation

Temperature Temperature simulationsimulation

Power Power simulationsimulation

18

My experimental procedures are;

1. Modify architectural simulator (SimpleScalar) /power simulator (Wattch) to imitate “RF duplicating” and “RF banking” schemes.

2. Study temperature simulator (HotSpot) and combine it with architectural/power simulators.

3. Evaluate the power/temperature impact of both clock-up schemes.

3. Research goal

19

4. Simulation Methodology

Wattch

Simple Scalar

FU (Functional Unit) access pattern

CPI

Active power per FU

HotSpot

Net performancecalculation

Functional unit temperature

Alpha 21364 configuration

SPEC 2000benchmark

20

4. Simulation methodology

Frequency 600MHz

Pipeline Width 4

# of RUUs 16

Load/Store Queue Depth 8

# of Integer ALU (Mult/Div) 4 (1)

# of FP ALU (Mult/Div) 1 (1)

TLB I/D 64 entries, 4way, 30cycles / 64 entries, 4way, 30cycles

Instruction length 32 bit

L2 (64B Blocks) 32KB, 4way, 6 cycles (Unified)

L1 I$/D$ (32KB Blocks) 16KB,direct,LRU / 16KB,4way,LRU

Branch Prediction Bimod (BTB size: 2048)

Mis-prediction Latency 3 cycles

Memory Latency First 18 cycles, Next 2 cycles

Program execution parameter GCC, Fastfwd: 100M cycles, Duration: 100M cycles

Simulated processor configuration

21

32 registers

RF configuration was changed as follows;

4. Simulation methodology

RF duplicating

32 registers

32 registers

32 registersRF banking

16 registers

16 registers

22

I modified ruu_dispatch ( ) in the SimpleScalar to emulate the read event for “RF duplicating” scheme

Dispatch width > 1 ?YesNo

Dispatch width = 0

Regfile0_access ++ Regfile1_access ++

Fetch inst from buffer

This instruction needs RF read?

NoNo

YesYes NoThis instruction needs RF read?

Dispatch inst to Cluster 0 Dispatch inst to Cluster 1

Dispatch width ++

Are there more inst in buffer?

Yes

ExitNo

4. Simulation methodology (RF duplicating)

23

This instruction needs RF write?

No

Regfile0_access ++ Regfile1_access ++

Instruction committed

This instruction is cluster 0 or 1?

Cluster 0Yes

Exit

Cluster 1

Regfile1_access ++ Regfile0_access ++

I modified ruu_commit ( ) in the SimpleScalar to emulate the write event for “RF duplicating” scheme.

4. Simulation methodology (RF duplicating)

24

I modified ruu_dispatch ( ) in the SimpleScalar to emulate the read event for “RF banking” scheme

Source reg # >15?YesNo

Dispatch width = 0

Bank0_access ++ Bank1_access ++

Fetch inst from buffer

This instruction needs RF read?

No YesYes No

Access Bank0 Access Bank1

Dispatch width ++

Are there more inst in buffer?

Yes

Exit

No

This instruction needs RF read?

More source reg?

No

4. Simulation methodology (RF banking)

25

This instruction needs RF write?

No

Bank0_access ++ Bank1_access ++

Instruction committed

Bank 0

Yes

Exit

Bank 1

I modified ruu_commit ( ) in the SimpleScalar to emulate the write event for “RF banking” scheme.

Destination reg # >15?No Yes

4. Simulation methodology (RF banking)

26

5. HotSpot introductionSimplistic Dynamic Compact Thermal Model (a.k.a. RC model) was used. This model uses electrical-thermal duality as below;

V <-> temp (T)I <-> power (P)R <-> thermal resistance (Rth)C <-> thermal capacitance (Cth)RC = time constant

Rth = t / (k ・ A)Cth = c ・ t ・ A

A

t

k = thermal conductivity of this material (W/mk)c = thermal capacitance per unit volume (J/m3)

k c

27

P : Total wattage generated inside of block Cth ・ dT/dt : Wattage consumed to heat up block T/ Rth : Wattage passing through block

From the power balance among the three, we know that P = Cth ・ dT/dt + T/ Rth. Hence, dT/dt = (RP - T) / RC.

T ( )℃ T + dT/dt ( )℃

1 unit time after

T / RthP

5. HotSpot introduction

28

The differential equation above is solved using a fourth-order Runge-Kutta method.

1. Try to solve dy/dx=f(x,y)   with initial state of y(x0)=y02. Partition the interval into n with dx3. When the x value is x0,   x1=x0+dx,   x2=x0+2dx, ・・・ ,   xn=x0+ndx, approximated incremental value is calculated using k1,k2,k3,and k4 as follows;

k=1/6(k1+2k2+2k3+k4)Here, k1,k2,k3,and k4 are represented as follows;

k1=f(x0,y0)dxk2=f(x0+dx/2,y0+k1/2)dxk3=f(x0+dx/2,y0+k2/2)dxk4=f(x0+dx,y0+k3)dx

4. y1 will be calculated as y1=y0+(k1+k2+k3+k4)/6

5. HotSpot introduction

29

P = 10.0 WRth = 1.25 (K/W)Cth = 0.005 (J/K)

The values of P, Rth, Cth are computed by “Reducing Power Density through Activity Migration”

I simulated register file temperature over time….

Register file

Silicon die

5. HotSpot introduction

30

5. HotSpot introductionAlpha 21364 floor-plan was used after slight modification.

I assume the Integer Register file functional unit area remains same after implementing “RF duplicating” and “RF banking” scheme.

31

5. HotSpot introductionHeat sink and heat spreader dimension remains same as that of HotSpot default setting

32

Die thickness 50 um

Die area 16 mm x 16 mm

Convection capacitance 140.4 J/K

Convection registance 0.2 K/W

Heat-sink side 60 mm

Heat-sink thickness 6.9 mm

Spreader side 30 mm

Spreader thickness 1.0 mm

Interface material thickness 0.075 mm

Ambient temperature 40 C

Sampling interval 10K cycles (= 1.667 msec)

Activity factor (for Wattch) Static Activity factors (Power value does NOT depend on FU access status)

Clock gating method

(for Wattch)

Ideal, aggressive (zero power consumed when power off)

Simulated thermal environment factor5. HotSpot introduction

33

6. Experimental resultTemperature simulation result

Temperature of register file

46

48

50

52

54

56

58

60

Tem 1000 2000 3000 4000 5000 6000 7000 8000 9000cycle (10K)

tem

pera

ture

(C)

Original

RF duplicating

RF banking

34

6. Experimental result

Power/Temperature simulation results

Original RF duplicating RF banking

RF power average

0.286 W 0.218 W(- 23.8%)

0.104 W(- 63.6%)

Peak temperature

59.4 C 57.6 C(- 3.1%)

54.9 C(- 7.6%)

35

7. Conclusion

1. Clock-up schemes of “RF duplicating” and “RF banking” also have a positive effect in power-saving.

2. “RF banking” saves RF power by 63.6% while “RF duplicating” by 23.8%.

3. A peak temperature almost remains same, despite of a huge power-saving above.

4. Other temperature-reduction scheme must be invented to tackle hot-spot problem.

Recommended