35
1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end projec t R92922128 水水 水水 2004/06/08

1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

Embed Size (px)

Citation preview

Page 1: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

1

Power/Temperature analysis of register file architecture for superscalar processo

r

Hardware/Software co-design term-end project

R92922128 水沼 仁志 2004/06/08

Page 2: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

2

1. Introduction

2. Motivation & background

3. Register File architecture study

4. Simulation methodology

5. HotSpot introduction

6. Experimental result

7. Conclusion

ContentsContentsContentsContents

Page 3: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

3

1. Introduction“Temperature” becomes a serious headache for modern microprocessor designers.

Quoted from “Temperature-aware microarchitecture presentation slide”

Page 4: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

4

1990 2000

10

He

at d

ens

ity (

W/c

m2)

100

1000

2010

Hotplate (2003)

1

Nuclear plant

Rocket nozzle

Solar surface

Quoted from Intel developer’s forum

If we don’t take action right now, the heat density would reach unbearable level within a decade.

1. Introduction

Page 5: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

5

2. Motivation & backgroundRegister file has become a hot spot on the modern processor .

Hot Spots

Quoted from “Termal Modeling and Measurement of Large High Power Silicon Devices with Asymmetric Power Distribution”, Jeffrey Deeney

Page 6: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

6

Register file has become a hot spot on the modern processor .

Hot Spots

Quoted from “Temperature-aware micro-architecture”, Kevin skadron, etc. 2003

2. Motivation & background

Page 7: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

7

8-regMU

X

IF ID

MU

XM

UX

MU

XM

UX

MU

XM

UX

MU

X

IF ID

IF ID

IF ID

Way0

Way1

Way2

Way3

4-reg MU

X

IF ID

MU

XM

UX

MU

XIF ID

Way0

Way1

Register File also becomes a critical path determining cycle time, as issue width increases.

2. Motivation & background

Page 8: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

8

Two major schemes were proposed to reduce RF delay.

1. RF duplicating

2. RF banking

2. Motivation & background

Page 9: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

9

“RF duplicating” is to reduce port density by duplicating RF(from 8 ports to 4 ports).

8-reg RFMU

XM

UX

MU

XM

UX

MU

XM

UX

MU

XM

UX

8-reg RF

Cluster0

Cluster1

2. Motivation & background

8-regMU

X

IF ID

MU

XM

UX

MU

XM

UX

MU

XM

UX

MU

X

IF ID

IF ID

IF ID

Way0

Way1

Way2

Way3

IF ID

IF ID

IF ID

IF ID

Way0

Way1

Way2

Way3

(Original) (RF duplicating)

Page 10: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

10

“RF banking” is a technique to reduce a port density by splitting RF into multiple-bank structure (From 8 ports to 2 ports).

8-regMU

X

IF ID

MU

XM

UX

MU

XM

UX

MU

XM

UX

MU

X

IF ID

IF ID

IF ID

Way0

Way1

Way2

Way3

2-reg RF

2-reg RF

MUX

MUXIF IDWay0

MUX

MUX

2-reg RFMUX

MUX

2-reg RFMUX

MUX

IF IDWay1

IF IDWay2

IF IDWay3

(Original) (RF banking)

2. Motivation & background

Page 11: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

11

These two schemes are also beneficial to power saving because the total power necessary to drive each line/port is reduced.

Scheme PROs

RF duplicating

- Clock up thanks to

port density decrease

- Power reduction thanks to port density decreaseRF banking

2. Motivation & background

Page 12: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

12

Way0

Way1

Way2

Way3

RF duplicating scheme’s drawback 1: Additional power is required for synchronization between two RF contents

Way0Cluster0

Cluster1

Way1

Way2

Way3

(Original)

(RF duplicating)

RF read

RF write

Time

Time

2. Motivation & background

Page 13: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

13

RF duplicating scheme’s drawback 2: Inter-cluster bypass path becomes a performance bottleneck.

8-reg RFMU

XM

UX

MU

XM

UX

MU

XM

UX

MU

XM

UX

8-reg RF

Cluster0Cluster1 R

enam

ing

Win

dow

Inst

ruct

ion

Dec

ode

Inst

ruct

ion

Fet

ch

Dat

a ca

che

2. Motivation & background

Page 14: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

14

RF banking scheme’s drawback: Performance loss due to bank conflicts when too many global ports try to access the same local port/bank.

Arbiter

Decoder

Decoder2-reg RF

2-reg RF

MUX

MUXR

enam

ing

Win

dow MUX

MUX

2-reg RFMUX

MUX

2-reg RFMUX

MUX

Arbiter

Decoder

Decoder

Decoder

Decoder

Arbiter

Arbiter

Arbiter

Arbiter

Arbiter

Arbiter

Inst

ruct

ion

Dec

ode

Inst

ruct

ion

Fet

ch

Dat

a ca

che

Decoder

Global port Local port

Decoder

2. Motivation & background

Page 15: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

15

Pros & Cons between two schemes are as follows.

Scheme PROs CONs

RF duplicating

- Clock up thanks to

port density decrease

- Power reduction thanks to port density decrease

- Power increase due to RF synchronization overhead- CPI down due to longer bypass overhead in case of inter-cluster communication

RF banking - CPI down due to Instruction stall in case of port/bank conflict

3. Research goal

Page 16: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

16

From the power / temperature view point, we need to analyze and quantify power overhead caused by these schemes.

3. Research goal

Scheme PROs CONs

RF duplicating

- Clock up thanks to

port density decrease

- Power reduction thanks to port density decrease

- Power increase due to RF synchronization overhead- CPI down due to longer bypass overhead in case of inter-cluster communication

RF banking - Clock up thanks to

port density decrease

- Power reduction thanks to port density decrease

- CPI down due to Instruction stall in case of port/bank conflict

Page 17: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

17

My research goal is to use the metrics of power and temperature to evaluate two clock-up schemes for register file, “RF duplicating” and “RF banking”.

3. Research goal

PerformancePerformance

PowerPower TemperatureTemperature

Architectural Architectural simulationsimulation

Temperature Temperature simulationsimulation

Power Power simulationsimulation

Page 18: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

18

My experimental procedures are;

1. Modify architectural simulator (SimpleScalar) /power simulator (Wattch) to imitate “RF duplicating” and “RF banking” schemes.

2. Study temperature simulator (HotSpot) and combine it with architectural/power simulators.

3. Evaluate the power/temperature impact of both clock-up schemes.

3. Research goal

Page 19: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

19

4. Simulation Methodology

Wattch

Simple Scalar

FU (Functional Unit) access pattern

CPI

Active power per FU

HotSpot

Net performancecalculation

Functional unit temperature

Alpha 21364 configuration

SPEC 2000benchmark

Page 20: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

20

4. Simulation methodology

Frequency 600MHz

Pipeline Width 4

# of RUUs 16

Load/Store Queue Depth 8

# of Integer ALU (Mult/Div) 4 (1)

# of FP ALU (Mult/Div) 1 (1)

TLB I/D 64 entries, 4way, 30cycles / 64 entries, 4way, 30cycles

Instruction length 32 bit

L2 (64B Blocks) 32KB, 4way, 6 cycles (Unified)

L1 I$/D$ (32KB Blocks) 16KB,direct,LRU / 16KB,4way,LRU

Branch Prediction Bimod (BTB size: 2048)

Mis-prediction Latency 3 cycles

Memory Latency First 18 cycles, Next 2 cycles

Program execution parameter GCC, Fastfwd: 100M cycles, Duration: 100M cycles

Simulated processor configuration

Page 21: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

21

32 registers

RF configuration was changed as follows;

4. Simulation methodology

RF duplicating

32 registers

32 registers

32 registersRF banking

16 registers

16 registers

Page 22: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

22

I modified ruu_dispatch ( ) in the SimpleScalar to emulate the read event for “RF duplicating” scheme

Dispatch width > 1 ?YesNo

Dispatch width = 0

Regfile0_access ++ Regfile1_access ++

Fetch inst from buffer

This instruction needs RF read?

NoNo

YesYes NoThis instruction needs RF read?

Dispatch inst to Cluster 0 Dispatch inst to Cluster 1

Dispatch width ++

Are there more inst in buffer?

Yes

ExitNo

4. Simulation methodology (RF duplicating)

Page 23: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

23

This instruction needs RF write?

No

Regfile0_access ++ Regfile1_access ++

Instruction committed

This instruction is cluster 0 or 1?

Cluster 0Yes

Exit

Cluster 1

Regfile1_access ++ Regfile0_access ++

I modified ruu_commit ( ) in the SimpleScalar to emulate the write event for “RF duplicating” scheme.

4. Simulation methodology (RF duplicating)

Page 24: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

24

I modified ruu_dispatch ( ) in the SimpleScalar to emulate the read event for “RF banking” scheme

Source reg # >15?YesNo

Dispatch width = 0

Bank0_access ++ Bank1_access ++

Fetch inst from buffer

This instruction needs RF read?

No YesYes No

Access Bank0 Access Bank1

Dispatch width ++

Are there more inst in buffer?

Yes

Exit

No

This instruction needs RF read?

More source reg?

No

4. Simulation methodology (RF banking)

Page 25: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

25

This instruction needs RF write?

No

Bank0_access ++ Bank1_access ++

Instruction committed

Bank 0

Yes

Exit

Bank 1

I modified ruu_commit ( ) in the SimpleScalar to emulate the write event for “RF banking” scheme.

Destination reg # >15?No Yes

4. Simulation methodology (RF banking)

Page 26: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

26

5. HotSpot introductionSimplistic Dynamic Compact Thermal Model (a.k.a. RC model) was used. This model uses electrical-thermal duality as below;

V <-> temp (T)I <-> power (P)R <-> thermal resistance (Rth)C <-> thermal capacitance (Cth)RC = time constant

Rth = t / (k ・ A)Cth = c ・ t ・ A

A

t

k = thermal conductivity of this material (W/mk)c = thermal capacitance per unit volume (J/m3)

k c

Page 27: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

27

P : Total wattage generated inside of block Cth ・ dT/dt : Wattage consumed to heat up block T/ Rth : Wattage passing through block

From the power balance among the three, we know that P = Cth ・ dT/dt + T/ Rth. Hence, dT/dt = (RP - T) / RC.

T ( )℃ T + dT/dt ( )℃

1 unit time after

T / RthP

5. HotSpot introduction

Page 28: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

28

The differential equation above is solved using a fourth-order Runge-Kutta method.

1. Try to solve dy/dx=f(x,y)   with initial state of y(x0)=y02. Partition the interval into n with dx3. When the x value is x0,   x1=x0+dx,   x2=x0+2dx, ・・・ ,   xn=x0+ndx, approximated incremental value is calculated using k1,k2,k3,and k4 as follows;

k=1/6(k1+2k2+2k3+k4)Here, k1,k2,k3,and k4 are represented as follows;

k1=f(x0,y0)dxk2=f(x0+dx/2,y0+k1/2)dxk3=f(x0+dx/2,y0+k2/2)dxk4=f(x0+dx,y0+k3)dx

4. y1 will be calculated as y1=y0+(k1+k2+k3+k4)/6

5. HotSpot introduction

Page 29: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

29

P = 10.0 WRth = 1.25 (K/W)Cth = 0.005 (J/K)

The values of P, Rth, Cth are computed by “Reducing Power Density through Activity Migration”

I simulated register file temperature over time….

Register file

Silicon die

5. HotSpot introduction

Page 30: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

30

5. HotSpot introductionAlpha 21364 floor-plan was used after slight modification.

I assume the Integer Register file functional unit area remains same after implementing “RF duplicating” and “RF banking” scheme.

Page 31: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

31

5. HotSpot introductionHeat sink and heat spreader dimension remains same as that of HotSpot default setting

Page 32: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

32

Die thickness 50 um

Die area 16 mm x 16 mm

Convection capacitance 140.4 J/K

Convection registance 0.2 K/W

Heat-sink side 60 mm

Heat-sink thickness 6.9 mm

Spreader side 30 mm

Spreader thickness 1.0 mm

Interface material thickness 0.075 mm

Ambient temperature 40 C

Sampling interval 10K cycles (= 1.667 msec)

Activity factor (for Wattch) Static Activity factors (Power value does NOT depend on FU access status)

Clock gating method

(for Wattch)

Ideal, aggressive (zero power consumed when power off)

Simulated thermal environment factor5. HotSpot introduction

Page 33: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

33

6. Experimental resultTemperature simulation result

Temperature of register file

46

48

50

52

54

56

58

60

Tem 1000 2000 3000 4000 5000 6000 7000 8000 9000cycle (10K)

tem

pera

ture

(C)

Original

RF duplicating

RF banking

Page 34: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

34

6. Experimental result

Power/Temperature simulation results

Original RF duplicating RF banking

RF power average

0.286 W 0.218 W(- 23.8%)

0.104 W(- 63.6%)

Peak temperature

59.4 C 57.6 C(- 3.1%)

54.9 C(- 7.6%)

Page 35: 1 Power/Temperature analysis of register file architecture for superscalar processor Hardware/Software co-design term-end project R92922128 水沼 仁志 2004/06/08

35

7. Conclusion

1. Clock-up schemes of “RF duplicating” and “RF banking” also have a positive effect in power-saving.

2. “RF banking” saves RF power by 63.6% while “RF duplicating” by 23.8%.

3. A peak temperature almost remains same, despite of a huge power-saving above.

4. Other temperature-reduction scheme must be invented to tackle hot-spot problem.