Upload
gervais-hood
View
217
Download
1
Embed Size (px)
Citation preview
1
Power/Temperature analysis of register file architecture for superscalar processo
r
Hardware/Software co-design term-end project
R92922128 水沼 仁志 2004/06/08
2
1. Introduction
2. Motivation & background
3. Register File architecture study
4. Simulation methodology
5. HotSpot introduction
6. Experimental result
7. Conclusion
ContentsContentsContentsContents
3
1. Introduction“Temperature” becomes a serious headache for modern microprocessor designers.
Quoted from “Temperature-aware microarchitecture presentation slide”
4
1990 2000
10
He
at d
ens
ity (
W/c
m2)
100
1000
2010
Hotplate (2003)
1
Nuclear plant
Rocket nozzle
Solar surface
Quoted from Intel developer’s forum
If we don’t take action right now, the heat density would reach unbearable level within a decade.
1. Introduction
5
2. Motivation & backgroundRegister file has become a hot spot on the modern processor .
Hot Spots
Quoted from “Termal Modeling and Measurement of Large High Power Silicon Devices with Asymmetric Power Distribution”, Jeffrey Deeney
6
Register file has become a hot spot on the modern processor .
Hot Spots
Quoted from “Temperature-aware micro-architecture”, Kevin skadron, etc. 2003
2. Motivation & background
7
8-regMU
X
IF ID
MU
XM
UX
MU
XM
UX
MU
XM
UX
MU
X
IF ID
IF ID
IF ID
Way0
Way1
Way2
Way3
4-reg MU
X
IF ID
MU
XM
UX
MU
XIF ID
Way0
Way1
Register File also becomes a critical path determining cycle time, as issue width increases.
2. Motivation & background
8
Two major schemes were proposed to reduce RF delay.
1. RF duplicating
2. RF banking
2. Motivation & background
9
“RF duplicating” is to reduce port density by duplicating RF(from 8 ports to 4 ports).
8-reg RFMU
XM
UX
MU
XM
UX
MU
XM
UX
MU
XM
UX
8-reg RF
Cluster0
Cluster1
2. Motivation & background
8-regMU
X
IF ID
MU
XM
UX
MU
XM
UX
MU
XM
UX
MU
X
IF ID
IF ID
IF ID
Way0
Way1
Way2
Way3
IF ID
IF ID
IF ID
IF ID
Way0
Way1
Way2
Way3
(Original) (RF duplicating)
10
“RF banking” is a technique to reduce a port density by splitting RF into multiple-bank structure (From 8 ports to 2 ports).
8-regMU
X
IF ID
MU
XM
UX
MU
XM
UX
MU
XM
UX
MU
X
IF ID
IF ID
IF ID
Way0
Way1
Way2
Way3
2-reg RF
2-reg RF
MUX
MUXIF IDWay0
MUX
MUX
2-reg RFMUX
MUX
2-reg RFMUX
MUX
IF IDWay1
IF IDWay2
IF IDWay3
(Original) (RF banking)
2. Motivation & background
11
These two schemes are also beneficial to power saving because the total power necessary to drive each line/port is reduced.
Scheme PROs
RF duplicating
- Clock up thanks to
port density decrease
- Power reduction thanks to port density decreaseRF banking
2. Motivation & background
12
Way0
Way1
Way2
Way3
RF duplicating scheme’s drawback 1: Additional power is required for synchronization between two RF contents
Way0Cluster0
Cluster1
Way1
Way2
Way3
(Original)
(RF duplicating)
RF read
RF write
Time
Time
2. Motivation & background
13
RF duplicating scheme’s drawback 2: Inter-cluster bypass path becomes a performance bottleneck.
8-reg RFMU
XM
UX
MU
XM
UX
MU
XM
UX
MU
XM
UX
8-reg RF
Cluster0Cluster1 R
enam
ing
Win
dow
Inst
ruct
ion
Dec
ode
Inst
ruct
ion
Fet
ch
Dat
a ca
che
2. Motivation & background
14
RF banking scheme’s drawback: Performance loss due to bank conflicts when too many global ports try to access the same local port/bank.
Arbiter
Decoder
Decoder2-reg RF
2-reg RF
MUX
MUXR
enam
ing
Win
dow MUX
MUX
2-reg RFMUX
MUX
2-reg RFMUX
MUX
Arbiter
Decoder
Decoder
Decoder
Decoder
Arbiter
Arbiter
Arbiter
Arbiter
Arbiter
Arbiter
Inst
ruct
ion
Dec
ode
Inst
ruct
ion
Fet
ch
Dat
a ca
che
Decoder
Global port Local port
Decoder
2. Motivation & background
15
Pros & Cons between two schemes are as follows.
Scheme PROs CONs
RF duplicating
- Clock up thanks to
port density decrease
- Power reduction thanks to port density decrease
- Power increase due to RF synchronization overhead- CPI down due to longer bypass overhead in case of inter-cluster communication
RF banking - CPI down due to Instruction stall in case of port/bank conflict
3. Research goal
16
From the power / temperature view point, we need to analyze and quantify power overhead caused by these schemes.
3. Research goal
Scheme PROs CONs
RF duplicating
- Clock up thanks to
port density decrease
- Power reduction thanks to port density decrease
- Power increase due to RF synchronization overhead- CPI down due to longer bypass overhead in case of inter-cluster communication
RF banking - Clock up thanks to
port density decrease
- Power reduction thanks to port density decrease
- CPI down due to Instruction stall in case of port/bank conflict
17
My research goal is to use the metrics of power and temperature to evaluate two clock-up schemes for register file, “RF duplicating” and “RF banking”.
3. Research goal
PerformancePerformance
PowerPower TemperatureTemperature
Architectural Architectural simulationsimulation
Temperature Temperature simulationsimulation
Power Power simulationsimulation
18
My experimental procedures are;
1. Modify architectural simulator (SimpleScalar) /power simulator (Wattch) to imitate “RF duplicating” and “RF banking” schemes.
2. Study temperature simulator (HotSpot) and combine it with architectural/power simulators.
3. Evaluate the power/temperature impact of both clock-up schemes.
3. Research goal
19
4. Simulation Methodology
Wattch
Simple Scalar
FU (Functional Unit) access pattern
CPI
Active power per FU
HotSpot
Net performancecalculation
Functional unit temperature
Alpha 21364 configuration
SPEC 2000benchmark
20
4. Simulation methodology
Frequency 600MHz
Pipeline Width 4
# of RUUs 16
Load/Store Queue Depth 8
# of Integer ALU (Mult/Div) 4 (1)
# of FP ALU (Mult/Div) 1 (1)
TLB I/D 64 entries, 4way, 30cycles / 64 entries, 4way, 30cycles
Instruction length 32 bit
L2 (64B Blocks) 32KB, 4way, 6 cycles (Unified)
L1 I$/D$ (32KB Blocks) 16KB,direct,LRU / 16KB,4way,LRU
Branch Prediction Bimod (BTB size: 2048)
Mis-prediction Latency 3 cycles
Memory Latency First 18 cycles, Next 2 cycles
Program execution parameter GCC, Fastfwd: 100M cycles, Duration: 100M cycles
Simulated processor configuration
21
32 registers
RF configuration was changed as follows;
4. Simulation methodology
RF duplicating
32 registers
32 registers
32 registersRF banking
16 registers
16 registers
22
I modified ruu_dispatch ( ) in the SimpleScalar to emulate the read event for “RF duplicating” scheme
Dispatch width > 1 ?YesNo
Dispatch width = 0
Regfile0_access ++ Regfile1_access ++
Fetch inst from buffer
This instruction needs RF read?
NoNo
YesYes NoThis instruction needs RF read?
Dispatch inst to Cluster 0 Dispatch inst to Cluster 1
Dispatch width ++
Are there more inst in buffer?
Yes
ExitNo
4. Simulation methodology (RF duplicating)
23
This instruction needs RF write?
No
Regfile0_access ++ Regfile1_access ++
Instruction committed
This instruction is cluster 0 or 1?
Cluster 0Yes
Exit
Cluster 1
Regfile1_access ++ Regfile0_access ++
I modified ruu_commit ( ) in the SimpleScalar to emulate the write event for “RF duplicating” scheme.
4. Simulation methodology (RF duplicating)
24
I modified ruu_dispatch ( ) in the SimpleScalar to emulate the read event for “RF banking” scheme
Source reg # >15?YesNo
Dispatch width = 0
Bank0_access ++ Bank1_access ++
Fetch inst from buffer
This instruction needs RF read?
No YesYes No
Access Bank0 Access Bank1
Dispatch width ++
Are there more inst in buffer?
Yes
Exit
No
This instruction needs RF read?
More source reg?
No
4. Simulation methodology (RF banking)
25
This instruction needs RF write?
No
Bank0_access ++ Bank1_access ++
Instruction committed
Bank 0
Yes
Exit
Bank 1
I modified ruu_commit ( ) in the SimpleScalar to emulate the write event for “RF banking” scheme.
Destination reg # >15?No Yes
4. Simulation methodology (RF banking)
26
5. HotSpot introductionSimplistic Dynamic Compact Thermal Model (a.k.a. RC model) was used. This model uses electrical-thermal duality as below;
V <-> temp (T)I <-> power (P)R <-> thermal resistance (Rth)C <-> thermal capacitance (Cth)RC = time constant
Rth = t / (k ・ A)Cth = c ・ t ・ A
A
t
k = thermal conductivity of this material (W/mk)c = thermal capacitance per unit volume (J/m3)
k c
27
P : Total wattage generated inside of block Cth ・ dT/dt : Wattage consumed to heat up block T/ Rth : Wattage passing through block
From the power balance among the three, we know that P = Cth ・ dT/dt + T/ Rth. Hence, dT/dt = (RP - T) / RC.
T ( )℃ T + dT/dt ( )℃
1 unit time after
T / RthP
5. HotSpot introduction
28
The differential equation above is solved using a fourth-order Runge-Kutta method.
1. Try to solve dy/dx=f(x,y) with initial state of y(x0)=y02. Partition the interval into n with dx3. When the x value is x0, x1=x0+dx, x2=x0+2dx, ・・・ , xn=x0+ndx, approximated incremental value is calculated using k1,k2,k3,and k4 as follows;
k=1/6(k1+2k2+2k3+k4)Here, k1,k2,k3,and k4 are represented as follows;
k1=f(x0,y0)dxk2=f(x0+dx/2,y0+k1/2)dxk3=f(x0+dx/2,y0+k2/2)dxk4=f(x0+dx,y0+k3)dx
4. y1 will be calculated as y1=y0+(k1+k2+k3+k4)/6
5. HotSpot introduction
29
P = 10.0 WRth = 1.25 (K/W)Cth = 0.005 (J/K)
The values of P, Rth, Cth are computed by “Reducing Power Density through Activity Migration”
I simulated register file temperature over time….
Register file
Silicon die
5. HotSpot introduction
30
5. HotSpot introductionAlpha 21364 floor-plan was used after slight modification.
I assume the Integer Register file functional unit area remains same after implementing “RF duplicating” and “RF banking” scheme.
31
5. HotSpot introductionHeat sink and heat spreader dimension remains same as that of HotSpot default setting
32
Die thickness 50 um
Die area 16 mm x 16 mm
Convection capacitance 140.4 J/K
Convection registance 0.2 K/W
Heat-sink side 60 mm
Heat-sink thickness 6.9 mm
Spreader side 30 mm
Spreader thickness 1.0 mm
Interface material thickness 0.075 mm
Ambient temperature 40 C
Sampling interval 10K cycles (= 1.667 msec)
Activity factor (for Wattch) Static Activity factors (Power value does NOT depend on FU access status)
Clock gating method
(for Wattch)
Ideal, aggressive (zero power consumed when power off)
Simulated thermal environment factor5. HotSpot introduction
33
6. Experimental resultTemperature simulation result
Temperature of register file
46
48
50
52
54
56
58
60
Tem 1000 2000 3000 4000 5000 6000 7000 8000 9000cycle (10K)
tem
pera
ture
(C)
Original
RF duplicating
RF banking
34
6. Experimental result
Power/Temperature simulation results
Original RF duplicating RF banking
RF power average
0.286 W 0.218 W(- 23.8%)
0.104 W(- 63.6%)
Peak temperature
59.4 C 57.6 C(- 3.1%)
54.9 C(- 7.6%)
35
7. Conclusion
1. Clock-up schemes of “RF duplicating” and “RF banking” also have a positive effect in power-saving.
2. “RF banking” saves RF power by 63.6% while “RF duplicating” by 23.8%.
3. A peak temperature almost remains same, despite of a huge power-saving above.
4. Other temperature-reduction scheme must be invented to tackle hot-spot problem.