View
223
Download
1
Category
Preview:
Citation preview
1
Zvika Guz
Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez
Out Of Order Execution
2
Out Of Order Execution
• Goal: – Performance (IPC>1)
• How?– Wide Machine
– Speculations (Branch prediction)
– Out Of Order execution
» Essentially a data flow execution model: Operations execute as soon as their operands are available
– Eliminate name dependencies (aka false/anti dependencies)• WAW, WAR
» Via register renaming
• But, we still want a precise interrupt model– In-order commit
» Via Reorder Buffer (ROB)
4
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
Done?
DestDest
Oldest
Newest
from Memory
Dest
Reorder Buffer
RegistersF0F0F2F2F4F4F10F10
DestValue Instruction
F0F0
F2F2
F4F4
F10F10
RAT
5
4 Steps of Speculative Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr &
send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”)
2.Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch
CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”)
3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.
4.Commit—update register with reorder result When instr. at head of reorder buffer & result present, update
register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”)
6
Code Example
1. LD F0, 10(R2)2. ADDD F10, F4, F03. DIVD F2, F10, F64. BNE F2, <…>5. LD F4, 0(R3)6. ADDD F0, F4, F67. ADDD F0, F4, F6
7
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
Done?
DestDest
Oldest
Newest
from Memory
Dest
Reorder Buffer
RegistersF0F0F2F2F4F4F10F10
DestValue Instruction
F0F0
F2F2
F4F4
F10F10
RAT
8
F0F0
F2F2
F4F4
F10F10
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1F0F0 LD F0,10(R2)LD F0,10(R2) NN
Done?
DestDest
Oldest
Newest
from Memory
1 10+R21 10+R2Dest
F0F0F2F2F4F4F10F10
DestValue Instruction
Tomasulo With Reorder buffer:
ROB1ROB1
RAT
Reorder Buffer
Registers
9
F0F0
F2F2
F4F4
F10F10
2 ADDD R(F4),ROB12 ADDD R(F4),ROB1
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F10F10
F0F0ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)NN
NN
Done?
DestDest
Oldest
Newest
from Memory
1 10+R21 10+R2Dest
Reorder Buffer
RegistersF0F0F2F2F4F4F10F10
DestValue Instruction
Tomasulo With Reorder buffer:
ROB1ROB1
ROB2ROB2
RAT
10
F0F0
F2F2
F4F4
F10F10
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F2F2
F10F10
F0F0
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
NN
NN
NN
Done?
DestDest
Oldest
Newest
from Memory
1 10+R21 10+R2Dest
Reorder Buffer
RegistersF0F0F2F2F4F4F10F10
DestValue Instruction
ROB1ROB1
ROB3ROB3
ROB2ROB2
RAT
11
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD ROB5, R(F6)6 ADDD ROB5, R(F6)
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0F0 ADDD F0,F4,F6ADDD F0,F4,F6 NN
F4F4 LD F4,0(R3)LD F4,0(R3) NN
---- BNE F2,<…>BNE F2,<…> NN
F2F2
F10F10
F0F0
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
NN
NN
NN
Done?
DestDest
Oldest
Newest
from Memory
1 10+R21 10+R2
5 0+R35 0+R3
Dest
Reorder Buffer
RegistersF0F0F2F2F4F4F10F10
DestValue Instruction
F0F0
F2F2
F4F4
F10F10
ROB6ROB6
ROB3ROB3
ROB5ROB5
ROB2ROB2
RAT
12
F0F0 ROB7ROB7
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD ROB5, R(F6)6 ADDD ROB5, R(F6)
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0F0
F0F0 ADDD F0,F4,F6ADDD F0,F4,F6
ADDD F0,F4,F6ADDD F0,F4,F6NN
NN
F4F4 LD F4,0(R3)LD F4,0(R3) NN
---- BNE F2,<…>BNE F2,<…> NN
F2F2
F10F10
F0F0
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
NN
NN
NN
Done?
DestDest
Oldest
Newest
from Memory
Dest
Reorder Buffer
Registers
7 ADDD ROB5, R(F6)7 ADDD ROB5, R(F6)
F0F0F2F2F4F4F10F10
1 10+R21 10+R2
5 0+R35 0+R3
DestValue Instruction
F2F2
F4F4
F10F10
ROB3ROB3
ROB5ROB5
ROB2ROB2
RAT
13
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0F0
F0F0 ADDD F0,F4,F6ADDD F0,F4,F6
ADDD F0,F4,F6ADDD F0,F4,F6NN
NN
F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY
---- BNE F2,<…>BNE F2,<…> NN
F2F2
F10F10
F0F0
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
NN
NN
NN
Done?
DestDest
Oldest
Newest
from Memory
Dest
Reorder Buffer
Registers
2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD M[10],R(F6)6 ADDD M[10],R(F6)7 ADDD M[10],R(F6)7 ADDD M[10],R(F6)
F0F0F2F2F4F4F10F10
1 10+R21 10+R2
DestValue Instruction
F0F0
F2F2
F4F4
F10F10
ROB7ROB7
ROB3ROB3
ROB5ROB5
ROB2ROB2
RAT
14
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0F0
F0F0<val3><val3>
<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6
ADDD F0,F4,F6ADDD F0,F4,F6YY
YY
F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY
---- BNE F2,<…>BNE F2,<…> NN
F2F2
F10F10
F0F0
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
NN
NN
NN
Done?
DestDest
Oldest
Newest
from Memory
Dest
Reorder Buffer
RegistersF0F0F2F2F4F4F10F10
1 10+R21 10+R2
DestValue Instruction
F0F0
F2F2
F4F4
F10F10
ROB7ROB7
ROB3ROB3
ROB5ROB5
ROB2ROB2
RAT
15
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),M[2]2 ADDD R(F4),M[2]
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0F0
F0F0<val3><val3>
<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6
ADDD F0,F4,F6ADDD F0,F4,F6YY
YY
F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY
---- BNE F2,<…>BNE F2,<…> NN
F2F2
F10F10
F0F0 M[2]M[2]
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
NN
NN
YY
Done?
DestDest
Oldest
Newest
from Memory
Dest
Reorder Buffer
RegistersF0F0F2F2F4F4F10F10
DestValue Instruction
F0F0
F2F2
F4F4
F10F10
ROB7ROB7
ROB3ROB3
ROB5ROB5
ROB2ROB2
RAT
16
3 DIVD <val4>,R(F6)3 DIVD <val4>,R(F6)
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0F0
F0F0<val3><val3>
<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6
ADDD F0,F4,F6ADDD F0,F4,F6YY
YY
F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY
---- BNE F2,<…>BNE F2,<…> NN
F2F2
F10F10
F0F0<val4><val4>
M[2]M[2]
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
NN
YY
CC
Done?
DestDest
Oldest
Newest
from Memory
Dest
Reorder Buffer
RegistersF0 = M[2]F0 = M[2]F2F2F4F4F10F10
DestValue Instruction
F0F0
F2F2
F4F4
F10F10
ROB7ROB7
ROB3ROB3
ROB5ROB5
ROB2ROB2
RAT
17
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0F0
F0F0<val3><val3>
<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6
ADDD F0,F4,F6ADDD F0,F4,F6YY
YY
F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY
---- BNE F2,<…>BNE F2,<…> NN
F2F2
F10F10
F0F0
<val5><val5>
<val4><val4>
M[2]M[2]
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
YY
CC
CC
Done?
DestDest
Oldest
Newest
from Memory
Dest
Reorder Buffer
RegistersF0 = M[2]F0 = M[2]F2F2F4F4F10= <val4>F10= <val4>
DestValue Instruction
F0F0
F2F2
F4F4
F10F10
ROB7ROB7
ROB3ROB3
ROB5ROB5RAT
18
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
F0 = M[2]F0 = M[2]F2 = <val5>F2 = <val5>F4F4F10= <val4>F10= <val4>
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0F0
F0F0<val3><val3>
<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6
ADDD F0,F4,F6ADDD F0,F4,F6YY
YY
F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY
---- BNE F2,<…>BNE F2,<…> NN
F2F2
F10F10
F0F0
<val5><val5>
<val4><val4>
M[2]M[2]
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
CC
CC
CC
Done?
DestDest
Oldest
Newest
from Memory
Dest
Reorder Buffer
Registers
DestValue Instruction
F0F0
F2F2
F4F4
F10F10
ROB7ROB7
ROB5ROB5RAT
19
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
F0 = M[2]F0 = M[2]F2 = <val5>F2 = <val5>F4F4F10= <val4>F10= <val4>
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0F0
F0F0<val3><val3>
<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6
ADDD F0,F4,F6ADDD F0,F4,F6YY
YY
F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY
---- BNE F2,<…>BNE F2,<…> CC
F2F2
F10F10
F0F0
<val5><val5>
<val4><val4>
M[2]M[2]
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
CC
CC
CC
Done?
DestDest
Oldest
Newest
from Memory
Dest
Reorder Buffer
Registers
DestValue Instruction
F0F0
F2F2
F4F4
F10F10
ROB7ROB7
ROB5ROB5RAT
20
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
F0 = M[2]F0 = M[2]F2 = <val5>F2 = <val5>F4 = M[10]F4 = M[10]F10= <val4>F10= <val4>
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0F0
F0F0<val3><val3>
<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6
ADDD F0,F4,F6ADDD F0,F4,F6YY
YY
F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) CC
---- BNE F2,<…>BNE F2,<…> CC
F2F2
F10F10
F0F0
<val5><val5>
<val4><val4>
M[2]M[2]
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
CC
CC
CC
Done?
DestDest
Oldest
Newest
from Memory
Dest
Reorder Buffer
Registers
DestValue Instruction
F0F0
F2F2
F4F4
F10F10
ROB7ROB7
RAT
21
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
F0 = <val2>F0 = <val2>F2 = <val5>F2 = <val5>F4 = M[10]F4 = M[10]F10= <val4>F10= <val4>
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0F0
F0F0<val3><val3>
<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6
ADDD F0,F4,F6ADDD F0,F4,F6YY
CC
F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) CC
---- BNE F2,<…>BNE F2,<…> CC
F2F2
F10F10
F0F0
<val5><val5>
<val4><val4>
M[2]M[2]
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
CC
CC
CC
Done?
DestDest
Oldest
Newest
from Memory
Dest
Reorder Buffer
Registers
DestValue Instruction
F0F0
F2F2
F4F4
F10F10
ROB7ROB7
RAT
22
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
F0 = <val3>F0 = <val3>F2 = <val5>F2 = <val5>F4 = M[10]F4 = M[10]F10= <val4>F10= <val4>
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0F0
F0F0<val3><val3>
<val2><val2>ADDD F0,F4,F6ADDD F0,F4,F6
ADDD F0,F4,F6ADDD F0,F4,F6CC
CC
F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) CC
---- BNE F2,<…>BNE F2,<…> CC
F2F2
F10F10
F0F0
<val5><val5>
<val4><val4>
M[2]M[2]
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
CC
CC
CC
Done?
DestDest
Oldest
Newest
from Memory
Dest
Reorder Buffer
Registers
Dest Value Instruction
F0F0
F2F2
F4F4
F10F10
RAT
23
Remarks
• What about timing?– What happens on what cycle? No #cycles in the figure
» How many fetches/commits in a cycle?
» How many execution units?• Homework assignment
• Preserving precise interrupt model– When an interrupt occurs, we can flush everything
» Instructions that were not committed have no effect• Commit happens in-order
– Exceptions are taken on commit
• What happen if ROB is full?– Fetch is stopped until some instruction commits
» Committed instruction frees its ROB entry
24
Memory Hazards
• When is memory updated? – On commit (or later)
» Relevant for store instructions only
• WAR/WAW Hazards?– Handled by ROB
• RAW Hazards?
– Must ensure that no in-flight store is targeting the same address
– What about memory disambiguation?
» Simple answer: before starting the load we must know all the addresses of all other in-flight stores
» In real life we speculate on this
ST 0(R2),F1LD F2, 0(R2)
ST 0(R2), F1LD F2, 0(R4) //What if R4=R2?
Recommended