View
37
Download
0
Category
Preview:
DESCRIPTION
Computer Structure The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor. Lihu Rappoport and Adi Yoaz. The P6 Family. Features Out Of Order execution Register renaming Speculative Execution Multiple Branch prediction Super pipeline: 12 pipe stages. *off die. External - PowerPoint PPT Presentation
Citation preview
Computer Structure 2013 – P6 uArch 1
Computer Structure
The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor
Lihu Rappoport and Adi Yoaz
Computer Structure 2013 – P6 uArch 2
The P6 Family· Features
– Out Of Order execution– Register renaming– Speculative Execution– Multiple Branch prediction– Super pipeline: 12 pipe stages
Processor Year Freq (MHz) Bus (MHz) L2 cache Process (μ)
Pentium® Pro 1995 150~200 60/66 256/512K* 0.5, 0.35
Pentium® II 1997 233~450 66/100 512K* 0.35, 0.25
Pentium® III 1999 450~1400 100/133 256/512K 0.25, 0.18, 0.13
Pentium® M 2003 900~2260 400/533 1M / 2M 0.13, 90nm
CoreTM 2005 1660~2330 667 2M 65nm
CoreTM 2 2006 1800~2930 800/1066 2/4/8M 65nm
*off die
Computer Structure 2013 – P6 uArch 3
· In-Order Front End– BIU: Bus Interface Unit– IFU: Instruction Fetch Unit (includes IC)– BPU: Branch Prediction Unit– ID: Instruction Decoder– MS: Micro-Instruction Sequencer– RAT: Register Alias Table
· Out-of-order Core– ROB: Reorder Buffer– RRF: Real Register File– RS: Reservation Stations– IEU: Integer Execution Unit– FEU: Floating-point Execution Unit – AGU: Address Generation Unit– MIU: Memory Interface Unit– DCU: Data Cache Unit– MOB: Memory Order Buffer– L2: Level 2 cache
· In-Order Retire
P6 Arch
MS
AGU
MOB
External Bus
IEU
MIU
FEU
BPU
BIU
IFU
ID
RAT
RS
L2
DCU
ROB
Computer Structure 2013 – P6 uArch 4
O1 O3
R1 R2
Ex
I1 I2 I3 I4 I5 I6 I7 I8
NextIP
RegRen
RSWr
Icache Decode
RS disp
Retirement
In-Order Front End
Out-of-order Core
In-order Retirement
1: Next IP2: ICache lookup3: ILD (instruction length decode)4: rotate5: ID16: ID27: RAT- rename sources, ALLOC-assign destinations8: ROB-read sources RS-schedule data-ready uops for dispatch9: RS-dispatch uops10:EX11:Retirement
P6 Pipeline
Computer Structure 2013 – P6 uArch 5
In-Order Front End
· BPU – Branch Prediction Unit – predict next fetch address
· IFU – Instruction Fetch Unit– iTLB translates virtual to physical address (access PMH on miss)– IC supplies 16byte/cyc (access L2 cache on miss)
· ILD – Induction Length Decode – split bytes to instructions
· IQ – Instruction Queue – buffer the instructions
· ID – Instruction Decode – decode instructions into uops
· MS – Micro-Sequencer – provides uops for complex instructions
Next IPMux
BPU
ID
MS
ILD IQ IDQIFU
Bytes Instructions uops
Computer Structure 2013 – P6 uArch 6
Branch Prediction
· Implementation– Use local history to predict direction– Need to predict multiple branches Þ Need to predict branches before previous branches are resolvedÞ Branch history updated first based on prediction, later based on
actual execution (speculative history)– Target address taken from BTB
· Prediction rate: ~92%– ~60 instructions between mispredictions– High prediction rate is very crucial for long pipelines– Especially important for OOOE, speculative execution:
On misprediction all instructions following the branch in the instruction window are flushed
Effective size of the window is determined by prediction accuracy
· RSB used for Call/Return pairs
Computer Structure 2013 – P6 uArch 7
Branch Prediction – Clustering· Need to provide predictions for the entire fetch line each cycle
– Predict the first taken branch in the line, following the fetch IP
· Implemented by – Splitting IP into offset within line, set, and tag– If the tag of more than one way matches the fetch IP
The offsets of the matching ways are ordered Ways with offset smaller than the fetch IP offset are discarded The first branch that is predicted taken is chosen as the predicted branch
Jump into the fetch line
Jump out of the line
jmp jmpjmpjmp
Predictnot taken
Predicttaken
Predicttaken
Predicttaken
Computer Structure 2013 – P6 uArch 8
The P6 BTB
· 2-level, local histories, per-set counters· 4-way set associative: 512 entries in 128 sets
IP Tag Hist
1001
Pred=msb of counter
9
0
15
Way 0
Target
Way
2W
ay 3
9 432
counters
128sets
PTV
21 1 32
LRR
2
Per-Set
Branch Type00- cond01- ret10- call11- uncond
ReturnStackBuffer
Way
1
Prediction bit
4
ofst
· Up to 4 branches can have a tag match
Computer Structure 2013 – P6 uArch 9
Determine where each IA instruction starts
In-Order Front End: Decoder
Buffer Instructions
Convert instructions Into uops• If inst aligned with dec1/2/3
decodes into >1 uops, defer it to next cycle
IQ
Instruction Length Decode
16 Instruction bytes from IFU
1 uop
D1
IDQ
D2
1 uop
D0
4 uops
Buffers uops• Smooth decoder’s variable
throughput
D3
1 uop
Computer Structure 2013 – P6 uArch 10
Micro Operations (Uops)
· Each CISC inst is broken into one or more RISC uops– Simplicity:
Each uop is (relatively) simple Canonical representation of src/dest (2 src, 1 dest)
– Increased ILP e.g., pop eax becomes esp1<-esp0+4, eax1<-[esp0]
· Simple instructions translate to a few uops– Typical uop count (it is not necessarily cycle count!)
Reg-Reg ALU/Mov inst: 1 uopMem-Reg Mov (load) 1 uopMem-Reg ALU (load + op) 2 uopsReg-Mem Mov (store) 2 uops (st addr, st data)Reg-Mem ALU (ld + op + st) 4 uops
· Complex instructions need ucode
Computer Structure 2013 – P6 uArch 11
Out-of-order Core
MIS
AGU
MOB
External Bus
IEU
MIU
FEU
BTB
BIU
IFU
ID
RAT
RS
L2
DCU
ROB
· Reorder Buffer (ROB):– Holds “not yet retired” instructions– 40 entries, in-order
· Reservation Stations (RS): – Holds “not yet executed” instructions– 20 entries
· Execution Units– IEU: Integer Execution Unit– FEU: Floating-point Execution Unit
· Memory related units– AGU: Address Generation Unit MIU:
Memory Interface Unit– DCU: Data Cache Unit– MOB: Orders Memory operations– L2: Level 2 cache
Computer Structure 2013 – P6 uArch 12
Alloc & Rat· Perform register allocation and renaming for ≤4 uops/cyc· The Register Alias Table (RAT)
– Maps architectural registers into physical registers For each arch reg, holds the number of latest phy reg that updates it
– When a new uop that writes to a arch reg R is allocated Record phy reg allocated to the uop as the latest reg that updates R
· The Allocator (Alloc) – Assigns each uop an entry number in the ROB / RS– For each one of the sources (architectural registers) of the uop
Lookup the RAT to find out the latest phy reg updating it Write it up in the RS entry
– Allocate Load & Store buffers in the MOB
Arch reg #reg Location
EAX 0 RRF
EBX 19 ROB
ECX 23 ROB
Computer Structure 2013 – P6 uArch 13
Re-order Buffer (ROB)· Hold 40 uops which are not yet committed
– At the same order as in the program
· Provide a large physical register space for register renaming– One physical register per each ROB entry
physical register number = entry number Each uop has only one destination Buffer the execution results until retirement
– Valid data is set after uop executed and result written to physical reg
#entry EntryValid
DataValid
Physical Reg Data
Architectural dest. reg
0 1 1 12H EBX
1 1 1 33H ECX
2 1 0 xxx ESI
39 0 0 xxx XXX
Computer Structure 2013 – P6 uArch 14
RRF – Real Register File
· Holds the Architectural Register File– Architectural Register are numbered: 0 – EAX, 1 – EBX, …
· The value of an architectural register – is the value written to it by the last instruction committed
which writes to this register
RRF: #entry Arch Reg Data
0 (EAX) 9AH
1 (EBX) F34H
Computer Structure 2013 – P6 uArch 15
Uop flow through the ROB· Uops are entered in order
– Registers renamed by the entry number
· Once assigned: execution order unimportant· After execution:
– Entries marked “executed” and wait for retirement– Executed entry can be “retired” once all prior instruction have retired– Commit architectural state only after speculation (branch, exception)
has resolved
· Retirement– Detect exceptions and mispredictions
Initiate repair to get machine back on right track
– Update “real registers” with value of renamed registers– Update memory– Leave the ROB
Computer Structure 2013 – P6 uArch 16
Reservation station (RS)· Pool of all “not yet executed” uops
– Holds the uop attributes and the uop source data until it is dispatched
· When a uop is allocated in RS, operand values are updated– If operand is from an architectural register, value is taken from the
RRF– If operand is from a phy reg, with data valid set, value taken from ROB– If operand is from a phy reg, with data valid not set, wait for value
· The RS maintains operands status “ready/not-ready”– Each cycle, executed uops make more operands “ready”
The RS arbitrate the WB busses between the units The RS monitors the WB bus to capture data needed by awaiting uops Data can be bypassed directly from WB bus to execution unit
– Uops whose all operands are ready can be dispatched for execution Dispatcher chooses which of the ready uops to execute next Dispatches chosen uops to functional units
Computer Structure 2013 – P6 uArch 17
Register Renaming example
RS
RAT / Alloc
IDQ
Add EAX, EBX, EAX
#reg
EAX 0 RRF
EBX 19 ROB
ECX 23 ROB
Add ROB37, ROB19, RRF0
ROB
# Data
Valid Data DST
19 V V 12H EBX
23 V V 33H ECX
37 I x xxx XXX
38 I x xxx XXX
v src1 v src2 Pdst
add 1 97H 1 12H 37
RRF: 0 EAX 97H
# Data Valid Data DST
19 V V 12H EBX
23 V V 33H ECX
37 V I xxx EAX
38 I x xxx XXX
#reg
EAX 37 ROB
EBX 19 ROB
ECX 23 ROB
Computer Structure 2013 – P6 uArch 18
Register Renaming example (2)
RS
RAT / Alloc
IDQ
sub EAX, ECX, EAX
#reg
EAX 37 ROB
EBX 19 ROB
ECX 23 ROB
sub ROB38, ROB23, ROB37
ROB
# Data
Valid Data DST
19 V V 12H EBX
23 V V 33H ECX
37 I x xxx XXX
38 I x xxx XXX
v src1 v src2 Pdst
add 1 97H 1 12H 37
sub 0 rob37 1 33H 38
RRF: 0 EAX 97H
# Data Valid Data DST
19 V V 12H EBX
23 V V 33H ECX
37 V I xxx EAX
38 V I xxx EAX
#reg
EAX 38 ROB
EBX 19 ROB
ECX 23 ROB
Computer Structure 2013 – P6 uArch 19
MIU
Port 0
Port 1
Port 2
Port 3,4
SHFFMU
FDIVIDIV
FAUIEU
JEUIEU
AGU
AGU
Load Address
Store Address
Out-of-order Core: Execution Unitsinternal 0-dealy bypass within each EU
2nd bypass in RS
RS
1st bypass in MIU
DCUSDB
Computer Structure 2013 – P6 uArch 20
In-Order Retire
MIS
AGU
MOB
External Bus
IEU
MIU
FEU
BTB
BIU
IFU
ID
RAT
RS
L2
DCU
ROB
· ROB:– Retires up to 4 uops per clock– Copies the values to the RRF– Retirement is done In-order– Performs exception checking
Computer Structure 2013 – P6 uArch 21
In-order Retirement
· The process of committing the results to the architectural state of the processor
· Retire up to 4 uops per clock· Copy the values to the RRF· Retirement is done In Order· Perform exception checking· An instruction is retired after the following checks
– Instruction has executed– All previous instructions have retired– Instruction isn’t mis-predicted– no exceptions
Computer Structure 2013 – P6 uArch 22
Pipeline: Fetch
· Fetch 16B from I$· Length-decode instructions within 16B· Write instructions into IQ
Predict/Fetch DecodeIQ IDQ
AllocROBRS
Schedule EX Retire
Computer Structure 2013 – P6 uArch 23
Pipeline: Decode
· Read 4 instructions from IQ· Translate instructions into uops
– Asymmetric decoders (4-1-1-1)
· Write resulting uops into IDQ
Predict/Fetch DecodeIQ IDQ
AllocROBRS
Schedule EX Retire
Computer Structure 2013 – P6 uArch 24
Pipeline: Allocate
· Allocate, port bind and rename 4 uops· Allocate ROB/RS entry per uop
– If source data is available from ROB or RRF, write data to RS– Otherwise, mark data not ready in RS
Predict/Fetch DecodeIQ IDQ
AllocROBRS
Schedule EX Retire
Computer Structure 2013 – P6 uArch 25
Pipeline: EXE
· Ready/Schedule– Check for data-ready uops if needed functional unit available – Select and dispatch ≤6 ready uops/clock to EXE– Reclaim RS entries
· Write back results into RS/ROB– Write results into result buffer– Snoop write-back ports for results that are sources to uops in RS– Update data-ready status of these uops in the RS
Predict/Fetch DecodeIQ IDQ
AllocROBRS
Schedule EX Retire
Computer Structure 2013 – P6 uArch 26
Pipeline: Retire
· Retire ≤4 oldest uops in ROB– Uop may retire if
its ready bit is set it does not cause an exception all preceding candidates are eligible for retirement
– Commit results from result buffer to RRF– Reclaim ROB entry
· In case of exception– Nuke and restart
Predict/Fetch DecodeIQ IDQ
AllocROBRS
Schedule EX Retire
Computer Structure 2013 – P6 uArch 27
Jump Misprediction – Flush at Retire
· Flush the pipe when the mispredicted jump retires– retirement is in-order
Þ all the instructions preceding the jump have already retired
Þ all the instructions remaining in the pipe follow the jump
Þ flush all the instructions in the pipe
· Disadvantage– Misprediction known after the jump was executed
– Continue fetching instructions from wrong path until the branch retires
Computer Structure 2013 – P6 uArch 28
Jump Misprediction – Flush at Execute· When the JEU detects jump misprediction it
– Flush the in-order front-end– Instructions already in the OOO part continue to execute
Including instructions following the wrong jump, which take execution resource, and waste power, but will never be committed
– Start fetching and decoding from the “correct” path The “correct” path still be wrong
A preceding uop that hasn’t executed may cause an exception A preceding jump executed OOO can also mispredict
– The “correct” instruction stream is stalled at the RAT The RAT was wrongly updated also by wrong path instruction
· When the mispredicted branch retires– Resets all state in the Out-of-Order Engine (RAT, RS, RB, MOB, etc.)
Only instruction following the jump are left – they must all be flushed Reset the RAT to point only to architectural registers
– Un-stalls the in-order machine– RS gets uops from RAT and starts scheduling and dispatching them
Computer Structure 2013 – P6 uArch 29
Fetch DecodeIQ IDQ
AllocROBRS
RetireSchedule JEU
Pipeline: Branch gets to EXE
Computer Structure 2013 – P6 uArch 30
· Flush front-end and re-steer it to correct path· RAT state already updated by wrong path
– Block further allocation· Update BPU· OOO not flushed: Instructions already in the OOO continue to execute
– Including instructions following the wrong jump, which take execution resource, and waste power, but will never be committed
· Block younger branches from clearing
Fetch DecodeIQ IDQ
AllocROBRS
RetireSchedule JEU
Flush
Pipeline: Mispredicted Branch EXE
Computer Structure 2013 – P6 uArch 31
· When mispredicted branch retires– Flush OOO
Only instruction following the jump are left – they must all be flushed Resets all state in the OOO (RAT, RS, RB, MOB, etc.) Reset the RAT to point only to architectural registers
– Allow allocation of uops from correct path
Fetch DecodeIQ IDQ
AllocROBRS
RetireSchedule JEU
Pipeline: Mispredicted Branch Retires
Clear
Computer Structure 2013 – P6 uArch 32
Instant Reclamation
· Allow a faster recovery after jump misprediction– Allow execution/allocation of uops from correct path before
mispredicted jump retires
· Every few cycles take a checkpoint of the RAT
· In case of misprediction– Flush the frontend and re-steer it to the correct path– Recover RAT to latest checkpoint taken prior to misprediction– Recover RAT to exact state at misprediction
Rename 4 uops/cycle from checkpoint and until branch– Flush all uops younger than the branch in the OOO
Computer Structure 2013 – P6 uArch 33
Instant Reclamation Mispredicted Branch EXE
· JEClear raised on mispredicted macro-branches
Clear
AllocROBRS
Schedule
JEU RetirePredict/Fetch Decode
IQ IDQ
Computer Structure 2013 – P6 uArch 34
Instant Reclamation Mispredicted Branch EXE
· JEClear raised on mispredicted macro-branches– Flush frontend and re-steer it to the correct path– Flush all younger uops in OOO– Update BPU– Block further allocation
Clear
Predict/Fetch DecodeIQ IDQ
AllocROBRS
Schedule Retire
BPU UpdateJE
U
Computer Structure 2013 – P6 uArch 35
Pipeline: Instant Reclamation: EXE
· Restore RAT from latest check-point before branch· Recover RAT to its states just after the branch
– Before any instruction on the wrong path· Meanwhile front-end starts fetching and decoding instructions
from the correct path
Predict/Fetch DecodeIQ IDQ
AllocROBRS
Schedule Retire
JEU
Computer Structure 2013 – P6 uArch 36
Pipeline: Instant Reclamation: EXE
· Once done restoring the RAT – allow allocation of uops from correct path
Predict/Fetch DecodeIQ IDQ
AllocROBRS
Schedule Retire
JEU
Computer Structure 2013 – P6 uArch 37
Large ROB and RS are Important
· Large RS– Increases the window in which looking for impendent instructions
Exposes more parallelism potential
· Large ROB– The ROB is a superset of the RS ROB size ≥ RS size– Allows for of covering long latency operations (cache miss, divide)
· Example– Assume there is a Load that misses the L1 cache
Data takes ~10 cycles to return ~30 new instrs get into pipeline
– Instructions following the Load cannot commit Pile up in the ROB– Instructions independent of the load are executed, and leave the RS
As long as the ROB is not full, we can keep executing instructions
– A 40 entry ROB can cover for an L1 cache miss Cannot cover for an L2 cache miss, which is hundreds of cycles
Computer Structure 2013 – P6 uArch 38
OOO Execution of Memory Operations
Computer Structure 2013 – P6 uArch 39
P6 Caches· Blocking caches severely hurt OOO
– A cache miss prevents from other cache requests (which could possibly be hits) to be served
– Hurts one of the main gains from OOO – hiding caches misses
· Both L1 and L2 cache in the P6 are non-blocking– Initiate the actions necessary to return data to cache miss while
they respond to subsequent cached data requests– Support up to 4 outstanding misses
Misses translate into outstanding requests on the P6 bus The bus can support up to 8 outstanding requests
· Squash subsequent requests for the same missed cache line– Squashed requests not counted in number of outstanding requests– Once the engine has executed beyond the 4 outstanding requests
subsequent load requests are placed in the load buffer
Computer Structure 2013 – P6 uArch 40
OOO Execution of Memory Operations· The RS operates based on register dependencies
– RS cannot detect memory dependencies movl -4(%ebp), %ebx # MEM[ebp-4] ← ebxmovl %eax, -4(%ebp) # eax ← MEM[ebp-4]
– RS dispatches memory uops when data for address calculation is ready,
and the MOB and Address Generation Unit (AGU) are free– AGU computes the linear address
Segment-Base + Base-Address + (Scale*Index) + Displacement Sends linear address to MOB, to be stored in Load Buffer or Store Buffer
· MOB resolves memory dependencies and enforces memory ordering– Some memory dependencies can be resolved statically
store r1,a load r2,b
– Problem: some cannotstore r1,[r3]; load r2,b
can advance load before store
load must wait till r3 is known
Computer Structure 2013 – P6 uArch 41
Load and Store Ordering· x86 has small register set uses memory often
– Preventing Stores from passing Stores/Loads: 3%~5% perf. loss P6 chooses not allow Stores to pass Stores/Loads
– Preventing Loads from passing Loads/Stores: big perf. loss P6 allows Loads to pass Stores, and Loads to pass Loads
· Stores are not executed OOO– Stores are never performed speculatively
there is no transparent way to undo them
– Stores are also never re-ordered among themselves The Store Buffer dispatches a store only when
the store has both its address and its data, and there are no older stores awaiting dispatch
– Store commits its write to memory (DCU) at retirement
Computer Structure 2013 – P6 uArch 42
Store Implemented as 2 Uops
· Store decoded as two independent uops– STA (store-address): calculates the address of the store– STD (store-data): stores the data into the Store Data buffer
The actual write to memory is done when the store retires
· Separating STA & STD is important for memory OOO– Allows STA to dispatch earlier, even before the data is known– Address conflicts resolved earlier
opens memory pipeline for other loads
· STA and STD can be issued to execution units in parallel– STA dispatched to AGU when its sources (base+index) are ready– STD dispatched to SDB when its source operand is available
Computer Structure 2013 – P6 uArch 43
Memory Order Buffer (MOB)· Store Coloring
– Each Store allocated in-order in Store Buffer, and gets a SBID– Each load allocated in-order in Load Buffer,
and gets LBID + current SBID
· Load is checked against all previous stores– Stored with SBID ≤ store’s SBID
· Load blocked if– Unresolved address of a relevant STAs– STA to same address, but data not ready– Missing resources (DTLB miss, DCU miss)
· MOB writes blocking info into load buffer– Re-dispatches load when wake-up signal received
· If Load is not blocked executed (bypassed)
LBID SBID
Store
- 0
Store
- 1
Load 0 1
Store
- 2
Load 1 2
Load 2 2
Load 3 2
Store
- 3
Load 4 3
Computer Structure 2013 – P6 uArch 44
MOB (Cont.)
· If a Load misses in the DCU– The DCU marks the write-back data as invalid – Assigns a fill buffer to the load, and issues an L2 request– When critical chunk is returned, wakeup and re-dispatch the load
· Store → Load Forwarding– Older STA with same address as load and data ready
Load gets its data directly from the SB (no DCU access)
· Memory Disambiguation– MOB predicts if a load can proceed despite unknown STAs
Predict colliding block Load if there is unknown STA (as usual) Predict non colliding execute even if there are unknown STAs
– In case of wrong prediction The entire pipeline is flushed when the load retires
Computer Structure 2013 – P6 uArch 45
Pipeline: Load: Allocate
· Allocate ROB/RS, MOB entries· Assign Store ID (SBID) to enable ordering
IDQ
Alloc
ROBRS
RetireSchedule
LB
AGULB
Write
DTLB DCU WBMOB
Computer Structure 2013 – P6 uArch 46
Pipeline: Bypassed Load: EXE
· RS checks when data used for address calculation is ready· AGU calculates linear address: DS-Base + base + (Scale*Index) + Disp. · Write load into Load Buffer· DTLB Virtual → Physical + DCU set access· MOB checks blocking and forwarding· DCU read / Store Data Buffer read (Store → Load forwarding)· Write back data / write block code
IDQ
Alloc
ROBRS
RetireSchedule
LB
AGULB
Write
DTLB DCU WBMOB
Computer Structure 2013 – P6 uArch 47
Pipeline: Blocked Load Re-dispatch
· MOB determines which loads are ready, and schedules one· Load arbitrates for MEU · DTLB Virtual → Physical + DCU set access· MOB checks blocking/forwarding· DCU way select / Store Data Buffer read· write back data / write block code
IDQ
Alloc
ROBRS
RetireSchedule
LB
AGULB
Write
DTLB DCU WBMOB
Computer Structure 2013 – P6 uArch 48
Pipeline: Load: Retire
· Reclaim ROB, LB entries· Commit results to RRF
IDQ
Alloc
ROBRS
RetireSchedule
LB
AGULB
Write
DTLB DCU WBMOB
Computer Structure 2013 – P6 uArch 49
Pipeline: Store: Allocate
· Allocate ROB/RS· Allocate Store Buffer entry
IDQ RS
Alloc Schedule AGU SB
DTLB
ROB
SB
Retire
Computer Structure 2013 – P6 uArch 50
Pipeline: Store: STA EXE
· RS checks when data used for address calculation is ready– dispatches STA to AGU
· AGU calculates linear address· Write linear address to Store Buffer· DTLB Virtual → Physical · Load Buffer Memory Disambiguation verification· Write physical address to Store Buffer
IDQ RS
ScheduleAlloc AGUSBV.A.
ROB
DTLBSBP.A.
SB
Retire
Computer Structure 2013 – P6 uArch 51
Pipeline: Store: STD EXE
· RS checks when data for STD is ready– dispatches STD
· Write data to Store Buffer
IDQ RS
ScheduleAllocSB
dataROB
SB
Retire
Computer Structure 2013 – P6 uArch 52
Pipeline: Senior Store Retirement
· When STA (and thus STD) retires– Store Buffer entry marked as senior
· When DCU idle MOB dispatches senior store· Read senior entry
– Store Buffer sends data and physical address· DCU writes data· Reclaim SB entry
SB
IDQ RS
ScheduleAlloc
ROB
Retire
SB DCUMOB
Recommended