Upload
chaz
View
47
Download
7
Embed Size (px)
DESCRIPTION
Δ ιασωλήνωση - Pipelining. Pedro Trancoso Appendix B (και διαφάνειες από Prof. David Culler, Berkeley). Βιομηχανία Αυτοκινήτων. Βιομηχανία Αυτοκινήτων. Βιομηχανία Αυτοκινήτων. Βιομηχανία Αυτοκινήτων. Βιομηχανία Αυτοκινήτων. 1. Βιομηχανία Αυτοκινήτων. 2. 1. Βιομηχανία Αυτοκινήτων. 3. - PowerPoint PPT Presentation
Citation preview
Διασωλήνωση - Pipelining
Pedro TrancosoAppendix B
(και διαφάνειες από Prof. David Culler, Berkeley)
Βιομηχανία Αυτοκινήτων
Βιομηχανία Αυτοκινήτων
Βιομηχανία Αυτοκινήτων
Βιομηχανία Αυτοκινήτων
Βιομηχανία Αυτοκινήτων
1
Βιομηχανία Αυτοκινήτων
2 1
Βιομηχανία Αυτοκινήτων
3 2 1
Βιομηχανία Αυτοκινήτων
4 3 2 1
Βιομηχανία Αυτοκινήτων
5 4 3 2
Ας πλύνουμε τα ρούχα! Σειριακή Μεθόδος
Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 2030 40 2030 40 2030 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
Χρησιμοποιώντας Διασωλήνωση...
Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
Μάθαμε ότι... Pipelining doesn’t help latency (χρόνος αναμονής) of single task, it helps throughput (ρυθμοαπόδοση) of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously
Potential speedup = Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to “fill” pipeline and time to “drain” it reduces speedup
A
B
C
D
6 PM 7 8 9
Task
Order
Time
30 40 40 40 40 20
Διασωλήνωση Εντολών Execute billions of instructions, so
throughput is what matters What is desirable in the ISA (ΑΣΕ) for
pipelining? Variable length instructions vs.
all instructions same length? Memory operands part of any operation vs.
memory operands only in loads or stores? Register operand (τελεστέος) many places in
instruction format vs. registers located in same place?
Κύκλος Εκτέλεσης Instruction
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Instruction
Obtain instruction from program storage
Determine required actions and instruction size
Locate and obtain operand data
Compute result value or status
Deposit results in storage for later use
Determine successor instruction
Στάδια Διασωλήνωσης του DLX (1)
1. Instruction Fetch (IF)2. Instruction Decode / Register Fetch
(ID)3. Execution / Effective Address (EX)4. Memory Access / Branch Completion
(MEM)5. Write-back (WB)(go to 1!)
Στάδια Διασωλήνωσης του DLX (2)
1. Instruction Fetch (IF)IR Mem[PC]NPC PC+4
2. Instruction Decode / Register Fetch (ID)A Regs[IR6..10]B Regs[IR11..15]Imm ((IR16)^16##IR16..31)
3. Execution / Effective Address (EX)Mem ref: ALU output A+ImmReg-reg (ALUop): ALUoutput A op BReg-imm (ALU op): ALU output A op ImmBranch: ALU output NPC+Imm
cond (A op 0)
Στάδια Διασωλήνωσης του DLX (3)
4. Memory Access / Branch Completion (MEM)
Mem access: LMD Mem[ALU output] Mem[ALU output] B
Branch: if (cond) PC ALU output else PC NPC
5. Write-back (WB)Reg-reg ALU instr: Regs[IR16..20] ALU outputReg-imm ALU instr: Regs[IR11..15] ALU outputLoad instr: Regs[IR11..15] LMD
(go to 1!)
Διασωλήνωση του DLX
Cycle number 1
Instruction I IF
Instruction I+1
Instruction I+2
Instruction I+3
Instruction I+4
DLX Pipeline
Cycle number 1 2
Instruction I IF ID
Instruction I+1 IF
Instruction I+2
Instruction I+3
Instruction I+4
DLX Pipeline
Cycle number 1 2 3
Instruction I IF ID EX
Instruction I+1 IF ID
Instruction I+2 IF
Instruction I+3
Instruction I+4
DLX Pipeline
Cycle number 1 2 3 4
Instruction I IF ID EX MEM
Instruction I+1 IF ID EX
Instruction I+2 IF ID
Instruction I+3 IF
Instruction I+4
DLX Pipeline
Cycle number 1 2 3 4 5
Instruction I IF ID EX MEM WB
Instruction I+1 IF ID EX MEM
Instruction I+2 IF ID EX
Instruction I+3 IF ID
Instruction I+4 IF
DLX Pipeline
Cycle number 1 2 3 4 5 6
Instruction I IF ID EX MEM WB
Instruction I+1 IF ID EX MEM WB
Instruction I+2 IF ID EX MEM
Instruction I+3 IF ID EX
Instruction I+4 IF ID
DLX Pipeline
Cycle number 1 2 3 4 5 6 7 8 9
Instruction I IF ID EX MEM WB
Instruction I+1 IF ID EX MEM WB
Instruction I+2 IF ID EX MEM WB
Instruction I+3 IF ID EX MEM WB
Instruction I+4 IF ID EX MEM WB
Παράδειγμα: MIPS
Op
31 26 01516202125
Rs1 Rd immediate
Op
31 26 025
Op
31 26 01516202125
Rs1 Rs2
target
Rd Opx
Register-Register
561011
Register-Immediate
Op
31 26 01516202125
Rs1 Rs2/Opx immediate
Branch
Jump / Call
5 Βήματα της Διόδου Δεδομένων (Datapath) του MIPS
MemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
LMD
ALU
MU
X
Mem
ory
Reg File
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
4
Ad
der Zero?
Next SEQ PC
Addre
ss
Next PC
WB Data
Inst
RD
RS1
RS2
Imm
MemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MU
XM
UX
Mem
ory
MU
X
SignExtend
Zero?
IF/ID
ID/E
X
MEM
/WB
EX
/MEM
4
Ad
der
Next SEQ PC Next SEQ PC
RD RD RD WB
Data
Next PC
Addre
ss
RS1
RS2
Imm
MU
X
Datapath
Control Path
5 Βήματα της Διόδου Δεδομένων (Datapath) του MIPS
MemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MU
XM
UX
Mem
ory
MU
X
SignExtend
Zero?
IF/ID
ID/E
X
MEM
/WB
EX
/MEM
4
Ad
der
Next SEQ PC Next SEQ PC
RD RD RD WB
Data
Next PC
Addre
ss
RS1
RS2
Imm
MU
X
Datapath
Control Path
Inst
1
Inst
1
Inst
2
Inst
1
Inst
2
Inst
3
5 Βήματα της Διόδου Δεδομένων (Datapath) του MIPS
Εκτέλεση με Διασωλήνωση
Instr.
Order
Time (clock cycles)
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
Όρια της Διασωλήνωσης
Hazards (Κίνδυνοι): circumstances that would cause incorrect execution if next instruction were launched Structural hazards (κίνδυνος δομής):
Attempting to use the same hardware to do two different things at the same time
Data hazards (κίνδυνος δεδομένων): Instruction depends on result of prior instruction still in the pipeline
Control hazards (κίνδυνος ελέγχου): Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
Παράδειγμα: μια πόρτα μνήμης
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
DMem
Structural Hazard
Λύσεις για κίνδυνους δομής Defn: attempt to use same
hardware for two different things at the same time
Solution 1: Wait must detect the hazard must have mechanism to stall
Solution 2: Throw more hardware at the problem
Εύρεση και λύση ενός κινδύνου δομής
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Stall
Instr 3
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
Reg
ALU
DMemIfetch Reg
Bubble Bubble Bubble BubbleBubble
Λύση των Κίνδυνων Δομών στο σχεδιασμό
ALU
Instr
Cach
e
Reg File
MU
XM
UX
Data
Cach
e
MU
X
SignExtend
Zero?
IF/ID
ID/E
X
MEM
/WB
EX
/MEM
4
Ad
der
Next SEQ PC Next SEQ PC
RD RD RD WB
Data
Next PC
Addre
ss
RS1
RS2
Imm
MU
X
Datapath
Control Path
Σημασία του ΑΣΕ στην λύση των Κινδύνων Δομής Simple to determine the sequence
of resources used by an instruction opcode tells it all
Uniformity in the resource usage Compare MIPS to IA32? MIPS approach => all instructions
flow through same 5-stage pipeling
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Κίνδυνοι ΔεδομένωνTime (clock cycles)
IF ID/RF EX MEM WB
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Read After Write (RAW) (Διάβασμα μετά από γράψιμο) InstrJ tries to read operand before InstrI writes it
Caused by a “Data Dependence” (εξάρτηση δεδομένων). This hazard results from an actual need for communication.
Τρεις Είδους Κίνδυνοι Δεδομένων
I: add r1,r2,r3J: sub r4,r1,r3
Write After Read (WAR) (Γράψιμο μετά από διάβασμα) InstrJ writes operand before InstrI reads it
Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”.
Can it happen in the MIPS 5 stage pipeline? Can’t happen in MIPS 5 stage pipeline because:
All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
Τρεις Είδους Κίνδυνοι Δεδομένων
Write After Write (WAW) (Γράψιμο μετά από γράψιμο) InstrJ writes operand before InstrI writes it.
Called an “output dependence” (εξάρτηση εξόδου) by compiler writers. This also results from the reuse of name “r1”.
Can it happen in the MIPS 5 stage pipeline? Can’t happen in MIPS 5 stage pipeline because:
All instructions take 5 stages, and Writes are always in stage 5
Will see WAR and WAW in later more complicated pipes
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
Τρεις Είδους Κίνδυνοι Δεδομένων
Time (clock cycles)
Μεταβίβαση (Forwarding) για αποφυγή Κινδύνου Δεδομένων
Inst
r.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Αλλαγές στο υλικό για Forwarding
MEM
/WR
ID/E
X
EX
/MEM
DataMemory
ALU
mux
mux
Registe
rs
NextPC
Immediate
mux
Time (clock cycles)
Instr.
Order
lw r1,0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Κίνδυνοι Δεδομένων και με Forwarding
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Λύσεις…
Adding hardware? ... not Detection? Compilation techniques?
What is the cost of load delays?
Time (clock cycles)
or r8,r1,r9
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Reg
ALU
DMemIfetch Reg
RegIfetch
ALU
DMem RegBubble
Ifetch
ALU
DMem RegBubble Reg
Ifetch
ALU
DMemBubble Reg
How is this different from the instruction issue stall?
Λύσεις του κινδύνου της εντολής φόρτωσης
Χρονοδρομολόγηση Λογισμικού για αποφυγή κίνδυνου δεδομένωνSoftware Scheduling to Avoid Load Hazards
Fast code:LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory. Slow code:
LW Rb,b
LW Rc,c
ADD Ra,Rb,Rc
SW a,Ra
LW Re,e
LW Rf,f
SUB Rd,Re,Rf
SW d,Rd
STALL
STALL
Σχέση με τη ΕΣΑ What is exposed about this organizational
hazard in the instruction set? k cycle delay?
bad, CPI is not part of ISA k instruction slot delay
load should not be followed by use of the value in the next k instructions
Nothing, but code can reduce run-time delays MIPS did the transformation in the assembler
Κίνδυνος Ελέγχου από εντολές διακλάδωσης (Branches) => Three Stage Stall
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Παράδειγμα: Branch Stall Impact If 30% branch, Stall 3 cycles significant
(CPI=?) Two part solution:
Determine branch taken or not sooner, AND Compute taken branch address earlier
MIPS branch tests if register = 0 or 0 MIPS Solution:
Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3
Ad
der
IF/ID
Διασωλήνωση της διόδου δεδομένων του MIPS
MemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MU
X
Data
Mem
ory
MU
X
SignExtend
Zero?
MEM
/WB
EX
/MEM
4
Ad
der
Next SEQ PC
RD RD RD WB
Data
• Data stationary control– local decode for each instruction phase / pipeline stage
Next PC
Addre
ss
RS1
RS2
ImmM
UX
ID/E
X
Τέσσερις Επιλογές για Κίνδυνους Ελέγχου
#1: Stall until branch direction is clear#2: Predict Branch Not Taken
Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average (CPI=?) PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS
MIPS still incurs 1 cycle branch penalty Other machines: branch target known before outcome
#4: Delayed Branch Define branch to take place AFTER a
following instruction
branch instructionsequential successor1
sequential successor2........sequential successorn
........branch target if taken
1 slot delay allows proper decision and branch target address in 5 stage pipeline
MIPS uses this but…
Branch delay of length n
Τέσσερις Επιλογές για Κίνδυνους Ελέγχου
Καθυστερημένη Εντολή Διακλάδωσης (Delayed Branch)
Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Canceling branches allow more slots to be filled
Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots
useful in computation About 50% (60% x 80%) of slots usefully filled
Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
Καθυστερημένη Εντολή Διακλάδωσης (Delayed Branch)
Recall: Speed Up συνάρτηση για διασωλήνωση
pipelined
dunpipeline
TimeCycle
TimeCycle
CPI stall Pipeline CPI Idealdepth Pipeline CPI Ideal
Speedup
pipelined
dunpipeline
TimeCycle
TimeCycle
CPI stall Pipeline 1depth Pipeline
Speedup
Instper cycles Stall Average CPI Ideal CPIpipelined
For simple RISC pipeline, Ideal CPI = 1:
Παράδειγμα: Αξιολόγηση της Διαφορετικής Υλοποίησης της Εντολής Διακλάδωσης
Assume: Conditional & Unconditional = 14%, 65% change PC
Scheduling Branch CPI speedup v. scheme penalty stall
Stall pipeline 3 1.42 1.0Predict taken 1 1.14 1.26Predict not taken 1 1.09 1.29Delayed branch 0.5 1.07 1.31
Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty
Άλλες Δυσκολίες... Χρόνος αναμονής της κρυφής
μνήμης Διακοπές και Εξαιρέσεις
(Interrupts and Exceptions) ΑΣΕ (π.χ. ADD3 42(R1),56(R1)+,@(R1)) Λειτουργίες Πολύ-κύκλων
(Multicycle Operations)
Σχέση μεταξύ Κρυφής Μνήμης και Διασωλήνωσης
WB
Data
Ad
der
IF/ID
ALU
Mem
ory
Reg File
MU
X
Data
Mem
ory
MU
X
SignExtend
Zero? MEM
/WB
EX
/MEM
4
Ad
der
Next SEQ PC
RD RD RD
Next PC
Addre
ss
RS1
RS2
Imm
MU
X
ID/E
X
I-$ D-$
Memory
Λειτουργίες Πολύ-κύκλων
Λειτουργίες Πολύ-κύκλων
Παράδειγμα: MIPS R4000
Στάδια (8): IF – IS – RF – EX – DF – DS – TC - WB
Άλλα Παραδείγματα PowerPC G4 (4):
PowerPC G4e (7):
Pentium 4 (20):
FETCH DECODE / DISPATCH
EXECUTE COMPLETE / WRITE-BACK
FETCH FETCH DECODE / DISP
ISSUE EXE COMPL WRITE - BACK
TCNIP
TCNIP
TCF
TCF
DRIVE
A&R
A&R
A&R
Q SCHE
SCHE
SCHE
DISP
DISP
REG
REG
EXE
FLAGS
BRAN
DRIVE
Άρθρα ISCA 2002
The Optimum Pipeline Depth for a MicroprocessorA. Hartstein, T. R. Puzak
The Optimal Useful Logic Depth Per Pipeline Stage is 6-8 FO4
M.S. Hrishikesh , Norman P. Jouppi , Keith I. Farkas , Doug Burger , Stephen W. Keckler, Premkishore Shivakumar
Increasing Processor Performance by Implementing Deeper Pipelines
Eric Sprangle , Doug Carmean
Ασκήσεις H&P:
(1) a, b, c (2) a, b, c