Δ ιασωλήνωση - Pipelining

Διασωλήνωση - Pipelining

Pedro TrancosoAppendix B

(και διαφάνειες από Prof. David Culler, Berkeley)

Βιομηχανία Αυτοκινήτων





1


2 1


3 2 1


4 3 2 1


5 4 3 2

Ας πλύνουμε τα ρούχα! Σειριακή Μεθόδος

Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 2030 40 2030 40 2030 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

Χρησιμοποιώντας Διασωλήνωση...

Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

Μάθαμε ότι... Pipelining doesn’t help latency (χρόνος αναμονής) of single task, it helps throughput (ρυθμοαπόδοση) of entire workload

Pipeline rate limited by slowest pipeline stage

Multiple tasks operating simultaneously

Potential speedup = Number pipe stages

Unbalanced lengths of pipe stages reduces speedup

Time to “fill” pipeline and time to “drain” it reduces speedup

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20

Διασωλήνωση Εντολών Execute billions of instructions, so

throughput is what matters What is desirable in the ISA (ΑΣΕ) for

pipelining? Variable length instructions vs.

all instructions same length? Memory operands part of any operation vs.

memory operands only in loads or stores? Register operand (τελεστέος) many places in

instruction format vs. registers located in same place?

Κύκλος Εκτέλεσης Instruction

Fetch

Instruction

Decode

Operand

Fetch

Execute

Result

Store

Next

Instruction

Obtain instruction from program storage

Determine required actions and instruction size

Locate and obtain operand data

Compute result value or status

Deposit results in storage for later use

Determine successor instruction

Στάδια Διασωλήνωσης του DLX (1)

1. Instruction Fetch (IF)2. Instruction Decode / Register Fetch

(ID)3. Execution / Effective Address (EX)4. Memory Access / Branch Completion

(MEM)5. Write-back (WB)(go to 1!)


1. Instruction Fetch (IF)IR Mem[PC]NPC PC+4

2. Instruction Decode / Register Fetch (ID)A Regs[IR6..10]B Regs[IR11..15]Imm ((IR16)^16##IR16..31)

3. Execution / Effective Address (EX)Mem ref: ALU output A+ImmReg-reg (ALUop): ALUoutput A op BReg-imm (ALU op): ALU output A op ImmBranch: ALU output NPC+Imm

cond (A op 0)


4. Memory Access / Branch Completion (MEM)

Mem access: LMD Mem[ALU output] Mem[ALU output] B

Branch: if (cond) PC ALU output else PC NPC

5. Write-back (WB)Reg-reg ALU instr: Regs[IR16..20] ALU outputReg-imm ALU instr: Regs[IR11..15] ALU outputLoad instr: Regs[IR11..15] LMD

(go to 1!)

Διασωλήνωση του DLX

Cycle number 1

Instruction I IF

Instruction I+1

Instruction I+2

Instruction I+3

Instruction I+4

DLX Pipeline

Cycle number 1 2

Instruction I IF ID

Instruction I+1 IF

Instruction I+2

Instruction I+3

Instruction I+4

DLX Pipeline

Cycle number 1 2 3

Instruction I IF ID EX

Instruction I+1 IF ID

Instruction I+2 IF

Instruction I+3

Instruction I+4

DLX Pipeline

Cycle number 1 2 3 4

Instruction I IF ID EX MEM

Instruction I+1 IF ID EX


Instruction I+3 IF

Instruction I+4

DLX Pipeline

Cycle number 1 2 3 4 5

Instruction I IF ID EX MEM WB

Instruction I+1 IF ID EX MEM



Instruction I+4 IF

DLX Pipeline

Cycle number 1 2 3 4 5 6


Instruction I+1 IF ID EX MEM WB

Instruction I+2 IF ID EX MEM



DLX Pipeline

Cycle number 1 2 3 4 5 6 7 8 9






Παράδειγμα: MIPS

Op

31 26 01516202125

Rs1 Rd immediate

Op

31 26 025

Op

31 26 01516202125

Rs1 Rs2

target

Rd Opx

Register-Register

561011

Register-Immediate

Op

31 26 01516202125

Rs1 Rs2/Opx immediate

Branch

Jump / Call

5 Βήματα της Διόδου Δεδομένων (Datapath) του MIPS

MemoryAccess

Write

Back

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

LMD

ALU

MU

X

Mem

ory

Reg File

MU

XM

UX

Data

Mem

ory

MU

X

SignExtend

4

Ad

der Zero?

Next SEQ PC

Addre

ss

Next PC

WB Data

Inst

RD

RS1

RS2

Imm

MemoryAccess

Write

Back

InstructionFetch


ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MU

XM

UX

Mem

ory

MU

X

SignExtend

Zero?

IF/ID

ID/E

X

MEM

/WB

EX

/MEM

4

Ad

der

Next SEQ PC Next SEQ PC

RD RD RD WB

Data

Next PC

Addre

ss

RS1

RS2

Imm

MU

X

Datapath

Control Path


MemoryAccess

Write

Back

InstructionFetch


ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MU

XM

UX

Mem

ory

MU

X

SignExtend

Zero?

IF/ID

ID/E

X

MEM

/WB

EX

/MEM

4

Ad

der


RD RD RD WB

Data

Next PC

Addre

ss

RS1

RS2

Imm

MU

X

Datapath

Control Path

Inst

1

Inst

1

Inst

2

Inst

1

Inst

2

Inst

3


Εκτέλεση με Διασωλήνωση

Instr.

Order

Time (clock cycles)

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5

Όρια της Διασωλήνωσης

Hazards (Κίνδυνοι): circumstances that would cause incorrect execution if next instruction were launched Structural hazards (κίνδυνος δομής):

Attempting to use the same hardware to do two different things at the same time

Data hazards (κίνδυνος δεδομένων): Instruction depends on result of prior instruction still in the pipeline

Control hazards (κίνδυνος ελέγχου): Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

Παράδειγμα: μια πόρτα μνήμης

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg


DMem

Structural Hazard

Λύσεις για κίνδυνους δομής Defn: attempt to use same

hardware for two different things at the same time

Solution 1: Wait must detect the hazard must have mechanism to stall

Solution 2: Throw more hardware at the problem

Εύρεση και λύση ενός κινδύνου δομής

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Stall

Instr 3

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg


Reg

ALU

DMemIfetch Reg

Bubble Bubble Bubble BubbleBubble

Λύση των Κίνδυνων Δομών στο σχεδιασμό

ALU

Instr

Cach

e

Reg File

MU

XM

UX

Data

Cach

e

MU

X

SignExtend

Zero?

IF/ID

ID/E

X

MEM

/WB

EX

/MEM

4

Ad

der


RD RD RD WB

Data

Next PC

Addre

ss

RS1

RS2

Imm

MU

X

Datapath

Control Path

Σημασία του ΑΣΕ στην λύση των Κινδύνων Δομής Simple to determine the sequence

of resources used by an instruction opcode tells it all

Uniformity in the resource usage Compare MIPS to IA32? MIPS approach => all instructions

flow through same 5-stage pipeling

Instr.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Κίνδυνοι ΔεδομένωνTime (clock cycles)

IF ID/RF EX MEM WB

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Read After Write (RAW) (Διάβασμα μετά από γράψιμο) InstrJ tries to read operand before InstrI writes it

Caused by a “Data Dependence” (εξάρτηση δεδομένων). This hazard results from an actual need for communication.

Τρεις Είδους Κίνδυνοι Δεδομένων

I: add r1,r2,r3J: sub r4,r1,r3

Write After Read (WAR) (Γράψιμο μετά από διάβασμα) InstrJ writes operand before InstrI reads it

Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”.

Can it happen in the MIPS 5 stage pipeline? Can’t happen in MIPS 5 stage pipeline because:

All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7


Write After Write (WAW) (Γράψιμο μετά από γράψιμο) InstrJ writes operand before InstrI writes it.

Called an “output dependence” (εξάρτηση εξόδου) by compiler writers. This also results from the reuse of name “r1”.

Can it happen in the MIPS 5 stage pipeline? Can’t happen in MIPS 5 stage pipeline because:

All instructions take 5 stages, and Writes are always in stage 5

Will see WAR and WAW in later more complicated pipes

I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7


Time (clock cycles)

Μεταβίβαση (Forwarding) για αποφυγή Κινδύνου Δεδομένων

Inst

r.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Αλλαγές στο υλικό για Forwarding

MEM

/WR

ID/E

X

EX

/MEM

DataMemory

ALU

mux

mux

Registe

rs

NextPC

Immediate

mux

Time (clock cycles)

Instr.

Order

lw r1,0(r2)

sub r4,r1,r6

and r6,r1,r7

or r8,r1,r9

Κίνδυνοι Δεδομένων και με Forwarding

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Λύσεις…

Adding hardware? ... not Detection? Compilation techniques?

What is the cost of load delays?

Time (clock cycles)

or r8,r1,r9

Instr.

Order

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

Reg

ALU

DMemIfetch Reg

RegIfetch

ALU

DMem RegBubble

Ifetch

ALU

DMem RegBubble Reg

Ifetch

ALU

DMemBubble Reg

How is this different from the instruction issue stall?

Λύσεις του κινδύνου της εντολής φόρτωσης

Χρονοδρομολόγηση Λογισμικού για αποφυγή κίνδυνου δεδομένωνSoftware Scheduling to Avoid Load Hazards

Fast code:LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd

Try producing fast code for

a = b + c;

d = e – f;

assuming a, b, c, d ,e, and f in memory. Slow code:

LW Rb,b

LW Rc,c

ADD Ra,Rb,Rc

SW a,Ra

LW Re,e

LW Rf,f

SUB Rd,Re,Rf

SW d,Rd

STALL

STALL

Σχέση με τη ΕΣΑ What is exposed about this organizational

hazard in the instruction set? k cycle delay?

bad, CPI is not part of ISA k instruction slot delay

load should not be followed by use of the value in the next k instructions

Nothing, but code can reduce run-time delays MIPS did the transformation in the assembler

Κίνδυνος Ελέγχου από εντολές διακλάδωσης (Branches) => Three Stage Stall

10: beq r1,r3,36

14: and r2,r3,r5

18: or r6,r1,r7

22: add r8,r1,r9

36: xor r10,r1,r11

Reg ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Παράδειγμα: Branch Stall Impact If 30% branch, Stall 3 cycles significant

(CPI=?) Two part solution:

Determine branch taken or not sooner, AND Compute taken branch address earlier

MIPS branch tests if register = 0 or 0 MIPS Solution:

Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

Ad

der

IF/ID

Διασωλήνωση της διόδου δεδομένων του MIPS

MemoryAccess

Write

Back

InstructionFetch


ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MU

X

Data

Mem

ory

MU

X

SignExtend

Zero?

MEM

/WB

EX

/MEM

4

Ad

der

Next SEQ PC

RD RD RD WB

Data

• Data stationary control– local decode for each instruction phase / pipeline stage

Next PC

Addre

ss

RS1

RS2

ImmM

UX

ID/E

X

Τέσσερις Επιλογές για Κίνδυνους Ελέγχου

#1: Stall until branch direction is clear#2: Predict Branch Not Taken

Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average (CPI=?) PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS

MIPS still incurs 1 cycle branch penalty Other machines: branch target known before outcome

#4: Delayed Branch Define branch to take place AFTER a

following instruction

branch instructionsequential successor1

sequential successor2........sequential successorn

........branch target if taken

1 slot delay allows proper decision and branch target address in 5 stage pipeline

MIPS uses this but…

Branch delay of length n

Τέσσερις Επιλογές για Κίνδυνους Ελέγχου

Καθυστερημένη Εντολή Διακλάδωσης (Delayed Branch)

Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Canceling branches allow more slots to be filled

Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots

useful in computation About 50% (60% x 80%) of slots usefully filled

Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

Καθυστερημένη Εντολή Διακλάδωσης (Delayed Branch)

Recall: Speed Up συνάρτηση για διασωλήνωση

pipelined

dunpipeline

TimeCycle

TimeCycle

CPI stall Pipeline CPI Idealdepth Pipeline CPI Ideal

Speedup

pipelined

dunpipeline

TimeCycle

TimeCycle

CPI stall Pipeline 1depth Pipeline

Speedup

Instper cycles Stall Average CPI Ideal CPIpipelined

For simple RISC pipeline, Ideal CPI = 1:

Παράδειγμα: Αξιολόγηση της Διαφορετικής Υλοποίησης της Εντολής Διακλάδωσης

Assume: Conditional & Unconditional = 14%, 65% change PC

Scheduling Branch CPI speedup v. scheme penalty stall

Stall pipeline 3 1.42 1.0Predict taken 1 1.14 1.26Predict not taken 1 1.09 1.29Delayed branch 0.5 1.07 1.31

Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty

Άλλες Δυσκολίες... Χρόνος αναμονής της κρυφής

μνήμης Διακοπές και Εξαιρέσεις

(Interrupts and Exceptions) ΑΣΕ (π.χ. ADD3 42(R1),56(R1)+,@(R1)) Λειτουργίες Πολύ-κύκλων

(Multicycle Operations)

Σχέση μεταξύ Κρυφής Μνήμης και Διασωλήνωσης

WB

Data

Ad

der

IF/ID

ALU

Mem

ory

Reg File

MU

X

Data

Mem

ory

MU

X

SignExtend

Zero? MEM

/WB

EX

/MEM

4

Ad

der

Next SEQ PC

RD RD RD

Next PC

Addre

ss

RS1

RS2

Imm

MU

X

ID/E

X

I-$ D-$

Memory

Λειτουργίες Πολύ-κύκλων

Λειτουργίες Πολύ-κύκλων

Παράδειγμα: MIPS R4000

Στάδια (8): IF – IS – RF – EX – DF – DS – TC - WB

Άλλα Παραδείγματα PowerPC G4 (4):

PowerPC G4e (7):

Pentium 4 (20):

FETCH DECODE / DISPATCH

EXECUTE COMPLETE / WRITE-BACK

FETCH FETCH DECODE / DISP

ISSUE EXE COMPL WRITE - BACK

TCNIP

TCNIP

TCF

TCF

DRIVE

A&R

A&R

A&R

Q SCHE

SCHE

SCHE

DISP

DISP

REG

REG

EXE

FLAGS

BRAN

DRIVE

Άρθρα ISCA 2002

The Optimum Pipeline Depth for a MicroprocessorA. Hartstein, T. R. Puzak

The Optimal Useful Logic Depth Per Pipeline Stage is 6-8 FO4

M.S. Hrishikesh , Norman P. Jouppi , Keith I. Farkas , Doug Burger , Stephen W. Keckler, Premkishore Shivakumar

Increasing Processor Performance by Implementing Deeper Pipelines

Eric Sprangle , Doug Carmean

Ασκήσεις H&P:

(1) a, b, c (2) a, b, c

Documents

Δ ιασωλήνωση - Pipelining