CS4100: 計算機結構 Pipelining

CS4100: CS4100: 計算機結構計算機結構

PipeliningPipelining

國立清華大學資訊工程學系一零零學年度第二學期

Computer ArchitecturePipelining-2

Outline An overview of pipelining A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch hazards Exceptions Superscalar and dynamic pipelining


Laundry example:

Ann, Brian, Cathy, Dave each have one load ofclothes to wash, dry,and fold

Washer takes 30 minutes

Dryer takes 40 minutes

“Folder” takes 20 minutes

A B C D

Pipelining Is Natural!


Sequential laundry takes 6 hours for 4 loadsIf they learned pipelining, how long would it take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

Sequential Laundry


Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

Pipelined Laundry: Start ASAP


Pipelining LessonsDoesn’t help latency of single task, but throughput of entire

Pipeline rate limited by slowest stage

Multiple tasks working at same time using different resources

Potential speedup = Number pipe stages

Unbalanced stage length; time to “fill” & “drain” the pipeline reduce speedup

Stall for dependences

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20


Single cycle vs. Pipeline

Clk

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

Load

Pipeline Implementation:

Clk

Single Cycle Implementation:

Load Store Waste

Ifetch Reg Exec Mem Wr

Ifetch Reg Exec Mem WrStore

Ifetch Reg Exec Mem WrR-type

Cycle 1 Cycle 2


Pipeline PerformanceSingle-cycle (Tc= 800ps)

Pipelined (Tc= 200ps)


Instr.

Order

Time (clock cycles)

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

Why Pipeline? Because the Resources Are There!

Single-cycle

Datapath




Designing a Pipelined Processor

Examine the datapath and control diagramStarting with single cycle datapathSingle cycle control?

Partition datapath into stages:IF (instruction fetch), ID (instruction decode

and register file read), EX (execution or address calculation), MEM (data memory access), WB (write back)

Associate resources with stages Ensure that flows do not conflict, or figure

out how to resolve Assert control in appropriate stage


Multi-Execution Steps

Step nameAction for R-type

instructionsAction for memory-reference

instructionsAction for branches

Action for jumps

Instruction fetch IR = Memory[PC]PC = PC + 4

Instruction A = Reg [IR[25-21]]decode/register fetch B = Reg [IR[20-16]]

ALUOut = PC + (sign-extend (IR[15-0]) << 2)

Execution, address ALUOut = A op B ALUOut = A + sign-extend if (A ==B) then PC = PC [31-28] IIcomputation, branch/ (IR[15-0]) PC = ALUOut (IR[25-0]<<2)jump completion

Memory access or R-type Reg [IR[15-11]] = Load: MDR = Memory[ALUOut]completion ALUOut or

Store: Memory [ALUOut] = B

Memory read completion Load: Reg[IR[20-16]] = MDR

But, use single-cycle datapath ...


Split Single-cycle Datapath

What to add to split the datapath into stages?

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

1

ALUresult

Mux

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back

FeedbackPath


Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

Pipeline registers (latches)

Add Pipeline Registers

Use registers between stages to carry data and control


IF: Instruction Fetch Fetch the instruction from the Instruction

Memory ID: Instruction Decode

Registers fetch and instruction decode EX: Calculate the memory address MEM: Read the data from the Data Memory WB: Write the data back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Ifetch Reg/Dec Exec Mem WrLoad

Consider load


Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Ifetch Reg/Dec Exec Mem Wr1st lw

Ifetch Reg/Dec Exec Mem Wr2nd lw

Ifetch Reg/Dec Exec Mem Wr3rd lw

Pipelining load

5 functional units in the pipeline datapath are: Instruction Memory for the Ifetch stage Register File’s Read ports (busA and busB) for

the Reg/Dec stage ALU for the Exec stage Data Memory for the MEM stage Register File’s Write port (busW) for the WB

stage


IR = mem[PC]; PC = PC + 4

IF Stage of load

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw

Address

Datamemory

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Instruction decode

lw

Address

Datamemory

IR, PC+4


ID Stage of load A = Reg[IR[25-21]]; B = Reg[IR[20-16]];

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw

Address

Datamemory

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Instruction decode

lw

Address

Datamemory


EX Stage of load ALUout = A + sign-ext(IR[15-0])

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Execution

lw

Address

Datamemory


MEM State of load MDR = mem[ALUout]

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Memory

lw

Address

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writedata

ReaddataData

memory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Write back

lw

Writeregister

Address

97108/Patterson Figure 06.15


WB Stage of load Reg[IR[20-16]] = MDR

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Memory

lw

Address

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writedata

ReaddataData

memory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Write back

lw

Writeregister

Address


Who will supply

this address?


Cycle 1 Cycle 2 Cycle 3 Cycle 4

Ifetch Reg/Dec Exec WrR-type

The Four Stages of R-type

IF: fetch the instruction from the Instruction Memory

ID: registers fetch and instruction decode EX: ALU operates on the two register

operands WB: write ALU output back to the register

file


We have a structural hazard: Two instructions try to write to the register file

at the same time! Only one write port

Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9






Ops! We have a problem!

Pipelining R-type and load


Important Observation


1 2 3 4 5


1 2 3 4

Each functional unit can only be used once per instruction

Each functional unit must be used at the same stage for all instructions: Load uses Register File’s write port during its

5th stage

R-type uses Register File’s write port during its 4th stage

Several ways to solve: forwarding, adding pipeline bubble, making instructions same length


Clock

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

Ifetch Reg/Dec Mem WrR-type





Ifetch Reg/Dec Exec WrR-type Mem

Exec

Exec

Exec

Exec

1 2 3 4 5

Solution: Delay R-type’s Write Delay R-type’s register write by one cycle:

R-type also use Reg File’s write port at Stage 5 MEM is a NOP stage: nothing is being done.

R-type also has 5 stages



Ifetch Reg/Dec Exec MemStore Wr

The Four Stages of store


ID: registers fetch and instruction decode EX: calculate the memory address MEM: write the data into the Data Memory

Add an extra stage: WB: NOP



ID: registers fetch and instruction decode EX:

compares the two register operandselect correct branch target addresslatch into PC

Add two extra stages: MEM: NOP WB: NOP


Ifetch Reg/Dec Exec MemBeq Wr

The Three Stages of beq


Pipelined Datapath

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX


Graphically Representing Pipelines

Can help with answering questions like: How many cycles to execute this code? What is the ALU doing during cycle 4? Help understand datapaths

IM Reg DM Reg

IM Reg DM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $10, 20($1)

Programexecutionorder(in instructions)

sub $11, $2, $3

ALU

ALU


Example 1: Cycle 1

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction decode

lw $10, 20($1)

Instruction fetch

sub $11, $2, $3

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw $10, 20($1)

Address

Datamemory

Address

Datamemory

Clock 1

Clock 2


Example 1: Cycle 2

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction decode

lw $10, 20($1)

Instruction fetch

sub $11, $2, $3

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetch

lw $10, 20($1)

Address

Datamemory

Address

Datamemory

Clock 1

Clock 2


Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

3216Sign

extend

Writeregister

Writedata

Memory

lw $10, 20($1)

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

sub $11, $2, $3

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

lw $10, 20($1)

Instruction decode

sub $11, $2, $3

3216Sign

extend

Address

Datamemory

Datamemory

Address

Clock 3

Clock 4

Example 1: Cycle 3


Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

3216Sign

extend

Writeregister

Writedata

Memory

lw $10, 20($1)

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

sub $11, $2, $3

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Execution

lw $10, 20($1)

Instruction decode

sub $11, $2, $3

3216Sign

extend

Address

Datamemory

Datamemory

Address

Clock 3

Clock 4

Example 1: Cycle 4


Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEMID/EX MEM/WB

Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

lw $10, 20($1)

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion


Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

sub $11, $2, $3

Memory

sub $11, $2, $3

Address

Datamemory

Address

Datamemory

Clock 6

Clock 5

Example 1: Cycle 5


Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion


Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

lw $10, 20($1)

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion


Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

sub $11, $2, $3

Memory

sub $11, $2, $3

Address

Datamemory

Address

Datamemory

Clock 6

Clock 5Example 1: Cycle 6




Pipeline Control: Control Signals

PC

Instructionmemory

Address

Inst

ruct

ion

Instruction[20– 16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1Write

data

Read

data Mux

1

ALUcontrol

RegWrite

MemRead


6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

PCSrc

Zero

AddAdd

result

Shiftleft 2

ALUresult

ALU

Zero

Add

0

1

Mux

0

1

Mux


Execution/Address Calculationstage control lines

Memory access stagecontrol lines

Write-back stagecontrol lines

RegDst

ALUOp1

ALUOp0

ALUSrc Branch

MemRead

MemWrite

Regwrite

Mem toReg

1 1 0 0 0 0 0 1 00 0 0 1 0 1 0 1 1X 0 0 1 0 0 1 0 XX 0 1 0 1 0 0 0 X

Fig. 4.22

Group Signals According to Stages

Can use control signals of single-cycle CPU


Pass control signals along just like the data Main control generates control signals during

ID

Data Stationary Control

Control

EX

M

WB

M

WB

WB


Instruction

Fig. 4.50


IF/ID

Register

ID/E

x Register

Ex/M

EM

Register

ME

M/W

B R

egister

ID EX MEM

ExtOp

ALUOp

RegDst

ALUSrc

Branch

MemWr

MemtoReg

RegWr

MainControl

ExtOp

ALUOp

RegDst

ALUSrc

MemtoReg

RegWr

MemtoReg

RegWr

MemtoReg

RegWr

Branch

MemWr

Branch

MemW

WB

Data Stationary Control (cont.) Signals for EX (ExtOp, ALUSrc, ...) are used 1 cycle

later Signals for MEM (MemWr, Branch) are used 2 cycles

later Signals for WB (MemtoReg, MemWr) are used 3 cycles

later


WB Stage of load Reg[IR[20-16]] = MDR

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Memory

lw

Address

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writedata

ReaddataData

memory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Write back

lw

Writeregister

Address


Who will supply

this address?


Datapath with Control

PC

Instructionmemory

Inst

ruct

ion

Add


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4


0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

RegW

rite

MemRead

Control

ALU


6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Mem

Write

AddressData

memory

Address


lw $10, 20($1)

sub $11, $2, $3

and $12, $4, $5

or $13, $6, $7

add $14, $8, $9

Let’s Try it Out


Example 2: Cycle 1

Instructionmemory


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4


0

Mux

0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

RegW

rite

MemRead

Control

ALU


EX

M

WB

M

WB

WBIn

stru

ctio

n

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

000

00

0000

000

00

000

0

00

00

0

0

0

Mux

0

1

Add

PC

0

Datamemory

Address

Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

Mux

0

1

Add Addresult

Writeregister

Writedata

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

RegW

rite

ALU

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

010

11

0001

000

00

000

0

00

00

0

0

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

lwControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

X

10

20

X

1


Instruction[15– 0] Sign

extend


20

$X

$1

10

X

Mem

Write

MemRead

Mem

Writ

e

Datamemory

Address

Address

Address

Clock 2

Clock 1


Example 2: Cycle 2

Instructionmemory


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4


0

Mux

0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

RegW

rite

MemRead

Control

ALU


EX

M

WB

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

000

00

0000

000

00

000

0

00

00

0

0

0

Mux

0

1

Add

PC

0

Datamemory

Address

Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

Mux

0

1

Add Addresult

Writeregister

Writedata

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

RegW

rite

ALU

M

WB

WBIn

stru

ctio

n

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

010

11

0001

000

00

000

0

00

00

0

0

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

lwControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

X

10

20

X

1



extend


20

$X

$1

10

X

Mem

Write

MemRead

Mem

Writ

e

Datamemory

Address

Address

Address

Clock 2

Clock 1


Instructionmemory

Address


Mem

toR

eg

Branch

ALUSrc

4


0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

RegW

rite

MemRead

Control

ALU


EX

M

WB

WBIn

stru

ctio

n

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

000

10

1100

010

11

000

1

00

00

0

0

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

Writeregister

Writedata 1

ALUresult

ALUcontrol

Shiftleft 2

RegW

rite

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

MEM/WB

IF: or $13, $6, $7

000

10

1100

000

10

101

0

11

10

0

0

0

Mux

0

1

Add

PC

0Writedata

Mux

1

andControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

12

X

X

5

4




X

$5

$4

X

12

Mem

Write

MemRead

Mem

Writ

e

sub

11

X

X

3

2

X

$3

$2

X

11

$1

20

10

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$3

$2

11

Mux

Mux

ALUAddress Read

dataData

memory

10

WB

Zero

Zero

Signextend

Signextend

Datamemory

Address

Clock 3

Clock 4

Example 2: Cycle 3


Instructionmemory

Address


Mem

toR

eg

Branch

ALUSrc

4


0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

RegW

rite

MemRead

Control

ALU


EX

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

000

10

1100

010

11

000

1

00

00

0

0

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

Writeregister

Writedata 1

ALUresult

ALUcontrol

Shiftleft 2

RegW

rite

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

MEM/WB

IF: or $13, $6, $7

000

10

1100

000

10

101

0

11

10

0

0

0

Mux

0

1

Add

PC

0Writedata

Mux

1

andControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

12

X

X

5

4




X

$5

$4

X

12

Mem

Write

MemRead

Mem

Writ

e

sub

11

X

X

3

2

X

$3

$2

X

11

$1

20

10

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$3

$2

11

Mux

Mux

ALUAddress Read

dataData

memory

10

WB

Zero

Zero

Signextend

Signextend

Datamemory

Address

Clock 3

Clock 4

Example 2: Cycle 4


Example 2: Cycle 5

Instructionmemory

Address


Branch

ALUSrc

4


0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

RegW

rite

MemRead

Control

ALU


EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

MEM/WB

IF: add $14, $8, $9

000

10

1100

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

ALUcontrol

Shiftleft 2

RegW

rite

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

MEM/WB

IF: after<1>

000

10

1100

000

10

101

0

10

00

0

1

0

Mux

0

1

Add

PC

0Writedata

Mux

1

addControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

14

X

X

9

8




X

$9

$8

X

14

Mem

Write

MemRead

Mem

Writ

e

or

13

X

X

7

6

X

$7

$6

X

13

$4

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$7

$6

13

Mux

Mux

ALUReaddata

12

WB

11 10

10$5

12

WB

Mem

toR

eg

1

1

11

11

Writeregister

Writedata

Zero

Zero

Datamemory

Address

Datamemory

Address

Signextend

Signextend

Clock 5

Clock 6


Example 2: Cycle 6

Instructionmemory

Address


Branch

ALUSrc

4


0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

RegW

rite

MemRead

Control

ALU


EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

MEM/WB

IF: add $14, $8, $9

000

10

1100

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

ALUcontrol

Shiftleft 2

RegW

rite

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

MEM/WB

IF: after<1>

000

10

1100

000

10

101

0

10

00

0

1

0

Mux

0

1

Add

PC

0Writedata

Mux

1

addControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

14

X

X

9

8




X

$9

$8

X

14

Mem

Write

MemRead

Mem

Writ

e

or

13

X

X

7

6

X

$7

$6

X

13

$4

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$7

$6

13

Mux

Mux

ALUReaddata

12

WB

11 10

10$5

12

WB

Mem

toR

eg

1

1

11

11

Writeregister

Writedata

Zero

Zero

Datamemory

Address

Datamemory

Address

Signextend

Signextend

Clock 5

Clock 6


Example 2: Cycle 7

Fig. 6.34

Instructionmemory

Address


Branch

ALUSrc

4


0

1

Add Addresult


Writedata

ALUresult

Shiftleft 2

RegW

rite

MemRead

Control

ALU


Signextend

EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .

MEM/WB

IF: after<2>

000

00

0000

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

RegW

rite

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .

MEM/WB

IF: after<3>

000

00

0000

000

00

000

0

10

00

0

1

0

Mux

0

1

Add

PC

0Writedata

Mux

1

Control

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2



extend


Mem

Write

MemRead

Mem

Writ

e

$8

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

Mux

Mux

ALUReaddata

14

WB

13 12

12$9

14

WB

Mem

toR

eg

1

0

13

13

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2 Zero

Datamemory

Address

Datamemory

Address

Clock 7

Clock 8


Example 2: Cycle 8

Fig. 6.34

Instructionmemory

Address


Branch

ALUSrc

4


0

1

Add Addresult


Writedata

ALUresult

Shiftleft 2

RegW

rite

MemRead

Control

ALU


Signextend

EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .

MEM/WB

IF: after<2>

000

00

0000

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

RegW

rite

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .

MEM/WB

IF: after<3>

000

00

0000

000

00

000

0

10

00

0

1

0

Mux

0

1

Add

PC

0Writedata

Mux

1

Control

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2



extend


Mem

Write

MemRead

Mem

Writ

e

$8

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

Mux

Mux

ALUReaddata

14

WB

13 12

12$9

14

WB

Mem

toR

eg

1

0

13

13

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2 Zero

Datamemory

Address

Datamemory

Address

Clock 7

Clock 8


WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .

MEM/WB

IF: after<4>

000

00

0000

000

00

000

0

00

00

0

1

0

Mux

0

1

Add

PC

0Writedata

Mux

1

Control

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2



extend


MemRead

Mem

Write

Mux

Mux

ALUReaddata

WB

14

14

Writeregister

Writedata

Datamemory

Address

Clock 9

Example 2: Cycle 9


Summary of Pipeline Basics Pipelining is a fundamental concept

Multiple steps using distinct resources Utilize capabilities of datapath by pipelined

instruction processing Start next instruction while working on the

current one Limited by length of longest stage (plus

fill/flush) Need to detect and resolve hazards

What makes it easy in MIPS? All instructions are of the same length Just a few instruction formats Memory operands only in loads and stores

What makes pipelining hard? hazards


Outline An overview of pipelining A pipelined datapath Pipelined control Data hazards and forwarding (R-Type and R-

Type) Data hazards and stalls (Load and R-type) Branch hazards Exceptions Superscalar and dynamic pipelining


Pipeline Hazards Pipeline Hazards:

Structural hazards: attempt to use the same resource in two different ways at the same time

Ex.: combined washer/dryer or folder busy doing something else (watching TV)

Data hazards: attempt to use item before ready Instruction depends on result of prior instruction

still in the pipeline Control hazards: attempt to make decision before

condition is evaluated Ex.: wash football uniforms and need to see result

of previous load to get proper detergent level Branch instructions

Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards


Mem

Instr.

Order

Time

Load

Instr 1

Instr 2

Instr 3

Instr 4A

LUMem Reg Mem Reg

AL

UMem Reg Mem Reg

AL

UMem Reg Mem RegA

LUReg Mem Reg

AL

UMem Reg Mem Reg

Use 2 memory: data memory and instruction memory

Structural Hazard: Single Memory


Pipeline Hazards Illustrated

IF ID EX MEM WB

StructuralHazard

IF ID ….


Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruc

tion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

Feedback Path


Data Hazards

IM Reg

IM Reg



sub $2, $1, $3


and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/ -2 0 -20 -2 0 -20 -2 0

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM


Types of Data Hazards

Three types: (inst. i1 followed by inst. i2) RAW (read after write):

i2 tries to read operand before i1 writes it WAR (write after read):

i2 tries to write operand before i1 reads it Gets wrong operand, e.g., autoincrement addr. Can’t happen in MIPS 5-stage pipeline because:

All instructions take 5 stages, and reads are always in stage 2, and writes are always in stage 5

WAW (write after write): i2 tries to write operand before i1 writes it Leaves wrong result ( i1’s not i2’s); occur only

in pipelines that write in more than one stage Can’t happen in MIPS 5-stage pipeline because:

All instructions take 5 stages, and writes are always in stage 5


Pipeline Hazards Illustrated

IF ID EX MEM WB

IF ID EX Mem

RAW (read after write) Data Hazard

WAW Data Hazard (write after write)

IF ID EX MEM WB WAR Data Hazard (write after read)

IF ID EX MEM WB

IF ID EX MEM WB


Handling Data Hazards Use simple, fixed designs

Eliminate WAR by always fetching operands early (ID) in pipeline

Eliminate WAW by doing all write backs in order (last stage, static)

These features have a lot to do with ISA design Internal forwarding in register file:

Write in first half of clock and read in second half

Read delivers what is written, resolve hazard between sub and add

Detect and resolve remaining ones Compiler inserts NOP (software solution) Forward (hardware solution) Stall (hardware solution)


Software Solution Have compiler guarantee no hazards Where do we insert the NOPs?

sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

Problem: this really slows us down!


Data Hazards

IM Reg

IM Reg



sub $2, $1, $3


and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/ -2 0 -20 -2 0 -20 -2 0

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)


DM Reg

Reg

Reg

Reg

DM

Insert two nops


Data Hazards : Forwarding

IM Reg

IM Reg



sub $2, $1, $3


and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/ -2

0 -20 -20 -20 -20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)


DM Reg

Reg

Reg

Reg

DM


Datapath with Forwarding


Control: Detecting Data Hazards

Hazard conditions: 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt

Two optimizations: Don’t forward if instruction does not write

register=> check if RegWrite is asserted

Don’t forward if destination register is $0=> check if RegisterRd = 0


Detecting Data Hazards (cont.) Hazard conditions using control signals:

At EX stage:EX/MEM.RegWrite and (EX/MEM.RegRd0) and (EX/MEM.RegRd=ID/EX.RegRs)

At MEM stage:MEM/WB.RegWrite and (MEM/WB.RegRd0) and (MEM/WB.RegRd=ID/EX.RegRs)

(replace ID/EX.RegRt for ID/EX.RegRs for the other two conditions)


Use temporary results, e.g., those in pipeline registers, don’t wait for them to be written

Resolving Hazards: Forwarding

IM Reg

IM Reg



sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM


Datapath with Forwarding


Forwarding Logic Forwarding: input to ALU from any pipe reg.

Add multiplexors to ALU input Control forwarding in EX => carry Rs in ID/EX

Control signals for forwarding: If both WB and MEM forward, e.g., add $1,$1,$2; add $1,$1,$3; add $1,$1,$4; => let MEM forward

EX hazard: if (EX/MEM.RegWrite and (EX/MEM.RegRd0)

and (EX/MEM.RegRd=ID/EX.RegRs)) ForwardA=10

MEM hazard: if (MEM/WB.RegWrite and (MEM/WB.RegRd0)

and (EX/MEM.RegRd ID/EX.Reg.Rs)and (MEM/WB.RegRd=ID/EX.RegRs))

ForwardA=01(ID/EX.RegRt<->ID/EX.RegRs, ForwardB<-> ForwardA)


Example 3: Cycle 3

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5 sub $2, $1, $3

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

10 10

$2

$5

5

2

4

$1

$3

3

1

2

Control

ALU

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

or $4, $4, $2 and $4, $2, $5

ID/EX

sub $2, . . .

EX/MEM

before<1>

MEM/WB

add $9, $4, $2

Clock 4

4

6

10 10

$4

$2

6

2

4

$2

$5

5

2

4

Control

ALU

10

2

WB

M

WB


Example 3: Cycle 4

Fig. 6.41

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5 sub $2, $1, $3

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

10 10

$2

$5

5

2

4

$1

$3

3

1

2

Control

ALU

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

or $4, $4, $2 and $4, $2, $5

ID/EX

sub $2, . . .

EX/MEM

before<1>

MEM/WB

add $9, $4, $2

Clock 4

4

6

10 10

$4

$2

6

2

4

$2

$5

5

2

4

Control

ALU

10

2

WB

M

WB


PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

add $9, $4, $2 or $4, $4, $2

ID/EX

and $4, . . .

EX/MEM

sub $2, . . .

MEM/WB

after<1>

Clock 5

4

2

10 10

$4

$2

2

4

9

$4

$2

4

2

24

Control

ALU

10

WB

2

1

PCInstruction

memory

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

after<1>after<2> add $9, $4, $2 or $4, . . .

EX/MEM

and $4, . . .

MEM/WB

Clock 6

10

$4

$2

2

4

9

ALU

10

4

4

WB

4

1

Registers

Inst

ruct

ion

IF/ID

ID/EX

4

Control

Example 3: Cycle 5


Example 3: Cycle 6

Fig. 6.42

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

add $9, $4, $2 or $4, $4, $2

ID/EX

and $4, . . .

EX/MEM

sub $2, . . .

MEM/WB

after<1>

Clock 5

4

2

10 10

$4

$2

2

4

9

$4

$2

4

2

24

Control

ALU

10

WB

2

1

PCInstruction

memory

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

after<1>after<2> add $9, $4, $2 or $4, . . .

EX/MEM

and $4, . . .

MEM/WB

Clock 6

10

$4

$2

2

4

9

ALU

10

4

4

WB

4

1

Registers

Inst

ruct

ion

IF/ID

ID/EX

4

Control


Outline An overview of pipelining A pipelined datapath Pipelined control Data hazards and forwarding (R-Type and R-

Type) Data hazards and stalls (Load and R-type) Branch hazards Exceptions Superscalar and dynamic pipelining


lw can still cause a hazard: if is followed by an instruction to read the

loaded reg.

Reg

IM

Reg

Reg

IM



lw $2, 20($1)


and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

Can't Always Forward

Use stalling or compiler to resolve


Stalling Stall pipeline by keeping instructions in same

stage and inserting an NOP instead

lw $2, 20($1)


and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble


PCInstruction

memory

Registers

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

0

Mux

IF/ID

Inst

ruct

ion

ID/EX.MemRead

IF/ID

Write

PC

Write

ID/EX.RegisterRt

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

RtRs

Rd

Rt EX/MEM.RegisterRd

MEM/WB.RegisterRd

Datapath with Stalling Unit Forwarding controls ALU inputs, hazard

detection controls PC, IF/ID, control signals


Control: Handling Stalls Hazard detection unit in ID to insert stall

between a load instruction and its use: if (ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.registerRt)) stall the pipeline for one cycle(ID/EX.MemRead=1 indicates a load instruction)

How to stall? Stall instruction in IF and ID: not change PC

and IF/ID=> the stages re-execute the instructions

What to move into EX: insert an NOP by changing EX, MEM, WB control fields of ID/EX pipeline register to 0

as control signals propagate, all control signals to EX, MEM, WB are deasserted and no registers or memories are written


Example 4: Cycle 2

Hazarddetection

unit

0

MuxIF

/ID

Write

PC

Write

ID/EX.RegisterRt

lw $2, 20($1)

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

2

500 11

$2

$5

5

2

4

$1

$X

X

1

2

Control

ALU

M

WB

Hazarddetection

unit

0

MuxIF

/ID

Write

PC

Write

ID/EX.RegisterRt

ID/EX.MemRead

ID/EX.MemRead

M

WB

$1

$X

X

1

2

before<3>

PCInstruction

memory

Registers

Mux

Mux

Mux

EX WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

ID/EX

EX/MEM

MEM/WB

and $4, $2, $5 lw $2, 20($1) before<1> before<2>

Clock 2

1

1

X

X11

Control

ALU

M

WB


Example 4: Cycle 3

Hazarddetection

unit

0

MuxIF/ID

Write

PC

Write

ID/EX.RegisterRt

lw $2, 20($1)

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

2

500 11

$2

$5

5

2

4

$1

$X

X

1

2

Control

ALU

M

WB

Hazarddetection

unit

0

MuxIF

/ID

Write

PC

Write

ID/EX.RegisterRt

ID/EX.MemRead

ID/EX.MemRead

M

WB

$1

$X

X

1

2

before<3>

PCInstruction

memory

Registers

Mux

Mux

Mux

EX WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

ID/EX

EX/MEM

MEM/WB

and $4, $2, $5 lw $2, 20($1) before<1> before<2>

Clock 2

1

1

X

X11

Control

ALU

M

WB

Computer ArchitecturePipelining-83Hazard

detectionunit

0

MuxIF

/IDW

rite

PC

Writ

e

ID/EX.RegisterRt

$2

$5

5

2

2

2

4

WB

Hazarddetection

unit

0

MuxIF

/IDW

rite

PC

Writ

e

ID/EX.RegisterRt

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

Datamemory

Mux

Inst

ruct

ion

IF/ID

and $4, $2, $5 bubble

ID/EX

lw $2, . . .

EX/MEM

before<1>

MEM/WB

Clock 4

2

2

5

510

11

00

$2

$5

5

2

4

Control

ALU

M

WB

bubble lw $2, . . .

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5

ID/EX

EX/MEM

MEM/WB

add $9, $4, $2

Clock 5

2

210 10

11

$4

$2

2

4

4

4

2

4

$2

$5

5

2

4

Control

ALU

0

WB

ID/EX.MemRead

ID/EX.MemRead

or $4, $4, $2

or $4, $4, $2

Example 4: Cycle 4


Hazarddetection

unit

0

MuxIF

/IDW

rite

PC

Writ

e

ID/EX.RegisterRt

$2

$5

5

2

2

2

4

WB

Hazarddetection

unit

0

MuxIF

/IDW

rite

PC

Writ

eID/EX.RegisterRt

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

Datamemory

Mux

Inst

ruct

ion

IF/ID

and $4, $2, $5 bubble

ID/EX

lw $2, . . .

EX/MEM

before<1>

MEM/WB

Clock 4

2

2

5

510

11

00

$2

$5

5

2

4

Control

ALU

M

WB

bubble lw $2, . . .

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5

ID/EX

EX/MEM

MEM/WB

add $9, $4, $2

Clock 5

2

210 10

11

$4

$2

2

4

4

4

2

4

$2

$5

5

2

4

Control

ALU

0

WB

ID/EX.MemRead

ID/EX.MemRead

or $4, $4, $2

or $4, $4, $2

Example 4: Cycle 5


Example 4: Cycle 6

Fig. 6.49

Registers

Inst

ruct

ion

ID/EX

4

Control

PCInstruction

memory

PCInstruction

memory

Hazarddetection

unit

0

Mux

IF/ID

Writ

e

PC

Write

IF/ID

Write

PC

Write

ID/EX.RegisterRt

bubble

Registers

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Inst

ruct

ion

IF/ID

add $9, $4, $2

ID/EX

and $4, . . .

EX/MEM

MEM/WB

Clock 6

4

4

2

210 10

$4

$2

2

4

49

$2

2

Control

ALU

10

WB0

add $9, $4, $2 or $4, . . . and $4, . . .after<2> after<1>

after<1>

Clock 7

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Forwardingunit

EX/MEM

MEM/WB

10 10

$4

$4

$2

2

4

4

9

ALU

10

WB

44

4

1

Hazarddetection

unit

0

Mux

ID/EX.RegisterRt

or $4, $4, $2

ID/EX.MemRead

ID/EX.MemRead

Mux

IF/ID


Registers

Inst

ruct

ion

ID/EX

4

Control

PCInstruction

memory

PCInstruction

memory

Hazarddetection

unit

0

Mux

IF/ID

Writ

e

PC

Write

IF/ID

Write

PC

Write

ID/EX.RegisterRt

bubble

Registers

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Inst

ruct

ion

IF/ID

add $9, $4, $2

ID/EX

and $4, . . .

EX/MEM

MEM/WB

Clock 6

4

4

2

210 10

$4

$2

2

4

49

$2

2

Control

ALU

10

WB0

add $9, $4, $2 or $4, . . . and $4, . . .after<2> after<1>

after<1>

Clock 7

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Forwardingunit

EX/MEM

MEM/WB

10 10

$4

$4

$2

2

4

4

9

ALU

10

WB

44

4

1

Hazarddetection

unit

0

Mux

ID/EX.RegisterRt

or $4, $4, $2

ID/EX.MemRead

ID/EX.MemRead

Mux

IF/ID

Example 4: Cycle 7




Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruc

tion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

Feedback Path


Pipeline Datapath with Control Signals

PC

Instructionmemory

Address

Inst

ruct

ion


MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4


0

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1Write

data

Read

data Mux

1

ALUcontrol

RegWrite

MemRead


6


MemWrite

Address

Datamemory

PCSrc

Zero

AddAdd

result

Shiftleft 2

ALUresult

ALU

Zero

Add

0

1

Mux

0

1

Mux


When decide to branch, other inst. are in pipeline!

Reg

Reg

CC 1


40 beq $1, $3, 7


IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

Branch Hazards


Handling Branch Hazard Predict branch always not taken

Need to add hardware for flushing inst. if wrong Branch decision made at MEM => need to flush

instruction in IF/ID, ID/EX by changing control values to 0

Reduce delay of taken branch by moving branch execution earlier in the pipeline Move up branch address calculation to ID Check branch equality at ID (using XOR) by

comparing the two registers read during ID Branch decision made at ID => one instruction

to flush Add a control signal, IF.Flush, to zero

instruction field of IF/ID => making the instruction an NOP

Dynamic branch prediction Compiler rescheduling, delay branch


Pipeline with Flushing

PC Instructionmemory

4

Registers

Mux

Mux

Mux

ALU

EX

M

WB

M

WB

WB

ID/EX

0

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

Signextend

Control

Mux

=

Shiftleft 2

Mux


Example 5: Cycle 3

PCInstruction

memory

4

Registers

Signextend

Mux

Mux

Control

EX

M

WB

M

WB

WB

Mux

Hazarddetection

unit

Forwardingunit

Mux

IF.Flush

IF/ID

and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8

MEM/WB

EX/MEM

ID/EX

Clock 3

72 44

48 44

28

7

$1

$3

10

48

72

72

0

Mux

0

$4

$8

ALUData

memory

bubble (nop)lw $4, 50($7)

Clock 4

Mux

Shiftleft 2

before<1>

beq $1, $3, 7 sub $10, . . . before<1>

before<2>

=


4

Registers

Signextend

Mux

Mux

Control

EX

M

WB

M

WB

WB

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

MEM/WB

EX/MEM

ID/EX

76 72

76 72

$1

$3

10

76

ALUData

memory

Mux

Shiftleft 2

=


Example 5: Cycle 4

PCInstruction

memory

4

Registers

Signextend

Mux

Mux

Control

EX

M

WB

M

WB

WB

Mux

Hazarddetection

unit

Forwardingunit

Mux

IF.Flush

IF/ID

and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8

MEM/WB

EX/MEM

ID/EX

Clock 3

72 44

48 44

28

7

$1

$3

10

48

72

72

0

Mux

0

$4

$8

ALUData

memory

bubble (nop)lw $4, 50($7)

Clock 4

Mux

Shiftleft 2

before<1>

beq $1, $3, 7 sub $10, . . . before<1>

before<2>

=


4

Registers

Signextend

Mux

Mux

Control

EX

M

WB

M

WB

WB

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

MEM/WB

EX/MEM

ID/EX

76 72

76 72

$1

$3

10

76

ALUData

memory

Mux

Shiftleft 2

=


Dynamic Branch Prediction In deeper and superscalar pipelines, branch

penalty is more significant Use dynamic prediction (e.g. loop)

Branch prediction buffer (aka branch history table)

Indexed by recent branch instruction addresses

Stores outcome (taken/not taken) To execute a branch

Check table, expect the same outcome Start fetching from fall-through or target If wrong, flush pipeline and flip prediction


Predict-not-taken + branch decision at ID=> the following instruction is always executed=> branches take effect 1 cycle later

0 clock cycle penalty per branch instruction if can find instruction to put in slot (50% of time)

Instr.

Order

Time (clock cycles)

add

beq

misc

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

Mem

AL

UReg Mem Reg

lw Mem

AL

UReg Mem Reg

Delayed Branch




Exceptions and Interrupts “ Unexpected” events requiring change

in flow of control Different ISAs use the terms differently

Exception Arises within the CPU e.g., undefined opcode, overflow, syscall,

… Interrupt

From an external I/O controller Dealing with them without sacrificing

performance is hard

§4.9 Exceptions


Handling Exceptions In MIPS, exceptions managed by a System

Control Coprocessor (CP0) Save PC of offending (or interrupted)

instruction In MIPS: Exception Program Counter (EPC)

Save indication of the problem In MIPS: Cause register We’ll assume 1-bit

0 for undefined opcode, 1 for overflow Jump to handler at 8000 00180


An Alternate Mechanism Vectored Interrupts

Handler address determined by the cause Example:

Undefined opcode: C000 0000 Overflow: C000 0020 …: C000 0040

Instructions either Deal with the interrupt, or Jump to real handler


Handler Actions Read cause, and transfer to relevant handler Determine action required If restartable

Take corrective action use EPC to return to program

Otherwise Terminate program Report error using EPC, cause, …


Exceptions in a Pipeline Another form of control hazard Consider overflow on add in EX stage

add $1, $2, $1 Prevent $1 from being clobbered Complete previous instructions Flush add and subsequent instructions Set Cause and EPC register values Transfer control to handler

Similar to mispredicted branch Use much of the same hardware


Pipeline with Exceptions


Exception Properties Restartable exceptions

Pipeline can flush the instruction Handler executes, then returns to the

instruction Refetched and executed from scratch

PC saved in EPC register Identifies causing instruction Actually PC + 4 is saved

Handler must adjust


Exception Example Exception on add in

40 sub $11, $2, $444 and $12, $2, $548 or $13, $2, $64C add $1, $2, $150 slt $15, $6, $754 lw $16, 50($7)…

Handler80000180 sw $25, 1000($0)80000184 sw $26, 1004($0)…


Exception Example


Exception Example


Multiple Exceptions Pipelining overlaps multiple instructions

Could have multiple exceptions at once Simple approach: deal with exception from

earliest instruction Flush subsequent instructions “Precise” exceptions

In complex pipelines Multiple instructions issued per cycle Out-of-order completion Maintaining precise exceptions is difficult!


Imprecise Exceptions Just stop pipeline and save state

Including exception cause(s) Let the handler work out

Which instruction(s) had exceptions Which to complete or flush

May require “manual” completion Simplifies hardware, but more complex

handler software Feasible for complex multiple-issue

out-of-order pipelines




Instruction-Level Parallelism (ILP)

Pipelining: executing multiple instructions in parallel

To increase ILP Deeper pipeline

Less work per stage shorter clock cycle Multiple issue

Replicate pipeline stages multiple pipelines Start multiple instructions per clock cycle CPI < 1, so use Instructions Per Cycle (IPC) E.g., 4GHz 4-way multiple-issue

16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies reduce this in practice

§4.10 Parallelism

and Advanced Instruction Level P

arallelism


Multiple Issue Static multiple issue

Compiler groups instructions to be issued together

Packages them into “issue slots” Compiler detects and avoids hazards

Dynamic multiple issue CPU examines instruction stream and chooses

instructions to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced

techniques at runtime


Speculation “ Guess” what to do with an instruction

Start operation as soon as possible Check whether guess was right

If so, complete the operation If not, roll-back and do the right thing

Common to static and dynamic multiple issue Examples

Speculate on branch outcome Roll back if path taken is different

Speculate on load Roll back if location is updated


Compiler/Hardware Speculation

Compiler can reorder instructions e.g., move load before branch Can include “fix-up” instructions to recover

from incorrect guess Hardware can look ahead for instructions to

execute Buffer results until it determines they are

actually needed Flush buffers on incorrect speculation


Speculation and Exceptions What if exception occurs on a speculatively

executed instruction? e.g., speculative load before null-pointer check

Static speculation Can add ISA support for deferring exceptions

Dynamic speculation Can buffer exceptions until instruction

completion (which may not occur)


Static Multiple Issue Compiler groups instructions into “issue

packets” Group of instructions that can be issued on a

single cycle Determined by pipeline resources required

Think of an issue packet as a very long instruction Specifies multiple concurrent operations Very Long Instruction Word (VLIW)


Scheduling Static Multiple Issue

Compiler must remove some/all hazards Reorder instructions into issue packets No dependencies with a packet Possibly some dependencies between packets

Varies between ISAs; compiler must know! Pad with nop if necessary


MIPS with Static Dual Issue Two-issue packets

One ALU/branch instruction One load/store instruction 64-bit aligned

ALU/branch, then load/store Pad an unused instruction with nop

Address

Instruction type

Pipeline Stages

n ALU/branch IF ID EX MEM WB

n + 4 Load/store IF ID EX MEM WB

n + 8 ALU/branch IF ID EX MEM WB


n + 16 ALU/branch IF ID EX MEM WB



MIPS with Static Dual Issue


Hazards in the Dual-Issue MIPS

More instructions executing in parallel EX data hazard

Forwarding avoided stalls with single-issue Now can’t use ALU result in load/store in same

packet add $t0, $s0, $s1load $s2, 0($t0)

Split into two packets, effectively a stall Load-use hazard

Still one cycle use latency, but now two instructions

More aggressive scheduling required


Scheduling Example Schedule this for dual-issue MIPS

Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 4($s1) # store result addi $s1, $s1,–4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0

ALU/branch Load/store cycle

Loop: nop lw $t0, 0($s1) 1

addi $s1, $s1,–4 nop 2

addu $t0, $t0, $s2 nop 3

bne $s1, $zero, Loop

sw $t0, 4($s1) 4

IPC = 5/4 = 1.25 (c.f. peak IPC = 2)


Loop Unrolling Replicate loop body to expose more

parallelism Reduces loop-control overhead

Use different registers per replication Called “register renaming” Avoid loop-carried “anti-dependencies”

Store followed by a load of the same register Aka “name dependence”

Reuse of a register name


Loop Unrolling Example

IPC = 14/8 = 1.75 Closer to 2, but at cost of registers and code size

ALU/branch Load/store cycle

Loop: addi $s1, $s1,–16 lw $t0, 0($s1) 1

nop lw $t1, 4($s1) 2

addu $t0, $t0, $s2 lw $t2, 8($s1) 3

addu $t1, $t1, $s2 lw $t3, 12($s1) 4

addu $t2, $t2, $s2 sw $t0, 4($s1) 5

addu $t3, $t4, $s2 sw $t1, 8($s1) 6

nop sw $t2, 12($s1) 7

bne $s1, $zero, Loop

sw $t3, 16($s1) 8


Dynamic Multiple Issue “ Superscalar” processors CPU decides whether to issue 0, 1, 2, … each

cycle Avoiding structural and data hazards

Avoids the need for compiler scheduling Though it may still help Code semantics ensured by the CPU


Dynamic Pipeline Scheduling Allow the CPU to execute instructions out of

order to avoid stalls But commit result to registers in order

Examplelw $t0, 20($s2)addu $t1, $t0, $t2sub $s4, $s4, $t3slti $t5, $s4, 20

Can start sub while addu is waiting for lw


Dynamically Scheduled CPU

Results also sent to any waiting

reservation stations

Reorders buffer for register writes

Can supply operands for issued

instructions

Preserves dependencies

Hold pending operands


Register Renaming Reservation stations and reorder buffer

effectively provide register renaming On instruction issue to reservation station

If operand is available in register file or reorder buffer

Copied to reservation station No longer required in the register; can be

overwritten If operand is not yet available

It will be provided to the reservation station by a function unit

Register update may not be required


Speculation Predict branch and continue issuing

Don’t commit until branch outcome determined

Load speculation Avoid load and cache miss delay

Predict the effective address Predict loaded value Load before completing outstanding stores Bypass stored values to load unit

Don’t commit load until speculation cleared


Why Do Dynamic Scheduling? Why not just let the compiler schedule code? Not all stalls are predicable

e.g., cache misses Can’t always schedule around branches

Branch outcome is dynamically determined Different implementations of an ISA have

different latencies and hazards


Does Multiple Issue Work?

Yes, but not as much as we’d like Programs have real dependencies that limit

ILP Some dependencies are hard to eliminate

e.g., pointer aliasing Some parallelism is hard to expose

Limited window size during instruction issue

Memory delays and limited bandwidth Hard to keep pipelines full

Speculation can help if done well

The BIG Picture


Power Efficiency Complexity of dynamic scheduling and

speculations requires power Multiple simpler cores may be better

Microprocessor

Year Clock Rate

Pipeline

Stages

Issue width

Out-of-order/

Speculation

Cores Power

i486 1989 25MHz 5 1 No 1 5W

Pentium 1993 66MHz 5 2 No 1 10W

Pentium Pro 1997 200MHz 10 3 Yes 1 29W

P4 Willamette

2001 2000MHz

22 3 Yes 1 75W

P4 Prescott 2004 3600MHz

31 3 Yes 1 103W

Core 2006 2930MHz

14 4 Yes 2 75W

UltraSparc III 2003 1950MHz

14 4 No 1 90W

UltraSparc T1

2005 1200MHz

6 1 No 8 70W


Fallacies

Pipelining is easy (!) The basic idea is easy The devil is in the details

e.g., detecting data hazards Pipelining is independent of technology

So why haven’t we always done pipelining? More transistors make more advanced

techniques feasible Pipeline-related ISA design needs to take

account of technology trends e.g., predicted instructions

§4.13 Fallacies and P

itfalls


Pitfalls Poor ISA design can make pipelining harder

e.g., complex instruction sets (VAX, IA-32) Significant overhead to make pipelining work IA-32 micro-op approach

e.g., complex addressing modes Register update side effects, memory indirection

e.g., delayed branches Advanced pipelines have long delay slots


Concluding Remarks ISA influences design of datapath and control Datapath and control influence design of ISA Pipelining improves instruction throughput

using parallelism More instructions completed per second Latency for each instruction not reduced

Hazards: structural, data, control Multiple issue and dynamic scheduling (ILP)

Dependencies limit achievable parallelism Complexity leads to the power wall

§4.14 Concluding R

emarks

Documents

CS4100: 計算機結構 Pipelining