GAS STATION Pipelining & Hazards IILecture 4 EECS 470 Slide 6 © Wenisch 2016 -- Portions ©...

Preview:

Citation preview

Lecture 4 Slide 1 EECS 470

EECS 470

Lecture 4

Pipelining & Hazards II Winter 2021

Jon Beaumont

http://www.eecs.umich.edu/courses/eecs470

GAS STATION

Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, University of Pennsylvania, and University of Wisconsin.

Lecture 4 Slide 2 EECS 470

Class Question

Which of the following best explains why pipelining results

in speedup?

a) Instructions are executed with shorter latency

b) Clock period is reduced

c) More instructions are executed at the same time

d) Magnets

Lecture 4 Slide 3 EECS 470

Announcements

• Reminder Lab #1 due tomorrow by 12:30p

Get checked off by GSI/IA

Verilog assignment #1 due tomorrow Submit to autograder by 11:59p

HW # 1 due Thursday 2/4 Submit through Gradescope by 11:59p

• I have OH today from 3-4 OH format for all staff: Join Zoom link, put yourself on Office Hour

Queue You will be let into a breakout room when you are at the head

Lecture 4 Slide 4 EECS 470

Last Time

• Baseline processor discussion Review 5-stage pipeline from EECS 370

Lecture 4 Slide 5 EECS 470

Today

• Hazards Detection Resolution

Software (avoidance) Hardware (stalling, forwarding)

Lecture 4 Slide 6 EECS 470

Lingering Questions

• "How recent was the pipeline method developed? What will be the next best method?" Basic pipelines have been used since the very early days of

computing (1930s) Deep pipelines became very popular with vector processors in the

1970s Less popular know we'll discuss why

Recent trends have been not towards better performance, but

better reliability and power-effeciency EECS 573 (Microarchitectures) covers a lot of these interesting topics

• Remember, you can submit lingering questions to cover next lecture at: https://bit.ly/3oSr5FD

Lecture 4 Slide 7 EECS 470

Balancing Pipeline Stages

IF

ID

EX

MEM

WB

TIF= 6 units

TID= 2 units

TEX= 9 units

TMEM= 5 units

TWB= 8 units

Can we do better in terms of either performance or efficiency?

Lecture 4 Slide 8 EECS 470

Balancing Pipeline Stages

Two Methods for Stage Quantization: Merging of multiple stages Further subdividing a stage

Recent Trends: Deeper pipelines (more and more stages)

Pipeline depth growing more slowly since Pentium 4. Why?

Multiple pipelines Pipelined memory/cache accesses (tricky)

Lecture 4 Slide 9 EECS 470

The Cost of Deeper Pipelines

Instruction pipelines are not ideal i.e. Instructions in different stages can have dependencies

Suppose add 1 2 3

nand 3 4 5

F D E M W F D E M W

t0 t1 t2 t3 t4 t5

Inst0 Inst1

F D E M W F D E M W

t0 t1 t2 t3 t4 t5

add nand E Stall

F E M D Stall D

RAW!!

(read-after-write

dependency)

Lecture 4 Slide 10 EECS 470

Terminology

Pipeline Hazards: Potential violations of program dependences Must ensure program dependences are not violated

Hazard Resolution: Static Method: Performed at compiled time in software Dynamic Method: Performed at run time using hardware

Pipeline Interlock: Hardware mechanisms for dynamic hazard resolution Must detect and enforce dependences at run time

Lecture 4 Slide 11 EECS 470

Handling Data Hazards

Avoidance (static) Make sure there are no hazards in the code

Detect and Stall (dynamic) Stall until earlier instructions finish

Detect and Forward (dynamic) Get correct value from elsewhere in pipeline

Lecture 4 Slide 12 EECS 470

Handling Data Hazards: Avoidance

Programmer/compiler must know implementation details Insert noops between dependent instructions

add 1 2 3 noop noop nand 3 4 5

write R3 in cycle 5

read R3 in cycle 6

Lecture 4 Slide 13 EECS 470

Problems with Avoidance

Binary compatibility New implementations may require more noops

Code size Higher instruction cache footprint Longer binary load times Worse in machines that execute multiple instructions / cycle

Intel Itanium – 25-40% of instructions are noops

Slower execution CPI=1, but many instructions are noops

Lecture 4 Slide 14 EECS 470

Handling Data Hazards: Detect & Stall

Detection Compare regA & regB with DestReg of preceding insn.

3 bit comparators

Stall Do not advance pipeline register for Fetch/Decode Pass noop to Execute

Which of the "Avoidance" issues does "Detect & Stall" fix? (select all)

a) Binary compatibility

b) Code size

c) Slower execution

15

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

Bits 0-2

Bits 16-18

op

dest

offset

valB

valA

PC+1 PC+1

target

ALU

result

op

dest

valB

op

dest

ALU

result

mdata

eq? instru

ction

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

Bits 22-24

data

dest

Fetch Decode Execute Memory WB

16

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

op

dest

offset

valB

valA

PC+1 PC+1

target

ALU

result

op

dest

valB

op

dest

ALU

result

mdata

eq? instru

ction

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

dest

Fetch Decode Execute Memory WB

17

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

op

offset

valB

valA

PC+1 PC+1

target

ALU

result

op

valB

op

ALU

result

mdata

eq?

ad

d 1

2 3

7 10

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

End of Cycle 1

18

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

add

3

7

14

PC+1 PC+1

target

ALU

result

op

valB

op

ALU

result

mdata

eq? na

nd

3 4

5

7 10

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

End of Cycle 2

19

Hazard detection

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

add

3

7

14

PC+1 PC+1

target

ALU

result

op

valB

op

ALU

result

mdata

eq? na

nd

3 4

5

7 10

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

3

First half of cycle 3

20

REG

file

IF/

ID

ID/

EX

3

compare

Hazard

detected

regA

regB

compare

compare compare

3

21

3

Hazard

detected

regA

regB

compare

0 1 1

0 1 1

0 0 0

1

22

Hazard

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

add

7

14

1 2

target

ALU

result

valB

ALU

result

mdata

eq? na

nd

3 4

5

7 10

11

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

3

en

en

First half of cycle 3

23

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

2

21

add

ALU

result

mdata

na

nd

3 4

5

7 10 11

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

End of cycle 3

noop

24

Hazard

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

noop

2

21

add

ALU

result

mdata

na

nd

3 4

5

7 10 11

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

3

en

en

First half of cycle 4

25

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

noop

2

noop

add

21

na

nd

3 4

5

7 10 11

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

End of cycle 4

noop

26

No Hazard

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

noop

2

noop

add

21

na

nd

3 4

5

7 10 11

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

3

First half of cycle 5

27

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

nand

11

21

2 3

noop

noop

ad

d 3

7 7

7 21 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

5 data

End of cycle 5

Lecture 4 Slide 28 EECS 470

Problems with Detect & Stall

CPI increases on every hazard

Are these stalls necessary? Not always! The new value for R3 is in the EX/Mem register

Reroute the result to the nand Called “forwarding” or “bypassing”

Lecture 4 Slide 29 EECS 470

Handling Data Hazards: Detect & Forward

Detection Same as detect and stall, but…

each possible hazard requires different forwarding paths

Forward Add data paths for all possible sources Add mux in front of ALU to select source

“bypassing logic” often a critical path in wide-issue machines I.e. superscalar machines # paths grows quadratically with machine width

Lecture 4 Slide 30 EECS 470

Sample Code Reminder

Run the following code on a pipelined datapath: nand 3 4 5 ; reg 5 = reg 3 ~& reg 4 add 6 3 7 ; reg 7 = reg 6 + reg 3 lw 3 6 10 ; reg 6 = Mem[reg3+10] sw 6 2 12 ; Mem[reg6+10] =reg 2

Poll: How many data dependencies are here? How many stalls will we see?

31

Hazard

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

add

7

14

1 2

na

nd

3 4

5

7 10 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

fwd fwd fwd

3

First half of cycle 3

32

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

nand

11

10

2 3

21

add

ad

d 6

3 7

7 10 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

5 data

H1

3

End of cycle 3

33

New Hazard

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

nand

11

10

2 3

21

add

ad

d 6

3 7

7 10 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

5 data

3 M

U

X

H1

3

First half of cycle 4

21

11

34

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

add

10

1

3 4

-2

nand

add

21

lw 3

6 1

0

7 10 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

7 5 3 data

M

U

X

H2 H1

End of cycle 4

35

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

add

10

1

3 4

-2

nand

add

21

lw 3

6 1

0

7 10 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

7 5 3 data

M

U

X

H2 H1

First half of cycle 5

3 No Hazard

21

1

36

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

lw

10

21

4 5

22

add

nand

-2

sw 6

2 1

2

7 21 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

7 5 data

M

U

X

H2 H1

6

End of cycle 5

37

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

lw

10

21

4 5

22

add

nand

-2

sw 6

2 1

2

7 21 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

6 7 5

data

M

U

X

H2 H1

First half of cycle 6

Hazard

6

en

en

L

38

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

5

31

lw

add

22

sw 6

2 1

2

7 21 11 -2

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

6 7 data

M

U

X

H2

End of cycle 6

noop

39

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

noop

5

31

lw

add

22

sw 6

2 1

2

7 21 11 -2

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

6 7 data

M

U

X

H2

First half of cycle 7

Hazard

6

40

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

sw

12

7

1

5

noop

lw

99

7 21 11 -2

14

1

0

22

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

6 data

M

U

X

H3

End of cycle 7

41

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

sw

12

7

1

5

noop

lw

99

7 21 11 -2

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

6 data

M

U

X

H3

First half of cycle 8

99

12

42

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

111

sw

7

noop

7 21 11 -2

14

99

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

M

U

X

H3

End of cycle 8

Lecture 4 Slide 43 EECS 470

Control Hazards

beq 1 1 10

sub 3 4 5

F D E M W

F D E M W

t0 t1 t2 t3 t4 t5

beq sub squash

Lecture 4 Slide 44 EECS 470

Handling Control Hazards

Avoidance (static) No branches? Convert branches to predication

Control dependence becomes data dependence

Detect and Stall (dynamic) Stop fetch until branch resolves

Speculate and squash (dynamic) Keep going past branch, throw away instructions if wrong

Lecture 4 Slide 45 EECS 470

Avoidance: if-conversion

if (a == b) {

x++;

y = n / d;

}

sub t1 a, b

jnz t1, PC+2

add x x, #1

div y n, d

sub t1 a, b

add(t1) x x, #1

div(t1) y n, d

sub t1 a, b

add t2 x, #1

div t3 n, d

cmov(t1) x t2

cmov(t1) y t3

If you're interested:

https://en.wikipedia.org/wiki/Predication_(computer_architecture)

Lecture 4 Slide 46 EECS 470

Handling Control Hazards: Detect & Stall

Detection In decode, check if opcode is branch or jump

Stall Hold next instruction in Fetch Pass noop to Decode

Lecture 4 Slide 47 EECS 470

Problems with Detect & Stall

CPI increases on every branch

Are these stalls necessary? Not always! Branch is only taken half the time

Assume branch is NOT taken Keep fetching, treat branch as noop If wrong, make sure bad instructions don’t complete

Lecture 4 Slide 48 EECS 470

Handling Control Hazards: Speculate & Squash

Speculate Assume branch is not taken

Squash Overwrite opcodes in Fetch, Decode, Execute with noop Pass target to Fetch

49

PC REG

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

sign

ext

Control

equal

M

U

X

beq

sub

add

nand

ad

d

sub

beq

beq

Inst

mem

no

op

no

op

no

op

Lecture 4 Slide 50 EECS 470

Problems with Speculate & Squash

Always assumes branch is not taken

Can we do better? Yes. Predict branch direction and target! Why possible? Program behavior repeats.

More on branch prediction to come...

Lecture 4 Slide 51 EECS 470

Next Time

• Going one step beyond pipelining: dynamic scheduling (a.k.a. out-of-order processing) Introduce a specific algorithm: scoreboard scheduling

• Lingering questions / feedback? I'll include an anonymous form at the end of every lecture: https://bit.ly/3oSr5FD

Recommended