ACA Chapter2

7/31/2019 ACA Chapter2

http://slidepdf.com/reader/full/aca-chapter2 1/12





12/8/201

T6030

Data Hazard on R1

Time (clock cycles)

IF ID/RF EX MEM WB

I n s t r.

O

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7 Reg A L U

DMemIfetch Reg

Reg A L U

DMemIfetch Reg

Reg A L U

DMemIfetch Reg

IT6030

d e r

or r8,r1,r9

xor r10,r1,r11

Reg A L U

DMemIfetch Reg

Reg A L U

DMemIfetch Reg

9

• Read After Write (RAW)InstrJ tries to read operand before InstrI writes it

Three Generic Data Hazards

• Caused by a “Dependence”. This hazard results froman actual need for communication.

I: add r1,r2,r3

J: sub r4,r1,r3

IT6030 10

• Write After Read (WAR)InstrJ writes operand before InstrI reads it


• Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.

, ,

J: add r1,r2,r3

K: mul r6,r1,r7

IT6030

• an appen n s age p pe ne ecause:

– All instructions take 5 stages, and

– Reads are always in stage 2, and

– Writes are always in stage 5

11


• Write After Write (WAW)InstrJ writes operand before InstrI writes it.

• Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.

• Can’t ha en in MIPS 5 sta e i eline because:

I: sub r1,r4,r3

J: add r1,r2,r3

K: mul r6,r1,r7

IT6030

– All instructions take 5 stages, and

– Writes are always in stage 5

• Will see WAR and WAW in more complicated pipes

12



12/8/201



12/8/201

T6030

Control Hazard on BranchesThree Stage Stall

10: beq r1,r3,36

U

Reg A L U DMemIfetch Reg

14: and r2,r3,r5

18: or r6,r1,r7

22: add r8,r1,r9

Reg A L U

DMemIfetch Reg

Reg A L DMemIfetch Reg

Reg A L U

DMemIfetch Reg

IT6030

36: xor r10,r1,r11 Reg A L U DMemIfetch Reg

What do you do with the 3 instructions in between?

How do you do it?

Where is the “commit”?

17

Four Branch Hazard Alternatives

#1: Stall until branch direction is clear

#2: Predict Branch Not Taken – Execute successor instructions in sequence

– “Squash” instructions in pipeline if branch actually taken

– Advantage of late pipeline state update

– 47% MIPS branches not taken on average

– PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken –

IT6030

–

– But haven’t calculated branch target address in MIPS

» MIPS still incurs 1 cycle branch penalty

» Other machines: branch target known before outcome

18

Four Branch Hazard Alternatives

#4: Delayed Branch – Define branch to take place AFTER a following instruction

branch instructionsequential successor1sequential successor

2........sequential successorn

branch target if taken

– 1 slot delay allows proper decision and branch target

Branch delay of length n

IT6030

address in 5 stage pipeline

– MIPS uses this

19

Scheduling Branch Delay Slots

add $1,$2,$3

if $2=0 then

dela slot

A. From before branch B. From branch target C. From fall through

add $1,$2,$3

if $1=0 then

delay slot

sub $4,$5,$6

add $1,$2,$3

if $1=0 then

delay slot sub $4,$5,$6

becomes becomes becomes

if $2=0 then

add $1,$2,$3

add $1,$2,$3

if $1=0 then

sub $4,$5,$6

IT6030

• A is the best choice, fills delay slot & reduces instruction count (IC)

• In B, the sub instruction may need to be copied, increasing IC

• In B and C, must be okay to execute sub when branch fails

a , ,

if $1=0 then

sub $4,$5,$6

20

12/8/201



12/8/201

T6030

Problems with Pipelining

• Exception: An unusual event happens to aninstruction during its execution – Examples: divide by zero, undefined opcode

• Interrupt: Hardware signal to switch theprocessor to a new instruction stream – Example: a sound card interrupts when it needs more audio

output samples (an audio “click” happens if it is left waiting)

• Problem: It must appear that the exception orinterrupt must appear between 2 instructions (Iiand Ii+1) –

IT6030

– i

complete – No effect of any instruction after Ii can take place

• The interrupt (exception) handler either abortsprogram or restarts at instruction Ii+1

21

Instruction Level Parallelism

• Instruction-Level Parallelism (ILP): overlap theexecution of instructions to improveperformance

• 2 approaches to exploit ILP:1) Rely on hardware to help discover and exploit the parallelism

dynamically (e.g., Pentium 4, AMD Opteron, IBM Power) , and

2) Rely on software technology to find parallelism, statically atcompile-time (e.g., Itanium 2)

IT6030 22

Instruction-Level Parallelism (ILP)• Basic Block (BB) ILP is quite small

– BB: a straight-line code sequence with no branches inexcept to the entry and no branches out except at the exit

– average dynamic branch frequency 15% to 25%=> 4 to 7 instructions execute between a pair of branches

– Plus instructions in BB likely to depend on each other

• To obtain substantial performanceenhancements, we must exploit ILP acrossmultiple basic blocks

• Simplest: loop-level parallelism to exploitparallelism among iterations of a loop. E.g.,

IT6030

for (i=1; i<=1000; i=i+1)x[i] = x[i] + y[i];

23

Loop-Level Parallelism• Exploit loop-level parallelism to parallelism by

“unrolling loop” either by1. dynamic via branch prediction or.(Another way is vectors, to be covered later)

• Determining instruction dependence is critical toLoop Level Parallelism

• If 2 instructions are – parallel, they can execute simultaneously in a

pipeline of arbitrary depth without causing any

IT6030

– dependent, they are not parallel and must be

executed in order, although they may often bepartially overlapped

24

12/8/201



12/8/201

T6030

• InstrJ is data dependent (aka true dependence) onInstrI:

1. Instr tries to read o erand before Instr writes it

Data Dependence and Hazards

2. or InstrJ is data dependent on InstrK which is dependent on InstrI

• If two instructions are data dependent, they cannotexecute simultaneously or be completely overlapped

• Data de endence in instruction se uence

I: add r1,r2,r3

J: sub r4,r1,r3

IT6030

data dependence in source code effect oforiginal data dependence must be preserved

• If data dependence caused a hazard in pipeline,called a Read After Write (RAW) hazard

25

ILP and Data Dependencies,Hazards

• HW/SW must preserve program order:order instructions would execute in if executedsequentially as determined by original source program – Dependences are a property of programs

• Presence of dependence indicates potential for ahazard, but actual hazard and length of any stall isproperty of the pipeline

• Importance of the data dependencies1) indicates the possibility of a hazard

IT6030

3) sets an upper bound on how much parallelism can possibly beexploited

• HW/SW goal: exploit parallelism by preserving programorder only where it affects the outcome of the program

26

• Name dependence: when 2 instructions use sameregister or memory location, called a name, but no

Name Dependence #1: Anti-dependence

ow o a a e ween e ns ruc ons assoc a ewith that name; 2 versions of name dependence

• InstrJ writes operand before InstrI reads itI: sub r4,r1,r3

J: add r1,r2,r3

K: mul r6,r1,r7

IT6030

Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”

• If anti-dependence caused a hazard in the pipeline,called a Write After Read (WAR) hazard

27

Name Dependence #2: Output dependence

• InstrJ writes operand before InstrI writes it.

I: sub r1,r4,r3

• Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”

• If anti-dependence caused a hazard in the pipeline,called a Write After Write (WAW) hazard

J: add r1,r2,r3

K: mul r6,r1,r7

IT6030

• Instructions involved in a name dependence canexecute simultaneously if name used in instructions ischanged so instructions do not conflict – Register renaming resolves name dependence for regs

– Either by compiler or by HW

28

12/8/201



12/8/201

T6030

Control Dependencies

• Every instruction is control dependent onsome set of branches, and, in general, these

preserve program order

if p1 {

S1;

};

if p2 {

IT6030

}

• S1 is control dependent on p1, and S2 iscontrol dependent on p2 but not on p1.

29

Control Dependence Ignored

• Control dependence need not bereserved

– willing to execute instructions that should not have beenexecuted, thereby violating the control dependences, ifcan do so without affecting correctness of the program

• Instead, 2 properties critical to programcorrectness are

1) exce tion behavior and

IT6030

2) data flow

30

FP Loop Showing Stalls

1 Loop: L.D F0,0(R1) ;F0=vector element

2 stall

3 ADD.D F4,F0,F2 ;add scalar in F2

4 stall

5 stall

6 S.D 0(R1),F4 ;store result7 DADDUI R1,R1,-8 ;decrement pointer 8B (DW)

8 stall ;assumes can’t forward to branch

9 BNEZ R1,Loop ;branch R1!=zero

IT6030

• 9 clock cycles: Rewrite code to minimize stalls?

ns ruc on ns ruc on a ency n producing result using result clock cycles

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

31

Revised FP Loop Minimizing Stalls

1 Loop: L.D F0,0(R1)

2 DADDUI R1,R1,-8

Instruction Instruction Latency in

. , ,

4 stall

5 stall

6 S.D 8(R1),F4 ;altered offset when move DSUBUI

7 BNEZ R1,Loop

Swap DADDUI and S.D by changing address of S.D

IT6030

7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loopoverhead; How make faster?

pro uc ng resu t us ng resu t c oc cyc es

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

32

12/8/201



12/8/201

T6030

Unroll Loop Four Times(straightforward way)

Rewrite loop tominimize stalls?

1 Loop:L.D F0,0(R1)

3 ADD.D F4,F0,F2

6 S.D 0(R1),F4 ;drop DSUBUI & BNEZ

1 cycle stall2 cycles stall

7 L.D F6,-8(R1)

9 ADD.D F8,F6,F2

12 S.D -8(R1),F8 ;drop DSUBUI & BNEZ

13 L.D F10,-16(R1)

15 ADD.D F12,F10,F2

18 S.D -16(R1),F12 ;drop DSUBUI & BNEZ

19 L.D F14,-24(R1)

21 ADD.D F16,F14,F2

IT6030

24 S.D -24(R1),F16

25 DADDUI R1,R1,#-32 ;alter to 4*8

26 BNEZ R1,LOOP

27 clock cycles, or 6.75 per iteration

(Assumes R1 is multiple of 4)

33

Unrolled Loop Detail

• Do not usually know upper bound of loop• Suppose it is n, and we would like to unroll the

loo to make k co ies of the bod

• Instead of a single unrolled loop, we generate apair of consecutive loops: – 1st executes (n mod k) times and has a body that is the

original loop

– 2nd is the unrolled body surrounded by an outer loop thatiterates (n/k) times

•

IT6030

,

will be spent in the unrolled loop

34

Unrolled Loop That Minimizes Stalls

1 Loop:L.D F0,0(R1)

2 L.D F6,-8(R1)

3 L.D F10,-16(R1)

4 L.D F14,-24(R1)

5 ADD.D F4,F0,F2

6 ADD.D F8,F6,F2

7 ADD.D F12,F10,F28 ADD.D F16,F14,F2

9 S.D 0(R1),F4

10 S.D -8(R1),F8

11 S.D -16(R1),F12

IT6030

, ,

13 S.D 8(R1),F16 ; 8-32 = -24

14 BNEZ R1,LOOP

14 clock cycles, or 3.5 per iteration

35

5 Loop Unrolling Decisions

• Requires understanding how one instruction depends onanother and how the instructions can be changed orreordered given the dependences:

.iterations were independent (except for maintenance code)

2. Use different registers to avoid unnecessary constraints

forced by using same registers for different computations3. Eliminate the extra test and branch instructions and adjust

the loop termination and iteration code

4. Determine that loads and stores in unrolled loop can beinterchan ed b observin that loads and stores from

IT6030

different iterations are independent• Transformation requires analyzing memory addresses and finding

that they do not refer to the same address

5. Schedule the code, preserving any dependences neededto yield the same result as the original code

36

12/8/201


http://slidepdf.com/reader/full/aca-chapter2 10/12T6030 1

3 Limits to Loop Unrolling

1. Decrease in amount of overhead amortized witheach extra unrolling• Amdahl’s Law

2. Growth in code size• For larger loops, concern it increases the instruction cache

miss rate

3. Register pressure: potential shortfall inregisters created by aggressive unrolling andscheduling•

IT6030

,

lose some or all of its advantage

• Loop unrolling reduces impact of branches onpipeline; another way is branch prediction

37

Static Branch Prediction

• Lecture 3 showed scheduling code around delayedbranch

• To reorder code around branches, need to predict

12%

22%

18%

11%12%

9%10%

15%

15%

20%

25%

i c t i o n R a t e

• Simplest scheme is to predict a branch as taken – Average misprediction = untaken branch frequency = 34% SPEC

• More accuratescheme predictsbranches usingprofile

IT6030

4%6%

0%

5%

c o m p r e s

s

e q n t

o t t

e s p r e s

s o g c c l i

d o d u

c e a

r

h y d r

o 2 d

m d l j d

p

s u 2 c

o r

M i s p r en orma on

collected fromearlier runs, andmodifypredictionbased on lastrun:

Integer Floating Point38

Dynamic Branch Prediction

• Why does prediction work? – Underlying algorithm has regularities

– Data that is being operated on has regularities

– Instruction sequence has redundancies that are artifacts ofway that humans/compilers think about problems

• Is dynamic branch prediction better than staticbranch prediction? – Seems to be

– There are a small number of important branches in programswhich have dynamic behavior

IT6030 39


• Performance = ƒ(accuracy, cost of misprediction)

• Branch History Table: Lower bits of PC addressindex table of 1-bit values

– Says whether or not branch taken last time – No address check

• Problem: in a loop, 1-bit BHT will cause twomispredictions (avg is 9 iteratios before exit):

IT6030

– ,

– First time through loop on next time through code, when itpredicts exit instead of looping

40

12/8/201



• Solution: 2-bit scheme where change prediction


T

T NT

Predict Taken Predict Taken

NTT

NT

IT6030

• Red: stop, not taken

• Green: go, taken

• Adds hysteresis to decision making process

NT

Taken

TakenT

41

A Dynamic Algorithm: Tomasulo’s

• For IBM 360/91 (before caches!) – ong memory atency

• Goal: High Performance without special compilers

• Small number of floating point registers (4 in 360)prevented interesting compiler scheduling of operations – This led Tomasulo to try to figure out how to get more effective registers

— renaming in hardware!

•

IT6030

• The descendants of this have flourished! – Alpha 21264, Pentium 4, AMD Opteron, Power 5, …

42

Tomasulo Algorithm

• Control & buffers distributed with Function Units (FU) – FU buffers called “reservation stations”; have pending operands

• Registers in instructions replaced by values or pointersto reservation stations(RS); called register renaming ;

– Renaming avoids WAR, WAW hazards – More reservation stations than registers, so can do optimizationscompilers can’t

• Results to FU from RS, not through registers, overCommon Data Bus that broadcasts results to all FUs

IT6030

– Avoids RAW hazards by executing an instruction only when itsoperands are available

• Load and Stores treated as FUs with RSs as well

• Integer instructions can go past branches (predicttaken), allowing FP ops beyond basic block in FP queue

43

Tomasulo Organization

From Mem FP RegistersFP OpQueue

Load Buffers

Add1Add2Add3

Mult1Mult2

StoreBuffers

oaLoad2Load3Load4

Load5Load6

IT6030

FP adders FP multipliers

ReservationStations

Common Data Bus (CDB)

To Mem

44

12/8/201



Reservation Station Components

Op: Operation to perform in the unit (e.g., + or –)

V Vk: Value of Source o erands – Store buffers has V field, result to be stored

Qj, Qk: Reservation stations producing sourceregisters (value to be written)

– Note: Qj,Qk=0 => ready

– Store buffers only have Qi for RS producing result

Busy: Indicates reservation station or FU is busy

IT6030

Register result status —Indicates which functional unitwill write each register, if one exists. Blank when nopending instructions that will write that register.

45

Three Stages of Tomasulo Algorithm

1. Issue —get instruction from FP Op QueueIf reservation station free (no structural hazard),

.

2. Execute —operate on operands (EX)When both operands ready then execute;if not ready, watch Common Data Bus for result

3. Write result —finish execution (WB)Write on Common Data Bus to all awaiting units;mark reservation station available

• Normal data bus: data + destination (“go to” bus)

• Common data bus: data + source “come from” bus

IT6030

– 64 bits of data + 4 bits of Functional Unit source address – Write if matches expected Functional Unit (produces result)

– Does the broadcast

• Example speed:3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /

46

Documents

ACA Chapter2