Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm...

RHK.F95 1

Lecture 6

• Software pipelining• Speculative Execution• Introduction to the memory hierarchy

design

Administration

• Project Assignments– 張弘義 Evaluate MMX instrution set for MPEG

– 徐紹文 Literature Survey: Power Aware Design for Reconfigurable Computing System

– 蔣宗哲 Evaluating Write-policy for Spec2000

– 黃志源 Flash memory

– 何丞世 &陳銘堂 Low power high level synthesis example by implementing DLX with FPGA

• proposal presentation on 11/1• Midterm on 11/8

Review: Summary• Tamasulo – out-of-order exectuion• Branch Prediction

– Branch History Table: 2 bits for loop accuracy– Correlation: Recently executed branches correlated with

next branch– Branch Target Buffer: include branch address &

prediction

• SuperScalar and VLIW– CPI < 1– Dynamic issue vs. Static issue– More instructions issue at same time, larger the penalty

of hazards

Software Pipelining

• Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations

• Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (- Tomasulo in SW)

Iteration 0 Iteration

1 Iteration 2 Iteration

3 Iteration 4

Software- pipelined iteration

SW Pipelining Example

Before: Unrolled 3 times1 LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F44 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F87 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F1210 SUBI R1,R1,#2411 BNEZ R1,LOOP

After: Software PipelinedLD F0,0(R1)ADDD F4,F0,F2LD F0,-8(R1)

1 SD 0(R1),F4; Stores M[i]2 ADDD F4,F0,F2; Adds to M[i-1]3 LD F0,-16(R1); loads M[i-2]4 SUBI R1,R1,#85 BNEZ R1,LOOP

SD 0(R1),F4ADDD F4,F0,F2SD -8(R1),F4

IF ID EX Mem WBIF ID EX Mem WB

IF ID EX Mem WB

SDADDDLD

Read F4Write F4

Read F0

Write F0

SW Pipelining Example

Symbolic Loop Unrolling– Less code space– Overhead paid only once

vs. each iteration in loop unrolling

Software Pipelining

Loop Unrolling

100 iterations = 25 loops with 4 unrolled iterations each

Trace Scheduling

• Parallelism across IF branches vs. LOOP branches• Two steps:

– Trace Selection» Find likely sequence of basic blocks (trace) of (statically

predicted) long sequence of straight-line code– Trace Compaction

» Squeeze trace into few VLIW instructions» Need bookkeeping code in case prediction is wrong

A[i] = A[i] + B[i]

A[i]=0?

Fix up instructions in case we are wrong

HW support for More ILP

• Speculation: allow an instruction to issue that is dependent on branch predicted to be taken withoutany consequences (including exceptions) if branch is not actually taken (“HW undo”)

• Often try to combine with dynamic scheduling• Tomasulo: separate speculative bypassing of

results from real bypassing of results– When instruction no longer speculative, write results

(instruction commit)– execute out-of-order but commit in order

Hardware Speculation

Execution completeExecution

2 3 4 51

•Instruction 1 & 2 is allowed to change the machine state•Instruction 4 & 5 should not change the machine state

• Tomasulo without speculation– Issue– Execution– Write Result

» write results to the CDB & update the register file or memory

• Tomasulo with speculation– Issue– Execution– Write Result

» write results to the CDB & store results in a HW buffer (Reorder Buffer)

– Commit» Update register file or memory

HW support for More ILP

• Need HW buffer for results of uncommitted instructions: reorder buffer

::::::MULTD F0, F2, F4DIVD F10,F0,F6::::

Busy Op Vj Vk Qj Qk Dest

ReorderBuffer

FP Regs

FP Adder FP Adder

Res Stations Res Stations

Figure 4.34, page 311

Mult1 Yes mult [F2] [F4] #1Mult2 Yes div [F6] #1

Reservation Station

Reorder BufferBusy Instruction State Destination Value

#1 yes multd f0,f2,f4 Write Result F0 x#2 yes divd f10,f0,f6 Execute F10

Four Steps of Speculative Tomasulo Algorithm

1.Issue—get instruction from FP Op QueueIf reservation station or reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination.

2.Execution—operate on operands (EX)When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute

3.Write result—finish execution (WB)Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

4.Commit—update register with reorder resultWhen instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instrfrom reorder buffer.

Commit Stage

1 2 4 53

2 4 53

Instruction 1 commit

Wait for instruction 3 to complete

5Instruction 4 commit

6 7 8 9

Speculative Execution

• How to handle mispredicted branch?

• How to handle precise exception?– Handle the exception when a instruction reaches the

head of the reorder buffer

4 53 6 7 Inst 3 is a mispredicted branch

flush the reorder buffer

11 Fetch from the right path

How to Measure Available ILP?

Initial HW Model here; MIPS compilers1. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided2. Branch prediction–perfect; no mispredictions 3. Jump prediction–all jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal

1 cycle latency for all instructions

Upper Limit to ILP

Programs

gcc espresso li fpppp doducd tomcatv

54.862.6

Program

Perfect Selective predictor Standard 2-bit Static None

More Realistic HW: Branch ImpactChange from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle

ProfileBHT (512)Pick Cor. or BHTPerfect

Program

9 10 11

5 5 6 5 57

Infinite 256 128 64 32 None

More Realistic HW: Register Impact

Change 2000 instrwindow, 64 instrissue, 8K 2 level Prediction

64 None256Infinite 32128

Program

4 5 4 46 5

3 3 4 4

Perfect Global/stack Perfect Inspection None

More Realistic HW: Alias Impact

Change 2000 instrwindow, 64 instr issue, 8K 2 level Prediction, 256 renaming registers

NoneGlobal/Stack perf;heap conflicts

Perfect Inspec.

Program

gcc expresso li fpppp doducd tomcatv

910 11

6 6 68

4 4 4 5 46

3 2 3 3 3 3

Infinite 256 128 64 32 16 8 4

Realistic HW : Window Impact

Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window

64 16256Infinite 32128 8 4

• 8-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe)vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)

Benchmark

s sc gcc

catv or

nn ear

Summary of Exploiting Instruction Level Parallelism

• Why exploiting ILP is important?• Techniques to exploit ILP

– Instruction scheduling» Static method : loop unrolling, software pipelining» Dynamic method: scoreboard and tomasulo

– Branch Prediction» How to handle mispredicted branch?

• Achieving CPI <1 – Superscalar processor– VLIW

Importance of Exploiting ILP• CPU utilization is low because of hazards

– Structural hazard– Data hazard– Control hazard

• CPU stalls if hazards can not be removed• Exploiting ILP to reduce the number of hazards

esso gc

Base Load stalls Branch stalls FP result stalls FP structuralstalls

Static instruction scheduling: loop unrolling

1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24

1 Loop: LD F0,0(R1)2 stall3 ADDD F4,F0,F24 SUBI R1,R1,85 BNEZ R1,Loop

;delayed branch6 SD 8(R1),F4

Software Pipelining

• Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations

• Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (- Tomasulo in SW)

Iteration 0 Iteration

1 Iteration 2 Iteration

3 Iteration 4

Software- pipelined iteration

HW instruction scheduling: out-of-order execution

• Key idea: Allow instructions behind stall to proceed

DIVD F0,F2,F4ADDD F10,F0,F8SUBD F8,F8,F14

– In-order-execution» SUBD is not issued until the dependence is cleared

– Dynamic scheduling enables out-of-order execution => out-of-order completion

Tomasulo Organization

FP addersFP adders

Add1Add2Add3

FP multipliersFP multipliers

Mult1Mult2

From Mem FP Registers

Reservation Stations

Common Data Bus (CDB)

To Mem

FP OpQueue

Load Buffers

Store Buffers

Load1Load2Load3Load4Load5Load6

Tamasulo

• Prevents Register as bottleneck• Avoids WAR, WAW hazards of Scoreboard• Allows loop unrolling in HW• Not limited to basic blocks

– branch prediction

• Lasting Contributions– Dynamic scheduling– Register renaming– Load/store disambiguation

Branch Prediction

• Static approach– Predicted taken, predicted untaken, branch delay slot,

profiled-based prediction

• Dynamic approach– Branch History Table

» 1 bit vs. 2 bits– Correlation Branch Prediction– Branch target buffer

• Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 4.13, p. 264)

Dynamic Branch Prediction

Predict Taken

Predict Not Taken

Predict Taken

Predict Not TakenT

Correlating Branches

Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior)

– Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction

• (2,2) predictor: 2-bit global, 2-bit local

Branch address (4 bits)

2-bits per branch local predictors

PredictionPrediction

2-bit global branch history

(01 = not taken then taken)

Need Address at Same Time as Prediction

• Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)

– Note: must check for branch match now, since can’t use wrong branch address (Figure 3.19, p. 262)

Branch PC Predicted PC

PC of instructionFETCH

Extra prediction state

bitsYes: instruction is branch and use predicted PC as next PC

No: branch not predicted, proceed normally

(Next PC = PC+4)

• Tomasulo without speculation– Issue– Execution– Write Result

» write results to the CDB & update the register file or memory

– Out-of-order completion» Cause problems for mispredicted branch & maintaining

precise exception

• Tomasulo with speculation– Issue– Execution– Write Result

» write results to the CDB & store results in a HW buffer (Reorder Buffer)

– Commit» Update register file or memory (in-order commit)

Getting CPI < 1 Multiple Instructions/Cycle

• Two variations:– Superscalar: varying no. instructions/cycle (1 to 8),

scheduled by compiler or by HW (Tomasulo)» IBM PowerPC, Sun SuperSparc, DEC Alpha, HP 7100

– Very Long Instruction Words (VLIW): fixed number of instructions (16) scheduled by the compiler

» Joint HP/Intel agreement in 1998?

Loop Unrolling in SuperScalar

Integer instruction FP instruction Clock cycleLoop: LD F0,0(R1) 1

LD F6,-8(R1) 2LD F10,-16(R1) ADDD F4,F0,F2 3LD F14,-24(R1) ADDD F8,F6,F2 4LD F18,-32(R1) ADDD F12,F10,F2 5SD 0(R1),F4 ADDD F16,F14,F2 6SD -8(R1),F8 ADDD F20,F18,F2 7SD -16(R1),F12 8SD -24(R1),F16 9SUBI R1,R1,#40 10BNEZ R1,LOOP 11SD -32(R1),F20 12

Unrolled 5 times to avoid delays (+1 due to SS)12 clocks, or 2.4 clocks per iteration

Loop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branchLD F0,0(R1) LD F6,-8(R1) 1LD F10,-16(R1) LD F14,-24(R1) 2LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4

ADDD F20,F18,F2 ADDD F24,F22,F2 5SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6SD -16(R1),F12 SD -24(R1),F16 7SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8SD -0(R1),F28 BNEZ R1,LOOP 9

Unrolled 7 times to avoid delays7 results in 9 clocks, or 1.3 clocks per iterationNeed more registers in VLIW

Limits to Multi-Issue Machines

• Inherent limitations of ILP– 1 branch in 5: How to keep a 5-way VLIW busy?– Latencies of units: many operations must be scheduled– Need about Pipeline Depth x No. Functional Units of independent

operations to keep machines busy

• Difficulties in building HW– Duplicate FUs to get parallel execution– Increase ports to Register File – Increase ports to memory– Decoding SS and impact on clock rate, pipeline depth

Limits to Multi-Issue Machines

• Limitations specific to either SS or VLIW implementation

– Decode issue in SS– VLIW code size: unroll loops + wasted fields in VLIW– VLIW lock step => 1 hazard & all instructions stall– VLIW & binary compatibility is practical weakness

Chapter 5: Memory Hierarchy

Recap: Who Cares About the Memory Hierarchy?

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)

Processor-MemoryPerformance Gap:(grows 50% / year)

Processor-DRAM Memory Gap (latency)

Levels of the Memory Hierarchy

CPU Registers100s Bytes<1s ns

Cache10s-100s K Bytes1-10 ns$10/ MByte

Main MemoryM Bytes100ns- 300ns$1/ MByte

Disk10s G Bytes, 10 ms (10,000,000 ns)$0.0031/ MByte

CapacityAccess TimeCost

Tapeinfinitesec-min$0.0014/ MByte

Registers

Memory

Instr. Operands

Blocks

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

The Principle of Locality

• The Principle of Locality:– Program access a relatively small portion of the address space at any

instant of time.• Two Different Types of Locality:

– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)

– Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)

• Last 15 years, HW (hardware) relied on locality for speed

Memory Hierarchy: Terminology• Hit: data appears in some block in the upper level (example:

Block X) – Hit Rate: the fraction of memory access found in the upper level– Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss• Miss: data needs to be retrieve from a block in the lower level

(Block Y)– Miss Rate = 1 - (Hit Rate)– Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor• Hit Time << Miss Penalty (500 instructions on 21264!)

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Cache Measures

• Hit rate: fraction found in that level– So high that usually talk about Miss rate– Miss rate fallacy: as MIPS to CPU performance,

miss rate to average memory access time in memory • Average memory-access time

= Hit time + Miss rate x Miss penalty (ns or clocks)

• Miss penalty: time to replace a block from lower level, including time to replace in CPU

– access time: time to lower level = f(latency to lower level)

– transfer time: time to transfer block =f(BW between upper & lower levels)

Simplest Cache: Direct MappedMemory

4 Byte Direct Mapped Cache

Memory Address0123456789ABCDEF

Cache Index0123

• Location 0 can be occupied by data from:

– Memory location 0, 4, 8, ... etc.– In general: any memory location

whose 2 LSBs of the address are 0s– Address<1:0> => cache index

• Which one should we place in the cache?

• How can we tell which one is in the cache?

1 KB Direct Mapped Cache, 32B blocks• For a 2 ** N byte cache:

– The uppermost (32 - N) bits are always the Cache Tag– The lowest M bits are the Byte Select (Block Size = 2 ** M)

Cache Index

Cache DataByte 0

Cache Tag Example: 0x50Ex: 0x01

Stored as partof the cache “state”

Valid Bit

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte SelectEx: 0x00

Two-way Set Associative Cache• N-way set associative: N entries for each Cache Index

– N direct mapped caches operates in parallel (N typically 2 to 4)

• Example: Two-way set associative cache– Cache Index selects a “set” from the cache– The two tags in the set are compared in parallel– Data is selected based on the tag result

Cache DataCache Block 0

Cache TagValid

Cache Tag Valid

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

Disadvantage of Set Associative Cache• N-way Set Associative Cache v. Direct Mapped Cache:

– N comparators vs. 1– Extra MUX delay for the data– Data comes AFTER Hit/Miss

• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:

– Possible to assume a hit and continue. Recover later if miss.

Cache Tag Valid

Cache TagValid

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

4 Questions for Memory Hierarchy

• Q1: Where can a block be placed in the upper level? (Block placement)

• Q2: How is a block found if it is in the upper level?(Block identification)

• Q3: Which block should be replaced on a miss? (Block replacement)

• Q4: What happens on a write? (Write strategy)

Q1: Where can a block be placed in the upper level?

• Block 12 placed in 8 block cache:– Fully associative, direct mapped, 2-way set associative– S.A. Mapping = Block Number Modulo Number Sets

01234567 0123456701234567

Memory

111111111122222222223301234567890123456789012345678901

Full Mapped Direct Mapped(12 mod 8) = 4

2-Way Assoc(12 mod 4) = 0

Q2: How is a block found if it is in the upper level?

• Tag on each block– No need to check index or block offset

• Increasing associativity shrinks index, expands tag

BlockOffset

Block Address

IndexTag

Q3: Which block should be replaced on a miss?

• Easy for Direct Mapped• Set Associative or Fully Associative:

– Random– LRU (Least Recently Used)

Assoc: 2-way 4-way 8-waySize LRU Ran LRU Ran LRU Ran16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

Q4: What happens on a write?

• Write through—The information is written to both the block in the cache and to the block in the lower-level memory.

• Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.

– is block clean or dirty?• Pros and Cons of each?

– WT: read misses cannot result in writes– WB: no repeated writes to same location

• WT always combined with write buffers so that don’t wait for lower level memory

Write Buffer for Write Through

• A Write Buffer is needed between the Cache and Memory

– Processor: writes data into the cache and the write buffer– Memory controller: write contents of the buffer to memory

• Write buffer is just a FIFO:– Typical number of entries: 4– Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

• Memory system designer’s nightmare:– Store frequency (w.r.t. time) -> 1 / DRAM write cycle– Write buffer saturation

ProcessorCache

Write Buffer

A Modern Memory Hierarchy• By taking advantage of the principle of locality:

– Present the user with as much memory as is available in the cheapest technology.

– Provide access at the speed offered by the fastest technology.

Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

SecondLevelCache

(SRAM)

ache1s 10,000,000s

(10s ms)Speed (ns): 10s 100s

100sGs

Size (bytes):Ks Ms

TertiaryStorage

(Disk/Tape)

10,000,000,000s (10s sec)

Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm...

Documents

Unidade de ponto flutuante com Tomasulo · Universidade Federal do Paraná João Manoel Pampanini Filho Implementação do CP1 do MIPS Unidade de ponto flutuante com Tomasulo Curitiba,

Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

Task Scheduling in Speculative Parallelization...Task Scheduling in Speculative Parallelization David Burgo Baptista Disserta˘c~ao para Obten˘c~ao do Grau de Mestre em Engenharia

Speculative Interference Attacks: Breaking Invisible

आयकर अपीलीय अधिकरण, मुंबई ......Share Market Speculative Profit 2,85,26,994 12. Dividend & Interest Income 14,58,970 13. Unexplained Investment

Onderzoek naar toepasbaarheid van reorder-point en fair-share …lib.ugent.be/fulltxt/RUG01/001/805/445/RUG01-001805445... · 2012-04-28 · reorder-point en fair-share logica bij

Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP

Algoritmo de Tomasulo MO401 – Arquitetura de Computadores I Cristiano Dalmaschio Ferreira Instituto de Computação Universidade Estadual de Campinas – SP

Hunter Lewis - Tutti Gli Errori Di Keynes. Perché Gli Stati Continuano a Creare Inflazione_ Bolle Speculative e Crisi Finanziarie (2010)

Speculative Runs on Interest Rate Pegs · Speculative Runs on Interest Rate Pegs The Frictionless Case Marco Bassettoyand Christopher Phelanz Abstract In this paper we show that interest

Crisi: Credit crunch Crisi e bolle speculative Crisi e bolle speculative La crisi finanziaria La crisi finanziaria La crisi si propaga La crisi si propaga

Articulator™ - injection needle Reorder No. 00711803 Reorder No. … · 2020. 8. 11. · Articulator™ - ago per iniezione Articulator™ - aguja de inyección . Articulator™

Are Asian Stock Markets Characterized by Rational Speculative Bubbles

Pepsi Refresh-FINAL-reorder - 하늘맑은 세상endofcap.tistory.com/attachment/cfile28.uf@1727D2274BE... · 2015-05-03 · Pepsi’Refresh’Project’Overview ... brand$is$associated$with$agood$cause”

Tomasulo Algorithm

Penerapan Metode Reorder Point Dalam Merancang Sistem

Design and Evaluation of a RISC Processor with a Tomasulo … · The Tomasulo scheduling algorithm is one of the most competitive scheduling algo-rithms. It provides low CPI rates

Roth Net Platinum™ – food bolus Reorder No. 00711155/media/Files/Documents/IFU/00731251e_ifu... · Roth Net® Platinum™ – food bolus. Reorder No. 00711155 . Référence de

SF Reorder SS14 Katalog RU SP

Schuldscheindarlehen und Unternehmensanleihen als ... · − Investment Grade ("investitionswürdig") − Non-Investment Grade (speculative grade) • long-term rating schlechter