View
0
Download
0
Category
Preview:
Citation preview
RHK.F95 1
Lecture 6
• Software pipelining• Speculative Execution• Introduction to the memory hierarchy
design
2
Administration
• Project Assignments– 張弘義 Evaluate MMX instrution set for MPEG
– 徐紹文 Literature Survey: Power Aware Design for Reconfigurable Computing System
– 蔣宗哲 Evaluating Write-policy for Spec2000
– 黃志源 Flash memory
– 何丞世 &陳銘堂 Low power high level synthesis example by implementing DLX with FPGA
• proposal presentation on 11/1• Midterm on 11/8
3
Review: Summary• Tamasulo – out-of-order exectuion• Branch Prediction
– Branch History Table: 2 bits for loop accuracy– Correlation: Recently executed branches correlated with
next branch– Branch Target Buffer: include branch address &
prediction
• SuperScalar and VLIW– CPI < 1– Dynamic issue vs. Static issue– More instructions issue at same time, larger the penalty
of hazards
4
Software Pipelining
• Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations
• Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (- Tomasulo in SW)
Iteration 0 Iteration
1 Iteration 2 Iteration
3 Iteration 4
Software- pipelined iteration
5
SW Pipelining Example
Before: Unrolled 3 times1 LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F44 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F87 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F1210 SUBI R1,R1,#2411 BNEZ R1,LOOP
After: Software PipelinedLD F0,0(R1)ADDD F4,F0,F2LD F0,-8(R1)
1 SD 0(R1),F4; Stores M[i]2 ADDD F4,F0,F2; Adds to M[i-1]3 LD F0,-16(R1); loads M[i-2]4 SUBI R1,R1,#85 BNEZ R1,LOOP
SD 0(R1),F4ADDD F4,F0,F2SD -8(R1),F4
IF ID EX Mem WBIF ID EX Mem WB
IF ID EX Mem WB
SDADDDLD
Read F4Write F4
Read F0
Write F0
6
SW Pipelining Example
Symbolic Loop Unrolling– Less code space– Overhead paid only once
vs. each iteration in loop unrolling
Software Pipelining
Loop Unrolling
100 iterations = 25 loops with 4 unrolled iterations each
7
Trace Scheduling
• Parallelism across IF branches vs. LOOP branches• Two steps:
– Trace Selection» Find likely sequence of basic blocks (trace) of (statically
predicted) long sequence of straight-line code– Trace Compaction
» Squeeze trace into few VLIW instructions» Need bookkeeping code in case prediction is wrong
8
A[i] = A[i] + B[i]
A[i]=0?
x
F
B[i]=
C[i]=
9
Fix up instructions in case we are wrong
10
HW support for More ILP
• Speculation: allow an instruction to issue that is dependent on branch predicted to be taken withoutany consequences (including exceptions) if branch is not actually taken (“HW undo”)
• Often try to combine with dynamic scheduling• Tomasulo: separate speculative bypassing of
results from real bypassing of results– When instruction no longer speculative, write results
(instruction commit)– execute out-of-order but commit in order
11
Hardware Speculation
Execution completeExecution
2 3 4 51
•Instruction 1 & 2 is allowed to change the machine state•Instruction 4 & 5 should not change the machine state
12
Hardware Speculation
• Tomasulo without speculation– Issue– Execution– Write Result
» write results to the CDB & update the register file or memory
• Tomasulo with speculation– Issue– Execution– Write Result
» write results to the CDB & store results in a HW buffer (Reorder Buffer)
– Commit» Update register file or memory
13
HW support for More ILP
• Need HW buffer for results of uncommitted instructions: reorder buffer
::::::MULTD F0, F2, F4DIVD F10,F0,F6::::
Busy Op Vj Vk Qj Qk Dest
ReorderBuffer
FP Regs
FPOp
Queue
FP Adder FP Adder
Res Stations Res Stations
Figure 4.34, page 311
Mult1 Yes mult [F2] [F4] #1Mult2 Yes div [F6] #1
Reservation Station
Reorder BufferBusy Instruction State Destination Value
#1 yes multd f0,f2,f4 Write Result F0 x#2 yes divd f10,f0,f6 Execute F10
14
Four Steps of Speculative Tomasulo Algorithm
1.Issue—get instruction from FP Op QueueIf reservation station or reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination.
2.Execution—operate on operands (EX)When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute
3.Write result—finish execution (WB)Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.
4.Commit—update register with reorder resultWhen instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instrfrom reorder buffer.
15
Commit Stage
1 2 4 53
2 4 53
4 53
6
Instruction 1 commit
4 5
Instruction 2 commit
Wait for instruction 3 to complete
Instruction 3 commit
5Instruction 4 commit
6 7
6 7 8
6 7 8 9
16
Speculative Execution
• How to handle mispredicted branch?
• How to handle precise exception?– Handle the exception when a instruction reaches the
head of the reorder buffer
4 53 6 7 Inst 3 is a mispredicted branch
flush the reorder buffer
11 Fetch from the right path
17
How to Measure Available ILP?
Initial HW Model here; MIPS compilers1. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided2. Branch prediction–perfect; no mispredictions 3. Jump prediction–all jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal
1 cycle latency for all instructions
18
Upper Limit to ILP
Programs
0
20
40
60
80
100
120
140
160
gcc espresso li fpppp doducd tomcatv
54.862.6
17.9
75.2
118.7
150.1
19
Program
0
10
20
30
40
50
60
gcc espresso li fpppp doducd tomcatv
35
41
16
6158
60
9
1210
48
15
6 7 6
46
13
45
6 6 7
45
14
45
2 2 2
29
4
19
46
Perfect Selective predictor Standard 2-bit Static None
More Realistic HW: Branch ImpactChange from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle
ProfileBHT (512)Pick Cor. or BHTPerfect
20
Program
0
10
20
30
40
50
60
gcc espresso li fpppp doducd tomcatv
11
15
12
29
54
10
15
12
49
16
1013
12
35
15
44
9 10 11
20
11
28
5 5 6 5 57
4 45
45 5
59
45
Infinite 256 128 64 32 None
More Realistic HW: Register Impact
Change 2000 instrwindow, 64 instrissue, 8K 2 level Prediction
64 None256Infinite 32128
21
Program
0
5
10
15
20
25
30
35
40
45
50
gcc espresso li fpppp doducd tomcatv
10
15
12
49
16
45
7 79
49
16
4 5 4 46 5
35
3 3 4 4
45
Perfect Global/stack Perfect Inspection None
More Realistic HW: Alias Impact
Change 2000 instrwindow, 64 instr issue, 8K 2 level Prediction, 256 renaming registers
NoneGlobal/Stack perf;heap conflicts
Perfect Inspec.
22
Program
0
10
20
30
40
50
60
gcc expresso li fpppp doducd tomcatv
10
15
12
52
17
56
10
15
12
47
16
10
1311
35
15
34
910 11
22
12
8 8 9
14
9
14
6 6 68
79
4 4 4 5 46
3 2 3 3 3 3
45
22
Infinite 256 128 64 32 16 8 4
Realistic HW : Window Impact
Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window
64 16256Infinite 32128 8 4
23
• 8-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe)vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)
Benchmark
0
100
200
300
400
500
600
700
800
900
espr
esso
li
eqnt
ott
com
pres
s sc gcc
spic
e
dodu
c
mdl
jdp2
wav
e5
tom
catv or
a
alvi
nn ear
mdl
jsp2
swm
256
su2c
or
hydr
o2d
nasa
fppp
p
24
Summary of Exploiting Instruction Level Parallelism
• Why exploiting ILP is important?• Techniques to exploit ILP
– Instruction scheduling» Static method : loop unrolling, software pipelining» Dynamic method: scoreboard and tomasulo
– Branch Prediction» How to handle mispredicted branch?
• Achieving CPI <1 – Superscalar processor– VLIW
25
Importance of Exploiting ILP• CPU utilization is low because of hazards
– Structural hazard– Data hazard– Control hazard
• CPU stalls if hazards can not be removed• Exploiting ILP to reduce the number of hazards
00.5
11.5
22.5
33.5
44.5
eqnt
ott
espr
esso gc
c li
dodu
c
nasa
7 ora
spic
e2g6
su2c
or
tom
catv
Base Load stalls Branch stalls FP result stalls FP structuralstalls
26
Static instruction scheduling: loop unrolling
1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24
1 Loop: LD F0,0(R1)2 stall3 ADDD F4,F0,F24 SUBI R1,R1,85 BNEZ R1,Loop
;delayed branch6 SD 8(R1),F4
27
Software Pipelining
• Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations
• Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (- Tomasulo in SW)
Iteration 0 Iteration
1 Iteration 2 Iteration
3 Iteration 4
Software- pipelined iteration
28
HW instruction scheduling: out-of-order execution
• Key idea: Allow instructions behind stall to proceed
DIVD F0,F2,F4ADDD F10,F0,F8SUBD F8,F8,F14
– In-order-execution» SUBD is not issued until the dependence is cleared
– Dynamic scheduling enables out-of-order execution => out-of-order completion
29
Tomasulo Organization
FP addersFP adders
Add1Add2Add3
FP multipliersFP multipliers
Mult1Mult2
From Mem FP Registers
Reservation Stations
Common Data Bus (CDB)
To Mem
FP OpQueue
Load Buffers
Store Buffers
Load1Load2Load3Load4Load5Load6
30
Tamasulo
• Prevents Register as bottleneck• Avoids WAR, WAW hazards of Scoreboard• Allows loop unrolling in HW• Not limited to basic blocks
– branch prediction
• Lasting Contributions– Dynamic scheduling– Register renaming– Load/store disambiguation
31
Branch Prediction
• Static approach– Predicted taken, predicted untaken, branch delay slot,
profiled-based prediction
• Dynamic approach– Branch History Table
» 1 bit vs. 2 bits– Correlation Branch Prediction– Branch target buffer
32
• Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 4.13, p. 264)
Dynamic Branch Prediction
T
T
NT
Predict Taken
Predict Not Taken
Predict Taken
Predict Not TakenT
NT
T
NT
NT
33
Correlating Branches
Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior)
– Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction
• (2,2) predictor: 2-bit global, 2-bit local
Branch address (4 bits)
2-bits per branch local predictors
PredictionPrediction
2-bit global branch history
(01 = not taken then taken)
34
Need Address at Same Time as Prediction
• Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)
– Note: must check for branch match now, since can’t use wrong branch address (Figure 3.19, p. 262)
Branch PC Predicted PC
=?
PC of instructionFETCH
Extra prediction state
bitsYes: instruction is branch and use predicted PC as next PC
No: branch not predicted, proceed normally
(Next PC = PC+4)
35
Hardware Speculation
• Tomasulo without speculation– Issue– Execution– Write Result
» write results to the CDB & update the register file or memory
– Out-of-order completion» Cause problems for mispredicted branch & maintaining
precise exception
• Tomasulo with speculation– Issue– Execution– Write Result
» write results to the CDB & store results in a HW buffer (Reorder Buffer)
– Commit» Update register file or memory (in-order commit)
36
Getting CPI < 1 Multiple Instructions/Cycle
• Two variations:– Superscalar: varying no. instructions/cycle (1 to 8),
scheduled by compiler or by HW (Tomasulo)» IBM PowerPC, Sun SuperSparc, DEC Alpha, HP 7100
– Very Long Instruction Words (VLIW): fixed number of instructions (16) scheduled by the compiler
» Joint HP/Intel agreement in 1998?
37
Loop Unrolling in SuperScalar
Integer instruction FP instruction Clock cycleLoop: LD F0,0(R1) 1
LD F6,-8(R1) 2LD F10,-16(R1) ADDD F4,F0,F2 3LD F14,-24(R1) ADDD F8,F6,F2 4LD F18,-32(R1) ADDD F12,F10,F2 5SD 0(R1),F4 ADDD F16,F14,F2 6SD -8(R1),F8 ADDD F20,F18,F2 7SD -16(R1),F12 8SD -24(R1),F16 9SUBI R1,R1,#40 10BNEZ R1,LOOP 11SD -32(R1),F20 12
Unrolled 5 times to avoid delays (+1 due to SS)12 clocks, or 2.4 clocks per iteration
38
Loop Unrolling in VLIW
Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branchLD F0,0(R1) LD F6,-8(R1) 1LD F10,-16(R1) LD F14,-24(R1) 2LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
ADDD F20,F18,F2 ADDD F24,F22,F2 5SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6SD -16(R1),F12 SD -24(R1),F16 7SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8SD -0(R1),F28 BNEZ R1,LOOP 9
Unrolled 7 times to avoid delays7 results in 9 clocks, or 1.3 clocks per iterationNeed more registers in VLIW
39
Limits to Multi-Issue Machines
• Inherent limitations of ILP– 1 branch in 5: How to keep a 5-way VLIW busy?– Latencies of units: many operations must be scheduled– Need about Pipeline Depth x No. Functional Units of independent
operations to keep machines busy
• Difficulties in building HW– Duplicate FUs to get parallel execution– Increase ports to Register File – Increase ports to memory– Decoding SS and impact on clock rate, pipeline depth
40
Limits to Multi-Issue Machines
• Limitations specific to either SS or VLIW implementation
– Decode issue in SS– VLIW code size: unroll loops + wasted fields in VLIW– VLIW lock step => 1 hazard & all instructions stall– VLIW & binary compatibility is practical weakness
41
Chapter 5: Memory Hierarchy
42
Recap: Who Cares About the Memory Hierarchy?
µProc60%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10 yrs)
1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU19
82
Processor-MemoryPerformance Gap:(grows 50% / year)
Perf
orman
ce
Time
Processor-DRAM Memory Gap (latency)
43
Levels of the Memory Hierarchy
CPU Registers100s Bytes<1s ns
Cache10s-100s K Bytes1-10 ns$10/ MByte
Main MemoryM Bytes100ns- 300ns$1/ MByte
Disk10s G Bytes, 10 ms (10,000,000 ns)$0.0031/ MByte
CapacityAccess TimeCost
Tapeinfinitesec-min$0.0014/ MByte
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
44
The Principle of Locality
• The Principle of Locality:– Program access a relatively small portion of the address space at any
instant of time.• Two Different Types of Locality:
– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)
– Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
• Last 15 years, HW (hardware) relied on locality for speed
45
Memory Hierarchy: Terminology• Hit: data appears in some block in the upper level (example:
Block X) – Hit Rate: the fraction of memory access found in the upper level– Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss• Miss: data needs to be retrieve from a block in the lower level
(Block Y)– Miss Rate = 1 - (Hit Rate)– Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor• Hit Time << Miss Penalty (500 instructions on 21264!)
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
46
Cache Measures
• Hit rate: fraction found in that level– So high that usually talk about Miss rate– Miss rate fallacy: as MIPS to CPU performance,
miss rate to average memory access time in memory • Average memory-access time
= Hit time + Miss rate x Miss penalty (ns or clocks)
• Miss penalty: time to replace a block from lower level, including time to replace in CPU
– access time: time to lower level = f(latency to lower level)
– transfer time: time to transfer block =f(BW between upper & lower levels)
47
Simplest Cache: Direct MappedMemory
4 Byte Direct Mapped Cache
Memory Address0123456789ABCDEF
Cache Index0123
• Location 0 can be occupied by data from:
– Memory location 0, 4, 8, ... etc.– In general: any memory location
whose 2 LSBs of the address are 0s– Address<1:0> => cache index
• Which one should we place in the cache?
• How can we tell which one is in the cache?
48
1 KB Direct Mapped Cache, 32B blocks• For a 2 ** N byte cache:
– The uppermost (32 - N) bits are always the Cache Tag– The lowest M bits are the Byte Select (Block Size = 2 ** M)
Cache Index
0123
:
Cache DataByte 0
0431
:
Cache Tag Example: 0x50Ex: 0x01
0x50
Stored as partof the cache “state”
Valid Bit
:31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :
Cache Tag
Byte SelectEx: 0x00
9
49
Two-way Set Associative Cache• N-way set associative: N entries for each Cache Index
– N direct mapped caches operates in parallel (N typically 2 to 4)
• Example: Two-way set associative cache– Cache Index selects a “set” from the cache– The two tags in the set are compared in parallel– Data is selected based on the tag result
Cache DataCache Block 0
Cache TagValid
:: :
Cache DataCache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
50
Disadvantage of Set Associative Cache• N-way Set Associative Cache v. Direct Mapped Cache:
– N comparators vs. 1– Extra MUX delay for the data– Data comes AFTER Hit/Miss
• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:
– Possible to assume a hit and continue. Recover later if miss.
Cache DataCache Block 0
Cache Tag Valid
: ::
Cache DataCache Block 0
Cache TagValid
:: :
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
51
4 Questions for Memory Hierarchy
• Q1: Where can a block be placed in the upper level? (Block placement)
• Q2: How is a block found if it is in the upper level?(Block identification)
• Q3: Which block should be replaced on a miss? (Block replacement)
• Q4: What happens on a write? (Write strategy)
52
Q1: Where can a block be placed in the upper level?
• Block 12 placed in 8 block cache:– Fully associative, direct mapped, 2-way set associative– S.A. Mapping = Block Number Modulo Number Sets
Cache
01234567 0123456701234567
Memory
111111111122222222223301234567890123456789012345678901
Full Mapped Direct Mapped(12 mod 8) = 4
2-Way Assoc(12 mod 4) = 0
53
Q2: How is a block found if it is in the upper level?
• Tag on each block– No need to check index or block offset
• Increasing associativity shrinks index, expands tag
BlockOffset
Block Address
IndexTag
54
Q3: Which block should be replaced on a miss?
• Easy for Direct Mapped• Set Associative or Fully Associative:
– Random– LRU (Least Recently Used)
Assoc: 2-way 4-way 8-waySize LRU Ran LRU Ran LRU Ran16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
55
Q4: What happens on a write?
• Write through—The information is written to both the block in the cache and to the block in the lower-level memory.
• Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.
– is block clean or dirty?• Pros and Cons of each?
– WT: read misses cannot result in writes– WB: no repeated writes to same location
• WT always combined with write buffers so that don’t wait for lower level memory
56
Write Buffer for Write Through
• A Write Buffer is needed between the Cache and Memory
– Processor: writes data into the cache and the write buffer– Memory controller: write contents of the buffer to memory
• Write buffer is just a FIFO:– Typical number of entries: 4– Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
• Memory system designer’s nightmare:– Store frequency (w.r.t. time) -> 1 / DRAM write cycle– Write buffer saturation
ProcessorCache
Write Buffer
DRAM
57
A Modern Memory Hierarchy• By taking advantage of the principle of locality:
– Present the user with as much memory as is available in the cheapest technology.
– Provide access at the speed offered by the fastest technology.
Control
Datapath
SecondaryStorage(Disk)
Processor
Registers
MainMemory(DRAM)
SecondLevelCache
(SRAM)
On-C
hipC
ache1s 10,000,000s
(10s ms)Speed (ns): 10s 100s
100sGs
Size (bytes):Ks Ms
TertiaryStorage
(Disk/Tape)
10,000,000,000s (10s sec)
Ts
Recommended