ACA Unit-5

8/13/2019 ACA Unit-5

1/54

UNIT-V

Memory Hierarchy Design

8/13/2019 ACA Unit-5

2/54

2

Memory Hierarchy Design

5.1 Introduction

5.2 Review of the ABCs of Caches

5.3 Cache Performance

5.4 Reducing Cache Miss Penalty

5.5 Reducing Cache Miss Rate5.6 Reducing Cache Miss Penalty or Miss Rate via Parallelism

5.7 Reducing Hit Time

5.8 Main Memory and Organizations for Improving Performance

5.9 Memory Technology5.10 Virtual Memory

5.11 Protection and Examples of Virtual Memory

8/13/2019 ACA Unit-5

3/54

3

The five classic components of a computer:

Control

Datapath

Memory

Processor

Input

Output

Where do we fetch instructions to execute?

Build a memory hierarchy which includes main memory & caches (internalmemory) and hard disk (external memory)

Instructions are first fetched from external storage such as hard disk andare kept in the main memory. Before they go to the CPU, they areprobably extracted to stay in the caches

8/13/2019 ACA Unit-5

4/54

4

Technology Trends

DRAM

Year Size Cycle Time

1980 64 Kb 250 ns

1983 256 Kb 220 ns

1986 1 Mb 190 ns

1989 4 Mb 165 ns

1992 16 Mb 145 ns

1995 64 Mb 120 ns

2000 256 Mb 100 ns

Capacity Speed (latency)

CPU: 2x in 1.5 years 2x in 1.5 years

DRAM: 4x in 3 years 2x in 10 years

Disk: 4x in 3 years 2x in 10 years

4000:1! 2.5:1!

8/13/2019 ACA Unit-5

5/54

5

The gap (latency) grows about 50% per year!

CPU1.35X/yr1.55X/yr

Memory7%/yr

Performance Gap between CPUs and Memory

(improvementratio)

8/13/2019 ACA Unit-5

6/54

6

Levels of the Memory Hierarchy

CPU Registers

500 bytes0.25 ns

Cache

64 KB

1 ns

Main Memor y

512 MB100ns

Disk100 GB5 ms

Capacity

Acc ess Time

Upper Level

Lower Level

Faster

Larger

Memory Hierarchy

Speed

Capacity

Registers

Cache

Memory

I/O Devices

Blocks

Pages

Files

???

8/13/2019 ACA Unit-5

7/54

7

Cache:

In this textbook it mainly means the first level of the memoryhierarchy encountered once the address leaves the CPU

applied whenever buffering is employed to reuse commonlyoccurring items, i.e. file caches, name caches, and so on

Principle of Locality:

Program access a relatively small portion of the address space atany instant of time.

Two Different Types of Locality:

Temporal Locality(Locality in Time): If an item is referenced, it willtend to be referenced again soon (e.g., loops, reuse)

Spatial Locality(Locality in Space): If an item is referenced, itemswhose addresses are close by tend to be referenced soon(e.g., straightline code, array access)

ABCs of Caches

8/13/2019 ACA Unit-5

8/54

8

Memory Hierarchy: Terminology

Hit: data appears in some block in the cache (example: Block X) Hit Rate: the fraction of cache access found in the cache Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss Miss: data needs to be retrieved from a block in the main

memory (Block Y)

Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in cache

+ Time to deliver the block to the processor Hit Time

8/13/2019 ACA Unit-5

9/54

8/13/2019 ACA Unit-5

10/54

10

ExampleAssume we have a computer where the CPI is 1.0 when all memory accesses

hit the cache. The only data access are loads and stores, and these total 50% ofthe instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%,

how much faster would the computer be if all instructions are in the cache?

Answer:

(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then

CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time(B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls.

memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty

= IC*(1+50%)*2%*25 = IC*0.75

then CPU(B) = (IC + IC*0.75)* Clock cycle time

= 1.75*IC*clock cycle timeThe performance ration is easy to get to be the inverse of the CPU execution

time :

CPU(B)/CPU(A) = 1.75

The computer with no cache miss is 1.75 times faster.

P.395 Example

8/13/2019 ACA Unit-5

11/54

11

Four Memory Hierarchy Questions

Q1 (block placement):Where can a block be placed in the upper level?

Q2 (block identification):

How is a block found if it is in the upper level?

Q3 (block replacement):Which bock should be replaced on a miss?

Q4 (write strategy):

What happens on a write?

8/13/2019 ACA Unit-5

12/54

12

Q1(block placement): Where can a block be placed?

Direct mapped: (Block number)mod (Number of blocks in cache)Set associative: (Block number)mod (Number of sets in cache)

# of set # of blocks n-way: n blocks in a set 1-way = direct mapped

Fully associative:# of set = 1

Example: block 12 placedin a 8-block cache

8/13/2019 ACA Unit-5

13/54

13

Simplest Cache: Direct Mapped (1-way)

Memory

4 Block Direct Mapped Cache

Block number

0

1

2

3

4

5

6

7

8

9

A

BC

D

E

F

Block Index in Cache

0

1

2

3

The block have only one place it can appear in thecache. The mapping is usually

(Block address) MOD ( Number of blocks in cache)

8/13/2019 ACA Unit-5

14/54

14

Example: 1 KB Direct Mapped Cache, 32B Blocks

For a 2Nbyte cache:

The uppermost (32 - N) bits are always the Cache Tag

The lowest M bits are the Byte Select (Block Size = 2M

)

0

1

23

:

Cache Data

Byte 0

:

0x50

Stored as partof the cache state

Valid Bit

:

31

Byte 1Byte 31 :Byte 32Byte 33Byte 63 :

Byte 992Byte 1023 :

Cache Tag

Cache Index

0431

Cache Tag Example: 0x50

Ex: 0x01

Byte Select

Ex: 0x00

9

8/13/2019 ACA Unit-5

15/54

15

Block Offset selects the desired data from the block, the index filed selectsthe set, and the tag field compared against the CPU address for a hit Use the Cache Indexto select the cache set Check the Tagon each block in that set

No need to check index or block offsetA val id bitis added to the Tag to indicate whether or not this entry

contains a valid address Select the desiredbytes using Block Offset

Increasing associativity => shrinks index expands tag

Block Address Block Offset

(Block Size)Tag Cache/Set Index

Three portions of an address in a set-associative or direct-mapped cache

Q2 (block identification): How is a block found?

8/13/2019 ACA Unit-5

16/54

16

Example: Two-way set associative cache

Cache Index selects a set from the cache

The two tags in the set are compared in parallel Data is selected based on the tag result

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

Cache Index

0431

Cache Tag Example: 0x50

Ex: 0x01

Byte Select

Ex: 0x00

9

0x50

8/13/2019 ACA Unit-5

17/54

17

Disadvantage of Set Associative Cache

N-way Set Associative Cache v.s. Direct Mapped Cache:

N comparators vs. 1

Extra MUX delay for the data Data comes AFTERHit/Miss

In a direct mapped cache, Cache Block is available BEFOREHit/Miss:

Possible to assume a hit and continue. Recover later if miss.

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

8/13/2019 ACA Unit-5

18/54

18

Easy for Direct Mappedhardware decisions are simplified

Only one block frame is checked and only that block can be replacedSet Associative or Fully Associative

There are many blocks to choose from on a miss to replace

Three primary strategies for selecting a block to be replaced Random: randomly selected

LRU: Least Recently Used block is removed FIFO(First in, First out)

Data cache misses per 1000 instructions for various replacement strategiesAssociativity: 2-way 4-way 8-waySize LRU Random FIFO LRU Random FIFO LRU Random FIFO

16 KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.464 KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3256 KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5

There are little difference between LRU and random for the largest size cache, withLRU outperforming the others for smaller caches. FIFO generally outperforms

random in the smaller cache sizes

Q3 (block replacement): Which block should bereplaced on a cache miss?

8/13/2019 ACA Unit-5

19/54

19

Reads dominate processor cache accesses.

E.g. 7% of overallmemory traffic are writes while 21% of datacacheaccess are writesTwo option we can adopt when writing to the cache:

Write throughThe information is written to both the block in thecache and to the block in the lower-level memory.Write backThe information is written only to the block in the cache.

The modified cache block is written to main memory only when it isreplaced.

To reduce the frequency of writing back blocks on replacement, a dirtybitis used to indicate whether the block was modified in the cache(dirty) or not (clean). If clean, no write back since identical information

to the cache is foundPros and Cons

WT: simply to be implemented. The cache is always clean, so readmisses cannot result in writesWB: writes occur at the speed of the cache. And multiple writes within

a block require only one write to the lower-level memory

Q4(write strategy): What happens on a write?

8/13/2019 ACA Unit-5

20/54

8/13/2019 ACA Unit-5

21/54

21

Two options on a write miss

Write allocatethe block is allocated on a write miss, followedby the write hit actions

Write misses act like read misses

No-write allocatewrite misses do not affect the cache. Theblock is modified only in the lower-level memory

Block stay out of the cache in no-write allocateuntil the program triesto read the blocks, but with write allocateeven blocks that are only

written will still be in the cache

Write-Miss Policy: Write Allocate vs. Not Allocate

8/13/2019 ACA Unit-5

22/54

22

Example:Assume a fully associative write-back cache with many

cache entries that starts empty. Below is sequence of five memoryoperations.

Write Mem[100];Write Mem[100];Read Mem[200];Write Mem[200];Write Mem[100].

What are the number of hits and misses (inclusive reads and writes) whenusing no-write allocateversus write allocate?

Answer :

No-write Allocate: Write allocate:

Write Mem[100]; 1 write miss Write Mem[100]; 1 write missWrite Mem[100]; 1 write miss Write Mem[100]; 1 write hitRead Mem[200]; 1 read miss Read Mem[200]; 1 read missWrite Mem[200]; 1 write hit Write Mem[200]; 1 write hitWrite Mem[100]. 1 write miss Write Mem[100]; 1 write hit

4 misses; 1 hit 2 misses; 3 hits

Write-Miss Policy Example

8/13/2019 ACA Unit-5

23/54

23

Example: Split Cache vs. Unified Cache

Which has the better avg. memory access time?A 16-KB instruction cache with a 16-KB data cache (split cache), orA 32-KB unified cache?

Miss rates Size Instruction Cache Data Cache Unified Cache16KB 0.4% 11.4%

32 KB 3.18%

Cache Performance

AssumeA hit takes 1 clock cycle and the miss penalty is 100 cyclesA load or store takes 1 extra clock cycle on a unified cache since

there is only one cache port 36% of the instructions are data transfer instructions.

About 74% of the memory accesses are instruction references

Answer :Average memory access time (split)

= % instructions x (Hit time + Instruction miss rate x Miss penalty)+ % data x (Hit time + Instruction miss rate x Miss penalty)= 74% X ( 1 + 0.4% X 100) + 26% X ( 1 + 11.4% X 100) = 4.24

Average memory access time(unified)= 74% X ( 1 + 3.18%x100) + 26% X ( 1 + 1 + 3.18% X 100) = 4.44

8/13/2019 ACA Unit-5

24/54

24

Example:Suppose a processor:

Ideal CPI = 1.0 (ignoring memory stalls)

Avg. miss rate is 2%Avg. memory references per instruction is 1.5 Miss penalty is 100 cycles

What are the impact on performance when behavior of the cache is included?

Answer :CPI = CPU execution cycles per instr. + Memory stall cycles per instr.= CPI execution + Miss rate x Memory accesses per instr. x Miss penalty

CPI with cache= 1.0 + 2% x 1.5 x 100 = 4CPI without cache= 1.0 + 1.5 x 100 = 151

CPU time with cache= IC x CPI x Clock cycle time = IC x 4.0 x Clock cycle timeCPU time without cache= IC x 151 x Clock cycle time

Without cache, the CPI of the processor increases from 1 to 151!

75 % of the time the processor is stalled waiting for memory! (CPI: 14)

Impact of Memory Access on CPU Performance

8/13/2019 ACA Unit-5

25/54

25

Example:What is the impact of two different cache organizations (direct

mapped vs. 2-way set associative) on the performance of a CPU?

Ideal CPI = 2.0 (ignoring memory stalls)

Clock cycle time is 1.0 nsAvg. memory references per instruction is 1.5 Cache size: 64 KB, block size: 64 bytes For set-associative, assume the clock cycle time is stretched 1.25 times to

accommodate the selection multiplexer Cache miss penalty is 75 ns

Hit time is 1 clock cycle Miss rate: direct mapped 1.4%; 2-way set-associative 1.0%.

Answer :

Avg. memory access time1-way= 1.0+(0.014 x 75) = 2.05 nsAvg. memory access time

2-way

= 1.0 x 1.25 + (0.01 x 75) = 2.00 ns

CPU time1-way= IC x (CPIexecution + Miss rate x Memory accesses per instructionx Miss penalty) x Clock cycle time

= IC x (2.0 x 1.0 + (1.5 x 0.014 x 75)) = 3.58 ICCPU time2-way= IC x (2.0 x 1.0 x 1.25+ (1.5 x 0.01 x 75)) = 3.63 IC

Impact of Cache Organizations on CPU Performance

8/13/2019 ACA Unit-5

26/54

26

Summary of Performance Equations

8/13/2019 ACA Unit-5

27/54

27

The next few sections in the text book look at ways to improve cache

and memory access times.

TimeCycleClockPenaltyMissRateMissnInstructio

AccessesMemoryCPIICTimeCPU Execution )(*

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

Section 5.5 Section 5.4Section 5.7

Improving Cache Performance

8/13/2019 ACA Unit-5

28/54

28

Reducing Cache Miss Penalty

Average Memory Access Time

= Hit Time + Miss Rate * Miss Penalty

Time to handle a miss is becoming more and more the

controlling factor. This is because of the great improvement inspeed of processors as compared to the speed of memory.

Five optimizations1. Multilevel caches2. Critical word first and early restart3. Giving priority to read misses over writes

4. Merging write buffer5. Victim caches

M ltil l C h

8/13/2019 ACA Unit-5

29/54

29

Approaches

Make the cache faster to keep pace with the speed of CPUs Make the cache larger to overcome the widening gap

L1: fast hits, L2: fewer misses L2 Equations

Average Memory Access Time = Hit TimeL1+ Miss RateL1x Miss PenaltyL1Miss PenaltyL1= Hit TimeL2+ Miss RateL2x Miss PenaltyL2

Average Memory Access Time = Hit TimeL1

+Miss RateL1x (Hit TimeL2+Miss RateL2x Miss PenaltyL2)Hit TimeL1

8/13/2019 ACA Unit-5

30/54

30

Design of L2 Cache

Size

Since everything in L1 cache is likely to be in L2 cache, L2 cacheshould be much bigger than L1

Whether data in L1 is in L2 novice approach: design L1 and L2 independently

multilevel inclusion: L1 data are always present in L2 Advantage: easy for consistency between I/O and cache (checking L2 only)

Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-levelblock to be replaced => slightly higher 1st-level miss rate

i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2

multilevel exclusion: L1 data is never found in L2 A cache miss in L1 results in a swap of blocks between L1 and L2

Advantage: prevent wasting space in L2

i.e. AMD Athlon: 64 KB L1 and 256 KB L2

8/13/2019 ACA Unit-5

31/54

31

Dont wait for full block to be loaded before restarting CPU

Critical Word FirstRequest missed word first from memoryand send it to CPU as soon as it arrives; let CPU continueexecution while filling the rest of the words in the block. Alsocalled wrapped fetchand requested word first

Early restartAs soon as the requested word of the blockarrives, send it to the CPU and let the CPU continue execution Given spatial locality, CPU tends to want next sequential word, so its

not clear if benefit by early restart

Generally useful only in large blocks,

block

O2: Critical Word First and Early Restart

8/13/2019 ACA Unit-5

32/54

32

Serve reads before writes have been completed Write through with write buffers

SW R3, 512(R0) ; M[512]

8/13/2019 ACA Unit-5

33/54

33

O4: Merging Write Buffer If a write buffer is empty, the data and the full address are

written in the buffer, and the write is finished from the CPUs

perspective Usually a write buffer supports multi-words

Write merging: addresses of write buffers are checked to see ifthe address of the new data matches the address of a validwrite buffer entry. If so, the new data are combined

Write buffer with 4 entries, each can hold four 64-bit words(left) without merging (right) Four writes are merged into a single entry

writing multiple words at the same time is faster than writing multiple times

O5 Vi ti C h

8/13/2019 ACA Unit-5

34/54

34

O5: Victim Caches

Idea of recycling: remember what was discarded latest due tocache miss in case it is needed again rather simply discarded or swapped into L2

victim cache: a small, fully associative cache between a cacheand its refill pathcontain only blocks that are discarded from a cache because of a miss,

victimschecked on a miss before going to the next lower-level memoryVictim caches of 1 to 5 entries are effective at reducing misses,

especially for small, direct-mapped data caches

AMD Athlon: 8 entries

8/13/2019 ACA Unit-5

35/54

35

Reducing Miss Rate

3 Cs of Cache Miss

CompulsoryThe first access to a block is not in the cache, so the blockmust be brought into the cache. Also called cold start missesor first

reference misses.

(Misses in even an Infinite Cache)

CapacityIf the cache cannot contain all the blocks needed during

execution of a program, capacity misseswill occur due to blocks being

discarded and later retrieved.

(Misses in Fully Associative Size X Cache)

ConflictIf block-placement strategy is set associative or direct mapped,

conflict misses (in addition to compulsory & capacity misses) will occurbecause a block can be discarded and later retrieved if too many blocks

map to its set. Also called collision missesor interference misses.

(Misses in N-way Associative but hits in Fully Associative Size X Cache)

8/13/2019 ACA Unit-5

36/54

36

3 Cs of Cache Miss

miss rate 1-way associative cache size X= miss rate 2-way associative cache size X/2

Compulsory vanishingly

small

3Cs Absolute Miss Rate (SPEC92) 2:1 Cache Rule

Conflict

Cache Size (KB)

MissRateperType

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 816

32

64

128

1-way

2-way

4-way

8-way

Capacity

Compulsory

8/13/2019 ACA Unit-5

37/54

37

3Cs Relative Miss Rate

Conflict

Flaws: for fixed block size

Good: insight => invention

Cache Size (KB)

MissRateperTy

pe

0%

20%

40%

60%

80%

100%

1 2 4 816

32

64

128

1-way

2-way4-way

8-way

Capacity

Compulsory

8/13/2019 ACA Unit-5

38/54

38

Five Techniques to Reduce Miss Rate

1. Larger block size

2. Larger caches

3. Higher associativity4. Way prediction and pseudoassociative caches

5. Compiler optimizations

8/13/2019 ACA Unit-5

39/54

39

Block Size (bytes)

Miss

Rate

0%

5%

10%

15%

20%

25%

16

32

64

128

256

1K

4K

16K

64K

256K

Size of Cache

Using the principle of

locality: The larger theblock, the greater thechance parts of it will beused again.

O1: Larger Block Size

Take advantage of spatial locality

-The larger the block, the greater the chance parts of it is used again # of blocks is reduced for the cache of same size => Increase miss penalty It may increase conflict misses and even capacity misses if the cache is

small Usually high latency and high bandwidth encourage large block size

8/13/2019 ACA Unit-5

40/54

40

O2: Larger Caches

Increasing capacity of cache reduces capacity misses(Figure 5.14 and 5.15)

May be longer hit time and higher cost

Trends: Larger L2 or L3 off-chip caches

Cache Size (KB)

MissRateperType

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 816

32

64

128

1-way

2-way

4-way

8-way

Capacity

Compulsory

8/13/2019 ACA Unit-5

41/54

41

Figure 5-14 and 5-15 show how improve miss rates improve

with higher associativity 8-way set asociative is as effective as fully associative for practical purposes 2:1 Cache Rule:

Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2

Tradeoff: higher associative cache complicates the circuit May have longer clock cycle

Beware: Execution time is the only final measure!Will Clock Cycle time increase as a result of having a more

complicated cache?

Hill [1988] suggested hit time for 2-way vs. 1-way is:external cache +10%,internal + 2%

O3: Higher Associativity

8/13/2019 ACA Unit-5

42/54

42

O4: Way Prediction & Pseudoassociative Caches

way prediction: extra bits are kept in cache to predict the way, orblock within the set of the next cache access

Example: 2-way I-cache of Alpha 21264 If the predictor is correct, I-cache latency is 1 clock cycle If incorrect, tries the other block, changes the way predictor, and has a

latency of 3 clock cycles excess of 85% accuracyreduce conflict miss and maintain the hit speed of direct-mapped cache

pseudoassociative or column associative On a miss, a 2nd cache entry is checked before going to the next lower

level one fast hit and one slow hit

Invert the most significant bit to the find other block in the pseudoset

Miss penalty may become slightly longer

O5: Compiler Optimizations

8/13/2019 ACA Unit-5

43/54

43

O5: Compiler Optimizations

Improve hit rate by compile-time optimization Reordering instructions with profiling information (McFarling[1989])

Reduce misses by 50% for a 2KB direct-mapped 4-byte-block I-cache, and 75%in an 8KB cache

Get best performance when it was possible to prevent some instruction fromentering the cache

Aligning basic block: the entry point is at the beginning of a

cache block Decrease the chance of a cache miss for sequential code Loop Interchange: exchanging the nesting of loops

Improve spatial locality => reduce misses Make data be accessed in order

=> maximize use of data in a cache block before discarded

/* Before: row first */

for(j=0;j

8/13/2019 ACA Unit-5

44/54

44

Blocking: operating on submatrices or blocksMaximize accesses to the data loaded into the cache before replacedImprove temporal localityX=Y*Z

/* Before */for(i=0;i

8/13/2019 ACA Unit-5

45/54

45

5.6 Reducing Cache Penalty or Miss Rate

via Parallelism

Three techniques that overlap the execution of instructions

1.Nonblocking caches to reduce stalls on cache misses

to match the out-of-order processors

2.Hardware prefetching of insructions and data

3.Compiler-controlled prefetching

8/13/2019 ACA Unit-5

46/54

O2 H d P f t hi f I t ti d D t

8/13/2019 ACA Unit-5

47/54

47

O2: Hardware Prefetching of Instructions and Data

Prefetch instructions or data before requested by the CPU either directly into the caches or into an external buffer (faster than

accessing main memory)Instruction prefetch: frequently done in hardware outside cache Fetch two blocks on a miss

the requested block is placed in I-cache when it returns the prefetched block is placed in ins truct io n stream buffer (ISB)

1 single ISB would catch 15% to 25% of misses from a 4KB 16-byte-blockdirect-mapped I-cache. 4 ISBs increased the data hit rate to 43% (Jouppi1990)

UltraSPARC III: data prefetch If a load hits in the prefetch cache

the block is read from the prefetch cache the next prefetch request is issued: calculating the stride of the next

prefetched block using the difference between the current address and theprevious address

Up to 8 simultaneous prefetches

It may interfere with demand misses resulting in lowering performance

O3 C il C t ll d P f t hi

8/13/2019 ACA Unit-5

48/54

48

O3: Compiler-Controlled Prefetching

Compiler-controlled prefetching Register prefetch: load the value into a register

Cache prefetch: load data only into the cache (not register)Faulting vs. nonfaulting: the address does or does not cause anexception for virtual address faults and protection violations normal load instruction = faulting register prefetch instruction

Most effective prefetch: semantically invisible to a program doesnt change the contents of registers and memory, and cannot cause virtual memory faults

nonbinding prefetch: nonfaulting cache prefetch Overlapping execution: CPU proceeds while the prefetched data are being

fetched

Advantage: The compiler may avoid unnecessary prefetches in hardware Drawback: Prefetch instructions incurs instruction overhead

5 7 R d i Hit Ti

8/13/2019 ACA Unit-5

49/54

49

5.7 Reducing Hit Time

Importance of cache hit time

Average Memory Access Time= Hit Time + Miss Rate * Miss Penalty

More importantly, cache access time limits the clock cycle rate in

many processors today!

Fast hit time:Quickly and efficiently find out if data is in the cache, and if it is, get that data out of the cache

Four techniques:

1.Small and simple caches

2.Avoiding address translation during indexing of the cache

3.Pipelined cache access

4.Trace caches

O1: Small and Simple Caches

8/13/2019 ACA Unit-5

50/54

50

O1: Small and Simple Caches

A time-consuming portion of a cache hit is using the index portion of theaddress to read the tag memoryand then compare it to the address

Guideline: smaller hardware is fasterWhy Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB

second level cache?

Small data cache and thus fast clock rate

Guideline: simpler hardware is fasterDirect Mapped, on chip

General design:small and simple cache for 1st-level cache

Keeping the tags on chip and the data off chip for 2nd-level cachesThe emphasis recently is on fast clock time while hiding L1 misses with

dynamic execution and using L2 caches to avoid going to memory

8/13/2019 ACA Unit-5

51/54

8/13/2019 ACA Unit-5

52/54

52

Virtually indexed, physically tagged cache

CPU

TB

$

MEM

VA

PA

PA

ConventionalOrganization

CPU

$

TB

MEM

VA

VA

PA

Virtually Addressed CacheTranslate only on miss

Synonym Problem

CPU

$ TB

MEM

VA

PA

TagsPA

Overlap cache access

with VA translation:requires $ index toremain invariant

across translation

VATags

L2 $

O3: Pipelined Cache Access

8/13/2019 ACA Unit-5

53/54

53

O3: Pipelined Cache Access

Simply to pipeline cache access

Multiple clock cycle for 1st-level cache hit

Advantage: fast cycle time and slow hitExample: accessing instructions from I-cache

Pentium: 1 clock cycle

Pentium Pro ~ Pentium III: 2 clocks

Pentium 4: 4 clocks

Drawback: Increasing the number of pipeline stages leads to greater penalty on mispredicted branches and

more clock cycles between the issue of the load and the use of the data

Note that it increases the bandwidth of instructions rather thandecreasing the actual latency of a cache hit

O4: Trace Caches

8/13/2019 ACA Unit-5

54/54

O4: Trace Caches

Trace cache for instructions: find a dynamic sequence ofinstructions including taken branches to load into a cache block The cache blocks contain

dynamic traces of executed instructions determined by CPU

rather than static sequences of instructions determined by memory

branch prediction is folded into the cache: validated along with the

addresses to have a valid fetch i.e. Intel NetBurst microarchitecture

advantage: better utilization Trace caches store instructions only from the branch entry point to the exit

of the trace

Unused part of a long block entered or exited from a taken branch inconventional I-cache may not be fetched

Downside: store the same instructions multiple times

Documents

ACA Unit-5