ACA Unit-5

Embed Size (px)

Citation preview

  • 8/13/2019 ACA Unit-5

    1/54

    UNIT-V

    Memory Hierarchy Design

  • 8/13/2019 ACA Unit-5

    2/54

    2

    Memory Hierarchy Design

    5.1 Introduction

    5.2 Review of the ABCs of Caches

    5.3 Cache Performance

    5.4 Reducing Cache Miss Penalty

    5.5 Reducing Cache Miss Rate5.6 Reducing Cache Miss Penalty or Miss Rate via Parallelism

    5.7 Reducing Hit Time

    5.8 Main Memory and Organizations for Improving Performance

    5.9 Memory Technology5.10 Virtual Memory

    5.11 Protection and Examples of Virtual Memory

  • 8/13/2019 ACA Unit-5

    3/54

    3

    The five classic components of a computer:

    Control

    Datapath

    Memory

    Processor

    Input

    Output

    Where do we fetch instructions to execute?

    Build a memory hierarchy which includes main memory & caches (internalmemory) and hard disk (external memory)

    Instructions are first fetched from external storage such as hard disk andare kept in the main memory. Before they go to the CPU, they areprobably extracted to stay in the caches

  • 8/13/2019 ACA Unit-5

    4/54

    4

    Technology Trends

    DRAM

    Year Size Cycle Time

    1980 64 Kb 250 ns

    1983 256 Kb 220 ns

    1986 1 Mb 190 ns

    1989 4 Mb 165 ns

    1992 16 Mb 145 ns

    1995 64 Mb 120 ns

    2000 256 Mb 100 ns

    Capacity Speed (latency)

    CPU: 2x in 1.5 years 2x in 1.5 years

    DRAM: 4x in 3 years 2x in 10 years

    Disk: 4x in 3 years 2x in 10 years

    4000:1! 2.5:1!

  • 8/13/2019 ACA Unit-5

    5/54

    5

    The gap (latency) grows about 50% per year!

    CPU1.35X/yr1.55X/yr

    Memory7%/yr

    Performance Gap between CPUs and Memory

    (improvementratio)

  • 8/13/2019 ACA Unit-5

    6/54

    6

    Levels of the Memory Hierarchy

    CPU Registers

    500 bytes0.25 ns

    Cache

    64 KB

    1 ns

    Main Memor y

    512 MB100ns

    Disk100 GB5 ms

    Capacity

    Acc ess Time

    Upper Level

    Lower Level

    Faster

    Larger

    Memory Hierarchy

    Speed

    Capacity

    Registers

    Cache

    Memory

    I/O Devices

    Blocks

    Pages

    Files

    ???

  • 8/13/2019 ACA Unit-5

    7/54

    7

    Cache:

    In this textbook it mainly means the first level of the memoryhierarchy encountered once the address leaves the CPU

    applied whenever buffering is employed to reuse commonlyoccurring items, i.e. file caches, name caches, and so on

    Principle of Locality:

    Program access a relatively small portion of the address space atany instant of time.

    Two Different Types of Locality:

    Temporal Locality(Locality in Time): If an item is referenced, it willtend to be referenced again soon (e.g., loops, reuse)

    Spatial Locality(Locality in Space): If an item is referenced, itemswhose addresses are close by tend to be referenced soon(e.g., straightline code, array access)

    ABCs of Caches

  • 8/13/2019 ACA Unit-5

    8/54

    8

    Memory Hierarchy: Terminology

    Hit: data appears in some block in the cache (example: Block X) Hit Rate: the fraction of cache access found in the cache Hit Time: Time to access the upper level which consists of

    RAM access time + Time to determine hit/miss Miss: data needs to be retrieved from a block in the main

    memory (Block Y)

    Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in cache

    + Time to deliver the block to the processor Hit Time

  • 8/13/2019 ACA Unit-5

    9/54

  • 8/13/2019 ACA Unit-5

    10/54

    10

    ExampleAssume we have a computer where the CPI is 1.0 when all memory accesses

    hit the cache. The only data access are loads and stores, and these total 50% ofthe instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%,

    how much faster would the computer be if all instructions are in the cache?

    Answer:

    (A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then

    CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time(B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls.

    memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty

    = IC*(1+50%)*2%*25 = IC*0.75

    then CPU(B) = (IC + IC*0.75)* Clock cycle time

    = 1.75*IC*clock cycle timeThe performance ration is easy to get to be the inverse of the CPU execution

    time :

    CPU(B)/CPU(A) = 1.75

    The computer with no cache miss is 1.75 times faster.

    P.395 Example

  • 8/13/2019 ACA Unit-5

    11/54

    11

    Four Memory Hierarchy Questions

    Q1 (block placement):Where can a block be placed in the upper level?

    Q2 (block identification):

    How is a block found if it is in the upper level?

    Q3 (block replacement):Which bock should be replaced on a miss?

    Q4 (write strategy):

    What happens on a write?

  • 8/13/2019 ACA Unit-5

    12/54

    12

    Q1(block placement): Where can a block be placed?

    Direct mapped: (Block number)mod (Number of blocks in cache)Set associative: (Block number)mod (Number of sets in cache)

    # of set # of blocks n-way: n blocks in a set 1-way = direct mapped

    Fully associative:# of set = 1

    Example: block 12 placedin a 8-block cache

  • 8/13/2019 ACA Unit-5

    13/54

    13

    Simplest Cache: Direct Mapped (1-way)

    Memory

    4 Block Direct Mapped Cache

    Block number

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    A

    BC

    D

    E

    F

    Block Index in Cache

    0

    1

    2

    3

    The block have only one place it can appear in thecache. The mapping is usually

    (Block address) MOD ( Number of blocks in cache)

  • 8/13/2019 ACA Unit-5

    14/54

    14

    Example: 1 KB Direct Mapped Cache, 32B Blocks

    For a 2Nbyte cache:

    The uppermost (32 - N) bits are always the Cache Tag

    The lowest M bits are the Byte Select (Block Size = 2M

    )

    0

    1

    23

    :

    Cache Data

    Byte 0

    :

    0x50

    Stored as partof the cache state

    Valid Bit

    :

    31

    Byte 1Byte 31 :Byte 32Byte 33Byte 63 :

    Byte 992Byte 1023 :

    Cache Tag

    Cache Index

    0431

    Cache Tag Example: 0x50

    Ex: 0x01

    Byte Select

    Ex: 0x00

    9

  • 8/13/2019 ACA Unit-5

    15/54

    15

    Block Offset selects the desired data from the block, the index filed selectsthe set, and the tag field compared against the CPU address for a hit Use the Cache Indexto select the cache set Check the Tagon each block in that set

    No need to check index or block offsetA val id bitis added to the Tag to indicate whether or not this entry

    contains a valid address Select the desiredbytes using Block Offset

    Increasing associativity => shrinks index expands tag

    Block Address Block Offset

    (Block Size)Tag Cache/Set Index

    Three portions of an address in a set-associative or direct-mapped cache

    Q2 (block identification): How is a block found?

  • 8/13/2019 ACA Unit-5

    16/54

    16

    Example: Two-way set associative cache

    Cache Index selects a set from the cache

    The two tags in the set are compared in parallel Data is selected based on the tag result

    Cache Data

    Cache Block 0

    Cache TagValid

    :: :

    Cache Data

    Cache Block 0

    Cache Tag Valid

    : ::

    Cache Index

    Mux 01Sel1 Sel0

    Cache Block

    CompareAdr Tag

    Compare

    OR

    Hit

    Cache Index

    0431

    Cache Tag Example: 0x50

    Ex: 0x01

    Byte Select

    Ex: 0x00

    9

    0x50

  • 8/13/2019 ACA Unit-5

    17/54

    17

    Disadvantage of Set Associative Cache

    N-way Set Associative Cache v.s. Direct Mapped Cache:

    N comparators vs. 1

    Extra MUX delay for the data Data comes AFTERHit/Miss

    In a direct mapped cache, Cache Block is available BEFOREHit/Miss:

    Possible to assume a hit and continue. Recover later if miss.

    Cache Data

    Cache Block 0

    Cache Tag Valid

    : ::

    Cache Data

    Cache Block 0

    Cache TagValid

    :: :

    Cache Index

    Mux 01Sel1 Sel0

    Cache Block

    CompareAdr Tag

    Compare

    OR

    Hit

  • 8/13/2019 ACA Unit-5

    18/54

    18

    Easy for Direct Mappedhardware decisions are simplified

    Only one block frame is checked and only that block can be replacedSet Associative or Fully Associative

    There are many blocks to choose from on a miss to replace

    Three primary strategies for selecting a block to be replaced Random: randomly selected

    LRU: Least Recently Used block is removed FIFO(First in, First out)

    Data cache misses per 1000 instructions for various replacement strategiesAssociativity: 2-way 4-way 8-waySize LRU Random FIFO LRU Random FIFO LRU Random FIFO

    16 KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.464 KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3256 KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5

    There are little difference between LRU and random for the largest size cache, withLRU outperforming the others for smaller caches. FIFO generally outperforms

    random in the smaller cache sizes

    Q3 (block replacement): Which block should bereplaced on a cache miss?

  • 8/13/2019 ACA Unit-5

    19/54

    19

    Reads dominate processor cache accesses.

    E.g. 7% of overallmemory traffic are writes while 21% of datacacheaccess are writesTwo option we can adopt when writing to the cache:

    Write throughThe information is written to both the block in thecache and to the block in the lower-level memory.Write backThe information is written only to the block in the cache.

    The modified cache block is written to main memory only when it isreplaced.

    To reduce the frequency of writing back blocks on replacement, a dirtybitis used to indicate whether the block was modified in the cache(dirty) or not (clean). If clean, no write back since identical information

    to the cache is foundPros and Cons

    WT: simply to be implemented. The cache is always clean, so readmisses cannot result in writesWB: writes occur at the speed of the cache. And multiple writes within

    a block require only one write to the lower-level memory

    Q4(write strategy): What happens on a write?

  • 8/13/2019 ACA Unit-5

    20/54

  • 8/13/2019 ACA Unit-5

    21/54

    21

    Two options on a write miss

    Write allocatethe block is allocated on a write miss, followedby the write hit actions

    Write misses act like read misses

    No-write allocatewrite misses do not affect the cache. Theblock is modified only in the lower-level memory

    Block stay out of the cache in no-write allocateuntil the program triesto read the blocks, but with write allocateeven blocks that are only

    written will still be in the cache

    Write-Miss Policy: Write Allocate vs. Not Allocate

  • 8/13/2019 ACA Unit-5

    22/54

    22

    Example:Assume a fully associative write-back cache with many

    cache entries that starts empty. Below is sequence of five memoryoperations.

    Write Mem[100];Write Mem[100];Read Mem[200];Write Mem[200];Write Mem[100].

    What are the number of hits and misses (inclusive reads and writes) whenusing no-write allocateversus write allocate?

    Answer :

    No-write Allocate: Write allocate:

    Write Mem[100]; 1 write miss Write Mem[100]; 1 write missWrite Mem[100]; 1 write miss Write Mem[100]; 1 write hitRead Mem[200]; 1 read miss Read Mem[200]; 1 read missWrite Mem[200]; 1 write hit Write Mem[200]; 1 write hitWrite Mem[100]. 1 write miss Write Mem[100]; 1 write hit

    4 misses; 1 hit 2 misses; 3 hits

    Write-Miss Policy Example

  • 8/13/2019 ACA Unit-5

    23/54

    23

    Example: Split Cache vs. Unified Cache

    Which has the better avg. memory access time?A 16-KB instruction cache with a 16-KB data cache (split cache), orA 32-KB unified cache?

    Miss rates Size Instruction Cache Data Cache Unified Cache16KB 0.4% 11.4%

    32 KB 3.18%

    Cache Performance

    AssumeA hit takes 1 clock cycle and the miss penalty is 100 cyclesA load or store takes 1 extra clock cycle on a unified cache since

    there is only one cache port 36% of the instructions are data transfer instructions.

    About 74% of the memory accesses are instruction references

    Answer :Average memory access time (split)

    = % instructions x (Hit time + Instruction miss rate x Miss penalty)+ % data x (Hit time + Instruction miss rate x Miss penalty)= 74% X ( 1 + 0.4% X 100) + 26% X ( 1 + 11.4% X 100) = 4.24

    Average memory access time(unified)= 74% X ( 1 + 3.18%x100) + 26% X ( 1 + 1 + 3.18% X 100) = 4.44

  • 8/13/2019 ACA Unit-5

    24/54

    24

    Example:Suppose a processor:

    Ideal CPI = 1.0 (ignoring memory stalls)

    Avg. miss rate is 2%Avg. memory references per instruction is 1.5 Miss penalty is 100 cycles

    What are the impact on performance when behavior of the cache is included?

    Answer :CPI = CPU execution cycles per instr. + Memory stall cycles per instr.= CPI execution + Miss rate x Memory accesses per instr. x Miss penalty

    CPI with cache= 1.0 + 2% x 1.5 x 100 = 4CPI without cache= 1.0 + 1.5 x 100 = 151

    CPU time with cache= IC x CPI x Clock cycle time = IC x 4.0 x Clock cycle timeCPU time without cache= IC x 151 x Clock cycle time

    Without cache, the CPI of the processor increases from 1 to 151!

    75 % of the time the processor is stalled waiting for memory! (CPI: 14)

    Impact of Memory Access on CPU Performance

  • 8/13/2019 ACA Unit-5

    25/54

    25

    Example:What is the impact of two different cache organizations (direct

    mapped vs. 2-way set associative) on the performance of a CPU?

    Ideal CPI = 2.0 (ignoring memory stalls)

    Clock cycle time is 1.0 nsAvg. memory references per instruction is 1.5 Cache size: 64 KB, block size: 64 bytes For set-associative, assume the clock cycle time is stretched 1.25 times to

    accommodate the selection multiplexer Cache miss penalty is 75 ns

    Hit time is 1 clock cycle Miss rate: direct mapped 1.4%; 2-way set-associative 1.0%.

    Answer :

    Avg. memory access time1-way= 1.0+(0.014 x 75) = 2.05 nsAvg. memory access time

    2-way

    = 1.0 x 1.25 + (0.01 x 75) = 2.00 ns

    CPU time1-way= IC x (CPIexecution + Miss rate x Memory accesses per instructionx Miss penalty) x Clock cycle time

    = IC x (2.0 x 1.0 + (1.5 x 0.014 x 75)) = 3.58 ICCPU time2-way= IC x (2.0 x 1.0 x 1.25+ (1.5 x 0.01 x 75)) = 3.63 IC

    Impact of Cache Organizations on CPU Performance

  • 8/13/2019 ACA Unit-5

    26/54

    26

    Summary of Performance Equations

  • 8/13/2019 ACA Unit-5

    27/54

    27

    The next few sections in the text book look at ways to improve cache

    and memory access times.

    TimeCycleClockPenaltyMissRateMissnInstructio

    AccessesMemoryCPIICTimeCPU Execution )(*

    Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

    Section 5.5 Section 5.4Section 5.7

    Improving Cache Performance

  • 8/13/2019 ACA Unit-5

    28/54

    28

    Reducing Cache Miss Penalty

    Average Memory Access Time

    = Hit Time + Miss Rate * Miss Penalty

    Time to handle a miss is becoming more and more the

    controlling factor. This is because of the great improvement inspeed of processors as compared to the speed of memory.

    Five optimizations1. Multilevel caches2. Critical word first and early restart3. Giving priority to read misses over writes

    4. Merging write buffer5. Victim caches

    M ltil l C h

  • 8/13/2019 ACA Unit-5

    29/54

    29

    Approaches

    Make the cache faster to keep pace with the speed of CPUs Make the cache larger to overcome the widening gap

    L1: fast hits, L2: fewer misses L2 Equations

    Average Memory Access Time = Hit TimeL1+ Miss RateL1x Miss PenaltyL1Miss PenaltyL1= Hit TimeL2+ Miss RateL2x Miss PenaltyL2

    Average Memory Access Time = Hit TimeL1

    +Miss RateL1x (Hit TimeL2+Miss RateL2x Miss PenaltyL2)Hit TimeL1

  • 8/13/2019 ACA Unit-5

    30/54

    30

    Design of L2 Cache

    Size

    Since everything in L1 cache is likely to be in L2 cache, L2 cacheshould be much bigger than L1

    Whether data in L1 is in L2 novice approach: design L1 and L2 independently

    multilevel inclusion: L1 data are always present in L2 Advantage: easy for consistency between I/O and cache (checking L2 only)

    Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-levelblock to be replaced => slightly higher 1st-level miss rate

    i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2

    multilevel exclusion: L1 data is never found in L2 A cache miss in L1 results in a swap of blocks between L1 and L2

    Advantage: prevent wasting space in L2

    i.e. AMD Athlon: 64 KB L1 and 256 KB L2

  • 8/13/2019 ACA Unit-5

    31/54

    31

    Dont wait for full block to be loaded before restarting CPU

    Critical Word FirstRequest missed word first from memoryand send it to CPU as soon as it arrives; let CPU continueexecution while filling the rest of the words in the block. Alsocalled wrapped fetchand requested word first

    Early restartAs soon as the requested word of the blockarrives, send it to the CPU and let the CPU continue execution Given spatial locality, CPU tends to want next sequential word, so its

    not clear if benefit by early restart

    Generally useful only in large blocks,

    block

    O2: Critical Word First and Early Restart

  • 8/13/2019 ACA Unit-5

    32/54

    32

    Serve reads before writes have been completed Write through with write buffers

    SW R3, 512(R0) ; M[512]

  • 8/13/2019 ACA Unit-5

    33/54

    33

    O4: Merging Write Buffer If a write buffer is empty, the data and the full address are

    written in the buffer, and the write is finished from the CPUs

    perspective Usually a write buffer supports multi-words

    Write merging: addresses of write buffers are checked to see ifthe address of the new data matches the address of a validwrite buffer entry. If so, the new data are combined

    Write buffer with 4 entries, each can hold four 64-bit words(left) without merging (right) Four writes are merged into a single entry

    writing multiple words at the same time is faster than writing multiple times

    O5 Vi ti C h

  • 8/13/2019 ACA Unit-5

    34/54

    34

    O5: Victim Caches

    Idea of recycling: remember what was discarded latest due tocache miss in case it is needed again rather simply discarded or swapped into L2

    victim cache: a small, fully associative cache between a cacheand its refill pathcontain only blocks that are discarded from a cache because of a miss,

    victimschecked on a miss before going to the next lower-level memoryVictim caches of 1 to 5 entries are effective at reducing misses,

    especially for small, direct-mapped data caches

    AMD Athlon: 8 entries

  • 8/13/2019 ACA Unit-5

    35/54

    35

    Reducing Miss Rate

    3 Cs of Cache Miss

    CompulsoryThe first access to a block is not in the cache, so the blockmust be brought into the cache. Also called cold start missesor first

    reference misses.

    (Misses in even an Infinite Cache)

    CapacityIf the cache cannot contain all the blocks needed during

    execution of a program, capacity misseswill occur due to blocks being

    discarded and later retrieved.

    (Misses in Fully Associative Size X Cache)

    ConflictIf block-placement strategy is set associative or direct mapped,

    conflict misses (in addition to compulsory & capacity misses) will occurbecause a block can be discarded and later retrieved if too many blocks

    map to its set. Also called collision missesor interference misses.

    (Misses in N-way Associative but hits in Fully Associative Size X Cache)

  • 8/13/2019 ACA Unit-5

    36/54

    36

    3 Cs of Cache Miss

    miss rate 1-way associative cache size X= miss rate 2-way associative cache size X/2

    Compulsory vanishingly

    small

    3Cs Absolute Miss Rate (SPEC92) 2:1 Cache Rule

    Conflict

    Cache Size (KB)

    MissRateperType

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    1 2 4 816

    32

    64

    128

    1-way

    2-way

    4-way

    8-way

    Capacity

    Compulsory

  • 8/13/2019 ACA Unit-5

    37/54

    37

    3Cs Relative Miss Rate

    Conflict

    Flaws: for fixed block size

    Good: insight => invention

    Cache Size (KB)

    MissRateperTy

    pe

    0%

    20%

    40%

    60%

    80%

    100%

    1 2 4 816

    32

    64

    128

    1-way

    2-way4-way

    8-way

    Capacity

    Compulsory

  • 8/13/2019 ACA Unit-5

    38/54

    38

    Five Techniques to Reduce Miss Rate

    1. Larger block size

    2. Larger caches

    3. Higher associativity4. Way prediction and pseudoassociative caches

    5. Compiler optimizations

  • 8/13/2019 ACA Unit-5

    39/54

    39

    Block Size (bytes)

    Miss

    Rate

    0%

    5%

    10%

    15%

    20%

    25%

    16

    32

    64

    128

    256

    1K

    4K

    16K

    64K

    256K

    Size of Cache

    Using the principle of

    locality: The larger theblock, the greater thechance parts of it will beused again.

    O1: Larger Block Size

    Take advantage of spatial locality

    -The larger the block, the greater the chance parts of it is used again # of blocks is reduced for the cache of same size => Increase miss penalty It may increase conflict misses and even capacity misses if the cache is

    small Usually high latency and high bandwidth encourage large block size

  • 8/13/2019 ACA Unit-5

    40/54

    40

    O2: Larger Caches

    Increasing capacity of cache reduces capacity misses(Figure 5.14 and 5.15)

    May be longer hit time and higher cost

    Trends: Larger L2 or L3 off-chip caches

    Cache Size (KB)

    MissRateperType

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    1 2 4 816

    32

    64

    128

    1-way

    2-way

    4-way

    8-way

    Capacity

    Compulsory

  • 8/13/2019 ACA Unit-5

    41/54

    41

    Figure 5-14 and 5-15 show how improve miss rates improve

    with higher associativity 8-way set asociative is as effective as fully associative for practical purposes 2:1 Cache Rule:

    Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2

    Tradeoff: higher associative cache complicates the circuit May have longer clock cycle

    Beware: Execution time is the only final measure!Will Clock Cycle time increase as a result of having a more

    complicated cache?

    Hill [1988] suggested hit time for 2-way vs. 1-way is:external cache +10%,internal + 2%

    O3: Higher Associativity

  • 8/13/2019 ACA Unit-5

    42/54

    42

    O4: Way Prediction & Pseudoassociative Caches

    way prediction: extra bits are kept in cache to predict the way, orblock within the set of the next cache access

    Example: 2-way I-cache of Alpha 21264 If the predictor is correct, I-cache latency is 1 clock cycle If incorrect, tries the other block, changes the way predictor, and has a

    latency of 3 clock cycles excess of 85% accuracyreduce conflict miss and maintain the hit speed of direct-mapped cache

    pseudoassociative or column associative On a miss, a 2nd cache entry is checked before going to the next lower

    level one fast hit and one slow hit

    Invert the most significant bit to the find other block in the pseudoset

    Miss penalty may become slightly longer

    O5: Compiler Optimizations

  • 8/13/2019 ACA Unit-5

    43/54

    43

    O5: Compiler Optimizations

    Improve hit rate by compile-time optimization Reordering instructions with profiling information (McFarling[1989])

    Reduce misses by 50% for a 2KB direct-mapped 4-byte-block I-cache, and 75%in an 8KB cache

    Get best performance when it was possible to prevent some instruction fromentering the cache

    Aligning basic block: the entry point is at the beginning of a

    cache block Decrease the chance of a cache miss for sequential code Loop Interchange: exchanging the nesting of loops

    Improve spatial locality => reduce misses Make data be accessed in order

    => maximize use of data in a cache block before discarded

    /* Before: row first */

    for(j=0;j

  • 8/13/2019 ACA Unit-5

    44/54

    44

    Blocking: operating on submatrices or blocksMaximize accesses to the data loaded into the cache before replacedImprove temporal localityX=Y*Z

    /* Before */for(i=0;i

  • 8/13/2019 ACA Unit-5

    45/54

    45

    5.6 Reducing Cache Penalty or Miss Rate

    via Parallelism

    Three techniques that overlap the execution of instructions

    1.Nonblocking caches to reduce stalls on cache misses

    to match the out-of-order processors

    2.Hardware prefetching of insructions and data

    3.Compiler-controlled prefetching

  • 8/13/2019 ACA Unit-5

    46/54

    O2 H d P f t hi f I t ti d D t

  • 8/13/2019 ACA Unit-5

    47/54

    47

    O2: Hardware Prefetching of Instructions and Data

    Prefetch instructions or data before requested by the CPU either directly into the caches or into an external buffer (faster than

    accessing main memory)Instruction prefetch: frequently done in hardware outside cache Fetch two blocks on a miss

    the requested block is placed in I-cache when it returns the prefetched block is placed in ins truct io n stream buffer (ISB)

    1 single ISB would catch 15% to 25% of misses from a 4KB 16-byte-blockdirect-mapped I-cache. 4 ISBs increased the data hit rate to 43% (Jouppi1990)

    UltraSPARC III: data prefetch If a load hits in the prefetch cache

    the block is read from the prefetch cache the next prefetch request is issued: calculating the stride of the next

    prefetched block using the difference between the current address and theprevious address

    Up to 8 simultaneous prefetches

    It may interfere with demand misses resulting in lowering performance

    O3 C il C t ll d P f t hi

  • 8/13/2019 ACA Unit-5

    48/54

    48

    O3: Compiler-Controlled Prefetching

    Compiler-controlled prefetching Register prefetch: load the value into a register

    Cache prefetch: load data only into the cache (not register)Faulting vs. nonfaulting: the address does or does not cause anexception for virtual address faults and protection violations normal load instruction = faulting register prefetch instruction

    Most effective prefetch: semantically invisible to a program doesnt change the contents of registers and memory, and cannot cause virtual memory faults

    nonbinding prefetch: nonfaulting cache prefetch Overlapping execution: CPU proceeds while the prefetched data are being

    fetched

    Advantage: The compiler may avoid unnecessary prefetches in hardware Drawback: Prefetch instructions incurs instruction overhead

    5 7 R d i Hit Ti

  • 8/13/2019 ACA Unit-5

    49/54

    49

    5.7 Reducing Hit Time

    Importance of cache hit time

    Average Memory Access Time= Hit Time + Miss Rate * Miss Penalty

    More importantly, cache access time limits the clock cycle rate in

    many processors today!

    Fast hit time:Quickly and efficiently find out if data is in the cache, and if it is, get that data out of the cache

    Four techniques:

    1.Small and simple caches

    2.Avoiding address translation during indexing of the cache

    3.Pipelined cache access

    4.Trace caches

    O1: Small and Simple Caches

  • 8/13/2019 ACA Unit-5

    50/54

    50

    O1: Small and Simple Caches

    A time-consuming portion of a cache hit is using the index portion of theaddress to read the tag memoryand then compare it to the address

    Guideline: smaller hardware is fasterWhy Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB

    second level cache?

    Small data cache and thus fast clock rate

    Guideline: simpler hardware is fasterDirect Mapped, on chip

    General design:small and simple cache for 1st-level cache

    Keeping the tags on chip and the data off chip for 2nd-level cachesThe emphasis recently is on fast clock time while hiding L1 misses with

    dynamic execution and using L2 caches to avoid going to memory

  • 8/13/2019 ACA Unit-5

    51/54

  • 8/13/2019 ACA Unit-5

    52/54

    52

    Virtually indexed, physically tagged cache

    CPU

    TB

    $

    MEM

    VA

    PA

    PA

    ConventionalOrganization

    CPU

    $

    TB

    MEM

    VA

    VA

    PA

    Virtually Addressed CacheTranslate only on miss

    Synonym Problem

    CPU

    $ TB

    MEM

    VA

    PA

    TagsPA

    Overlap cache access

    with VA translation:requires $ index toremain invariant

    across translation

    VATags

    L2 $

    O3: Pipelined Cache Access

  • 8/13/2019 ACA Unit-5

    53/54

    53

    O3: Pipelined Cache Access

    Simply to pipeline cache access

    Multiple clock cycle for 1st-level cache hit

    Advantage: fast cycle time and slow hitExample: accessing instructions from I-cache

    Pentium: 1 clock cycle

    Pentium Pro ~ Pentium III: 2 clocks

    Pentium 4: 4 clocks

    Drawback: Increasing the number of pipeline stages leads to greater penalty on mispredicted branches and

    more clock cycles between the issue of the load and the use of the data

    Note that it increases the bandwidth of instructions rather thandecreasing the actual latency of a cache hit

    O4: Trace Caches

  • 8/13/2019 ACA Unit-5

    54/54

    O4: Trace Caches

    Trace cache for instructions: find a dynamic sequence ofinstructions including taken branches to load into a cache block The cache blocks contain

    dynamic traces of executed instructions determined by CPU

    rather than static sequences of instructions determined by memory

    branch prediction is folded into the cache: validated along with the

    addresses to have a valid fetch i.e. Intel NetBurst microarchitecture

    advantage: better utilization Trace caches store instructions only from the branch entry point to the exit

    of the trace

    Unused part of a long block entered or exited from a taken branch inconventional I-cache may not be fetched

    Downside: store the same instructions multiple times