高等计算机系统结构 - 北京大学微处理器 ...mprc.pku.edu.cn/courses/architecture/spring2017/chx16_arch07_cache… · Memory Hierarchy: Apple iMac G5 iMac G5 1.6 GHz

北京大学计算机科学技术系北京大学微处理器研究开发中心

高等计算机系统结构

高速缓冲存储器

2016年4月25日

程旭


Memory HierarchyTake advantage of the principle of locality to:

Present as much memory as in the cheapest technology

Provide access at speed offered by the fastest technology

On

-Ch

ip

Cach

e

Reg

isters

Control

Datapath

Secondary

Storage

(Disk/

FLASH/

PCM)

Processor

Main

Memory

(DRAM/

FLASH/

PCM)

Second

Level

Cache

(SRAM)

Tertiary

Storage

(Tape/

Cloud

Storage)


微处理器-主存（DRAM）的延迟差距Performance

(1/latency)

Gap grew 50% per year

°How do architects address this gap?

• Put small, fast “cache” memories between CPU and DRAM.

• Create a “memory hierarchy”


1977: DRAM faster than microprocessors

Apple (1977)

Steve WozniakSteve Jobs

CPU: 1000 ns

DRAM: 400 ns


Since then: Technology scaling ...Circuit in

250 nm technology

(introduced in 2000)

L nanometers long

Same circuit in

180 nm technology

(introduced in 2003)

0.7 x L nm

Each dimension

30% smaller. Area is 50% smaller

Logic circuits use smaller C’s, lower Vdd, and

higher kn and kp to speed up clock rates.


削减处理器-存储器性能差距

处理器面积比晶体管数比

(成本) (功率)

Alpha 21164 37% 77%

StrongArm SA110 61% 94%

Pentium Pro 64% 88%

每个封装体两个芯片(2 dies)：Proc/I$/D$ + L2$

Cache本身并没有特殊的内在意义，它仅是缩小处理器-存储器之间性能差距的一种手段


Floorplan of the Alpha 21264 (1999)


Alpha微处理器

Time of a full cache miss in instructions executed:

1st Alpha : 340 ns/5.0 ns = 68 clks x 2 or 136

2nd Alpha : 266 ns/3.3 ns = 80 clks x 4 or 320

3rd Alpha : 180 ns/1.7 ns =108 clks x 6 or 648

1/2X latency x 3X clock rate x 3X Instr/clock ?X


存储层次设计的四个问题Q1: 信息块可以放在高层的哪里？ (Block placement)

全相联、组相联、直接映射

Q2: 如果信息块在高层，那么如何找到它？ (Block

identification)

标记/信息块

Q3: 在失效时，应该替换掉哪个信息块？(Block

replacement)

随机、 LRU、FIFO

Q4: 在写操作时，会发生什么情况 (Write strategy)

回写（Write Back）或直写（Write Through） (使用写缓冲器)


Memory Hierarchy: Apple iMac G5

iMac G5

1.6 GHz

$1299.00

Reg L1 Inst L1 Data L2 DRAM Disk

Size 1K 64K 32K 512K 256M 80G

Latency(cycles)

1 3 3 11 160 10M

Let programs address a memory space that scales to the disk size, at a speed that is usually as fast as register access

Managed

by compilerManaged

by hardware

Managed by OS,

hardware,

application

Goal: Illusion of large, fast, cheap memory


Power 7 On-Chip Caches [IBM 2009]

11

32KB L1 I$/core32KB L1 D$/core3-cycle latency

256KB Unified L2$/core8-cycle latency

32MB Unified Shared L3$Embedded DRAM (eDRAM)25-cycle latency to local slice



Latency: A closer look

Reg L1 Inst L1 Data L2 DRAM Disk

Size 1K 64K 32K 512K 256M 80G

Latency(cycles)

1 3 3 11 160 1E+07

Latency(sec)

0.6n 1.9n 1.9n 6.9n 100n 12.5m

Hz 1.6G 533M 533M 145M 10M 80

Architect’s latency toolkit:

Read latency: Time to return first byte of a random access

(1) Parallelism. Request data from N 1-bit-wide memories at the same time. Overlaps latency cost for all N bits. Provides N times the bandwidth. Requests to N memory banks (interleaving) have potential of N times the bandwidth. (2) Pipeline memory. If memory has N cycles of latency,issue a request each cycle, receive it N cycles later.


Cache性能

CPU time = (CPU execution clock cycles + Memory

stall clock cycles) clock cycle time

Memory stall clock cycles =

(Reads Read miss rate Read miss penalty +

Writes Write miss rate Write miss penalty)

Memory stall clock cycles =

Memory accesses Miss rate Miss penalty


Recall: The Performance Equation

Seconds

Program

Instructions

Program=

Seconds

Cycle

We need all three terms,

and only these terms, to

compute CPU Time!

What factors make different programs have different CPIs?

Instruction mix varies.

Cache behavior varies.

Branch prediction varies.

“CPI” -- The Average

Number of Clock Cycles Per

Instruction For the Program

Instruction

Cycles


Recall: CPI as a tool to guide design

Machine CPI (throughput,

not latency)

5 x 30 + 1 x 20 + 2 x 20 + 2 x 10 + 2 x 20

100= 2.7 cycles/instruction

Program

Instruction Mix

Where program spends its time


AMAT: Average Memory Access TimeSeconds

Program

Instructions

Program= Seconds

CycleInstruction

Cycles

True CPI depends on the

Average Memory Access Time

(AMAT) for Inst & Data

AMAT = Hit Time +

(Miss Rate x Miss Penalty)

Last slide assumesconstant memory access

time.

5

12 2 2

Machine CPI

Last slide computed it ...

Goal: Reduce AMAT

Beware! Improving one term may hurt other terms, and increase AMAT!

True CPI = Ideal CPI +Memory Stall Cycles.See Appendix B.2 ofCA-AQA for details.


Programs with locality cache well ...

Donald J. Hatfield, Jeanette Gerald: Program

Restructuring for Virtual Memory. IBM Systems Journal

10(3): 168-192 (1971)

Time

Mem

ory

Ad

dre

ss (

one

dot

per

acc

ess)

Q. Point out bad locality behavior ...

SpatialLocality

TemporalLocality

Bad


The caching algorithm in one slide

Temporal locality: Keep most recently accessed

data closer to processor.

Spatial locality: Move contiguous blocks in the

address space to upper levels.


Caching terminologyHit: Data

appears

in upper level

block

(ex: Blk X)

Miss: Data retrieval from

lower level needed

(ex: Blk Y)

Hit Rate: The fraction of

memory accesses found

in upper level.

Miss Rate:

1 - Hit Rate

Hit Time: Time to

access upper level. Includes hit/miss check.

Miss penalty:Time to replace block in upper level + deliver

to CPU

Hit Time << Miss Penalty


Example: A Direct Mapped Cache

Cache Tag (25 bits) Index Byte Select

531 04

=

Hit

Ex: 0x01

Return byte(s) of a “hit” cache

line

Ex: 0x00

PowerPC 970: 64K direct-mapped Level-1 I-cache

67

ValidBit

Byte

31...

Byte

1

Byte

0

Byte

31...

Byte

1

Byte

0

Cache Tags 024 Cache Data


Hybrid Design: Set Associative Cache

Cache Tag (26 bits) Index (2 bits)

Byte Select (4 bits)

Cache block halved to keep # of cached bits constant.

Valid

Cache Block

Cache Block

Cache Tags Cache Data

Cache Block

Cache Block

Cache TagsValidCache Data

Ex: 0x01

=

HitRight

=

HitLeft

Return bytes of “hit” set member

“N-way” set associative -- N is number of blocks for each color

16 bytes16 bytes

PowerPC 970: 32K 2-wayset associative L1 D-cache


Separate instruction and data caches?

Misses per 1000 instructions

Figure B.6 from CA-AQA. Data for a 2-way set associative

cache with 64-byte blocks for DEC Alpha.

Note: The extraordinarily effectiveness of large instruction caches ...

Compare 2k separate I & D to 2k+1

unified ...arrows mark crossover.


Unified vs Split Caches• Unified vs Separate I&D

• Example:

– 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%

– 32KB unified: Aggregate miss rate=1.99%

• Which is better (ignore L2 cache)?

– Assume 33% data ops 75% accesses from instructions (1.0/1.33)

– hit time=1, miss time=50

– Note that data hit has 1 stall for unified cache (only one port)

AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05

AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

ProcI-Cache-1

Proc

Unified

Cache-1

Unified

Cache-2

D-Cache-1

Proc

Unified

Cache-2


Effect of Cache Parameters on Performance

Larger cache size+ reduces capacity and conflict misses - hit time will increase

Higher associativity+ reduces conflict misses- may increase hit time

Larger line size+ reduces compulsory and capacity (reload) misses- increases conflict misses and miss penalty

25


6 basic cache optimizations

Larger block size to reduce miss rate

Bigger caches to reduce miss rate

Higher associativity to reduce miss rate

Multilevel caches to reduce miss penalty

Giving priority to read misses over writes to

reduce miss penalty

Avoiding address translation during indexing of

the cache to reduce hit time


Ten Advanced Optimizations of Cache Performance(5E)

Reducing the hit time—Small and simple first-level caches and way

prediction. Both techniques also generally decrease power

consumption.

Increasing cache bandwidth—Pipelined caches, multibanked

caches, and nonblocking caches. These techniques have varying

impacts on power consumption.

Reducing the miss penalty—Critical word first and merging write

buffers. These optimizations have little impact on power.

Reducing the miss rate—Compiler optimizations. Obviously any

improvement at compile time improves power consumption.

Reducing the miss penalty or miss rate via parallelism—Hardware

prefetching and compiler prefetching. These optimizations generally

increase power consumption, primarily due to prefetched data that

are unused.


改进Cache性能

Average Memory access time = Hit time + Miss rate

Miss penalty

1. 降低失效率

2. 降低失效损失，或者

3. 减少在cache中命中的时间


降低失效对失效进行分类： 3 Cs

Compulsory 第一次访问一个不在cache中的数据块，该块必须被调入。也称为 cold start misses o或 first reference misses。 (即使Cache无穷大，也会失效)

Capacity在程序执行中，cache不能存放其所需的所有数据块，就会先放弃一些块然后再找回，这就出现了capacity misses。（有限大小的全相联Cache也会出现的失效)

Conflict 如果采用组相联或直接映射的策略，除了义务失效和容量失效，还会因为有太多块要同时映射到同一组中，就会先放弃一些块然后再找回，这就出现了conflict misses 。也称为 collision misses 或 interference misses。(有限大小的 N路组相联Cache中出现的失效)


Cache Size (KB)

Mis

s R

ate

pe

r T

yp

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16

32

64

128

1-way

2-way

4-way

8-way

Capacity

Compulsory

3Cs的绝对失效率 (SPEC92)

Conflict

义务失效率非常低


Cache Size (KB)

Mis

s R

ate

pe

r T

yp

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.141 2 4 8

16

32

64

128

1-way

2-way

4-way

8-way

Capacity

Compulsory

2:1 Cache规律

Conflict

miss rate 1-way associative cache size X

= miss rate 2-way associative cache size X/2


3Cs 的相对失效率

Cache Size (KB)

Mis

s R

ate

pe

r T

yp

e

0%

20%

40%

60%

80%

100%1 2 4 8

16

32

64

128

1-way

2-way4-way

8-way

Capacity

Compulsory

Conflict

Flaws: for fixed block size

Good: insight => invention


如何能减少失效?3 Cs: Compulsory, Capacity, Conflict

在所有情况，假设总的cache大小不变：

在下列情况，会发生什么变化：

1) 改变块大小：3Cs中哪些失效会受到明显影响?

2) 改变相联度：3Cs中哪些失效会受到明显影响?

3) 改变编译器:

3Cs中哪些失效会受到明显影响?


Block Size (bytes)

Miss

Rate

0%

5%

10%

15%

20%

25%

16 32 64

128

256

1K

4K

16K

64K

256K

1. 通过增大块大小来减少失效


2. 通过增大相联度来减少失效8路组相联实际上在减少Miss rate方面与全相效果接近。

2:1 Cache规律：

Miss Rate DM cache size N = Miss Rate 2-way cache

size N/2

小心：执行时间是唯一最终度量标准！

是否时钟周期时间会增加?

Hill [1988]的研究表明2-way cache的命中时间比1-

way 外部cache的时间会 +10%, 比内部cache的时间会+ 2%

通过增加Cache容量来减少失效。


示例：平均存储器访问时间与失效率

假设与直接映射的时钟周期时间相比对2路cache为1.10、对4路1.12、对8路 1.14

Cache Size Associativity

(KB) 1-way 2-way 4-way 8-way

1 2.33 2.15 2.07 2.01

2 1.98 1.86 1.76 1.68

4 1.72 1.67 1.61 1.53

8 1.46 1.48 1.47 1.43

16 1.29 1.32 1.32 1.32

32 1.20 1.24 1.25 1.27

64 1.14 1.20 1.21 1.23

128 1.10 1.17 1.18 1.20

(红表示平均存储器访问时间没有被更高的相联度改善)

假设失效损失为10周期


3. 使用Victim Cache来减少失效

如何结合直接映射的高速命中时间，而又避免冲突失效?

增加一个存放从cache中放弃数据的缓冲器（全相联cache）

Jouppi [1990]: 对于4KB的直接映射数据cache，4-entry

victim cache可以消除 20% 至95%的冲突。

在Alpha、HP等中使用


To Next Lower Level InHierarchy

DATATAGS

One Cache line of DataTag and Comparator




Victim Cache


First idea:

Miss cache

Miss Cache: Small. Fully-associative. LRU replacement.

Checked in parallel with the L1 cache.

If L1 and miss cache both miss, the data block returned by

the next-lower cache is placed in L1 and miss cache.

L1 Cache

“Miss Cache”


Second idea:

Victim cache

Victim Cache: Small. Fully-associative. LRU replacement.

Checked in parallel with the L1 cache.

If L1 and miss cache both miss, the data block removed from

the L1 cache is placed into the victim cache.

L1 Cache

“Victim Cache”


Victim CacheMiss Cache

% of conflict misses removed

Plotted vs number of {miss, victim} cache entries

Each symbol a benchmark.{Solid, dashed} line is L1 {I, D}.


Third idea:

Streaming

prefetch

buffer

(FIFO)

L1 Cache

Prefetch FIFO

Prefetch buffer: Small FIFO of cache lines and tags.

Check head of FIFO in parallel with L1 cache access.

If both miss, fetch missed block “k” for L1. Clear FIFO.

Prefetch blocks “k + 1”, “k +2”, ... and place in FIFO tail.


Fourth idea:

Multi-way

streaming

prefetch

buffer

Multi-way buffer: 4-FIFO version of the original design.

Allows block streams for 4 misses to proceed in parallel.

If an access misses L1 and all FIFOs, clear LRU FIFO.

L1 Cache

FIFO FIFO FIFO FIFO


Multi-Way BufferSingle-Way Buffer

% of all misses removed

Plotted vs number of prefetches that follow a miss.

Each symbol a benchmark.{Solid, dashed} line is L1 {I, D}.


4-way data victim cache + 4-way data stream buffer

Green line:

Shows

how a

system

with

perfect

caching

performs.

Also: Instruction stream buffer.

Purple line: Performance of enhanced system

Baseline

Complete

memory

system


4. 通过伪相联减少失效如何结合直接映射的快速命中时间和两路组相联

cache的低冲突失效的优势?

分解cache: 在失效时，检查cache的另一半看是否有所需信息，如果有称为伪命中（pseudo-hit） (慢命中)

缺点：如果命中需要1或2个周期，那么CPU难以流水

适用于不与处理器直接连接的cache（二级cache）

用于MIPS R10000和UltraSPARC的二级cache。

路预测(Way Prediction)

命中时间

伪命中时间失效损失

时间




5. 通过硬件预取指令和数据减少失效

例如，指令预取Alpha 21064在失效时取2个信息块

额外的块放置在流缓冲器（stream buffer）中

失效时，检测流缓冲器

对于数据块也可使用上述策略Jouppi [1990] 对于4KB cache，1个数据流缓冲器可以减少25%的损失；4个流缓冲器，减少43%

Palacharla & Kessler [1994] 对于科学计算程序，对于两个64KB，四路组相联cache，8个流缓冲器减少 50%至70%的失效

采用预测策略的前提是具有额外的存储带宽，它的使用没有“其他破坏”代价


6. 通过软件预取数据减少失效

数据预取将数据装入寄存器 (HP PA-RISC loads)

Cache预取: 装入cache(MIPS IV, PowerPC, SPARC v. 9)

不会产生故障的特殊预取指令；一种推测式执行

发射预取指令需要时间发射预取指令的开销是否小于减少失效的收益？

超标量的能力越强越可以减小发射带宽的难度


7. 通过编译优化减少失效McFarling [1989] 对于块大小为4字节的8KB直接映射cache，软件可以

75% 的失效

指令

对存储访问重排序，因而可以减少冲突失效

进行剖视（Profiling）来观测冲突(使用他们开发的工具)

数据

合并数组(data merge)：通过将两个独立数组合并为一个复合元素的数组来改进空间局部性

循环交换(loop interchange): 通过改变循环嵌套来按序访问存储器中存储的数据

循环合并(loop fusion): 将两个具有相同循环类型且有一些变量重叠的独立循环合并

块化(blocking): 通过不断使用一些数据块（而不是完整地遍历一行和一列）来改进时间局部性


合并数据的示例/* Before: 2 sequential arrays */

int val[SIZE];

int key[SIZE];

/* After: 1 array of stuctures */

struct merge {

int val;

int key;

};

struct merge merged_array[SIZE];

减少 val和 key之间的冲突改进空间局部性


循环交换示例/* Before */

for (k = 0; k < 100; k = k+1)

for (j = 0; j < 100; j = j+1)

for (i = 0; i < 5000; i = i+1)

x[i][j] = 2 * x[i][j];

/* After */

for (k = 0; k < 100; k = k+1)

for (i = 0; i < 5000; i = i+1)

for (j = 0; j < 100; j = j+1)

x[i][j] = 2 * x[i][j];

用顺序访问代替跳步（100个存储字）访问存储器

改进空间局部性


循环合并示例/* Before */

for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)

a[i][j] = 1/b[i][j] * c[i][j];

for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)

d[i][j] = a[i][j] + c[i][j];

/* After */

for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)

{ a[i][j] = 1/b[i][j] * c[i][j];

d[i][j] = a[i][j] + c[i][j];}

每次对a & c访问两次失效与每次访问一次

失效; 改进空间局部性


分块示例

两个内层循环:

读取z[]的所有 NN个元素

分别读取y[]一行的N个元素

写x[]一行的N个元素

容量失效是 N和Cache容量的函数：3 N N 4 => 无容量失效；否则 ...

思路：计算满足条件的 B B子阵

/* Before */

for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)

{r = 0;

for (k = 0; k < N; k = k+1){

r = r + y[i][k]*z[k][j];};

x[i][j] = r;

};


分块示例（续）/* After */

for (jj = 0; jj < N; jj = jj+B)

for (kk = 0; kk < N; kk = kk+B)

for (i = 0; i < N; i = i+1)

for (j = jj; j < min(jj+B-1,N); j = j+1)

{r = 0;

for (k = kk; k < min(kk+B-1,N); k = k+1) {

r = r + y[i][k]*z[k][j];};

x[i][j] = x[i][j] + r;

};

B 称为块化因子（Blocking Factor）

容量失效从2N3 + N2 减至 2N3/B +N2

是否也会降低冲突失效?


通过分块减少冲突失效

没有全相联的cache的冲突失效与块化大小

Lam et al [1991] a blocking factor of 24 had a fifth the

misses vs. 48 despite both fit in cache

Blocking Factor

Mis

s R

ate

0

0.05

0.1

0 50 100 150

Fully Associative Cache

Direct Mapped Cache

MIS

S R

AT

IO


Performance Improvement

1 1.5 2 2.5 3

compress

cholesky

(nasa7)

spice

mxm (nasa7)

btrix (nasa7)

tomcatv

gmty (nasa7)

vpenta (nasa7)

merged

arrays

loop

interchange

loop fusion blocking

编译优化减少cache失效小结


总结

3 Cs: Compulsory, Capacity, Conflict

降低失效率

1. 通过增大块大小减少失效

2.通过增大相联度减少失效

3.通过Victim Cache减少失效

4.通过伪-相联减少失效

5.通过硬件预取指令或数据减少失效

6.通过软件预取数据减少失效

7.通过编译优化减少失效

注意：在评价性能时仅仅侧重于某一个参数是危险的

CPUtime IC CPIExecution

Memory accesses

InstructionMiss rate Miss penalty

Clock cycle time


改进cache性能（续）Average Memory access time = Hit time + Miss rate

Miss penalty

1. 降低失效率




1. 减少失效损失: 在失效时读比写优先Write through with write buffers offer RAW conflicts with

main memory reads on cache misses

If simply wait for write buffer to empty, might increase read

miss penalty (old MIPS 1000 by 50% )

Check write buffer contents before read;

if no conflicts, let the memory access continue

Write Back?

Read miss replacing dirty block

Normal: Write dirty block to memory, and then do the read

Instead copy the dirty block to a write buffer, then do the

read, and then do the write

CPU stall less since restarts as soon as do read


1.1 在失效时读比写优先1.2 Merging Write Buffer

writebuffer

CPU

in out

DRAM (or lower mem)

Write Buffer


2. 减少失效损失: 子块放置Don’t have to load full block on a miss

Have valid bits per subblock to indicate valid

(Originally invented to reduce tag storage)

Valid Bits Subblocks


3. 减少失效损失: 提前重启和关键字先送

Don’t wait for full block to be loaded before restarting CPU

Early restartAs soon as the requested word of the block arrives,

send it to the CPU and let the CPU continue execution

Critical Word FirstRequest the missed word first from memory

and send it to the CPU as soon as it arrives; let the CPU continue

execution while filling the rest of the words in the block. Also

called wrapped fetch and requested word first

Generally useful only in large blocks,

Spatial locality a problem; tend to want next sequential

word, so not clear if benefit by early restart

block


4. 减少失效损失: 用Non-blocking Caches来减少失效时暂停

Non-blocking cache or lockup-free cache allow data cache

to continue to supply cache hits during a miss

requires out-of-order executuion CPU

hit under miss reduces the effective miss penalty by

working during miss vs. ignoring CPU requests

hit under multiple miss or miss under miss? may further

lower the effective miss penalty by overlapping multiple

misses

Significantly increases the complexity of the cache controller as

there can be multiple outstanding memory accesses

Requires muliple memory banks (otherwise cannot support)

Penium Pro allows 4 outstanding memory misses


对SPEC，失效下命中的情况

浮点程序的平均: AMAT= 0.68 0.52 0.34 0.26

整数程序的平均: AMAT= 0.24 0.20 0.19 0.19

8 KB Data Cache、直接映射、32B 数据块、失效需要16周期

Hit Under i Misses

Av

g. M

em

. Ac

ce

ss

Tim

e

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

eq

nto

tt

esp

ress

o

xlis

p

com

pre

ss

mdl

jsp

2

ea

r

fppp

p

tom

catv

swm

25

6

do

du

c

su2

cor

wa

ve5

mdl

jdp2

hyd

ro2

d

alv

inn

na

sa7

spic

e2

g6

ora

0->1

1->2

2->64

Base

Integer Floating Point

n次失效下命中

0->1

1->2

2->64

Base


5.二级cache二级cache的计算公式

AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 +

Miss PenaltyL2)

定义:局部失效率该cache的失效次数除以对该级cache 进行的总的存储访问次数 (Miss rateL2)

总失效率该cache的失效次数除以 CPU产生的总的存储器访问次数(Miss RateL1 x Miss RateL2)

总失效率是我们真正关心的


局部和全局失效率的比较一级cache:32 KByte;

增加二级cache

总失效率接近于二级Cache的单级cache失效率

使得二级 >> 一级（大小）

对二级Cache不要使用局部失效率

二级cache与CPU时钟周期无关!

成本和平均存储访问时间

通常，快命中时间和更少的失效

由于命中增多，目标失效减少

Linear

Log

Cache Size

Cache Size


减少失效损失:哪些适用于二级 Cache?

降低失效率

1. 通过增大块大小减少失效

2.通过增大相联度减少失效

3.通过Victim Cache减少失效

4.通过伪-相联减少失效

5.通过硬件预取指令或数据减少失效

6.通过软件预取数据减少失效

7.通过编译优化减少失效


Relative CPU Time

Block Size

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

16 32 64 128 256 512

1.361.28 1.27

1.34

1.54

1.95

二级cache 块大小和平均存储器访问时间

第一级32KB , 与存储器的通路8字节宽


减少失效损失小结CPUtime IC CPI

ExecutionMemory accesses

InstructionMiss rate Miss penalty

Clock cycle time

五种技术

失效时，读比写优先；合并写缓存

子块放置

失效时，提前重启和关键存储字先送

非阻塞Cache (Hit under Miss, Miss under Miss)

二级Cache

可适用于多级cache

问题：到DRAM的时间可能随着cache的级数而增长

乱序执行CPU可以隐藏第一级数据cache的失效，但在第二级cache失效时会暂停


Cache优化小结

技术 MR MP HT Complexity

增大块大小 + - 0

增高相联度 + - 2

淘汰块Cache + 2

伪相联Cache + 2

指令/数据的硬件预取 + 2

编译控制的预取 + 3

编译减少失效 + 0

读失效优先 + 1

子块放置 + + 1

提前重启和关键存储字优先 + 2

非阻塞Cache + 3

二级Cache + 2

失效

率失

效损

失


改进Cache性能

Average Memory access time = Hit time + Miss rate

Miss penalty

1. 降低失效率




1. 通过小、简单的Cache来加快命中时间

为什么Alpha 21164设置8KB指令cache和8KB数据cache + 96KB二级cache?

小数据cache和时钟频率

片载cache直接映射


2. 通过避免地址变换加快命中将虚拟地址送给cache? 称为虚拟地址cache（Virtually

Addressed Cache）或者虚拟Cache（Virtual Cache）性对于物理cache（Physical Cache）每次进程间的逻辑切换都必须冲洗cache; 否则将会发生错误命中代价是冲洗时间 + 空cache的义务失效

需要处理处理别名(aliases) (也称为化名(synonyms)); 两个不同的虚拟地址映射到同一物理地址

I/O必然与cache相互影响，因而需要虚拟地址

处理别名的策略硬件保证 Index域和直接映射，它们都是唯一的。

称为“页面染色（page coloring)”

冲洗cache的解决策略增加进程标识符（process identifier tag）：与进程内的地址一起还标识进程本身：如果进程错误就不会命中


虚拟地址Caches

CPU

TB

$

MEM

VA

PA

PA

常规组织

CPU

$

TB

MEM

VA

VA

PA

虚拟地址cache

只在失效时才变换别名问题

CPU

$ TB

MEM

VA

PA

Tags

PA

Cache访问与虚拟地

址变换重叠：需要cahe索引来保持变换

间的不变性

VA

Tags

L2 $


2.通过避免地址变化加快cache命中: 进程标识符的效果

黑色为单进程

浅灰为冲洗cache时的多进程

深灰为使用进程标识符的多进程

Y轴：失效率达20%

X轴：Cache大小从2 KB到 1024 KB


2.通过避免地址变化加快cache命中: 利用地址的物理部分进行索引

限制cache不能超过页面大小: 那么，需要更大的 cache时，怎么办？增大相联度将会使TAG和INDEX之间的界限右移

页面染色

Page Address Page Offset

Address Tag Index Block Offset

如果索引就是地址的某一物理部分，就可以与变换并行开始标志访问，因而就可以与物理标志进行比较


将标签检测和更改cache分为不同流水级；当前写操作的标签检测 & 上一次写操作的cache更改

流水线中只有 STORES；失效时清空

Store r2, (r1) Check r1Add --Sub --Store r4, (r3) M[r1]<-r2&

check r3

阴影部分为延迟写缓冲器（Delayed Write Buffer）在读操作中必须被检测；

3. 通过流水化写操作来加快命中时间


4. Trace Cache in Pentium 4

How to supply enough instructions every cycle without dependencies?

Instead of limiting the instructions in a static cache block to spatial locality

A trace cache finds a dynamic sequence of instructions including taken branches to load into a cache block.

Much more complicated address mapping mechanisms.

Trace cache store the same instructions multiple times in the instruction cache.


xxxxxxxx


Cache 优化小结技术 MR MP HT Complexity

Larger Block Size + - 0Higher Associativity + - 1Victim Caches + + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + + 2Compiler Controlled Prefetching + + 3Compiler Reduce Misses + 0

Priority to Read Misses + 1Subblock Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2

Small & Simple Caches - + 0Avoiding Address Translation + 2Pipelining Writes + 1

Trace Cache + 3

mis

s r

ate

hit

tim

em

iss

pen

alt

y


Cache 优化小结技术 MR MP HT Complexity

Larger Block Size + - 0Higher Associativity + - 1Victim Caches + + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + + 2Compiler Controlled Prefetching + + 3Compiler Reduce Misses + 0

Priority to Read Misses + 1Subblock Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2

Small & Simple Caches - + 0Avoiding Address Translation + 2Pipelining Writes + 1

Trace Cache + 3

mis

s r

ate

hit

tim

em

iss

pen

alt

y


所学知识对Caches性能的影响

这对以下领域意味着什么？

编译技术、操作系统、算法、数据结构

1

10

100

1000

19

80

19

81

19

82

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

DRAM

CPU 1960-1985: Speed

= f(no. operations)

1990

Pipelined

Execution &

Fast Clock Rate

Out-of-Order

execution

Superscalar

Instruction Issue

1998: Speed =

f(non-cached memory accesses)


4th E



(1) Small and Simple First-Level Caches toReduce Hit Time and Power

Direct-mapped caches can overlap the tag check with the

transmission of the data, effectively reducing hit time.

lower levels of associativity will usually reduce power

because fewer cache lines must be accessed.

In recent designs, there are three other factors that have led

to the use of higher associativity in first-level caches.

many processors take at least two clock cycles to access the

cache and thus the impact of a longer hit time may not be critical.

to keep the TLB out of the critical path (a delay that would be

larger than that associated with increased associativity), almost

all L1 caches should be virtually indexed.

with the introduction of multithreading, conflict misses can

increase, making higher associativity more attractive.


Figure 2.4 Energy consumption per read increases as cache size and

associativity are increased. The large penalty for eight-way set associative

caches is due to the cost of reading out eight tags and the corresponding

data in parallel.


(2) Way Prediction to Reduce Hit Timereduces conflict misses and yet maintains the hit speed of

direct-mapped cache.

This prediction means the multiplexor is set early to select

the desired block, and only a single tag comparison is

performed that clock cycle in parallel with reading the

cache data.

A miss results in checking the other blocks for matches in

the next clock cycle.

Simulations suggest that set prediction accuracy is in

excess of 90% for a two-way set associative cache and

80% for a four-way set associative cache, with better

accuracy on I-caches than D-caches.


Way selection to save power

An extended form of way prediction can also be used to

reduce power consumption by using the way prediction

bits to decide which cache block to actually access (the

way prediction bits are essentially extra address bits);

This approach, which might be called way selection, saves

power when the way prediction is correct but adds

significant time on a way misprediction, since the access,

not just the tag match and selection, must be repeated.

Such an optimization is likely to make sense only in low-

power processors.


(3)Pipelined Cache Access to Increase Cache Bandwidth

is simply to pipeline cache access so that the effective latency of

a first-level cache hit can be multiple clock cycles, giving fast

clock cycle time and high bandwidth but slow hits.

the pipeline for the instruction cache access for Intel x86

processors:

Pentium processors(mid-1990s) took: 1 clock cycle,

Pentium Pro -- Pentium III(mid-1990s—2000) took: 2

clocks,

Pentium 4 -- Core i7 takes: 4 clocks.

increases No. of pipeline stages, leading to a greater penalty on

mispredicted branches and more clock cycles between issuing

the load and using the data, but it does make it easier to

incorporate high degrees of associativity.


(4) Nonblocking Caches to Increase Cache Bandwidth

Figure 2.5 The effectiveness of a nonblocking cache is evaluated by allowing 1, 2, or 64 hits under a cache miss with 9 SPECINT (on the left)

and 9 SPECFP (on the right) benchmarks. The data memory system modeled after the Intel i7 consists of a 32KB L1 cache with a four cycle

access latency. The L2 cache (shared with instructions) is 256 KB with a 10 clock cycle access latency. The L3 is 2 MB and a 36-cycle access

latency. All the caches are eight-way set associative and have a 64-byte block size. Allowing one hit under miss reduces the miss penalty by 9% for

the integer benchmarks and 12.5% for the floating point. Allowing a second hit improves these results to 10% and 16%, and allowing 64 results in

little additional improvement.


(5) Multibanked Caches to Increase Cache Bandwidth

The Arm Cortex-A8 supports 1-4 banks in its L2 cache;

the Intel Core i7 has 4 banks in L1 (to support up to 2 memory

accesses per clock), and the L2 has 8 banks.

A simple mapping that works well is to spread the addresses of the

block sequentially across the banks, called sequential interleaving.

Multiple banks also are a way to reduce power consumption both in

caches and DRAM.

Figure 2.6 Four-way interleaved cache banks using block addressing. Assuming 64

bytes per blocks, each of these addresses would be multiplied by 64 to get byte addressing.


(6)Critical Word First and Early Restart to Reduce Miss Penalty

Critical word first—Request the missed word first from

memory and send it to the processor as soon as it arrives;

let the processor continue execution while filling the rest

of the words in the block.

Early restart—Fetch the words in normal order, but as

soon as the requested word of the block arrives send it to

the processor and let the processor continue execution.

The benefits of critical word first and early restart depend

on the size of the block and the likelihood of another

access to the portion of the block that has not yet been

fetched.


(7)Merging Write Buffer to Reduce Miss Penalty

Figure 2.7 To illustrate write merging, the write buffer on top does not use it while the write buffer on the

bottom does. The four writes are merged into a single buffer entry with write merging; without it, the buffer is full

even though three-fourths of each entry is wasted. The buffer has four entries, and each entry holds four 64-bit words.

The address for each entry is on the left, with a valid bit (V) indicating whether the next sequential 8 bytes in this

entry are occupied. (Without write merging, the words to the right in the upper part of the figure would only be used

for instructions that wrote multiple words at the same time.)


(8) Compiler Optimizations to Reduce Miss Rate

Loop Interchange

Blocking


(9) Hardware Prefetching of Instructionsand Data to Reduce Miss Penalty or Miss Rate

Figure 2.10 Speedup due to hardware prefetching on Intel Pentium 4 with hardware prefetching turned on for 2 of 12

SPECint2000 benchmarks and 9 of 14 SPECfp2000 benchmarks. Only the programs that benefit the most from prefetching are

shown; prefetching speeds up the missing 15 SPEC benchmarks by less than 15% [Singhal 2004].


(10) Compiler-Controlled Prefetching toReduce Miss Penalty or Miss Rate

Register prefetch will load the value into a register.

Cache prefetch loads data only into the cache and not the

register.

Either of these can be faulting or nonfaulting; that is, the

address does or does not cause an exception for virtual

address faults and protection violations.

The most effective prefetch is “semantically invisible” to a

program: It doesn’t change the contents of registers and

memory, and it cannot cause virtual memory faults.

Most processors today offer nonfaulting cache prefetches.

nonfaulting cache prefetch, also called nonbinding prefetch.


Documents

高等计算机系统结构 - 北京大学微处理器 ...mprc.pku.edu.cn/courses/architecture/spring2017/chx16_arch07_cache… · Memory Hierarchy: Apple iMac G5 iMac G5 1.6 GHz