View
250
Download
0
Category
Preview:
Citation preview
北京大学计算机科学技术系 北京大学微处理器研究开发中心
高等计算机系统结构
高速缓冲存储器
2016年4月25日
程 旭
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Memory HierarchyTake advantage of the principle of locality to:
Present as much memory as in the cheapest technology
Provide access at speed offered by the fastest technology
On
-Ch
ip
Cach
e
Reg
isters
Control
Datapath
Secondary
Storage
(Disk/
FLASH/
PCM)
Processor
Main
Memory
(DRAM/
FLASH/
PCM)
Second
Level
Cache
(SRAM)
Tertiary
Storage
(Tape/
Cloud
Storage)
北京大学计算机科学技术系 北京大学微处理器研究开发中心
微处理器-主存(DRAM)的延迟差距Performance
(1/latency)
Gap grew 50% per year
°How do architects address this gap?
• Put small, fast “cache” memories between CPU and DRAM.
• Create a “memory hierarchy”
北京大学计算机科学技术系 北京大学微处理器研究开发中心
1977: DRAM faster than microprocessors
Apple (1977)
Steve WozniakSteve Jobs
CPU: 1000 ns
DRAM: 400 ns
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Since then: Technology scaling ...Circuit in
250 nm technology
(introduced in 2000)
L nanometers long
Same circuit in
180 nm technology
(introduced in 2003)
0.7 x L nm
Each dimension
30% smaller. Area is 50% smaller
Logic circuits use smaller C’s, lower Vdd, and
higher kn and kp to speed up clock rates.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
削减 处理器-存储器性能差距
处理器 面积比 晶体管数比
(成本) (功率)
Alpha 21164 37% 77%
StrongArm SA110 61% 94%
Pentium Pro 64% 88%
每个封装体 两个芯片(2 dies):Proc/I$/D$ + L2$
Cache本身并没有特殊的内在意义,它仅是缩小处理器-存储器之间性能差距的一种手段
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Floorplan of the Alpha 21264 (1999)
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Alpha微处理器
Time of a full cache miss in instructions executed:
1st Alpha : 340 ns/5.0 ns = 68 clks x 2 or 136
2nd Alpha : 266 ns/3.3 ns = 80 clks x 4 or 320
3rd Alpha : 180 ns/1.7 ns =108 clks x 6 or 648
1/2X latency x 3X clock rate x 3X Instr/clock ?X
北京大学计算机科学技术系 北京大学微处理器研究开发中心
存储层次设计的四个问题Q1: 信息块可以放在高层的哪里? (Block placement)
全相联、组相联、直接映射
Q2: 如果信息块在高层,那么如何找到它? (Block
identification)
标记/信息块
Q3: 在失效时,应该替换掉哪个信息块?(Block
replacement)
随机、 LRU、FIFO
Q4: 在写操作时,会发生什么情况 (Write strategy)
回写(Write Back)或 直写(Write Through) (使用写缓冲器)
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Memory Hierarchy: Apple iMac G5
iMac G5
1.6 GHz
$1299.00
Reg L1 Inst L1 Data L2 DRAM Disk
Size 1K 64K 32K 512K 256M 80G
Latency(cycles)
1 3 3 11 160 10M
Let programs address a memory space that scales to the disk size, at a speed that is usually as fast as register access
Managed
by compilerManaged
by hardware
Managed by OS,
hardware,
application
Goal: Illusion of large, fast, cheap memory
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Power 7 On-Chip Caches [IBM 2009]
11
32KB L1 I$/core32KB L1 D$/core3-cycle latency
256KB Unified L2$/core8-cycle latency
32MB Unified Shared L3$Embedded DRAM (eDRAM)25-cycle latency to local slice
北京大学计算机科学技术系 北京大学微处理器研究开发中心
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Latency: A closer look
Reg L1 Inst L1 Data L2 DRAM Disk
Size 1K 64K 32K 512K 256M 80G
Latency(cycles)
1 3 3 11 160 1E+07
Latency(sec)
0.6n 1.9n 1.9n 6.9n 100n 12.5m
Hz 1.6G 533M 533M 145M 10M 80
Architect’s latency toolkit:
Read latency: Time to return first byte of a random access
(1) Parallelism. Request data from N 1-bit-wide memories at the same time. Overlaps latency cost for all N bits. Provides N times the bandwidth. Requests to N memory banks (interleaving) have potential of N times the bandwidth. (2) Pipeline memory. If memory has N cycles of latency,issue a request each cycle, receive it N cycles later.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Cache性能
CPU time = (CPU execution clock cycles + Memory
stall clock cycles) clock cycle time
Memory stall clock cycles =
(Reads Read miss rate Read miss penalty +
Writes Write miss rate Write miss penalty)
Memory stall clock cycles =
Memory accesses Miss rate Miss penalty
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Recall: The Performance Equation
Seconds
Program
Instructions
Program=
Seconds
Cycle
We need all three terms,
and only these terms, to
compute CPU Time!
What factors make different programs have different CPIs?
Instruction mix varies.
Cache behavior varies.
Branch prediction varies.
“CPI” -- The Average
Number of Clock Cycles Per
Instruction For the Program
Instruction
Cycles
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Recall: CPI as a tool to guide design
Machine CPI (throughput,
not latency)
5 x 30 + 1 x 20 + 2 x 20 + 2 x 10 + 2 x 20
100= 2.7 cycles/instruction
Program
Instruction Mix
Where program spends its time
北京大学计算机科学技术系 北京大学微处理器研究开发中心
AMAT: Average Memory Access TimeSeconds
Program
Instructions
Program= Seconds
CycleInstruction
Cycles
True CPI depends on the
Average Memory Access Time
(AMAT) for Inst & Data
AMAT = Hit Time +
(Miss Rate x Miss Penalty)
Last slide assumesconstant memory access
time.
5
12 2 2
Machine CPI
Last slide computed it ...
Goal: Reduce AMAT
Beware! Improving one term may hurt other terms, and increase AMAT!
True CPI = Ideal CPI +Memory Stall Cycles.See Appendix B.2 ofCA-AQA for details.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Programs with locality cache well ...
Donald J. Hatfield, Jeanette Gerald: Program
Restructuring for Virtual Memory. IBM Systems Journal
10(3): 168-192 (1971)
Time
Mem
ory
Ad
dre
ss (
one
dot
per
acc
ess)
Q. Point out bad locality behavior ...
SpatialLocality
TemporalLocality
Bad
北京大学计算机科学技术系 北京大学微处理器研究开发中心
The caching algorithm in one slide
Temporal locality: Keep most recently accessed
data closer to processor.
Spatial locality: Move contiguous blocks in the
address space to upper levels.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Caching terminologyHit: Data
appears
in upper level
block
(ex: Blk X)
Miss: Data retrieval from
lower level needed
(ex: Blk Y)
Hit Rate: The fraction of
memory accesses found
in upper level.
Miss Rate:
1 - Hit Rate
Hit Time: Time to
access upper level. Includes hit/miss check.
Miss penalty:Time to replace block in upper level + deliver
to CPU
Hit Time << Miss Penalty
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Example: A Direct Mapped Cache
Cache Tag (25 bits) Index Byte Select
531 04
=
Hit
Ex: 0x01
Return byte(s) of a “hit” cache
line
Ex: 0x00
PowerPC 970: 64K direct-mapped Level-1 I-cache
67
ValidBit
Byte
31...
Byte
1
Byte
0
Byte
31...
Byte
1
Byte
0
Cache Tags 024 Cache Data
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Hybrid Design: Set Associative Cache
Cache Tag (26 bits) Index (2 bits)
Byte Select (4 bits)
Cache block halved to keep # of cached bits constant.
Valid
Cache Block
Cache Block
Cache Tags Cache Data
Cache Block
Cache Block
Cache TagsValidCache Data
Ex: 0x01
=
HitRight
=
HitLeft
Return bytes of “hit” set member
“N-way” set associative -- N is number of blocks for each color
16 bytes16 bytes
PowerPC 970: 32K 2-wayset associative L1 D-cache
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Separate instruction and data caches?
Misses per 1000 instructions
Figure B.6 from CA-AQA. Data for a 2-way set associative
cache with 64-byte blocks for DEC Alpha.
Note: The extraordinarily effectiveness of large instruction caches ...
Compare 2k separate I & D to 2k+1
unified ...arrows mark crossover.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Unified vs Split Caches• Unified vs Separate I&D
• Example:
– 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%
– 32KB unified: Aggregate miss rate=1.99%
• Which is better (ignore L2 cache)?
– Assume 33% data ops 75% accesses from instructions (1.0/1.33)
– hit time=1, miss time=50
– Note that data hit has 1 stall for unified cache (only one port)
AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05
AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24
ProcI-Cache-1
Proc
Unified
Cache-1
Unified
Cache-2
D-Cache-1
Proc
Unified
Cache-2
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Effect of Cache Parameters on Performance
Larger cache size+ reduces capacity and conflict misses - hit time will increase
Higher associativity+ reduces conflict misses- may increase hit time
Larger line size+ reduces compulsory and capacity (reload) misses- increases conflict misses and miss penalty
25
北京大学计算机科学技术系 北京大学微处理器研究开发中心
6 basic cache optimizations
Larger block size to reduce miss rate
Bigger caches to reduce miss rate
Higher associativity to reduce miss rate
Multilevel caches to reduce miss penalty
Giving priority to read misses over writes to
reduce miss penalty
Avoiding address translation during indexing of
the cache to reduce hit time
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Ten Advanced Optimizations of Cache Performance(5E)
Reducing the hit time—Small and simple first-level caches and way
prediction. Both techniques also generally decrease power
consumption.
Increasing cache bandwidth—Pipelined caches, multibanked
caches, and nonblocking caches. These techniques have varying
impacts on power consumption.
Reducing the miss penalty—Critical word first and merging write
buffers. These optimizations have little impact on power.
Reducing the miss rate—Compiler optimizations. Obviously any
improvement at compile time improves power consumption.
Reducing the miss penalty or miss rate via parallelism—Hardware
prefetching and compiler prefetching. These optimizations generally
increase power consumption, primarily due to prefetched data that
are unused.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
改进Cache性能
Average Memory access time = Hit time + Miss rate
Miss penalty
1. 降低失效率
2. 降低失效损失,或者
3. 减少在cache中命中的时间
北京大学计算机科学技术系 北京大学微处理器研究开发中心
降低失效对失效进行分类: 3 Cs
Compulsory 第一次访问一个不在cache中的数据块,该块必须被调入。也称为 cold start misses o或 first reference misses。 (即使Cache无穷大,也会失效)
Capacity在程序执行中,cache不能存放其所需的所有数据块,就会先放弃一些块然后再找回,这就出现了capacity misses。(有限大小的全相联Cache也会出现的失效)
Conflict 如果采用组相联或直接映射的策略,除了义务失效和容量失效,还会因为有太多块要同时映射到同一组中,就会先放弃一些块然后再找回,这就出现了conflict misses 。也称为 collision misses 或 interference misses。(有限大小的 N路组相联Cache中出现的失效)
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Cache Size (KB)
Mis
s R
ate
pe
r T
yp
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
16
32
64
128
1-way
2-way
4-way
8-way
Capacity
Compulsory
3Cs的绝对失效率 (SPEC92)
Conflict
义务失效率非常低
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Cache Size (KB)
Mis
s R
ate
pe
r T
yp
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.141 2 4 8
16
32
64
128
1-way
2-way
4-way
8-way
Capacity
Compulsory
2:1 Cache规律
Conflict
miss rate 1-way associative cache size X
= miss rate 2-way associative cache size X/2
北京大学计算机科学技术系 北京大学微处理器研究开发中心
3Cs 的相对失效率
Cache Size (KB)
Mis
s R
ate
pe
r T
yp
e
0%
20%
40%
60%
80%
100%1 2 4 8
16
32
64
128
1-way
2-way4-way
8-way
Capacity
Compulsory
Conflict
Flaws: for fixed block size
Good: insight => invention
北京大学计算机科学技术系 北京大学微处理器研究开发中心
如何能减少失效?3 Cs: Compulsory, Capacity, Conflict
在所有情况,假设总的cache大小不变:
在下列情况,会发生什么变化:
1) 改变块大小:3Cs中哪些失效会受到明显影响?
2) 改变相联度:3Cs中哪些失效会受到明显影响?
3) 改变编译器:
3Cs中哪些失效会受到明显影响?
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Block Size (bytes)
Miss
Rate
0%
5%
10%
15%
20%
25%
16 32 64
128
256
1K
4K
16K
64K
256K
1. 通过增大块大小来减少失效
北京大学计算机科学技术系 北京大学微处理器研究开发中心
2. 通过增大相联度来减少失效8路组相联实际上在减少Miss rate方面与全相效果接近。
2:1 Cache规律:
Miss Rate DM cache size N = Miss Rate 2-way cache
size N/2
小心:执行时间是唯一最终度量标准!
是否时钟周期时间会增加?
Hill [1988]的研究表明2-way cache的命中时间比1-
way 外部cache的时间会 +10%, 比内部cache的时间会+ 2%
通过增加Cache容量来减少失效。
北京大学计算机科学技术系 北京大学微处理器研究开发中心
示例:平均存储器访问时间 与 失效率
假设与直接映射的时钟周期时间相比 对2路cache为1.10、对4路1.12、对8路 1.14
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20
(红 表示平均存储器访问时间 没有被更高的相联度改善)
假设失效损失为10周期
北京大学计算机科学技术系 北京大学微处理器研究开发中心
3. 使用Victim Cache来减少失效
如何结合直接映射的高速命中时间,而又避免冲突失效?
增加一个存放从cache中放弃数据的缓冲器(全相联cache)
Jouppi [1990]: 对于4KB的直接映射数据cache,4-entry
victim cache可以消除 20% 至95%的冲突。
在Alpha、HP等中使用
北京大学计算机科学技术系 北京大学微处理器研究开发中心
To Next Lower Level InHierarchy
DATATAGS
One Cache line of DataTag and Comparator
One Cache line of DataTag and Comparator
One Cache line of DataTag and Comparator
One Cache line of DataTag and Comparator
Victim Cache
北京大学计算机科学技术系 北京大学微处理器研究开发中心
First idea:
Miss cache
Miss Cache: Small. Fully-associative. LRU replacement.
Checked in parallel with the L1 cache.
If L1 and miss cache both miss, the data block returned by
the next-lower cache is placed in L1 and miss cache.
L1 Cache
“Miss Cache”
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Second idea:
Victim cache
Victim Cache: Small. Fully-associative. LRU replacement.
Checked in parallel with the L1 cache.
If L1 and miss cache both miss, the data block removed from
the L1 cache is placed into the victim cache.
L1 Cache
“Victim Cache”
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Victim CacheMiss Cache
% of conflict misses removed
Plotted vs number of {miss, victim} cache entries
Each symbol a benchmark.{Solid, dashed} line is L1 {I, D}.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Third idea:
Streaming
prefetch
buffer
(FIFO)
L1 Cache
Prefetch FIFO
Prefetch buffer: Small FIFO of cache lines and tags.
Check head of FIFO in parallel with L1 cache access.
If both miss, fetch missed block “k” for L1. Clear FIFO.
Prefetch blocks “k + 1”, “k +2”, ... and place in FIFO tail.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Fourth idea:
Multi-way
streaming
prefetch
buffer
Multi-way buffer: 4-FIFO version of the original design.
Allows block streams for 4 misses to proceed in parallel.
If an access misses L1 and all FIFOs, clear LRU FIFO.
L1 Cache
FIFO FIFO FIFO FIFO
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Multi-Way BufferSingle-Way Buffer
% of all misses removed
Plotted vs number of prefetches that follow a miss.
Each symbol a benchmark.{Solid, dashed} line is L1 {I, D}.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
4-way data victim cache + 4-way data stream buffer
Green line:
Shows
how a
system
with
perfect
caching
performs.
Also: Instruction stream buffer.
Purple line: Performance of enhanced system
Baseline
Complete
memory
system
北京大学计算机科学技术系 北京大学微处理器研究开发中心
4. 通过伪相联减少失效如何结合直接映射的快速命中时间和两路组相联
cache的低冲突失效的优势?
分解cache: 在失效时,检查cache的另一半看是否有所需信息,如果有称为 伪命中(pseudo-hit) (慢命中)
缺点:如果命中需要1或2个周期,那么CPU难以流水
适用于不与处理器直接连接的cache(二级cache)
用于MIPS R10000和UltraSPARC的二级cache。
路预测(Way Prediction)
命中时间
伪命中时间 失效损失
时间
北京大学计算机科学技术系 北京大学微处理器研究开发中心
北京大学计算机科学技术系 北京大学微处理器研究开发中心
北京大学计算机科学技术系 北京大学微处理器研究开发中心
5. 通过 硬件 预取指令和数据减少失效
例如,指令预取Alpha 21064在失效时取2个信息块
额外的块放置在 流缓冲器(stream buffer)中
失效时,检测流缓冲器
对于数据块也可使用上述策略Jouppi [1990] 对于4KB cache,1个数据流缓冲器可以减少25%的损失;4个流缓冲器,减少43%
Palacharla & Kessler [1994] 对于科学计算程序,对于两个64KB,四路组相联cache,8个流缓冲器减少 50%至70%的失效
采用预测策略的前提是具有额外的存储带宽,它的使用没有“其他破坏”代价
北京大学计算机科学技术系 北京大学微处理器研究开发中心
6. 通过 软件 预取数据减少失效
数据预取将数据装入寄存器 (HP PA-RISC loads)
Cache预取: 装入cache(MIPS IV, PowerPC, SPARC v. 9)
不会产生故障的特殊预取指令;一种推测式执行
发射预取指令需要时间发射预取指令的开销 是否小于减少失效的收益?
超标量的能力越强 越可以减小 发射带宽的难度
北京大学计算机科学技术系 北京大学微处理器研究开发中心
7. 通过编译优化减少失效McFarling [1989] 对于块大小为4字节的8KB直接映射cache,软件可以
75% 的失效
指令
对存储访问重排序,因而可以减少冲突失效
进行剖视(Profiling)来观测冲突(使用他们开发的工具)
数据
合并数组(data merge):通过将两个独立数组合并为一个复合元素的数组来改进空间局部性
循环交换(loop interchange): 通过改变循环嵌套来按序访问存储器中存储的数据
循环合并(loop fusion): 将两个具有相同循环类型且有一些变量重叠的独立循环合并
块化(blocking): 通过不断使用一些数据块(而不是完整地遍历一行和一列)来改进时间局部性
北京大学计算机科学技术系 北京大学微处理器研究开发中心
合并数据的示例/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
/* After: 1 array of stuctures */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
减少 val和 key之间的冲突改进空间局部性
北京大学计算机科学技术系 北京大学微处理器研究开发中心
循环交换示例/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
用顺序访问代替跳步(100个存储字)访问存储器
改进空间局部性
北京大学计算机科学技术系 北京大学微处理器研究开发中心
循环合并示例/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
每次对a & c访问两次失效 与每次访问一次
失效; 改进空间局部性
北京大学计算机科学技术系 北京大学微处理器研究开发中心
分块示例
两个内层循环:
读取z[]的所有 NN个元素
分别读取y[]一行的N个元素
写x[]一行的N个元素
容量失效是 N和Cache容量的函数:3 N N 4 => 无容量失效;否则 ...
思路:计算满足条件的 B B子阵
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
北京大学计算机科学技术系 北京大学微处理器研究开发中心
分块示例(续)/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};
B 称为块化因子(Blocking Factor)
容量失效从2N3 + N2 减至 2N3/B +N2
是否也会降低冲突失效?
北京大学计算机科学技术系 北京大学微处理器研究开发中心
通过分块减少冲突失效
没有全相联的cache的冲突失效与 块化大小
Lam et al [1991] a blocking factor of 24 had a fifth the
misses vs. 48 despite both fit in cache
Blocking Factor
Mis
s R
ate
0
0.05
0.1
0 50 100 150
Fully Associative Cache
Direct Mapped Cache
MIS
S R
AT
IO
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Performance Improvement
1 1.5 2 2.5 3
compress
cholesky
(nasa7)
spice
mxm (nasa7)
btrix (nasa7)
tomcatv
gmty (nasa7)
vpenta (nasa7)
merged
arrays
loop
interchange
loop fusion blocking
编译优化减少cache失效 小结
北京大学计算机科学技术系 北京大学微处理器研究开发中心
总 结
3 Cs: Compulsory, Capacity, Conflict
降低失效率
1. 通过增大块大小减少失效
2.通过增大相联度减少失效
3.通过Victim Cache减少失效
4.通过伪-相联减少失效
5.通过硬件预取指令或数据减少失效
6.通过软件预取数据减少失效
7.通过编译优化减少失效
注意:在评价性能时仅仅侧重于某一个参数是危险的
CPUtime IC CPIExecution
Memory accesses
InstructionMiss rate Miss penalty
Clock cycle time
北京大学计算机科学技术系 北京大学微处理器研究开发中心
改进cache性能(续)Average Memory access time = Hit time + Miss rate
Miss penalty
1. 降低失效率
2. 降低失效损失,或者
3. 减少在cache中命中的时间
北京大学计算机科学技术系 北京大学微处理器研究开发中心
1. 减少失效损失: 在失效时读比写优先Write through with write buffers offer RAW conflicts with
main memory reads on cache misses
If simply wait for write buffer to empty, might increase read
miss penalty (old MIPS 1000 by 50% )
Check write buffer contents before read;
if no conflicts, let the memory access continue
Write Back?
Read miss replacing dirty block
Normal: Write dirty block to memory, and then do the read
Instead copy the dirty block to a write buffer, then do the
read, and then do the write
CPU stall less since restarts as soon as do read
北京大学计算机科学技术系 北京大学微处理器研究开发中心
1.1 在失效时读比写优先1.2 Merging Write Buffer
writebuffer
CPU
in out
DRAM (or lower mem)
Write Buffer
北京大学计算机科学技术系 北京大学微处理器研究开发中心
2. 减少失效损失: 子块放置Don’t have to load full block on a miss
Have valid bits per subblock to indicate valid
(Originally invented to reduce tag storage)
Valid Bits Subblocks
北京大学计算机科学技术系 北京大学微处理器研究开发中心
3. 减少失效损失: 提前重启和关键字先送
Don’t wait for full block to be loaded before restarting CPU
Early restartAs soon as the requested word of the block arrives,
send it to the CPU and let the CPU continue execution
Critical Word FirstRequest the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU continue
execution while filling the rest of the words in the block. Also
called wrapped fetch and requested word first
Generally useful only in large blocks,
Spatial locality a problem; tend to want next sequential
word, so not clear if benefit by early restart
block
北京大学计算机科学技术系 北京大学微处理器研究开发中心
4. 减少失效损失: 用Non-blocking Caches来减少失效时暂停
Non-blocking cache or lockup-free cache allow data cache
to continue to supply cache hits during a miss
requires out-of-order executuion CPU
hit under miss reduces the effective miss penalty by
working during miss vs. ignoring CPU requests
hit under multiple miss or miss under miss? may further
lower the effective miss penalty by overlapping multiple
misses
Significantly increases the complexity of the cache controller as
there can be multiple outstanding memory accesses
Requires muliple memory banks (otherwise cannot support)
Penium Pro allows 4 outstanding memory misses
北京大学计算机科学技术系 北京大学微处理器研究开发中心
对SPEC,失效下命中的情况
浮点程序的平均: AMAT= 0.68 0.52 0.34 0.26
整数程序的平均: AMAT= 0.24 0.20 0.19 0.19
8 KB Data Cache、直接映射、32B 数据块、 失效需要16周期
Hit Under i Misses
Av
g. M
em
. Ac
ce
ss
Tim
e
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
eq
nto
tt
esp
ress
o
xlis
p
com
pre
ss
mdl
jsp
2
ea
r
fppp
p
tom
catv
swm
25
6
do
du
c
su2
cor
wa
ve5
mdl
jdp2
hyd
ro2
d
alv
inn
na
sa7
spic
e2
g6
ora
0->1
1->2
2->64
Base
Integer Floating Point
n次失效下命中
0->1
1->2
2->64
Base
北京大学计算机科学技术系 北京大学微处理器研究开发中心
5.二级cache二级cache的计算公式
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 +
Miss PenaltyL2)
定义:局部失效率 该cache的失效次数除以对该级cache 进行的总的存储访问次数 (Miss rateL2)
总失效率该cache的失效次数除以 CPU产生的 总的存储器访问次数(Miss RateL1 x Miss RateL2)
总失效率是我们真正关心的
北京大学计算机科学技术系 北京大学微处理器研究开发中心
局部和全局失效率的比较一级cache:32 KByte;
增加二级cache
总失效率接近于二级Cache的单级cache失效率
使得二级 >> 一级(大小)
对二级Cache不要使用局部失效率
二级cache与CPU时钟周期无关!
成本和平均存储访问时间
通常,快命中时间和更少的失效
由于命中增多,目标失效减少
Linear
Log
Cache Size
Cache Size
北京大学计算机科学技术系 北京大学微处理器研究开发中心
减少失效损失:哪些适用于二级 Cache?
降低失效率
1. 通过增大块大小减少失效
2.通过增大相联度减少失效
3.通过Victim Cache减少失效
4.通过伪-相联减少失效
5.通过硬件预取指令或数据减少失效
6.通过软件预取数据减少失效
7.通过编译优化减少失效
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Relative CPU Time
Block Size
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
16 32 64 128 256 512
1.361.28 1.27
1.34
1.54
1.95
二级cache 块大小 和平均存储器访问时间
第一级32KB , 与存储器的通路8字节宽
北京大学计算机科学技术系 北京大学微处理器研究开发中心
减少失效损失小结CPUtime IC CPI
ExecutionMemory accesses
InstructionMiss rate Miss penalty
Clock cycle time
五种技术
失效时,读比写优先;合并写缓存
子块放置
失效时,提前重启和关键存储字先送
非阻塞Cache (Hit under Miss, Miss under Miss)
二级Cache
可适用于多级cache
问题:到DRAM的时间可能随着cache的级数而增长
乱序执行CPU可以隐藏第一级数据cache的失效,但在第二级cache失效时会暂停
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Cache优化小结
技术 MR MP HT Complexity
增大块大小 + - 0
增高相联度 + - 2
淘汰块Cache + 2
伪相联Cache + 2
指令/数据的硬件预取 + 2
编译控制的预取 + 3
编译减少失效 + 0
读失效优先 + 1
子块放置 + + 1
提前重启和关键存储字优先 + 2
非阻塞Cache + 3
二级Cache + 2
失效
率失
效损
失
北京大学计算机科学技术系 北京大学微处理器研究开发中心
改进Cache性能
Average Memory access time = Hit time + Miss rate
Miss penalty
1. 降低失效率
2. 降低失效损失,或者
3. 减少在cache中命中的时间
北京大学计算机科学技术系 北京大学微处理器研究开发中心
1. 通过小、简单的Cache来加快命中时间
为什么Alpha 21164设置8KB指令cache和8KB数据cache + 96KB二级cache?
小数据cache和时钟频率
片载cache直接映射
北京大学计算机科学技术系 北京大学微处理器研究开发中心
2. 通过避免地址变换加快命中将虚拟地址送给cache? 称为 虚拟地址cache(Virtually
Addressed Cache) 或者 虚拟Cache(Virtual Cache) 性对于物理cache(Physical Cache)每次进程间的逻辑切换都必须冲洗cache; 否则将会发生错误命中代价是冲洗时间 + 空cache的义务失效
需要处理处理别名(aliases) (也称为化名(synonyms)); 两个不同的虚拟地址映射到同一物理地址
I/O必然与cache相互影响,因而需要虚拟地址
处理别名的策略硬件保证 Index域和直接映射,它们都是唯一的。
称为“页面染色(page coloring)”
冲洗cache的解决策略增加进程标识符(process identifier tag):与进程内的地址一起还标识进程本身:如果进程错误就不会命中
北京大学计算机科学技术系 北京大学微处理器研究开发中心
虚拟地址Caches
CPU
TB
$
MEM
VA
PA
PA
常规组织
CPU
$
TB
MEM
VA
VA
PA
虚拟地址cache
只在失效时才变换别名问题
CPU
$ TB
MEM
VA
PA
Tags
PA
Cache访问与虚拟地
址变换重叠:需要cahe索引来保持变换
间的不变性
VA
Tags
L2 $
北京大学计算机科学技术系 北京大学微处理器研究开发中心
2.通过避免地址变化加快cache命中: 进程标识符的效果
黑色为单进程
浅灰为冲洗cache时的多进程
深灰为使用进程标识符的多进程
Y轴:失效率达20%
X轴:Cache大小从2 KB到 1024 KB
北京大学计算机科学技术系 北京大学微处理器研究开发中心
2.通过避免地址变化加快cache命中: 利用地址的物理部分进行索引
限制cache不能超过页面大小: 那么,需要更大的 cache时,怎么办?增大相联度将会使TAG和INDEX之间的界限右移
页面染色
Page Address Page Offset
Address Tag Index Block Offset
如果索引就是地址的某一物理部分,就可以与变换并行开始标志访问,因而就可以与物理标志进行比较
北京大学计算机科学技术系 北京大学微处理器研究开发中心
将标签检测和更改cache分为不同流水级;当前写操作的标签检测 & 上一次写操作的cache更改
流水线中只有 STORES;失效时清空
Store r2, (r1) Check r1Add --Sub --Store r4, (r3) M[r1]<-r2&
check r3
阴影部分为 延迟写缓冲器(Delayed Write Buffer) 在读操作中必须被检测;
3. 通过流水化写操作来加快命中时间
北京大学计算机科学技术系 北京大学微处理器研究开发中心
4. Trace Cache in Pentium 4
How to supply enough instructions every cycle without dependencies?
Instead of limiting the instructions in a static cache block to spatial locality
A trace cache finds a dynamic sequence of instructions including taken branches to load into a cache block.
Much more complicated address mapping mechanisms.
Trace cache store the same instructions multiple times in the instruction cache.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
xxxxxxxx
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Cache 优化小结技术 MR MP HT Complexity
Larger Block Size + - 0Higher Associativity + - 1Victim Caches + + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + + 2Compiler Controlled Prefetching + + 3Compiler Reduce Misses + 0
Priority to Read Misses + 1Subblock Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2
Small & Simple Caches - + 0Avoiding Address Translation + 2Pipelining Writes + 1
Trace Cache + 3
mis
s r
ate
hit
tim
em
iss
pen
alt
y
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Cache 优化小结技术 MR MP HT Complexity
Larger Block Size + - 0Higher Associativity + - 1Victim Caches + + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + + 2Compiler Controlled Prefetching + + 3Compiler Reduce Misses + 0
Priority to Read Misses + 1Subblock Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2
Small & Simple Caches - + 0Avoiding Address Translation + 2Pipelining Writes + 1
Trace Cache + 3
mis
s r
ate
hit
tim
em
iss
pen
alt
y
北京大学计算机科学技术系 北京大学微处理器研究开发中心
所学知识对Caches性能的影响
这对以下领域意味着什么?
编译技术、操作系统、算法、数据结构
1
10
100
1000
19
80
19
81
19
82
19
83
19
84
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
DRAM
CPU 1960-1985: Speed
= f(no. operations)
1990
Pipelined
Execution &
Fast Clock Rate
Out-of-Order
execution
Superscalar
Instruction Issue
1998: Speed =
f(non-cached memory accesses)
北京大学计算机科学技术系 北京大学微处理器研究开发中心
4th E
北京大学计算机科学技术系 北京大学微处理器研究开发中心
北京大学计算机科学技术系 北京大学微处理器研究开发中心
(1) Small and Simple First-Level Caches toReduce Hit Time and Power
Direct-mapped caches can overlap the tag check with the
transmission of the data, effectively reducing hit time.
lower levels of associativity will usually reduce power
because fewer cache lines must be accessed.
In recent designs, there are three other factors that have led
to the use of higher associativity in first-level caches.
many processors take at least two clock cycles to access the
cache and thus the impact of a longer hit time may not be critical.
to keep the TLB out of the critical path (a delay that would be
larger than that associated with increased associativity), almost
all L1 caches should be virtually indexed.
with the introduction of multithreading, conflict misses can
increase, making higher associativity more attractive.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Figure 2.4 Energy consumption per read increases as cache size and
associativity are increased. The large penalty for eight-way set associative
caches is due to the cost of reading out eight tags and the corresponding
data in parallel.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
(2) Way Prediction to Reduce Hit Timereduces conflict misses and yet maintains the hit speed of
direct-mapped cache.
This prediction means the multiplexor is set early to select
the desired block, and only a single tag comparison is
performed that clock cycle in parallel with reading the
cache data.
A miss results in checking the other blocks for matches in
the next clock cycle.
Simulations suggest that set prediction accuracy is in
excess of 90% for a two-way set associative cache and
80% for a four-way set associative cache, with better
accuracy on I-caches than D-caches.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Way selection to save power
An extended form of way prediction can also be used to
reduce power consumption by using the way prediction
bits to decide which cache block to actually access (the
way prediction bits are essentially extra address bits);
This approach, which might be called way selection, saves
power when the way prediction is correct but adds
significant time on a way misprediction, since the access,
not just the tag match and selection, must be repeated.
Such an optimization is likely to make sense only in low-
power processors.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
(3)Pipelined Cache Access to Increase Cache Bandwidth
is simply to pipeline cache access so that the effective latency of
a first-level cache hit can be multiple clock cycles, giving fast
clock cycle time and high bandwidth but slow hits.
the pipeline for the instruction cache access for Intel x86
processors:
Pentium processors(mid-1990s) took: 1 clock cycle,
Pentium Pro -- Pentium III(mid-1990s—2000) took: 2
clocks,
Pentium 4 -- Core i7 takes: 4 clocks.
increases No. of pipeline stages, leading to a greater penalty on
mispredicted branches and more clock cycles between issuing
the load and using the data, but it does make it easier to
incorporate high degrees of associativity.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
(4) Nonblocking Caches to Increase Cache Bandwidth
Figure 2.5 The effectiveness of a nonblocking cache is evaluated by allowing 1, 2, or 64 hits under a cache miss with 9 SPECINT (on the left)
and 9 SPECFP (on the right) benchmarks. The data memory system modeled after the Intel i7 consists of a 32KB L1 cache with a four cycle
access latency. The L2 cache (shared with instructions) is 256 KB with a 10 clock cycle access latency. The L3 is 2 MB and a 36-cycle access
latency. All the caches are eight-way set associative and have a 64-byte block size. Allowing one hit under miss reduces the miss penalty by 9% for
the integer benchmarks and 12.5% for the floating point. Allowing a second hit improves these results to 10% and 16%, and allowing 64 results in
little additional improvement.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
(5) Multibanked Caches to Increase Cache Bandwidth
The Arm Cortex-A8 supports 1-4 banks in its L2 cache;
the Intel Core i7 has 4 banks in L1 (to support up to 2 memory
accesses per clock), and the L2 has 8 banks.
A simple mapping that works well is to spread the addresses of the
block sequentially across the banks, called sequential interleaving.
Multiple banks also are a way to reduce power consumption both in
caches and DRAM.
Figure 2.6 Four-way interleaved cache banks using block addressing. Assuming 64
bytes per blocks, each of these addresses would be multiplied by 64 to get byte addressing.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
(6)Critical Word First and Early Restart to Reduce Miss Penalty
Critical word first—Request the missed word first from
memory and send it to the processor as soon as it arrives;
let the processor continue execution while filling the rest
of the words in the block.
Early restart—Fetch the words in normal order, but as
soon as the requested word of the block arrives send it to
the processor and let the processor continue execution.
The benefits of critical word first and early restart depend
on the size of the block and the likelihood of another
access to the portion of the block that has not yet been
fetched.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
(7)Merging Write Buffer to Reduce Miss Penalty
Figure 2.7 To illustrate write merging, the write buffer on top does not use it while the write buffer on the
bottom does. The four writes are merged into a single buffer entry with write merging; without it, the buffer is full
even though three-fourths of each entry is wasted. The buffer has four entries, and each entry holds four 64-bit words.
The address for each entry is on the left, with a valid bit (V) indicating whether the next sequential 8 bytes in this
entry are occupied. (Without write merging, the words to the right in the upper part of the figure would only be used
for instructions that wrote multiple words at the same time.)
北京大学计算机科学技术系 北京大学微处理器研究开发中心
(8) Compiler Optimizations to Reduce Miss Rate
Loop Interchange
Blocking
北京大学计算机科学技术系 北京大学微处理器研究开发中心
(9) Hardware Prefetching of Instructionsand Data to Reduce Miss Penalty or Miss Rate
Figure 2.10 Speedup due to hardware prefetching on Intel Pentium 4 with hardware prefetching turned on for 2 of 12
SPECint2000 benchmarks and 9 of 14 SPECfp2000 benchmarks. Only the programs that benefit the most from prefetching are
shown; prefetching speeds up the missing 15 SPEC benchmarks by less than 15% [Singhal 2004].
北京大学计算机科学技术系 北京大学微处理器研究开发中心
(10) Compiler-Controlled Prefetching toReduce Miss Penalty or Miss Rate
Register prefetch will load the value into a register.
Cache prefetch loads data only into the cache and not the
register.
Either of these can be faulting or nonfaulting; that is, the
address does or does not cause an exception for virtual
address faults and protection violations.
The most effective prefetch is “semantically invisible” to a
program: It doesn’t change the contents of registers and
memory, and it cannot cause virtual memory faults.
Most processors today offer nonfaulting cache prefetches.
nonfaulting cache prefetch, also called nonbinding prefetch.
北京大学计算机科学技术系 北京大学微处理器研究开发中心
Recommended