Jayesh Gaur*, Alaa Alameldeen**, Sreenivas...

Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

* Microarchitecture Research Lab (MRL), Intel India

**Memory Architecture Lab (MAL), Intel OregonISCA 2016

Seoul, Korea

Motivation

• Memory continues to be the bottleneck for modern CPU’s

• Larger Last Level Cache (LLC) capacity improves performance and power

• Higher hit rates Better Performance

• Fewer off-chip DRAM accesses Lower power, energy

• However this comes at a cost : Area and Leakage

• Cache compression is an attractive option

• Increased Capacity at lower area

But Compression can interfere with Replacement PoliciesWe present a new compression architecture to address this

How Cache Compression Works ?

Data ArrayTag Array

CompressionLogic

De-CompressionLogic

~8% increase in Area

Prior works have tried to change SRAM layout to allow compressionChanging a dense, timing sensitive SRAM layout is difficult

Data is fragmented across the setHow to associate tags to data ?

Tag Array(2X Tags)

Agenda

• Practical Architecture for Compressed Cache

• Interaction between Compression and Replacement Policies

• Base-Victim Proposal

• Results

• Performance

• Power

• Conclusions

Creating a Compressed LLC

• Tags per Set are doubled

• Exactly two tags are associated with each way

• Tag hit to data fetch is optimized

• Only 64B of data for every two tags

• Data corresponding to the tags is compressed

Data 0 Data 1

Data-0-size + Data-1-size <= 64B

Traces --->

IPC Ratio DRAM Read Ratio

Performance gain from compression

Average 12% Loss ! Hit Rates lower in generalWhy did larger LLC capacity lower performance ?

Compression FriendlyCompression Unfriendly

Issues with Compressed Cache

Partner line victimization

• Replacement policy is broken because of size limitations

• Performs poorer than baseline with many negative outliers

48 16 24 40 64 0 32 24

Way 0 Way 1 Way 2 Way 3

1 0 1 1 1 X 1 1NRU Age

Data Size

Incoming request Size

LRU candidate way does not have space

Allocating into LRU way will victimize the partner MRU line !We need to increase capacity but also preserve the replacement policy !

• Extra capacity (Tag-1 in each way) logically belongs to a Victim Cache

• Tag 0 victims are cached in Tag 1 space

• Base replacement policy strictly maintained in Tag 0

• Guarantees baseline cache hit behavior, performance

• Victim cache is always clean

• Partner line victimization is easy

Opportunistic Victim Cache

0 10 10 1

Read miss allocates in Tag0 LRU

If size exceeds what is available in Tag0, victimize partner line in Tag1 victim cache

In baseline it was never there

Cannot be poorer than baseline

If victim created insert into Tag1 victim cache

D,24 C,8 A,48 B,24

F,32 E,8 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

Incoming Request

Compressed LLC : Miss

D,24 C,8 A,48

F,32 E,8 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

Incoming Request

D,24 C,8 A,48

F,32 E,8 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

Incoming Request

D,24 C,8 A,48

F,32 E,8 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

Incoming Request

D,24 C,8 A,48

F,32 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

Incoming Request

D,24 C,8 A,48

F,32 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

Incoming Request

D,24 C,8 A,48

F,32 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

Incoming Request

Victim cache is always clean Dirty lines written to memoryReturn data to core after decompression

If hit from Tag1 victim cache then move data to Tag0

Behave as if read miss served from memory

But latency is just lookup of LLC

Gives performance

Management of Victim Cache is critical

Needs more analysis

D,24 C,56 A,48 B,24

F,32 E,8 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

Incoming Request

Compressed LLC : Hit in Victim $

Gives performance

Needs more analysis

D,24 C,56 A,48

F,32 E,8 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

Incoming Request

Gives performance

Needs more analysis

D,24 C,56 A,48

F,32 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

Incoming Request

Gives performance

Needs more analysis

D,24 C,56 A,48

F,32 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

Incoming Request

Gives performance

Needs more analysis

D,24 C,56 A,48

F,32 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

Incoming Request

Gives performance

Needs more analysis

D,24 C,56 A,48

B,24F,32 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

Incoming Request

Configuration

• x86 Core running at 4GHz

• 2MB, 16 way Inclusive LLC per core

• Not Recently Used (NRU) replacement

• DDR3-1600 15-15-15-34

• Base Delta Immediate (BDI) Compression

• Decompression Latency of 2 cycles

Category Traces

SPECFP 06 30

ISPEC 06 29

Productivity 14

Client 27

Overall 100

On an average each cache-line gets compressed to 55% of its sizeDoubling Tags should get most of the gains

Results : IPC Gain

1.101.08

1.111.09

SPECFP SPECINT Productivity Client Average SPECFP SPECINT Productivity Client Average

Compression Friendly Overall

3M Uncompressed LLC Opportunistic Compression

8% area addition gives performance equal to 50% increased areaGood gains across various categories of workloads

Results : Correlation with hit rate improvement

Traces ----------->

IPC Ratio DRAM Read Ratio

Hit Rate >= Baseline hit rate. No negative outliersMemory traffic reduces by average 16%

Compression Friendly Compression Unfriendly

Effect of Baseline Replacement Policy

SPECFP SPECINT Productivity Client Average SPECFP SPECINT Productivity Client Average

Compression Friendly Overall

SRRIP SRRIP + Compression CHAR CHAR + Compression

Good gains with various state of the art replacement policiesIncreases capacity while retaining benefits of good replacement !

Energy Savings

Power saved in DRAM compensates for increased power in LLCOverall 6.5% energy savings

Traces ------------>

DRAM Read Ratio

Energy Ratio with Word Enables

Conclusions

• Cache Compression increases capacity with low area impact

• But compression interferes with replacement policies

• We propose Base-Victim compression

• Opportunistic Victim cache created by compression

• Preserves gains from replacement policies

• No costly SRAM layout changes

• All changes in the cache controller

• ~50% increase in capacity with 8% area addition

Jayesh Gaur*, Alaa Alameldeen**, Sreenivas...

Documents

Biokeskukset Kehittäneet määrätietoisesti Suomen biotieteiden alaa:

PROF ALAA YASSIN PROFESSIONAL EXPERIENCE

Alaa El-Nawawy Portfolio

Alaa Portofolio 2016

Jayesh 2D Portfolio

Mohamed Alaa CV

Jayesh J Ravjee

PIISAAKO PINTA-ALAA KAUPALLISELLE KALASTUKSELLE

ALAA MOHAMED PORTFOLIO 2016

Alaa Ramadan

handout Sreenivas Alampalli 11-23-09.pdf

List PFA 11 12 by Alaa

IcaiD,yaaGaaoMs alaa maQauma @KI C

alaa arfaa CV

Alaa Samir

SIMON KUNTA - Sosiaalikollega · Simo kuntalaisten hyvinvoinnin edistäjänä 3546 asukasta kunnan kokonaispinta-ala 1483,1km², josta maapinta-alaa 1464,7km² ja vesipinta-alaa 18,4km²

ALAA AL MALLAHortfolio

Alaa el din Bahaa el din Ali

Portfolio Karim Alaa 2015

Portfolio Alaa