Adaptive Insertion Policies for Managing Shared Caches

Adaptive Insertion Policies for Managing Shared Caches

Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi,

Julien Sebot, Simon Steely Jr., Joel Emer

Intel Corporation, VSSAD

[email protected] Conference on Parallel Architectures and Compilation Techniques (PACT)

22

Paper Motivation

• Shared caches common and more so with increasing # of cores

• # concurrent applications contention for shared cache

• High Performance Manage shared cache efficiently

Core 0

FLC

Core 1

FLC

LLC

Core 0

FLC

MLC

Core 1

FLC

MLC

Core 2

FLC

MLC

Core 3

FLC

MLC

LLC

Core 0

FLC

LLC

Single Core( SMT )

Dual Core( ST/SMT )

Quad-Core( ST/SMT )

33

Problems with LRU-Managed Shared Caches

• Conventional LRU policy allocates resources based on rate of demand– Applications that do not benefit from

cache cause destructive cache interference

Mis

ses

Per

10

00

In

str

(un

der

LRU

)

soplex

h264ref

soplex

0 25 50 75 100Cache Occupancy Under LRU Replacement

(2MB Shared Cache)

h264ref

44

Addressing Shared Cache Performance

• Conventional LRU policy allocates resources based on rate of demand– Applications that do not benefit from

cache cause destructive cache interference

• Cache Partitioning: Reserves cache resources based on application benefit rather than rate of demand HW to detect cache benefit Changes to existing cache structure Not scalable to large # of applications

Mis

ses

Per

10

00

In

str

(un

der

LRU

)

soplex

h264ref

soplex

0 25 50 75 100Cache Occupancy Under LRU Replacement

(2MB Shared Cache)

h264ref

Eliminate Drawbacks of Cache Partitioning

55

Paper Contributions

• Problem: For shared caches, conventional LRU policy allocates cache resources based on rate-of demand rather than benefit

• Goals: Design a dynamic hardware mechanism that:1. Provides High Performance by Allocating Cache on a Benefit-basis2. Is Robust Across Different Concurrently Executing Applications3. Scales to Large Number of Competing Applications4. Requires Low Design Overhead

• Solution: Thread-Aware Dynamic Insertion Policy (TADIP) that improves average throughput by 12-18% for 2, 4, 8, and 16-core systems with two bytes of storage per HW-thread

TADIP, Unlike Cache Partitioning, DOES NOT Attempt to Reserve Cache Space

66

“Adaptive Insertion Policies for High-Performance Caching”

Moinuddin Qureshi, Aamer Jaleel, Yale Patt, Simon Steely Jr., Joel Emer

Appeared in ISCA’07

Review Insertion Policies

77

Cache Replacement 101 – ISCA’07

Two components of cache replacement:

• Victim Selection:– Which line to replace for incoming line? (E.g. LRU, Random

etc)

• Insertion Policy: – With what priority is the new line placed in the replacement

list? (E.g. insert new line into MRU position)Simple changes to insertion policy can minimize cache thrashing

and improves cache performance for memory-intensive workloads

88

Static Insertion Policies – ISCA’07

• Conventional (MRU Insertion) Policy: – Choose victim, promote to MRU

• LRU Insertion Policy (LIP):– Choose victim, DO NOT promote to MRU– Unless reused, lines stay at LRU position

• Bimodal Insertion Policy (BIP) – LIP does not age older lines– Infrequently insert some misses at MRU– Bimodal Throttle: • We used ~= 3%

a b c d e f g hMRU LRU

i a b c d e f g

Reference to ‘i’ with conventional LRU policy:

a b c d e f g i

Reference to ‘i’ with LIP:

if( rand() < ) Insert at MRU postion

elseInsert at LRU position

Reference to ‘i’ with BIP:

Applications Prefer Either Conventional LRU or BIP…

99

SDM-LRU

Follower Sets

SDM-BIP

Dynamic Insertion Policy (DIP) via “Set-Dueling” – ISCA’07

• Set Dueling Monitors (SDMs): Dedicated sets to estimate the performance of a pre-defined policy

• Divide the cache in three:– SDM-LRU: Dedicated LRU-sets– SDM-BIP: Dedicated BIP-sets– Follower sets

• PSEL: n-bit saturating counter– misses to SDM-LRU: PSEL++– misses to SDM-BIP: PSEL--

• Follower sets insertion policy:– Use LRU: If PSEL MSB = 0– Use BIP: If PSEL MSB = 1

PSEL+

miss

–miss

MSB = 1?NO YES

USE LRU DO BIP

- Based on Analytical and Empirical Studies:• 32 Sets per SDM• 10 bit PSEL counter

HW Required: 10 bits + Combinational Logic

1010

Extending DIP to Shared Caches

• DIP uses a single policy (LRU or BIP) for all applications competing for the cache

• DIP can not distinguish between apps that benefit from cache and those that do not

• Example: soplex + h264ref w/2MB cache– DIP learns LRU for both apps– soplex causes destructive

interference– Desirable that only h264ref

follow LRU and soplex follow BIP

soplex

h264ref

Need a Thread-Aware Dynamic Insertion Policy (TADIP)

Mis

ses

Per

10

00

In

str

(un

der

LRU

)

1111

Thread Aware Dynamic Insertion Policy (TADIP)• Assume N-core CMP running N apps, what is best insertion policy for each

app? (LRU=0, BIP=1)

• Insertion policy decision can be thought of as an N-bit binary string:

< P0, P1, P2 … PN-1 >

– If Px = 1, then for application use BIP, else use LRU– e.g. 0000 always use conventional LRU, 1111 always use BIP

• With N-bit string, 2N possible string combinations. How to find best one???– Offline Profiling: Input set/system dependent & impractical with

large N– Brute Force Search using SDMs: Infeasible with large NNeed a PRACTICAL and SCALABLE Implementation of TADIP

1212

Using Set-Dueling As a Practical Approach to TADIP

• Unnecessary to exhaustively search all 2N combinations

• Some bits of the best binary insertion string can be learned independently– Example: Always use BIP for applications that create interference

• Exponential Search Space Linear Search Space– Learn best policy (BIP or LRU) for each app in presence of all other

appsUse Per-Application SDMs To Decide:

In the presence of other apps, does an app cause destructive interference…

If so, use BIP for this app, else use LRU policy

1313

< P0, P1, P2, P3 >

TADIP Using Set-Dueling Monitors (SDMs)

• Assume a cache shared by 4 applications: APP0 APP1 APP2 APP3

< 0, P1, P2, P3 >

P = MSB( PSEL )

High-Level View of CacheSet-Level View of Cache

PSEL0+

miss

–In the presence of other apps, does

APP0 doing LRU or BIP improve cache

performance?

Follower Sets

PSEL1+

–

PSEL2+

–

PSEL3+

–

< 1, P1, P2, P3 >< P0, 0, P2, P3 >< P0, 1, P2, P3 >< P0, P1, 0, P3 >< P0, P1, 1, P3 >< P0, P1, P2, 0 >< P0, P1, P2, 1 >

PSEL0

PSEL1

PSEL2

PSEL3

1414


– LRU SDMs for each APP– BIP SDMs for each APP– Follower sets

• Per-APP PSEL saturating counters– misses to LRU: PSEL++– misses to BIP: PSEL--

• Follower sets insertion policy:– SDMs of one thread are

follower sets of another thread

– Let Px = MSB[ PSELx ]– Fill Decision: <P0, P1, P2, P3 >

HW Required: (10*T) bits + Combinational Logic

• Assume a cache shared by 4 applications: APP0 APP1 APP2 APP3

< P0, P1, P2, P3 >

< 0, P1, P2, P3 >

• 32 sets per SDM• 10-bit PSEL

P = MSB( PSEL )

PSEL0+

miss

–

Follower Sets

PSEL1+

–

PSEL2+

–

PSEL3+

–

< 1, P1, P2, P3 >< P0, 0, P2, P3 >< P0, 1, P2, P3 >< P0, P1, 0, P3 >< P0, P1, 1, P3 >< P0, P1, P2, 0 >< P0, P1, P2, 1 >

1515

Summarizing Insertion Policies

Policy Insertion Policy Search Space# of

SDMs#

Counters

LRU Replacement

< 0, 0, 0, … 0 > 0 0

DIP< 0, 0, 0, … 0 > and < 1, 1, 1, … 1

>2 1

Brute Force< 0, 0, 0, … 0 > … < 1, 1, 1, … 1

>2N 2N

TADIP< P0, P1, P2, … PN-1 > and Hamming Distance of 1

2N N

TADIP is SCALABLE with Large N

1616

Experimental Setup

• Simulator and Benchmarks: – CMP$im – A Pin-based Multi-Core Performance Simulator– 17 representative SPEC CPU2006 benchmarks

• Baseline Study:– 4-core CMP with in-order cores (assuming L1-hit IPC of 1)– Three-level Cache Hierarchy: 32KB L1, 256KB L2, 4MB L3– 15 workload mixes of four different SPEC CPU2006 benchmarks

• Scalability Study:– 2-core, 4-core, 8-core, 16-core systems– 50 workload mixes of 2, 4, 8, & 16 different SPEC CPU2006

benchmarks

1717

MPK

I%

MR

Uin

sert

ions

Cach

eU

sag

eA

PK

I

LRU BIP

MPK

I%

MR

Uin

sert

ions

Cach

eU

sag

eA

PK

I

Baseline LRU Policy / DIP TADIP

soplex + h264ref Sharing 2MB Cache

TADIP Improves Throughput by 27% over LRU and DIP

SOPLEX H264REFAPKI: accesses per 1000 instMPKI: misses per 1000 inst

MPK

I

1818

TADIP Results – Throughput

1.00

1.10

1.20

1.30

1.40

1.50

1.60

MIX

_0

MIX

_1

MIX

_2

MIX

_3

MIX

_4

MIX

_5

MIX

_6

MIX

_7

MIX

_8

MIX

_9

MIX

_10

MIX

_11

MIX

_12

MIX

_13

MIX

_14

GE

OM

EA

N

Th

rou

gh

pu

t N

orm

aliz

ed t

o L

RU

DIPTADIPNo Gains from DIP

DIP and TADIP are ROBUST and Do Not Degrade Performance over LRUMaking Thread-Aware Decisions is 2x Better than DIP

1919

TADIP Compared to Offline Best Static Policy

1.00

1.10

1.20

1.30

1.40

1.50

1.60

MIX

_0

MIX

_1

MIX

_2

MIX

_3

MIX

_4

MIX

_5

MIX

_6

MIX

_7

MIX

_8

MIX

_9

MIX

_10

MIX

_11

MIX

_12

MIX

_13

MIX

_14

GE

OM

EA

N

Th

rou

gh

pu

t N

orm

aliz

ed t

o L

RU DIP

TADIP

BEST STATIC

TADIP is within 85% of Best Offline Determined Insertion Policy Decision

Static Best almost always better because insertion string withbest IPC chosen as “Best Static”. TADIP optimizes for fewer misses. Can use TADIP to optimize other metrics (e.g. IPC)

TADIP Better Due to Phase Adaptation

2020

TADIP Vs. UCP ( MICRO’06 )

1.00

1.10

1.20

1.30

1.40

1.50

1.60

MIX

_0

MIX

_1

MIX

_2

MIX

_3

MIX

_4

MIX

_5

MIX

_6

MIX

_7

MIX

_8

MIX

_9

MIX

_10

MIX

_11

MIX

_12

MIX

_13

MIX

_14

GE

OM

EA

N

Th

rou

gh

pu

t N

orm

aliz

ed t

o L

RU

TADIP

UCP

DIP Out-Performs UCP Without Requiring Any Cache Partitioning Hardware

UCP

1920

TADIP

2Cost Per Thread (bytes)

Unlike Cache Partitioning Schemes, TADIP Does NOT Reserve Cache Space TADIP Does Efficient CACHE MANAGEMENT by Changing Insertion

Policy

Utility Based Cache Partitioning (UCP)

2121

TADIP Results – Sensitivity to Cache Size

1.00

1.25

1.50

1.75

2.00

MIX

_0

MIX

_1

MIX

_2

MIX

_3

MIX

_4

MIX

_5

MIX

_6

MIX

_7

MIX

_8

MIX

_9

MIX

_10

MIX

_11

MIX

_12

MIX

_13

MIX

_14

GE

OM

EA

N

Th

rou

gh

pu

t N

orm

aliz

ed t

o 4

MB

LR

U TADIP - 4MB

LRU - 8MB

TADIP - 8MB

LRU - 16MB

TADIP Provides Performance Equivalent to Doubling Cache Size

2222

TADIP Results – Scalability

1.00

1.25

1.50

1.75

2.00

0 10 20 30 40 50

Workloads

Th

rou

gh

pu

t N

orm

aliz

ed t

o L

RU

of

Res

pec

tive

Bas

elin

e S

yste

m

2-Thread

4-thread

8-Thread

16-Thread

TADIP Scales to Large Number of Concurrently Executing Applications

Thro

ughput

Norm

aliz

ed t

o B

ase

line S

yst

em

2323

Summary

• The Problem: For shared caches, conventional LRU policy allocates cache resources based on rate-of demand rather than benefit

• Solution: Thread-Aware Dynamic Insertion Policy (TADIP)

1. Provides High Performance by Allocating Cache on a Benefit-Basis

- Up to 94%, 64%, 26% and 16% performance on 2, 4, 8, and 16 core CMPs

2. Is Robust Across Different Workload Mixes

- Does not significantly hurt performance when LRU works well

3. Scales to Large Number of Competing Applications

- Evaluated up to 16-cores in our study

4. Requires Low Design Overhead

- < 2 bytes per HW-thread and NO CHANGES to existing cache structure

2424

Q&A

Journal of Instruction-Level Parallelism

1st Data Prefetching Championship (DPC-1) Sponsored by: Intel, JILP, IEEE TC-uARCH

Conjunction with: HPCA-15

Paper & Abstract Due: December 12th, 2008

Notification: January 16th, 2008

Final Version: January 30th, 2008

More Information and Prefetch Download Kit At:http://www.jilp.org/dpc/

2626

TADIP Results – Weighted Speedup

1.00

1.05

1.10

1.15

1.20

MIX

_0

MIX

_1

MIX

_2

MIX

_3

MIX

_4

MIX

_5

MIX

_6

MIX

_7

MIX

_8

MIX

_9

MIX

_10

MIX

_11

MIX

_12

MIX

_13

MIX

_14

GE

OM

EA

N

Wei

gh

ted

Sp

eed

up

No

rmal

ized

to

LR

U

DIP

TADIP

TADIP Provides More Than Two Times Performance of DIPTADIP Improves Performance over LRU by 18%

2727

TADIP Results – Fairness Metric

0.00

0.25

0.50

0.75

1.00

MIX

_0

MIX

_1

MIX

_2

MIX

_3

MIX

_4

MIX

_5

MIX

_6

MIX

_7

MIX

_8

MIX

_9

MIX

_10

MIX

_11

MIX

_12

MIX

_13

MIX

_14

GE

OM

EA

NHar

mo

nic

Mea

n o

f N

orm

aliz

ed I

PC

s

LRU

DIP

TADIP

TADIP Improves the Fairness

2828

TADIP In Presence of Prefetching on 4-core CMP TADIP In The Presence of Prefetching

0.90

1.00

1.10

1.20

1.30

1.40

1.50

1.60

1.70

0 10 20 30 40 50

Workloads

Th

rou

gh

pu

t N

orm

aliz

ed t

o L

RU

+ P

refe

tch

ing

TADIP Improves Performance Even In Presence of HW Prefetching

2929

Insertion Policy to Control Cache Occupancy (16-Cores)

• Changing insertion policy directly controls the amount of cache resources provided to an application

• In figure, only showing only the TADIP selection insertion policy for xalancbmk & sphinx3

• TADIP improves performance by 28%

Sixteen Core Mix with 16MB LLC

Insertion Policy Directly Controls Cache Occupancy

MPK

I%

MR

Uin

sert

ions

Cach

eU

sag

eA

PK

I

3030

< P0 , P1 >


• Assume a cache shared by 2 applications: APP0 and APP1

< 0 , P1 >

< 1 , P1 >

< P0 , 0 >

< P0 , 1 >

• 32 sets per SDM• 9-bit PSEL

P = MSB( PSEL )

PSEL0

PSEL1

High-Level View of CacheSet-Level View of Cache

+miss

–miss

+miss

–miss

In the presence of other apps, should

APP0 do LRU or BIP?

In the presence of other apps, should

APP1 do LRU or BIP?

Follower Sets

3131

< P0 , P1 >


• Assume a cache shared by 2 applications: APP0 and APP1

< 0 , P1 >

< 1 , P1 >

< P0 , 0 >

< P0 , 1 >

• 32 sets per SDM• 9-bit PSEL cntr

P = MSB( PSEL )

PSEL0

PSEL1

+miss

–miss

+miss

–miss

Follower Sets

– LRU SDMs for each APP– BIP SDMs for each APP– Follower sets

• PSEL0, PSEL1: per-APP PSEL– misses to LRU: PSEL++– misses to BIP: PSEL--

• Follower sets insertion policy:– SDMs of one thread are

follower sets of another thread

– Let Px = MSB[ PSELx ]– Fill Decision: <P0, P1>

HW Required: (9*T) bits + Combinational Logic

Documents

Adaptive Insertion Policies for Managing Shared Caches