CH20. Optimization for the Memory Hierarchy 2006.5.3 이병현

CH20. Optimization for the Memory

Hierarchy

2006.5.3이병현

Outline

Introduction Instruction-Cache Optimization Scalar Replacement of Array Elements Data-Cache Optimization

Introduction

Year

µProc60%/year(2/1.5yr)

DRAM9%/year(2/10 yrs)

1

10

100

1000

198

0 198

1 198

3 198

4 198

5 198

6 198

7 198

8 198

9 199

0 199

1 199

2 199

3 199

4 199

5 199

6 199

7 199

8 199

9 200

0

DRAM

CPU198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Using Hardware Assists: Instruction Prefetching

Hardware provides sequential prefetching of code

Provide software support – some newer 64-bit RISCs(ex:SPARC-V9,Alpah)→ Provide fetching hints to a system’ I-cache and instructio

n-fetch unit→ ex: for SPARC-V9 : iprefetch address

Procedure Sorting

Sort the statically linked routines that make up an object module at link time according to their calling relationships and frequency of use.

Objective→ To place routines near their callers in virtual

memory so as to reduce paging traffic.→ To place frequently used and related routines

so they are less likely to collide with each other in I cache.

Procedure Sorting

Begin with the weighted undirected static call graph.

Select an arc with the highest weight and merge the node.→ Coalesce their corresponding arcs.→ Add the weights of the coalesced arcs to compute the

label for the coalesced arc.

Node that are merged are placed next to each other in final ordering of the procedure.→ Weights of the connections used to determine their

relative order.

Procedure SortingP1

[P2,P4]

P3

P5

P6 P7

P8

50

4055

3

9032

40

20

P1

[P2,P4]

[P3,P6]

P5

P7

P8

50

4055

3

40

52

P1

[P5,[P2,P4]]

[P3,P6]

P7

P8

50

40

3

40

52

P1 [[P3,P6],[P5,[P2,P4]]]

P7

P8

90

3

40

[P1,[[P3,P6],[P5,[P2,P4]]]]

[P7,P8]

3

Resulting Order : P1,P3,P6,P5,P2,P4,P7,P8

P1 P2

P3 P4

P5

P6 P7

P8

50

40 100

50

5

390 32

40

20

Procedure and Block Placement Another approach to I-cache optimization that

combined with the procedure sorting. Modify the system linker to put each routine on

an I-cache block boundary, allowing the later phases of the compilation process to position frequently executed code segments.

If most basic blocks are short, this helps to keep the beginnings of basic blocks away from the ends of cache blocks.

Compiler can be instrumented to collect statistics and profiling feedback could be used.

Intraprocedural Code Positioning

Objective→ Move infrequently executed code out of the

main body of the code→ Straighten the code→ A higher fraction of the instructions fetched

into the I-cache are actually executed.

Intraprocedural Code Positioning Build the procedure’s flowgraph

→ edges annotated with their execution frequency bottom-up search of the flow graph

a. Building chains of basic blocks that should be placed as straight-line code

b. Two chains whose respective tail and head are connected by the edge with the highest execution frequency are merged

c. Select the entry chain and proceed through the other chains according to the weights of their connections


entry

B1

B2 B3

B4 B5

B8 B9

B6 B7

exit

30

40 5

45 10

20

14

10 14

10 5 10

10 15

< Example flowgraph >


< Resulting Arrangement >

entry B1 B2 B3B4

B5

B8

B9B6

B7

exit

Procedure Splitting

Divides each procedure into a primary and a secondary component→ Primary: frequently executed basic blocks.→ Secondary: rarely executed ones, such as

exception-handling code. Then collects the secondary

components of a series of procedure into a separate secondary section.→ Packing the primary components more

tightly together

Procedure Splitting

P S P S P S P S

P1 P2 P3 P4

P P P P

P1 P2 P3 P4

S S S S

P1 P2 P3 P4

< A group of precedure bodies, each split into primary(p) and secondary(s) >

< Result of collecting each type of component >

Combining Intra- and Interprocedural Method

McFaling’s work- Focus on optimization of entire programs for direct

-mapped I-caches- Work on object module- Depend on rearranging instructions in memory and

segregating some instructions to not be cached at all

Scalar Replacement of Array Elements

Scalar Replacement→ Replacing subscripted variable by scalars→ Making them available for register

allocation→ Find opportunities to reuse array elements

and replaces the reuses with references to scalar temporaries

→ Improve speed → Decrease the need for D-cache optimization

Scalar Replacement of Array Elements

examplesdo i = 1,N do j = 1,N do k = 1,N C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddoenddo

do i = 1,N do j = 1,N ct = C(i,j) do k = 1,N ct = ct + A(i,k)*B(k,j) enddo C(i,j) = ct enddo enddo

reduce by

)(2 23 NN

for i ← 1 to n do b[i+1] ← b[i] + 1.0 a[i] ← 2*b[i] + c[i]endfor

if n>=1 then t0 ← b[1] t1 ← t0 + 1.0 b[2] ← t1 a[1] ← 2 * t0 + c[1]endif

t0 ← t1for i ← 2 to n do t1 ← t0 + 1.0 b[i+1] ← t1 a[i] ← 2 * t0 + c[i] t0 ← t1endfor

reduce 40% memory access

Scalar Replacement of Array Elements Loop interchange may increase opportunities

by making loop carried dependences be carried by the innermost loop

Loop fusion can create opportunities for scalar replacement by bringing together in one loop multiple uses of a single array element.

for i ← 1 to n do for j ← 1 to n do a[i,j] ← b[i] + 0.5 a[i+1,j] ← b[i] – 0.5 endforendfor

for j ← 1 to n do for i ← 1 to n do a[i,j] ← b[i] + 0.5 a[i+1,j] ← b[i] - 0.5 endforendfor

for i ← 1 to n do a[i] ← a[i] + 1.0end forfor j ← 1 to n do b[j] ← a[j] * 0.618end for

for i ← 1 to n do a[i] ← a[i] + 1.0 b[i] ← a[i] * 0.618end for

Data Cache Optimization

Loop Transformations Locality and Tiling Data Prefetching

Loop Transformations Do things like

→ Interchanging two nested loops→ Reversing the order→ Fusing two loop bodies

If chosen properly→ Semantics of the program are preserved and its performanc

e is improved Three general types

→ Unimodular transformation→ Loop fusion and distribution→ tiling

Unimodular loop transformation Define

→ whose effect can be represented by the product of a unimodular matrix with a distance vector

Unimodular matrix→ square matrix with all integral components and wit

h a determinant of 1 or -1 Lexicographically positive

→ it has at least one non zero element and the first nonzero element in it is positive

Unimodular loop transformation Loop interchage

→ reverse the order of two adjacent loops in a loop nest

01

10

for i ← 1 to n do for j ← 1 to n do a[i,j] ← (a[i-1,j] + a[i+1,j])/2.0 endforendfor

for j ← 1 to n do for i ← 1 to n do a[i,j] ← (a[i-1,j] + a[i+1,j])/2.0 endforendfor

1

0

0

1

01

10

• Loop interchange matrix :

• product with the distance vector:

• the result is legal : lexicographically positive

Unimodular loop transformation Loop permutation

generalize loop interchange by allowing more than two loops to be moved at once and by not requiring then to be adjacent

0010

0001

1000

0100

• Example : interchanging the first with the third and the second with fourth

Unimodular loop transformation

Loop reversal → Reverse the order in which a particular

loop’s iterations are performed

100

010

001

• Reversing the direction of middle loop

0

1

0

1

10

01

• illegal transformation : lexicographically negative


Loop skewing→ change the shape of a loop’s iteration

spacefor i ← 1 to n do for j ← 1 to n do a[i,j] ← a[i+j] + 1.0 endforendfor

for i ← 1 to n do for j ← i+1 to i+n do a[i,j] ← a[j] + 1.0 endforendfor

11

01


Loop fusion→ Take two adjacent loop that have the same

iteration-space traversal and combines their bodies into a single loop

→ Legal as long as the loops have the same bounds and as long as there are no flow, anti-, or output dependence in the fused loop




Loop distribution→ Take a loop that contains multiple

statements and splits it into two loops with the same iteration space

→ Legal if it does not result in breaking any cycles in the dependence graph of the original loop



Data Prefetching

Software Data Prefetching Hardware Data Prefetching

Sequential Prefetching Prefetching with arbitrary strides

Integrating Hardware and Software Prefetching

Appendix : Data cache prefetching using a global history buffer

Data Prefetching What is it ?

Request for a future data need is initiated Useful execution continues during access Data moves from slow/far memory to fast/near cache Data ready in cache when needed (load/store)

When can it be used ? Future data needs are (somewhat) predictable

How is it implemented ? in hardware: history based prediction of future access in software: compiler inserted prefetch instructions

Data Prefetching

a) no prefetching

b) perfect prefetching

c) degraded prefetching

Software Data Prefetching Most contemporary micro processors support some form of fetch

instruction which can be used to implement prefetching Fetch instructions

Added by the programmer or by the compiler Can often be done effectively by the programmer Common characteristics

→ Nonblocking memory operation→ Require a lockup-free cache

loops with large array calculations provide excellent prefetching opportunities Common in scientific codes Exhibit poor cache utilization Predictable array referencing pattern

Software Data Prefetching

for (i=0; i<N; i++)

ip=ip+a[i]*b[i];

• example code for loop-based prefetching• assume a four-word cache block

• cause cache miss every fourth iteration

• simple prefetchingfor (i=0; i<N; i++){

fetch( &a[i+1]);

fetch( &b[i+1]);

ip=ip+a[i]*b[i];

}

→ several problems neet not prefetch every iteration

unnecessary and degrade performance

Software Data Prefetching• prefetching with loop unrolling

for (i=0; i<N; i+=4){

fetch( &a[i+4]);

fetch( &b[i+4]);

ip=ip+a[i]*b[i];

ip=ip+a[i+1]*b[i+1];



}

→ unroll loop by a factor of the number of words to be prefetched per cache block

→ still have improvements cache miss occurs during the first iteration

unnecessary prefetches will occur in the last iteration of the unrolled loop

Software Data Prefetching• software pipelining

→ assumption : prefetching one iteration ahead of the data’s actual use is sufficient to hide the latency of main memory access

fetch( &ip);

fetch( &a[0]);

fetch( &b[0];

for (i=0; i<N-4; i+=4){

fetch( &a[i+4]);

fetch( &b[i+4]);

ip=ip+a[i]*b[i];




}

for ( ; i<N; i++)

ip=ip+a[i]*b[i];

fetch( &ip);

fot(i=0; i<12; i+=4){

fetch( &a[i]);

fetch( &b[i];

}

for (i=0; i<N-12; i+=4){

fetch( &a[i+12]);

fetch( &b[i+12]);

ip=ip+a[i]*b[i];




}

for ( ; i<N; i++)

ip=ip+a[i]*b[i];

generalize for loops contain small computational bodies

prolog-prefetching only

main loop – prefetching and computation

epilog – computation only

→ initiate prefeches : δ =[ l / s]

l : average memory latency

s : estimated cycle time of the shortest possible execution path

l=100 cycles

d=45 cycles

Software Data Prefetching The loop transformations are fairly mechanical with s

ome refinements. performance penalty must be considered

add processor overhead- require extra excution cycles- source address must be calculated and stored

increase register pressure – additional spill code significant code expansion unable to detect when a prefetched block has been premat

urely evicted and needs to be refetched

Hardware Data Prefetching

Add prefetching capability to a system without the need for programmer or compiler intervention

No changes to existing executables Instruction overhead is completely eliminated take advantage of run-time information to ma

ke prefetching more effective

Sequential prefetching By grouping consecutive memory words into single

units, caches exploit the principle of spatial locality to implicitly prefetch data that is likely to be referenced in the near future

two consideration ensuing cache polution : As the cache block size increase

s, so does the amount of potentially useful data displaced from the cache to make room for the new block

Increasing the cache block size increases the likelihood of two processors sharing data from the same block, hence false sharing is more likely to arise

sequential prefetching can take advantage of spatial locality without introducing these problems

Sequential prefetching One block lookahead(OBL) implementation

initiate a prefetch for block b+1 when block b is accessed differ from simply doubling the block size differ depending on what type of access to block b initiates the p

refetch of b+1→ prefetch-on-miss : simply initiates a prefetch for block b+1 whene

ver an access for block b results in a cache miss.If b+1 is already cached, no memory access is initiated

→ tagged prefetch : associates a tag bit with every memory block. A tag bit is used to detect when a block is demanded-fetched or a prefetched block is referenced for the first time. In either of these cases, the next sequential block is fetched

Sequential prefetching

prefetch-on-miss

tagged prefetch

• tagged prefetch is more effective than prefetch-on-miss

• why? In prefetch-on-miss strictly sequential access pattern will result in a cache miss for every other cache block. In tagged prefetch just one cache miss occurs.

• One shortcoming

→ may not be initiated far enough in advance of the actual use to avoid a processor stall

→ to solve this, increase the number of blocks prefetched after a demand fetch

Sequential prefetching sequential prefetching with degree of prefetching(K)

prefetching K>1 subsequent blocks aids the memory system in staying ahead of rapid processor requests.

additional traffic and cache pollution are generated by sequential prefetching during program phases that show little spatial locality

→ adaptive sequential prefetching(Dahlgren et al,[1993])→ using FIFO stream buffer(jouppi [1990])

When K=2

Sequential prefetching

Properties No changees to existing executables Implemented with relatively simple hardward Compared to software prefetching, sequential hard

ware prefetching performs poorly when non-sequential memory access pattern

scalar references or array accesses with large strides can result in unnecessary prefetches

Prefetching with arbitrary strides employ special logic to monitor the processor’s addre

ss referencing pattern to detect constant stride array reference comparing successive addresses used by load or store

assume a memory instruction mi, references address a1,a2,a3 during 3 successive loop iteration

Prefetching for mi will be initiated if (a2-a1)=Δ≠0 The first prefetch address A3=a2+ Δ prefetching until An ≠ Δ

RPT (reference prediction table)

Prefetching with arbitrary strides RPT

hold the reference histories for only recently used memory instructions

table entries contain the address of the memory instruction

the previous address accessed by this instruction

a stride value for those entries that have established a stride

state field that records the entry’s current state

Prefetching with arbitrary strides

Prefetching with arbitrary strides The RPT improves upon sequential policies by correc

tly handling strided array references But previous example has limits the prefetch distanc

e to one loop iteration Prefetch address with distance

effective address + (stride*distance) the RPT entries are maintained under direction of the PC prefetches are initiated separately by a pseudo program cou

nter – lookahead program counter (LA-PC) distance = the difference between the PC and LA-PC

Integrating HW and SW prefetching software prefetching

compile time analysis to schedule fetch instructions within the user program

hardware prefetching prefetching opportunities at run-time without any c

ompiler or instruction set support Integrating these approaches

Gornish and Veidenbaum[1994] Zhang and Torrellas [1995] Chen [1995]

Integrating HW and SW prefetching Gornish and Veidenbaum[1994]

A variation on tagged hardware prefetching in which the degree of prefetching for a particular reference stream is calculated at compile time and passed on to the prefetch hardware

Zhang and Torrellas [1995] enable prefetching for irregular data structure the compiler initializes the tags in memory, the actual prefetching

is handled by hardware Chen [1995]

programmable prefetch engine extension to the RPT but tag, address, and stride information are supplied by the progr

am

Prefetching Using a Global History Buffer

prefetches from main memory to lowest level cache

conventional table-based prefetching stride prefetching correlation prefetching

Markov prefetching Distance prefetching Prefetch

algorithm

History table

Prefetch

address

Prefetch

Key

conventional table-based prefetching

stride prefetching Using a table to store stride-related history inform

ation for individual load instruction following a cache miss, if the algorithm detects a

constant stride pattern, trigger prefetches for addresses a+s,a+2s,…,a+ds

conventional table-based prefetching Markov prefetching

using a history table to record consecutive miss addresses when a cache miss occurs

→ the miss address indexes the correlation table→ the member of the table entry’s address list are prefetched, with the most rec

ent miss address first distance prefetching

Generalization of Markov’s using address delta ; the distance between two consecutive miss address more compact : One delta correlation can represent many miss address corr

elation Problems

a. table data can become stale and consequently reduce prefetch accuracyb. tables suffer from conflicts when multiple prefetch keys hash to the same ta

ble entryc. table hold a fixed, usually small amount of history per entry

Prefetching Using a Global History Buffer Index table

Prefetch algorithms access the index table with a key.

The key can be a load instruction’s PC, cache miss address, a hashed combination of two.

Entries contain pointers into the GHB

Global history buffer(GHB) N-entry FIFO table (implemented

as a circular buffer) Hold the n most recent L2 miss a

ddress Each entry stores a global miss a

ddress and a link pointer

Global History Buffer

miss addresses

Index Table

FI

Prefetch key

FO

PrefetchAlgorithm

Prefetchaddress

Prefetching Using a Global History Buffer

GHB example(use markov)a. When a L2 cache miss occurs, the mi

ss address indexes the index tableb. If hit, the index table entry will point to

the most recent occurrence of the same miss address in the GHB

c. This GHB entry is at the head of the linked list of other entries with the same miss address

d. The next FIFO ordered entry is the miss address that immediately followed the current miss address in the past

e. These next miss addresses are prefetch candidates.

In example : B, C

Global History Buffermiss address pointerpointer

Index Table

BC

A

A

head pointerC

C

A

C

=> Current => Prefetches

Key

DBCGlobal Miss

Address

D D

AB

Prefetching Using a Global History Buffer Improvement

1. GHB FIFO naturally gives table space priority to the most recent history, thus eliminating the stale-data problem

2. The index table and the GHB are sized separately3. A designer can use the ordered global history to create mo

st-sophisticated prefetching methods than conventional table-based prefetching

Drawback Collecting prefetch information requires multiple table acce

sses→ However linked list walk is short compared with L2 miss lat

ency

Conclusion for the data prefetching Prefetching schems are diverse 3 basic questions to help categorize a particular ap

proach1. when are prefetches initiated2. where are prefetched data placed3. what is prefetched

The majority of prefetching schemes concentrate on numerical, array-based applications.

despite the many application and system constraints, data prefetching has demonstrated the ability to reduce over all program execution time both in simulation studies and in real systems

References

Steven P. V. and DAVID J.L. 2000. DATA prefetch Mechanisms. In ACM computing Surveys, Vol.32,No.2 Kyle J.Nesbit and James E. smith 2005. Prefetching Using a Global History Buffer. In IEEE micro January/February 2005 (Vol. 2

5, No. 1)

Documents

CH20. Optimization for the Memory Hierarchy 2006.5.3 이병현