57
Global Load Instruction Aggregation Based on Code Motion The 2012 International Symposium on Parallel Architectures, Algorithms and Programming. December18, 2012

Global Load Instruction Aggregation Based on Code Motion

Embed Size (px)

Citation preview

Page 1: Global Load Instruction Aggregation Based on Code Motion

Global Load Instruction Aggregation

Based on Code Motion

The 2012 International Symposium on Parallel

Architectures, Algorithms and Programming.

December18, 2012

Page 2: Global Load Instruction Aggregation Based on Code Motion

Outline

Background

Previous works

Motivations

Partial Redundancy Elimination(PRE)

Lazy code motion(LCM)

Global Load Instruction Aggregation(GLIA)

Experiment results

Conclusion

Page 3: Global Load Instruction Aggregation Based on Code Motion

Processor

Main

memory

Speed:Speed:

Background

Page 4: Global Load Instruction Aggregation Based on Code Motion

Cache

memory

ProcessorImportant

Main

memory

Background

Page 5: Global Load Instruction Aggregation Based on Code Motion

1. Prefetch instructions

2. Transform loop structures.

for(j=0;j<10;j++)

for(i=0;i<10;i++)

... = a[i][j]

for(i=0;i<10;i++)

for(j=0;j<10;j++)

... = a[i][j]

before after

Previous works

Page 6: Global Load Instruction Aggregation Based on Code Motion

for(j=0;j<10;j++)

for(i=0;i<10;i++)

... = a[i][j]i:0

i:1

j:0

j:1

j:0

j:1

・・・

・・・

Previous works

Page 7: Global Load Instruction Aggregation Based on Code Motion

for(j=0;j<10;j++)

for(i=0;i<10;i++)

... = a[i][j]i:0

i:1

j:0

j:1

j:0

j:1

・・・

・・・

Previous works

Page 8: Global Load Instruction Aggregation Based on Code Motion

for(j=0;j<10;j++)

for(i=0;i<10;i++)

... = a[i][j]i:0

i:1

j:0

j:1

j:0

j:1

・・・

・・・

Previous works

Page 9: Global Load Instruction Aggregation Based on Code Motion

for(j=0;j<10;j++)

for(i=0;i<10;i++)

... = a[i][j]i:0

i:1

j:0

j:1

j:0

j:1

・・・

・・・

Previous works

Page 10: Global Load Instruction Aggregation Based on Code Motion

for(j=0;j<10;j++)

for(i=0;i<10;i++)

... = a[i][j]i:0

i:1

j:0

j:1

j:0

j:1

・・・

・・・

Previous works

Page 11: Global Load Instruction Aggregation Based on Code Motion

1. Prefetch instructions

2. Transform loop structures.

for(j=0;j<10;j++)

for(i=0;i<10;i++)

... = a[i][j]

for(i=0;i<10;i++)

for(j=0;j<10;j++)

... = a[i][j]

before after

Previous works

Page 12: Global Load Instruction Aggregation Based on Code Motion

1. Local technique

ex. target: initial load instruction, loop only.

2. It is necessary to change the structure.

Problems

Page 13: Global Load Instruction Aggregation Based on Code Motion

・・・

・・・

Main memory

Cache memory

main(){

x = a[i]

}

a[i]

a[i+1]

How we can apply cache optimization to any program globally?

Page 14: Global Load Instruction Aggregation Based on Code Motion

main(){

x = a[i]

}

・・・

・・・

Main memory

a[i]

a[i+1]

How we can apply cache optimization to any program globally?

Cache memory

a[i]

a[i+1]

Page 15: Global Load Instruction Aggregation Based on Code Motion

Cache memory

・・・

main(){

... = a[i]

... = b[i]

... = a[i+1]

}

a[i]

a[i+1]

a[i]

a[i+1]b[i]

b[i+1]

Main memory

How we can apply cache optimization to any program globally?

Page 16: Global Load Instruction Aggregation Based on Code Motion

Cache memory

main(){

... = a[i]

... = b[i]

... = a[i+1]

}

b[i]

b[i+1]

・・・

a[i]

a[i+1]

b[i]

b[i+1]

Main memory

How we can apply cache optimization to any program globally?

Page 17: Global Load Instruction Aggregation Based on Code Motion

Cache memory

main(){

... = a[i]

... = b[i]

... = a[i+1]

}

b[i]

b[i+1]

・・・

a[i]

a[i+1]

b[i]

b[i+1]

Main memory

Cache miss

How we can apply cache optimization to any program globally?

Page 18: Global Load Instruction Aggregation Based on Code Motion

main(){

... = a[i]

... = b[i]

... = a[i+1]

}

b[i]

b[i+1]

We can remove this cache miss by

changing the order of accesses

・・・

a[i]

a[i+1]

b[i]

b[i+1]Cache miss

How we can apply cache optimization to any program globally?

Page 19: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

z = b[i]

Expel from

cache memory

w = a[i+j]

y = x+1

Code motion

Page 20: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

z = b[i]

w = a[i+j]

y = x+1

Code motion

Page 21: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

z = b[i]

Live range

of wy = x+1

w = a[i+j]

Code motion

Page 22: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

z = b[i]

w

y = x+1

x

w = a[i+j]

Code motion

Page 23: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

z = b[i]

Spilly = x+1

w = a[i+j]

Code motion

Page 24: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

t = Load(j)

z = b[i]

w = a[i+t]

y = x+1

Change the

access order

Code motion

Page 25: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

z = b[i]

w = a[i+j]

y = x+1

Code motion

Page 26: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

z = b[i]

w = a[i+j]

y = x+1

Delayed

Code motion

Page 27: Global Load Instruction Aggregation Based on Code Motion

We use Partial Redundancy Elimination(PRE)

One of the code optimization

Eliminates redundant expressions

Implementation

Page 28: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

y = a[i]

t = a[i]x = tt = a[i]

y = t

PRE

Page 29: Global Load Instruction Aggregation Based on Code Motion

LCM determines two insertion node

-- Earliest and Latest

Knoop,J.,etc.:Lazy Code Motion, Proc. Programming Language Design and Implementation, ACM, pp.224-234, 1992.

x = a[i]

y = a[i]

LCM

• Earliest(n) denotes that node n is

the closest to the start node of the

nodes which can be inserted

• Latest(n) denotes that node n is

the closest to nodes which contain

same load instruction.

Page 30: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

y = a[i]

LCM

Page 31: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

y = a[i]

t = a[i]

LCM

Page 32: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

y = a[i]

t = a[i]

LCM

Page 33: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

y = a[i]

t = a[i]

Delayed

LCM

Page 34: Global Load Instruction Aggregation Based on Code Motion

x = a[i]y = a[i]

t = a[i]

Delayed

LCM

Page 35: Global Load Instruction Aggregation Based on Code Motion

x = ty = t

t = a[i]

LCM

Page 36: Global Load Instruction Aggregation Based on Code Motion

Purpose

1. Decrease the cache miss.

2. Suppress register spills.

Extension

1. Move not redundant load instructions.

2. Delayed considering the order of

memory access.

Global Load Instruction Aggregation(GLIA)

Page 37: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

w = a[i+1]

y = b[i]

GLIA

Page 38: Global Load Instruction Aggregation Based on Code Motion

x = a[i]t = a[i+1]

w = a[i+1]

y = b[i]

GLIA

Page 39: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

w = a[i+1]

y = b[i]t = a[i+1]

GLIA

Page 40: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

w = a[i+1]

y = b[i]t = a[i+1]

GLIA

Page 41: Global Load Instruction Aggregation Based on Code Motion

x = a[i]

w = t

y = b[i]t = a[i+1]

GLIA

Page 42: Global Load Instruction Aggregation Based on Code Motion

= a[i]

= a[i+1]

= b[i]

= a[i+1]

Application to the entire program

Page 43: Global Load Instruction Aggregation Based on Code Motion

= a[i]

= a[i+1]

= b[i]

= a[i+1]

Application to the entire program

Page 44: Global Load Instruction Aggregation Based on Code Motion

= a[i]

= a[i+1]

= b[i]

= a[i+1]

Application to the entire program

Page 45: Global Load Instruction Aggregation Based on Code Motion

= a[i]

= a[i+1]

= b[i]

= a[i+1]

Application to the entire program

Page 46: Global Load Instruction Aggregation Based on Code Motion

= a[i]

= a[i+1]

= b[i]

= a[i+1]

Application to the entire program

Page 47: Global Load Instruction Aggregation Based on Code Motion

= a[i]

= a[i+1]= b[i]

= a[i+1]

Application to the entire program

Page 48: Global Load Instruction Aggregation Based on Code Motion

Implementation our technique in COINS compiler as LIR

converter.

Benchmark

SPEC2000

Measurement

1. Execution efficiency

2. The number of cache misses

Experiment

Page 49: Global Load Instruction Aggregation Based on Code Motion

Environment

SPARC64-V 2GHz, Solaris 10

Optimization

BASE:applies Dead Code Elimination(DCE)

GLIADCE:applies GLIA and DCE.

Experiment(1/2) | Execution efficiency

Page 50: Global Load Instruction Aggregation Based on Code Motion

Improvement of art has been about 10.5%

Experiment(1/2) | Execution efficiency

Page 51: Global Load Instruction Aggregation Based on Code Motion

= a[i]

= a[j]

= b[i]

The decrease reason 1: speculative code motion

Page 52: Global Load Instruction Aggregation Based on Code Motion

= a[i]= a[j]= b[i]

The decrease reason 1: speculative code motion

Page 53: Global Load Instruction Aggregation Based on Code Motion

The number of spills

The decrease reason 2: register spill

Page 54: Global Load Instruction Aggregation Based on Code Motion

System parameter of x86 machine

Intel corei5-2320 3.00GHz

Floating register :8

Integer register :8

L1D cache memory:32KB

L2 cache memory :256KB

L3 cache memory :6144KB

Experiment(2/2) | Cache misses

Page 55: Global Load Instruction Aggregation Based on Code Motion

Improvement of twolf has been about 10.6%

Experiment(2/2) | Level 2 cache misses

Page 56: Global Load Instruction Aggregation Based on Code Motion

Improvement of art has been about 93.7%

Experiment(2/2) | Level 3 cache misses

Page 57: Global Load Instruction Aggregation Based on Code Motion

We proposed a new cache optimization.

1. GLIA can be applied to any programs

2. GLIA improves cache efficiency

3. GLIA considers register spill

Thank you for your attention.

Conclusion