37
DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers Wei Chung Hsu 徐徐徐 Computer Science Department 徐徐徐徐 (work was done in University of Minnesota, Twin Cities) 3/05/2010

DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers

  • Upload
    aldis

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers. Wei Chung Hsu 徐慰中 Computer Science Department 交通大學 (work was done in University of Minnesota, Twin Cities ) 3/05/2010. Dynamo. Dynamo is a dynamic optimizer It won the best paper award in PLDI’2000, cited 612 times - PowerPoint PPT Presentation

Citation preview

Page 1: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

DYNAMO vs. ADOREA Tale of Two Dynamic Optimizers

Wei Chung Hsu徐慰中Computer Science Department交通大學

(work was done in University of Minnesota, Twin Cities)3/05/2010

Page 2: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Dynamo Dynamo is a dynamic optimizer It won the best paper award in PLDI’2000, cited 612

times Work started by the HP lab and the HP system lab. MIT took over and ported it to x86, called it

DynamoRIO. This group later started a new company, called Determina (now acquired by VMware)

Considered revolutionary since optimizations were always performed statically (i.e. at compile time)

Page 3: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

SPEC CINT2006 for Opteron X4Name Description IC×109 CPI Tc (ns) Exec time Ref time SPECratio

perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3

bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8

gcc GNU C Compiler 1,050 1.72 0.40 724 8,050 11.1

mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8

go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6

hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5

sjeng Chess game (AI) 2,176 0.96 0.40 837 12,100 14.5

libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8

h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3

omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1

astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1

xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0

Geometric mean 11.7

Very high cache miss rate rates Ideal CPI should be 0.33

Time=CPI x Inst x Clock period

Page 4: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Where have all the cycles gone?• Cache misses

– Capacity, Compulsory/Cold, Conflict, Coherence– I-cache and D-cache– TLB misses

• Branch mis-predictions– Static and dynamic prediction– Mis-speculation

• Pipeline stalls– Ineffective code scheduling

often caused by memory aliasing

Unpredictable

Hard to deal with at compile time

Page 5: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Trend of Multi-cores

5Exploiting these potentials demands thread-level parallelism

Intel Core i7 die photo

Page 6: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Exploiting Thread-Level Parallelism

6

Potentially more parallelism with speculation

dependence

Sequential

Store *pLoad *q

Store *p

Time

p != q

Thread-Level Speculation (TLS)

Traditional ParallelizationLoad *q

p != q ??

Store 88

Load 20

Parallel execution

Store 88

Load 88

Speculation Failure

Time

Time

Load 88

Compiler gives up

p == q

But Unpredictable

Page 7: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Dynamic OptimizersDynamic optimizers

Dynamic Binary Optimizers (DBO)

Java VM (JVM) with JIT compiler(dynamic compilationor adaptive optimization)

Native-to-native dynamic binary optimizers (x86 x86, x86-32 x86-64 IA64 IA64)

Non-nativedynamic binarytranslators(e.g. x86 IA64,ARM MIPS,PPC x86, QEMUVmware, Rosetta)

Page 8: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

More on why dynamic binary optimizationNew architecture/micro-architecture features offer more opportunity for performance, but are not effectively exploited by legacy binary.

x86 P5/P6/PII/PIII, x86-32/x86-64, PA 7200/8000, …

Software evolution and ISV behaviors reduce effectiveness of traditional static optimizer

DLL, middleware, binary distribution, …

Profile sensitive optimizations would be more effective if performed at runtime

predication, speculation, branch prediction, prefetching

Multi-core environment with dynamic resource sharing makes static optimization challenging

shared cache, off-chip bandwidth, shared FU’s

Page 9: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

How Dynamo Works

Interpret untiltaken branch

Lookup branchtarget

Start of tracecondition?

Jump to codecache

Increment counterfor branch target

Counter exceedthreshold?

Interpret +code gen

End-of-tracecondition?

Create trace& optimize it

Emit intocache

Signalhandler

Code Cache

Dynamo is VM based

Page 10: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Trace Selection

A

B C

D

F

G H

I

E

A

C

D

F

G

I

E

call

returnto Bto H

back toruntime

trace layout in

tracecache

trace selection

Page 11: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Backpatching

A

C

D

F

G

I

E

to Bto H

back toruntime

H

I

E

When H becomes hot,a new trace is selectedstarting from H, and thetrace exit branch in block F is backpatched to branch to the new trace.

Page 12: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Execution Migrates to Code Cache

a.out

1

2

3

Code cache

1

2

3

0

4

interpreter/emulator

traceselector

optimizer

Page 13: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Trace Based Optimizations• Full and partial redundancy elimination• Dead code elimination• Trace scheduling• Instruction cache locality improvement• Dynamic procedure inlining (or procedure

outlining)• Some loop based optimizations

Page 14: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Summary of Dynamo• Dynamic Binary Optimization customizes

performance delivery:– Code is optimized by how the code is used

• Dynamic trace formation and trace-based optimizations

– Code is optimized for the machine it runs on– Code is optimized when all executables are

available– Code should be optimized only the parts that

really matters

Page 15: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

ADORE ADORE means ADaptive Object code RE-

optimization Was developed at the CSE department, U. of

Minnesota, Twin Cities Applied a very different model for dynamic

optimization systems Considered evolutionary, cited by 61

Page 16: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Dynamic Binary Optimizer’s Models

Application Binaries

DBO

Operating System

Hardware Platform

Application Binaries DBO

Operating System

Hardware Platform

-Translate most execution paths and keep in code cache- Easy to maintain control - Dynamo (PA-RISC) DynamoRIO (x86)

-Translate only hot execution paths and keep in code cache- Lower overhead - ADORE (IA64, SPARC) COBRA (IA64, x86 – ongoing)

Page 17: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

ADORE Framework

Hardware Performance Monitoring Unit (PMU)

Kernel

Phase Detection

Trace Selection

Optimization

Deployment

Main ThreadDynamic

OptimizationThread

Code Cache

Init PMUInt. on Event

Int on K-buffer ovf

On phase change

Pass traces to opt

Init Code $ Optimized Traces

Patch traces

Page 18: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Thread Level View

18

Application

K-buffer overflow handler

Init ADORE

ADORE invoked

ADORE invoked

User buffer

fullU

ser buffer full

Thread 1 Thread 2

sleep

sleep

User buffer full is

maintained for 1 main event. This

event is usually

CPU_CYCLES

Page 19: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Perf. of ADORE/Itanium on SPEC2000

Page 20: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Performance on BLAST

-15%

0%

15%

30%

45%

60%

blastnnt.1

blastnnt.10(4)

blastnnt.10(5)

blastnnt.10(7)

blastpaa.1

blastxnt.1

tblastnaa.1

Queries

% S

peed

-up

GCC O2 ORC O2 ECC O2

Page 21: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

ADORE vs. DynamoTasks Dynamo ADORE

Observation(profiling)

Interpretation/ instrumentation based

HPM sampling based

Optimization Trace layout and classic optimization

I/D-cache related optimizations(prefetching + trace layout)

Code cache Need large Code$ Small Code$ sufficient

Re-direction Interpretation and trace chaining

Code Patching

Page 22: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

ADORE on Multi-cores

• COBRA (Continuous Object code Re-Adaptation) framework is a follow up project, implemented on Itanium Montecito and x86’s new multi-core machines.

• ADORE on SPARC Panther (Ultra Sparc IV+) multi-core machines.

• ADORE for TLS tuning

Page 23: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

23

COBRA Framework

• Optimization Thread– Centralized Control– Initialization – Trace Selection– Trace Optimization– Trace Patching

• Monitor Threads– Localized Control– Per-thread Profile

Single System Image (Kernel)

Perfmon Sampling Kernel Driver

Kernel Sampling Buffer (KSB)

Per-Thread Monitoring Threads

Per-Thread UserSampling Buffer (USB)

Per-Thread Profile Buffer (PB)

Per-Thread Phase and Profile

Manager

Optimization Thread

MainController

Trace Selectionand Optimization

TraceCache

Main/Working Threads

Multi-Threaded Program With COBRA Monitoring and Optimizing Threads in Same Address Space

Trace Patcher

Processor 0

HardwarePerformance

Counters

Processor 3

HardwarePerformance

Counters

Processor 2

HardwarePerformance

Counters

Processor 1

HardwarePerformance

Counters

Page 24: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

24

Startup of 4 thread OpenMP Program

monitorthread

main process (worker thread) vfork

OMP monitor thread OMP monitor thread

worker thread worker thread

worker thread worker thread

Worker thread Worker thread

pthread_create

monitorthreadmonitor

threadmonitorthread

monitoringprocess

Optimzer thread Optimzer thread start

end

1

6

5

4

32

Same Address Space

Page 25: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

25

Prefetch vs. NoPrefetch

• The prefetch version when running with 4 threads suffers significantly from L2_OZQ_FULL stalls.

Scalability of DAXPY Kernel on 4-way Itanium 2 Machine(# of threads, with/without prefetch)

0.000.200.400.600.801.001.201.401.601.802.00

128K 512K 2Mdata working set size

Nor

mal

ized

exe

cutio

n tim

e to

ba

selin

e

(1, prefetch)(1, noprefetch)(2, prefetch)(2, noprefetch)(4, prefetch)(4, noprefetch)

26%34%

Page 26: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

26

Prefetch vs. Prefetch with .excl

• .excl hint: prefetch a cache line in exclusive state instead of shared state. (Invalidation based cache coherence protocol)

Scalability of DAXPY Kernel on 4-way Itanium 2 Machine (#of threads, prefetch without/with .excl hints)

0.00

0.20

0.40

0.60

0.80

1.00

1.20

128K 512K 2Mdata working set size

Nor

mal

ized

exe

cutio

n tim

e to

ba

selin

e

(1, prefetch)(1, prefetch.excl)(2, prefetch)(2, prefetch.excl)(4, prefetch)(4, prefetch.excl)

15%12%

Page 27: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

27

Execution time on 4-way SMP

0.900

0.950

1.000

1.050

1.100

1.150

1.200

bt.S sp.S lu.S ft.S mg.S cg.S avgNPB OMP v3.0 benchmarks

Spee

dup

rela

tive

to b

asel

ine

(pre

fetc

h)(4, prefetch) (4, noprefetch) (4, prefetch.excl)

8%

15%

2.7%4.7%

noprefetch: up to 15%, average 4.7% speedup prefetch.excl: up to 8%, average 2.7% speedup

Page 28: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

28

Execution time on cc-NUMA

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1.400

1.600

1.800

bt.S sp.S lu.S ft.S mg.S cg.S avgNPB OMP v3.0 benchmarks

Spee

dup

rela

tive

to b

asel

ine

(pre

fetc

h)(8, prefetch) (8, noprefetch) (8, prefetch.excl)

17.5%8.5%

68%

18%

noprefetch: up to 68%, average 17.5% speedup prefetch.excl: up to 18%, average 8.5% speedup

Page 29: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

29

Summary of Results from COBRA• We showed that coherent misses caused by aggressive

prefetching could limit the scalability of multithreaded program on scalable shared memory multiprocessors.

• With the guide of runtime profile, we experimented two optimizations.– Reducing aggressiveness of prefetching

• Up to 15%, average 4.7% speedup on 4-way SMP• Up to 68%, average 17.5% speedup on SGI Altix cc-NUMA

– Using exclusive hint for prefetch• Up to 8%, average 2.7% speedup on 4-way SMP• Up to 18%, average 8.5% speedup on SGI Altix cc-NUMA

Page 30: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

ADORE/SPARC• ADORE has been ported to Sparc/Solaris

platform since 2005.• Some porting issues:

– ADORE uses the libcpc interface on Solaris to conduct runtime profiling. A kernel buffer enhancement is added to Solaris 10.0 to reduce profiling and phase detection overhead

– Reachability is a true problem. (e.g. Oracle, Dyna3D)– Lack of branch trace buffer is painful. (e.g. Blast)

Page 31: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Performance of In-Thread Opt. (USIII+)

-10.00%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

BasePeak

Page 32: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Helper Thread Prefetching for Multi-Core

Main threadMain thread

Second coreSecond corePrefetches initiatedPrefetches initiated

Cache miss avoidedCache miss avoided L2L2

CacheCacheMissMiss

timeFirst CoreFirst Core

Trigger to activate (About 65 cycles delay)

Spin Waiting Spin again waiting for the next trigger

Page 33: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Performance of Dynamic Helper Thread(on Sun UltraSparc IV+)

-20.00%

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

BasePeak

Page 34: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Evaluation Environment for TLS

Benchmarks SPEC2000 written in C, -O3 optimization

Underlying architecture 4-core, chip-multiprocessor (CMP) speculation supported by coherence

Simulator Superscalar with detailed memory model simulates communication latency models bandwidth and contention

Detailed, cycle-accurate simulation

C

C

P

C

P

Interconnect

C

P

C

P

34

Page 35: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Dynamic Tuning for TLS

35

1.17x

1.23x

1.37x

Parallel Code Overhead

Page 36: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Summary of ADORE• ADORE uses Hardware Performance Monitoring

(HPM) capability to implement a light weight runtime profiling system. Efficient profiling and phase detection is the key to the success of dynamic native binary optimizers.

• ADORE can speed up real-world large applications optimized by production compilers.

• ADORE works on two architectures: Itanium and SPARC. COBRA is a follow-up system of ADORE. It works on Itanium and x86.

• ADORE/COBRA can also optimize for multi-cores.• ADORE has recently been applied to dynamic TLS

tuning.

Page 37: DYNAMO vs. ADORE A Tale of  Two Dynamic Optimizers

Conclusion“It was the best of times,

it was the worst of times…” -- opening line of “A Tale of Two Cities”

best of times for research: new areas where innovations are needed worst of times for research:saturated area where technologies are matured or well-understood, hard to innovate, …