Autotuning at Illinois María Jesús Garzarán University of Illinois

Autotuning at Illinois

María Jesús Garzarán

University of Illinois

Outline

1. Why Autotuning?

2. What is Autotuning?

3. Research Problems

Why autotuning?

• In the era of parallelism…• Applications and software must maintain high

efficiency as machines evolve.– Otherwise, no reason for new machines.

• Problem: High-efficiency requires laborious tuning. – Cost increase. – Low performance if not enough resources

• Would like to automate tuning.

Compilers

• One way is compilers, but compilers have limitations.– Lack semantic information → fewer choices– Must target all applications– Must be reasonably fast

Compiler vs. Manual TuningDiscrete Fourier Transform

Compiler vs. Manual TuningMatrix Matrix Multiplication

20x

MF

LOP

S

Matrix Size

Intel MKL

icc -O3 -xT

icc -O3

Compiler vs. Manual TuningMatrix Matrix Multiplication

loop 1c[i*N+j] += a[i*N+k]*b[k*N+j]

loop 2c[i][j] += a[i][k]*b[k][j]

loop 3C += a[i][k]*b[k][j]

Compilers …

• Can and should improve

• But we will need other strategies (at least in the short term)

Outline

1. Why Autotuning?



What is Autotuning

• An emerging strategy: empirical search– Goal: Automatically generate highly efficient code for each target

machine (and input set). – Programmers develop metaprograms (a program that generates

programs) that search the space of possible algorithms/implementations

Generator of the versions

High-level code

Source-to-source optimizer

Native compiler

Metaprogram:Decription of the space of versions

Object code

Execution

performance

Selectedcode

High-level code

Input data(training)

Autotuning with empirical search

Autotuning

• More laborious than conventional programming, but – Longer lifetime → cost reduction – Can accumulate experience → better results

– Can afford to search more extensively → better results

Examples of Existing Autotuning Systems

• ATLAS: Whaley, Petite, Dongarra (Tennessee)• BeBop: Demmel, Yelick, Im, Vuduc (Berkeley)

• Datamining: Jian, Garzarán, Snir (Illinois)• FFTW: Frigo (MIT)

• Illinois Sorting: Li, Garzarán, Padua (Illinois)• Matrix-matrix multiplication for GPU: Jiang, Snir (Illinois)• Phipac: Bilmes, Asanovic, Vuduc, Iyer, Demmel, Chin, Lan (Berkeley)• Space Pruning for GPU: Ryoo, Rodrigues,Stone, Baghsorkhi, Ueng,

Stratton, Hwu (Illinois)

• SPIRAL: Moura, Pueschel (CMU), Johnson (Drexel), Garzarán, Padua (Illinois)

• SPIKETune: Wong, Kuck (Intel), Sameh(Purdue), Padua (Illinois)

Outline

1. Why Autotuning?



Generator of the versions

High-level code

Source-to-source optimizer

Native compiler

Metaprogram: Decription of the version space

Object code

Execution

Selectedcode

High-level code

Input data(training)

Autotuning with empirical search

What to do when performance depends on the input

How to specify the search space?

performanceWhat is performance(execution time, power)?

How to drive the search?

Research Issues

1. What to do when performance depends on input

2. Modeling/Search

3. Description of the space

4. What to tune

5. What to tune for

Very promising, but much to learn

Issue 1: Performance depends on input

• When performance depends on the input we must generate dynamically adapting routines. – Illustrated with the generation of sorting routines

[CGO04] Li, Garzarán, Padua. A Dynamically Tuned Sorting Library. In Proc. of the Int. Symp. on Code Generation and Optimization,2004.

[CGO05] Li, Garzarán, Padua. Optimizing Sorting with Genetic Algorithms. In Proc. of the Int. Symp. on Code Generation and Optimization 2005.

Issue 1: Sorting

• Different algorithms to perform sorting– Radix sort– Quick sort– Merge sort

• No single algorithm is the best for all inputs and platforms

Our Contribution

• Design of hybrid algorithms and use of genetic search to find sorting routines that automatically adapt to the target machine and the input characteristics.

• Result:– Generation of the fastest sorting routines for sequential and

parallel execution

20

Sorting

Perf

orm

ance

(ke

ys

per

cycl

e)

Intel Xeon

AMD Athlon MP

CC-Radix

Merge Sort

Quicksort

CC-Radix

Merge SortQuicksort

Same inputdifferent performance

Standard Deviation

21

Sorting

Perf

orm

ance

(ke

ys

per

cycl

e)

Intel Xeon

AMD Athlon MP

CC-Radix

Merge Sort

Quicksort

CC-Radix

Merge SortQuicksort

Standard Deviation

22

Divide with pivot

Select with entropy

Divide into block

Sorting Genome

< theta ≥ theta

Divide by digit

Hybrid sorting

for dynamic adaptation

23

Input

Divide with pivot

Select with entropy

Divide by digit

Divide into block

< theta ≥ theta

Example of hybrid sorting

24

Divide with pivot

Select with entropy

Divide into block

Input

< theta ≥ theta

Divide by digit


25

Divide with pivot

Select with entropy

Divide into block

PivotBucket 1

Bucket 2

Input

< theta ≥ theta

Divide by digit


26

Divide with pivot

Select with entropy

Divide into block

Pivot

Select operations based on entropy

Bucket 1

Bucket 2

Input

< theta ≥ theta

Divide by digit


27

Divide with pivot

Select with entropy

Divide into block

Pivot


Bucket 1

Bucket 2

Input

Sorted

< theta ≥ theta

Divide by digit


28

Divide with pivot

Select with entropy

Divide into block

Pivot


Bucket 1

Bucket 2

Input

Sorted Sorted

< theta ≥ theta

Divide by digit


29

Divide with pivot

Select with entropy

Divide into block

Pivot


Bucket 1

Bucket 2

Input

Sorted Sorted

< theta ≥ theta

Divide by digit


30

Divide with pivot

Select with entropy

Divide into block

Pivot


Bucket 1

Bucket 2

Input

Sorted

< theta ≥ theta

Divide by digit


31

Target Machine

Learning Mechanism

Used at runtime

Training inputs

Mappinginput data ➔ best algorithm

Learning: Algorithm Selection

32

IBM Power3

26%

ClassifierSort

IBM ESSL

C++ STL

Results: Sequential Sorting

Results: Parallel SortingIntel Quad Intel Quad

CoreCore

Research Issues

1. Performance depends on input

2. Modeling/Search


4. What to tune

5. What to tune for

Issue 2: Modeling/Search

• When the search space is too big we must use models or better search mechanisms. Illustrated with:

1. An analytical model and hybrid approach for ATLAS[PLDI03] Yotov, Li, Ren, Cibulskis, DeJong, Garzarán, Padua, Pingali, Stodghill, and Wu. A

Comparison of Empirical and Model-driven Optimization. In PLDI, 2003.[Proc of IEEE] Yotov, Li, Ren, Garzarán, Padua, Pingali, and Stodghill. Is Search Really

Necessary to Generate High-Performance BLAS? In Proc. of the IEEE, 2005.[LCPC05] Epshteyn, Garzarán, Dejong, Padua, Ren, Li, Yotov and Pingali. Analytic Models

and Empirical Search: A Hybrid Approach to Code Optimization. In LCPC, 2005

2. Genetic search for sorting[CGO04, CG005]

36

ATLAS Modeling• ATLAS = Automated Tuned Linear Algebra Software,

developed by R. Clint Whaley, Antoine Petite and Jack Dongarra, at the University of Tennessee.

• ATLAS uses empirical search to automatically generate highly-tuned Basic Linear Algebra Libraries (BLAS). – Use search to adapt to the target machine

37

Our Contribution• Development of methods to speed-up the search process.

– Analytical models that replace the search– Hybrid models that combine models with empirical search

[LCPC05] Epshteyn, Garzarán, Dejong, Padua, Ren, Li, Yotov and Pingali. Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization. In LCPC, 2005

• The result– Same performance – Faster generation

38

ATLAS Infrastructure

DetectHardwareParameters

ATLAS SearchEngine(MMSearch)

NRMulAddLatency

L1SizeATLAS MMCode Generator(MMCase)

xFetchMulAddLatency

NBMU,NU,KU MiniMMM

Source

Compile,Execute,Measure

MFLOPS


ATLAS MMCode Generator(MMCase)


39

Modeling for Optimization Parameters

• Our Modeling Engine

• Optimization parameters– NB: Hierarchy of Models (later)– MU, NU:– KU: maximize subject to L1 Instruction Cache– Latency, MulAdd: from hardware parameters– xFetch: set to 2



NRMulAddLatency

L1I$Size ATLAS MMCode Generator(MMCase)

xFetchMulAddLatency

NBMU,NU,KU MiniMMM

Source

L1Size

Model

MU *NUMU NU LatencyRegisters

40

Modeling for Tile Size (NB)• Models of increasing complexity

– 3*NB2 ≤ C• Whole work-set fits in L1

– NB2 + NB + 1 ≤ C• Fully Associative• Optimal Replacement• Line Size: 1 word

– or

• Line Size > 1 word

– or

• LRU Replacement

B

N

M

A C

NB

NB

K

K

B

C

B

NB

B

NB

1

2

B

CNB

B

NB

1

2

B

C

B

NB

B

NB

B

NB

12

2

B

CNB

B

NB

13

2 A

M(I)

K

C

B

N (J)

KB

A

M(I)

K

C

B

N (J)

KL

41

MMM Performance• SGI R12000 • Sun UltraSparc III

• Intel Pentium III

0

100

200

300

400

500

600

0 1000 2000 3000 4000 5000

0

100

200

300

400

500

600

0 1000 2000 3000 4000 5000

0200400600800

10001200140016001800

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

BLAS COMPILER

ATLAS MODEL

MF

LO

PS

MF

LO

PS

MF

LO

PS

42

Models/Search

• Models reduce search time to 0.

• However, search is still necessary when a model does not exist.

43

Divide with pivotSelect with entropy

Divide into block

Sorting Genome

< theta ≥ theta

Divide by digit

Genetic search for sorting

Genetic operators are used to derive new offsprings:-Mutation (add, remove subtrees, change params)-Cross-over

Issue 2: Modeling/Search

We need tools to guide models and search:

P-Ray: Characterization of hardware

[LCPC05] Duchateau, Sidelnik, Garzarán, Padua. P-RAY: A Suite of Micro benchmarks for Multi-core Architectures. In LCPC, 2008.

45

Characterize Hardware

• P-Ray: Development of benchmarks to measure hardware characteristics of multicore platforms



NRMulAddLatency

L1I$Size ATLAS MMCode Generator(MMCase)

xFetchMulAddLatency

NBMU,NU,KU MiniMMM

Source

L1Size

46

Our Contribution• P-Ray: Tool to measure.

– Block Size– Cache Mapping– Processor Mapping– Effective Bandwidth

• The result– Correct results for 3 different platforms (Intel Xeon Haperton, Sun

UltraSparc T1 Niagara, Intel Core 2 Quad Kentsfield)

P-Ray:Processor Mapping

L2L2L2

Core 1

Core 3

L2L2L2

Core 5

Core 7

L2L2L2

Core 2

Core 4

L2L2L2

Core 6

Core 8

8 Core Intel Hapertown

Chip 1

Chip 2

Research Issues


2. Modeling/Search


4. What to automate

5. What to tune for

Issue 3:Description of the Space

• ATLAS generator is written in C

• We need more effective notations to implement a generator (describe the search space)

• Two possibilities:– Domain Specific Languages

– General Purpose Languages

Issue 3:Description of the Space

Illustrated with:

1. SPIRAL (Domain Specific Language) [Proc. Of IEEE05] Püschel, Moura, Johnson, Padua, Veloso, Singer, Xiong,

Franchetti, Gacic, Voronenko, Chen, Johnson, and Rizzolo. Spiral: Code Generation for DSP Transforms. Proc. Of IEEE, 2005.

http://www.spiral.net

2. Metalanguage (General Purpose Language)

[LCPC05] Donadio, Brodman, Roeder, Yotov, Barthou, Cohen, Garzarán, Padua and Pingali. A Language for the Compact Representation of Multiples Program

Versions. In LCPC 2005.

http://www.spiral.net/

SPIRAL

• SPIRAL, generator of signal processing algorithms (DFT, DCT, WHT, filters, …)

• SPIRAL uses empirical search to generate routines that adapt to the target machine:– Sequential, parallel, SIMD, …

SPIRAL Contribution

• Declarative domain-specific language and rewriting rules to specify the search space.

• The result– Generation of routines that run faster than IPP (manually tuned)– Intel has started to use SPIRAL to generate parts of the IPP

library

SPIRAL

• Search based on breakdown and re-writing rules:

This is SPL, SPIRALmetalanguage

54

SPIRAL Program Generation

Transform

Rule

SPL Formula

PDFTIDIDFTDFT mnmnnm

parameterized matrix

• a breakdown strategy (Cooley Tukey) • product of sparse matrices

DFTp

Ruletree8DFT

2DFT 4DFT

2DFT 2DFT

8DFT

DFT 24DFT

2DFT 2DFT(a) (b)

(a)

(b)

PFIIDIFDFT 222428

DFT8 ( F2 I2 ... I2)D I2 F2 P

CT

CT

CT

CT

SPIRAL Program Generation

SPIRAL

• Why is search important?

– Different formulas (algorithms) have different execution times• They differ in the memory access pattern• Have different ILP

SPIRAL Performance Results

Metaprogramming

• General-purpose programming of autotuned libraries and applications.

• A metaprogram contains a compact description of the space of program versions and how to proceed with the search.

Metaprogram example

%try s in {2,4,8}for j=1 to 128 by %s %for k=j to j+s-1 a(%k) = …

for j=1 to 128 by 4 a(j) = … a(j+1) = … a(j+2) = … a(j+3) = …

for j=1 to 128 by 2 a(j) = … a(j+1) = …

for j=1 to 128 by 8 a(j) = … a(j+1) = … a(j+2) = … a(j+3) = … a(j+4) = … a(j+5) = … a(j+6) = … a(j+7) = …

Search strategy

Program shapefor each value

Research Issues


2. Modelling/Search


4. What to tune

5. What to tune for

Issue 4: What to tune

1. Kernels (MMM, FFT, sorting, …)

2. Codelets

3. Primitives

Codelets

• A class of (short) code sequences that appear often in an application domain

• The set of codelets should cover much of the execution domain

• Applications are decomposed into codelets

• Codelets are autotuned

Codelets

• Need a database of codelets– Each codelet in the database contains a set of compiler

optimizations

• Application is decomposed in codelets that are matched against the codelets in the database – Application codelets are optimized using the set of optimizations

of the matched codelet in the database

• Collaboration with David Kuck and David Wong, INTEL

Primitive Operations

• Same as codelets, but not identified automatically by the compiler

• The user is expected to write the application using primitives

• The primitives operations are tuned for each target platform

Example of Primitive Operations

• HTA : Hierarchically Tiled Arrays

[PPoPP06] Bikshandi, Guo, Hoeflinger, Almasi, Fraguela, Garzarán, Padua, and von Praun. Programming for Parallelism and Locality with Hierarchically Tiled. In PPoPP, 2006.

[PPoPP08] Guo, Bikshandi, Fraguela, Garzarán, and Padua. Programming with Tiles.In PPoPP 2008.

•

Hierarchically Tiled Arrays (HTAs)

• HTA is a data type where tiles are explicit

• HTAs are manipulated with data parallel primitives– HTA programs look sequential programs where parallelism is

encapsulated into the data parallel primitives

• Result– Programs that run as fast as MPI (test with NAS benchmarks)– Fewer lines of code– Portable codes

FFT using HTA parallel primitives

Can be autotuned

Data Parallel Primitives

• Challenge:

Can we extend data parallel primitive operations to other complex data types, such as sets, trees, graphs?

Research Issues


2. Modeling/Search

3. Description of options/space search

4. What to tune

5. What to tune for

Issue 5: What to tune for

1. Execution Time (All the previous systems)

2. Power (Preliminary data in next slides)

3. Space

4. Reliability

71

Power in SPIRAL

• Processors allow software control of operating frequency and voltage

• e.g. Intel Pentium M 770 has 6 settings– 2.13 GHz at 1.340 volt (max performance)– 800MHz at 0.988 volt (min power/energy)

72

Experimental Setup

• Intel Pentium M model 770 – <2133MHz, 1.34V>, <1866MHz, 1.27V>, <1600MHz, 1.2V>, <1333MHz ,

1.13V>, <1067MHz, 1.06V>, <800MHz, 0.99V>

• Measurements– HW: Agilent 34134A current probe and Agilent 34401A DMM– SW: SPIRAL controlled automatic runtime and energy measurement routine

• Optimization space– voltage-frequency scaling

73

Dynamic voltage-frequency scaling

• Use of voltage scaling instructions– CPU bound region --> run at high frequency– Memory bound region --> run at low frequency

• Minimum impact on execution time and significant reduction in energy consumption

74

0

5

10

15

20

25

30

35

40

45

1 201 401 601 801 1001 1201 1401 1601 1801

Dynamic voltage-frequency scaling: memory profile

Time

Cach

e m

iss ra

tio

Each point shows the cache miss ratio every 100 seconds

WHT-219 (out-of-cache)

Zoom

75

Dynamic voltage-frequency scaling: memory profile

Cach

e m

iss ra

tio

Each point shows the cache miss ratio every 100 seconds

WHT-219 (out-of-cache)

Time

0

5

10

15

20

25

30

35

18000 19000 20000 21000 22000 23000 24000 25000 26000

low frequency

high frequency

76

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0.075 0.08 0.085 0.09 0.095 0.1 0.105

Dynamic voltage-frequency scaling: results

Ener

gy (J

oule

s)

WHT-219

Execution Time (Seconds)

Energy versus execution time

77

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0.075 0.08 0.085 0.09 0.095 0.1 0.105

Same exec. time10% less energy

Dynamic voltage-frequency scaling: results

Ener

gy (J

oule

s)

Execution Time (Seconds)

Energy versus execution time

Dynamic Voltage Scaling

Same energyless execution time

WHT-219

78

0

200

400

600

800

1000

1200

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Compiler Optimizations (Future work)

Iterations

Ca

che

mis

s ra

tio

Apply dependence analysis and group together iterations

with similar cache miss ratio

increases the benefit of dynamic voltage scaling

0

200

400

600

800

1000

1200

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Iterations

Research Agenda


2. Modeling/Search


4. What to automate

5. What to tune for

Documents

Autotuning at Illinois María Jesús Garzarán University of Illinois