The OpenUH Compiler: A Community Resource

The OpenUH Compiler: A Community Resource

Barbara Chapman

University of Houston

March, 2007

High Performance Computing and Tools Grouphttp://www.cs.uh.edu/~hpctools

Agenda

OpenUH compiler OpenMP language extensions Compiler – Tools interactions Compiler cost modeling

OpenUH: A Reference OpenMP Compiler

Based on Open64 Integrates features from other major branches:

Pathscale, ORC, UPC,… Complete support for OpenMP 2.5 in C/C++ and

Fortran Freely available and open source Stable, portable Modularized and complete optimization framework Available on most Linux/Unix platforms

OpenUH: A Reference OpenMP Compiler

Facilitates research and development For us as well as for the HPC community

Testbed for new language features New compiler transformations Interactions with variety of programming tools

Currently installed at Cobalt@NCSA and Columbia@NASA Cobalt: 2x512 processors, Columbia: 20x512

processors

The Open64 Compiler Suite

An optimizing compiler suite for C/C++ and Fortran77/90 on Linux/IA-64 systems

Open-sourced by SGI from Pro64 compiler State-of-the-art intra- & interprocedural analysis a

nd optimizations 5 levels of uniform IR (WHIRL), with IR-to-source

“translators”: whirl2c & whirl2f Used for research and commercial purposes: Int

el, HP, QLogic, STMicroelectronics, UPC, CAF, U Delaware, Tsinghua, Minnesota …

Lower to High IR

gfec

gfecc

f90

LocalIPA

MainIPA

LNO

Inliner

Mainopt

Lowerall

Lower I/O

LowerMid W

CG

.B

.I

-O3-IPA

.w2c.c

.w2c.h/.w2f.f

-O0

-O2/O3

-phase: w=off

Very high WHIRL

High WHIRL

Mid WHIRL

Low WHIRL

Take either path

(only for f90)

Lower MP

WHIRL2 C/Fortran

(only for OpenMP)

-mp

Major Modules in Open64

OpenUH Compiler Infrastructure

IPA(Inter Procedural Analyzer)

Source code w/ OpenMP directives

Source code with runtime library calls

Linking

CG(code generator for Itanium)

WOPT(global scalar optimizer)

Object files

LOWER_MP(Transformation of OpenMP )

A NativeCompiler

A NativeCompiler

ExecutablesExecutables

A Portable OpenMPRuntime library

A Portable OpenMPRuntime library

FRONTENDS(C/C++, Fortran 90, OpenMP)

Op

en64

Co

mp

iler

in

fras

tru

ctu

re LNO(Loop Nest Optimizer)

OMP_PRELOWER(Preprocess OpenMP )

WHIRL2C & WHIRL2F(IR-to-source for none-Itanium )

OpenMP Implementation in OpenUH Frontends:

Parse OpenMP pragmas OMP_PRELOWER:

Preprocessing Semantic checking

LOWER_MP Generation of microtasks for

parallel regions Insertion of runtime calls Variable handling, …

Runtime library Support for thread

manipulations Implements user level routines Monitoring environment

OpenMP Code Translation

int main(void)

{

int a,b,c;

#pragma omp parallel \ private(c)

do_sth(a,b,c);

return 0;

}

_INT32 main()

{

int a,b,c;

/* microtask */

void __ompregion_main1()

{

_INT32 __mplocal_c;

/*shared variables are kept intact,

substitute accesses to private

variable*/

do_sth(a, b, __mplocal_c);

}

…

/*OpenMP runtime calls */

__ompc_fork(&__ompregion_main1);

…

}

Runtime based on ORC work performed by Tsinghua University

Multicore Complexity

Resources (L2 cache, memory bandwidth): shared or separate Each core: single thread or multithreaded, complex or simplified Individual cores: symmetric or asymmetric (heterogeneous)

Cell processorAMD dual-core IBM Power4 Sun T-1 (Niagara)

Is OpenMP Ready for Multicore?

Is OpenMP ready? Designed for medium-scale SMPs: <100 threads One-team-for-all scheme for work sharing and synchro

nization. simple but not flexible

Some difficulties using OpenMP on these platforms Determining the optimal number of threads Binding threads to right processor cores Finding good scheduling policy and chunk size

Challenges Posed By New Architectures

Hierarchical and hybrid parallelism Clusters, SMPs, CMP (multicores), SMT (simultan

eous multithreading), … Diversity in kind and extent of resource shari

ng, potential for thread contention ALU/FP units, cache, MCU, data-path, memory ba

ndwidth Homogeneous or heterogeneous Deeper memory hierarchy Size and scale

Will many codes have multiple levels of parallelism?

We may want sibling threads to share in a workload on a multicore. But we may want SMT threads to do different things

Subteams of Threads

for (j=0; j<ProcessingNum;j++) { #pragma omp for on threads (2: omp_get_num_threads()-1 ) for k=0; k<M;k++) { //on threads in subteam ... processing(); } // barrier involves subteam only

• MPI provides for definition of groups of pre-existing processes

• Why not allow worksharing among groups (or subteams) of pre-existing threads?

• Logical machine description, mapping of threads to it• Or simple “spread” or “keep together” notations

Case Study: A Seismic Code

for (i=0;i<N;i++) { ReadFromFile(i,...); for (j=0; j<ProcessingNum; j++) for(k=0;k<M;k++){ process_data(); //involves several different seismic functions } WriteResultsToFile(i); }

This loop is parallel

Kingdom Suite from Seismic Micro Technology

Goal: create OpenMP code for SMP with hyperthreading enabled

Parallel Seismic Kernel V1

for( j=0; j< ProcessingNum; j++) { #pragma omp for schedule(dynamic) for(k=0; k<M; k++) { processing(); //user configurable functions } // here is the barrier } end of j-loop

Load Data

ProcessData

Save Data

Load Data

Save Data

Timeline

Save Data

Load Data:

Process Data:

Save Data:

OMP For implicit barrier causes the computation threads to wait for I/O threads to complete.

Subteams of Threads

for (j=0; j<ProcessingNum;j++) { #pragma omp for on threads (2: omp_get_num_threads()-1 ) for k=0; k<M;k++) { //on threads in subteam ... processing(); } // barrier involves subteam only

• A parallel loop does not incur overheads of nested parallelism• But we need to avoid the global barrier early on in the

loop’s execution• One way to do this would be to restrict loop execution to

a subset of the team of executing threads

Parallel Seismic Code V2

Loadline(nStartLine,...); // preload the first line of data#pragma omp parallel{ for (int iLineIndex=nStartLine; iLineIndex <= nEndLine; iLineIndex++) {#pragma omp single nowait {// loading the next line data, NO WAIT! Loadline(iLineIndex+1,...); }

for(j=0;j<iNumTraces;j++)#pragma omp for schedule(dynamic)

for(k=0;k<iNumSamples;k++) processing();

#pragma omp barrier#pragma omp single nowait { SaveLine(iLineIndex); } }}

onthread(0)

onthread(2: omp_get_num_threads()-1 )

onthread(1)

Load Data

Process

Data

Save Data

Load Data

Process

Data

Save Data

Load Data

Load Data

Process

Data

Timeline

OpenMP Scalability: Thread Subteam

Advantages Flexible worksharing/synchronization extension Low overhead because of static partition Facilitates thread-core mapping for better data locality and

less resource contention

Thread Subteam: original thread team is divided into several subteams, each of which can work simultaneously.

Implementation in OpenUH…

void * threadsubteam; __ompv_gtid_s1 = __ompc_get_local_thread_num();__ompc_subteam_create(&idSet1,&threadsubteam);

/*threads not in the subteam skip the later work*/if (!__ompc_is_in_idset(__ompv_gtid_s1,&idSet1)) goto L111; __ompc_static_init_4(__ompv_gtid_s1, …&__do_stride, 1, 1, &threadsubteam);for(__mplocal_i = __do_lower; __mplocal_i <= __do_upper; __mplocal_i = __mplocal_i + 1) { ......... } __ompc_barrier(&threadsubteam); /*barrier at subteam only*/

L111: /* Insert a label as the boundary between two worksharing bodies*/ __ompv_gtid_s1 = __ompc_get_local_thread_num(); mpsp_status = __ompc_single(__ompv_gtid_s1); if(mpsp_status == 1) { j = omp_get_thread_num(); printf("I am the one: %d\n", j); } __ompc_end_single(__ompv_gtid_s1); __ompc_barrier (NULL); /*barrier at the default team*/

//omp for

//omp single

• Tree-structured team and subteams in runtime library• Threads not in a subteam skip the work in compiler translation• Global thread IDs are converted into local IDs for loop scheduling• Implicit barriers only affect threads in a subteam

BT-MZ Performance with Subteams

Platform: Columbia@NASA

OpenMP 3.0 and Beyond

Ideas on support for multicore / higher levels scalability Extend nested parallelism by binding threads in advance

High overhead of dynamic thread creation/cancellation Poor data locality between parallel regions executed by di

fferent threads without binding Describe structure of threads used in computation

Map to logical machine, or group Explicit data migration Subteams of threads Control over the default behavior of idle threads

Major thrust for 3.0 spec. supports non-traditional loop parallelism

What About The Tools?

Typically hard work to use, steep learning curve

Low-level interaction with user Tuning may be fragmentary effort May require multiple tools

Often not integrated with each other Let alone with compiler

Can we improve tools’ results, reduce user effort and help compiler if they interact?

Dragon Tool Browser

Dragon Tool Browser

Front End

IPL

IPA-Link

Program Info.Database

Program Info.Database

LNO Data DependenceArray Section

Control flow graph

Call graph

.vcg .ps .bmp

CFG_IPLCFG_IPL

VCGVCG

Dragon Executable

WOPT/CGfeedback

Exporting Program Information

Static and dynamic program information is exported

Productivity: Integrated Development Environment

Low-Level Trace Data

High Level Profile/ Performance Problem Analyzer

Development Environment for MPI/OpenMP

CommonProgramDatabaseInterface

Perfsuite Runtime Monitoring

Dragon Program Analysis Results

Fluid Dynamics Application

Runtime Information /

Sampling Program AnalysesHigh Level Representation

Performance Analysis Results

Queries for Application InformationApplication

Source code

Performance Feedback

KOJAK

Selective Instrumented Executable

Executing Application

Static/FeedbackOptimizations

TAU

OpenUH

http://www.cs.uh.edu/~copper NSF CCF-0444468

Offending critical region was rewritten

Courtesy of R. Morgan, NASA Ames

Cascade Results

Tuning Environment Using OpenUH selective instrumentation combined with its internal cost model

for procedures and internal call graph, we find procedures with high amount of work, called infrequently, and within a certain call path level.

Using our instrumented OpenMP runtime we can monitor parallel regions.

Selective Instrumentation analysisCompiler and Runtime Components

A Performance Problem: Specification GenIDLEST

Real world scientific simulation code Solves incompressible Navier Stokes and energy equations MPI and OpenMP versions

Platform SGI Altix 3700 Two distributed shared memory systems

Each system with 512 Intel Itanium 2 Processors

Thread count: 8 The problem: OpenMP version is slower than MPI

Timings of Diff_coeff Subroutine

OpenMP version

MPI version

We find that a single procedure isresponsible for 20% of the time andthat it is 9 times slower than MPI!

Performance AnalysisWhen comparing the metrics between OpenMP and MPI using KOJAK performance algebra.

Procedure Timings

Some loops are 27 times slower in OpenMP thanMPI. These loops contains large amounts of stallingdue to remote memory accesses to the shared heap.

We find:

Large # of:

• Exceptions• Flushes• Cache Misses• Pipeline stalls

Pseudocode of The Problem Procedure procedure diff_coeff() { allocation of arrays to heap by master thread

initialization of shared arrays

PARALLEL REGION

{ loop in parallel over lower_bound [my thread id] , upper bound [my thread id]

computation on my portion of shared arrays

….. } }

Shared Arrays

• Lower and upper bounds of computational loops are shared, and stored within the same memory page and cache line• Delays in remote memory accesses are probable causes of exceptions causing processor flushes

Solution: Privatization•Privatizing the arrays improved the performance of the whole program by 30% and resulted in a speedup of 10 for the problem procedure.•Now this procedure only takes 5% of total time•Processor Stalls are reduced significantlyOpenMP Privatized Version

Stall Cycle Breakdown for Non-Privatized (NP) and Privatized (P) Versions of diff_coeff

0.00E+005.00E+091.00E+101.50E+102.00E+102.50E+103.00E+103.50E+104.00E+104.50E+105.00E+10

D-c

ach

stal

ls

Bra

nch

mis

pred

ictio

n

Inst

ruct

ion

mis

s st

all

FLP

Uni

ts

Fron

t-end

flush

es

Cyc

les NP

P

NP-P

OpenMP Platform-awareness: Cost modeling Cost modeling:

To estimate the cost, mostly the time, of executing a program (or a portion of it) on a given system (or a component of it) using compilers, runtime systems, performance tools, etc.

An OpenMP cost model is critical for: OpenMP compiler optimizations Adaptive OpenMP runtime support Load balancing in hybrid MPI/OpenMP Targeting OpenMP to new architectures: multicore

Complementing empirical search

Example Usage of Cost Modeling

Case 1: What is the optimal tiling size for a loop tiling transformation? Cache size, Miss penalties, Loop overhead, …

DO K2 = 1, M, B DO J2 = 1, M, B DO I = 1, M DO K1 = K2, MIN(K2+B-1,M) DO J1 = J2, MIN(J2+B-1,M) Z(J1,I) = Z(J1,I) + X(K1,I) * Y(J1,K1)

Case 2: What is the maximum number of threads for parallel execution without performance degradation? Parallel overhead Ratio of parallelizable_work/total_work System capacities …

Per f ormance of an OpenMP Program

0

5000

10000

15000

20000

25000

30000

1 2 4 8 16 32 64 128

Number of Threads

MFLO

PS

CMT Platforms

OpenMP Compiler

Cost Modeling

Processor

OpenMPApplications

Cache Topology

ParallelOverheads Number of

Threads

Thread-coremapping

SchedulingPolicy

Chunksize

ComputationRequirements

MemoryReferences

OpenMP Runtime Library

Determine parameters for OpenMP execution

OpenMP Implementation

Application Features

Architectural Profiles

Usage of OpenMP Cost Modeling

Modeling OpenMP

Previous models T_parallel_region = T_fork + T_worksharing + T_join T_worksharing = T_sequential / N_threads

Our model aims to consider much more: Multiple worksharing, synchronization portions in a parallel

region Scheduling policy Chunk size Load imbalance Cache impact for multiple threads on multiple processors …

Modeling OpenMP Parallel Regions A parallel region could encompass several worksharing and

synchronization portions The sum of the longest execution time of all threads between a pair

of synchronization points dominates the final execution time: load imbalance

A parallel region

Master Thread

Modeling OpenMP Worksharing Work-sharing has overhead because of

multiple dispatching of work chunks

…Thread i

Schedule (type, chunkSize)

Time (cycles)

Implementation in OpenUH

Cost models

Processor modelCache model

Parallel model

Loop overhead

Parallel overhead

Machine cost

Cache cost

Reduction cost

Computational resource cost

Dependency latency cost

Register spillingcost

Cache costOperation cost

Issue cost

Mem_ref costTLB cost

Based on the existing cost models used in loop optimization Only works for perfectly nested loops: those permitting arbitrary transformations To guide conventional loop transformation: unrolling, titling, interchanging, To help auto-parallelization: justification, which level, interchanging

Cost Model Extensions

Added a new phase in compiler to traverse IR to conduct modeling Working on OpenMP regions instead of perfectly nested

loops Enhancement to model OpenMP details

reusing processor and cache models for processor and cache cycles

Modeling load imbalance: using max (thread_i_exe) Modeling scheduling: adding a lightweight scheduler in the

model Reading an environment variable for the desired numbers

of threads during modeling (so this is currently fixed)

Experiment

Machine: Cobalt in NCSA (National Center for Supercomputing Applications) 32-processor SGI Altix 3700 1.5 GHz Itanium 2 with 6 M L3 Cache 256 G memory

Benchmark: OpenMP version of a classic matrix-matrix multiplication (MMM) code i, k, j order 3 different double floating point matrix sizes: 500, 1000, 1500 OpenUH compiler: -O3, -mp

Cycle measuring tools: pfmon, perfsuite

#pragma omp parallel for private ( i , j , k )f o r ( i = 0 ; i < N; i ++) f o r ( k = 0 ; k < K; k++) f o r ( j = 0 ; j < M; j ++) c [ i ] [ j ]= c [ i ] [ j ]+ a [ i ] [ k ]*b [ k ] [ j ] ;

Results

Model i ng vs. Measurement

10000000

100000000

1E+09

1E+10

1E+11

1 2 3 4 5 6 7 8

Number of Threads

CPU

Cycl

es

Model - 500Measure- 500Model - 1000Measure- 1000Model - 1500Measure- 1500

Measured data have irregular fluctuation, especially for smaller dataset with larger number of threads 108 [email protected] <0.1 second, relatively big system level noise from thread management

Overestimation for 500x500 array from 1 to 5 threads, underestimation for all the rest: optimistic assumption for resource utilization more threads, more underestimation: lack of contention models for cache, memory and bus

Efficiency = Modeling_Time / Compilation_Time x100% = 0.079s/6.33s =1.25%

Relative Accuracy: Modeling Different Chunk Sizes for Static Scheduling

Modeling vs Measuring for OpenMP Scheduling

0

5

10

15

20

25

30

35

1 10 100 250 500 1000

Bill

ion

s

Chunksize

CP

U C

ycle

s

static-modeling

static-measuring

Dynamic-measuring

Guided-measuring

4 threads, matrix size 1000x1000

Load imbalance

Excessivescheduling overheads

Successfully captured the trend of measured result

Cost Model

Detailed cost model could be used to recompile program regions that perform poorly Possibly with focus on improving specific aspect of code

Current models in OpenUH are inaccurate Most often they accurately predict trends

Fail to account for resource contention This will be critical for modeling multicore platforms

What level of accuracy should we be going for?

Summary

Challenge of multicores demands “simple” parallel programming models There is very much to explore in this regard

Compiler technology has advanced and public domain software has become fairly robust

Many opportunities for exploiting this to improve Languages Compiler implementations Runtime systems OS interactions Tool behavior …

Documents

The OpenUH Compiler: A Community Resource