Upload
suki
View
42
Download
3
Embed Size (px)
DESCRIPTION
The OpenUH Compiler: A Community Resource. Barbara Chapman University of Houston March, 2007. High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools. Agenda. OpenUH compiler OpenMP language extensions Compiler – Tools interactions Compiler cost modeling. - PowerPoint PPT Presentation
Citation preview
The OpenUH Compiler: A Community Resource
Barbara Chapman
University of Houston
March, 2007
High Performance Computing and Tools Grouphttp://www.cs.uh.edu/~hpctools
Agenda
OpenUH compiler OpenMP language extensions Compiler – Tools interactions Compiler cost modeling
OpenUH: A Reference OpenMP Compiler
Based on Open64 Integrates features from other major branches:
Pathscale, ORC, UPC,… Complete support for OpenMP 2.5 in C/C++ and
Fortran Freely available and open source Stable, portable Modularized and complete optimization framework Available on most Linux/Unix platforms
OpenUH: A Reference OpenMP Compiler
Facilitates research and development For us as well as for the HPC community
Testbed for new language features New compiler transformations Interactions with variety of programming tools
Currently installed at Cobalt@NCSA and Columbia@NASA Cobalt: 2x512 processors, Columbia: 20x512
processors
The Open64 Compiler Suite
An optimizing compiler suite for C/C++ and Fortran77/90 on Linux/IA-64 systems
Open-sourced by SGI from Pro64 compiler State-of-the-art intra- & interprocedural analysis a
nd optimizations 5 levels of uniform IR (WHIRL), with IR-to-source
“translators”: whirl2c & whirl2f Used for research and commercial purposes: Int
el, HP, QLogic, STMicroelectronics, UPC, CAF, U Delaware, Tsinghua, Minnesota …
Lower to High IR
gfec
gfecc
f90
LocalIPA
MainIPA
LNO
Inliner
Mainopt
Lowerall
Lower I/O
LowerMid W
CG
.B
.I
-O3-IPA
.w2c.c
.w2c.h/.w2f.f
-O0
-O2/O3
-phase: w=off
Very high WHIRL
High WHIRL
Mid WHIRL
Low WHIRL
Take either path
(only for f90)
Lower MP
WHIRL2 C/Fortran
(only for OpenMP)
-mp
Major Modules in Open64
OpenUH Compiler Infrastructure
IPA(Inter Procedural Analyzer)
Source code w/ OpenMP directives
Source code with runtime library calls
Linking
CG(code generator for Itanium)
WOPT(global scalar optimizer)
Object files
LOWER_MP(Transformation of OpenMP )
A NativeCompiler
A NativeCompiler
ExecutablesExecutables
A Portable OpenMPRuntime library
A Portable OpenMPRuntime library
FRONTENDS(C/C++, Fortran 90, OpenMP)
Op
en64
Co
mp
iler
in
fras
tru
ctu
re LNO(Loop Nest Optimizer)
OMP_PRELOWER(Preprocess OpenMP )
WHIRL2C & WHIRL2F(IR-to-source for none-Itanium )
OpenMP Implementation in OpenUH Frontends:
Parse OpenMP pragmas OMP_PRELOWER:
Preprocessing Semantic checking
LOWER_MP Generation of microtasks for
parallel regions Insertion of runtime calls Variable handling, …
Runtime library Support for thread
manipulations Implements user level routines Monitoring environment
OpenMP Code Translation
int main(void)
{
int a,b,c;
#pragma omp parallel \ private(c)
do_sth(a,b,c);
return 0;
}
_INT32 main()
{
int a,b,c;
/* microtask */
void __ompregion_main1()
{
_INT32 __mplocal_c;
/*shared variables are kept intact,
substitute accesses to private
variable*/
do_sth(a, b, __mplocal_c);
}
…
/*OpenMP runtime calls */
__ompc_fork(&__ompregion_main1);
…
}
Runtime based on ORC work performed by Tsinghua University
Multicore Complexity
Resources (L2 cache, memory bandwidth): shared or separate Each core: single thread or multithreaded, complex or simplified Individual cores: symmetric or asymmetric (heterogeneous)
Cell processorAMD dual-core IBM Power4 Sun T-1 (Niagara)
Is OpenMP Ready for Multicore?
Is OpenMP ready? Designed for medium-scale SMPs: <100 threads One-team-for-all scheme for work sharing and synchro
nization. simple but not flexible
Some difficulties using OpenMP on these platforms Determining the optimal number of threads Binding threads to right processor cores Finding good scheduling policy and chunk size
Challenges Posed By New Architectures
Hierarchical and hybrid parallelism Clusters, SMPs, CMP (multicores), SMT (simultan
eous multithreading), … Diversity in kind and extent of resource shari
ng, potential for thread contention ALU/FP units, cache, MCU, data-path, memory ba
ndwidth Homogeneous or heterogeneous Deeper memory hierarchy Size and scale
Will many codes have multiple levels of parallelism?
We may want sibling threads to share in a workload on a multicore. But we may want SMT threads to do different things
Subteams of Threads
for (j=0; j<ProcessingNum;j++) { #pragma omp for on threads (2: omp_get_num_threads()-1 ) for k=0; k<M;k++) { //on threads in subteam ... processing(); } // barrier involves subteam only
• MPI provides for definition of groups of pre-existing processes
• Why not allow worksharing among groups (or subteams) of pre-existing threads?
• Logical machine description, mapping of threads to it• Or simple “spread” or “keep together” notations
Case Study: A Seismic Code
for (i=0;i<N;i++) { ReadFromFile(i,...); for (j=0; j<ProcessingNum; j++) for(k=0;k<M;k++){ process_data(); //involves several different seismic functions } WriteResultsToFile(i); }
This loop is parallel
Kingdom Suite from Seismic Micro Technology
Goal: create OpenMP code for SMP with hyperthreading enabled
Parallel Seismic Kernel V1
for( j=0; j< ProcessingNum; j++) { #pragma omp for schedule(dynamic) for(k=0; k<M; k++) { processing(); //user configurable functions } // here is the barrier } end of j-loop
Load Data
ProcessData
Save Data
Load Data
Save Data
Timeline
Save Data
Load Data:
Process Data:
Save Data:
OMP For implicit barrier causes the computation threads to wait for I/O threads to complete.
Subteams of Threads
for (j=0; j<ProcessingNum;j++) { #pragma omp for on threads (2: omp_get_num_threads()-1 ) for k=0; k<M;k++) { //on threads in subteam ... processing(); } // barrier involves subteam only
• A parallel loop does not incur overheads of nested parallelism• But we need to avoid the global barrier early on in the
loop’s execution• One way to do this would be to restrict loop execution to
a subset of the team of executing threads
Parallel Seismic Code V2
Loadline(nStartLine,...); // preload the first line of data#pragma omp parallel{ for (int iLineIndex=nStartLine; iLineIndex <= nEndLine; iLineIndex++) {#pragma omp single nowait {// loading the next line data, NO WAIT! Loadline(iLineIndex+1,...); }
for(j=0;j<iNumTraces;j++)#pragma omp for schedule(dynamic)
for(k=0;k<iNumSamples;k++) processing();
#pragma omp barrier#pragma omp single nowait { SaveLine(iLineIndex); } }}
onthread(0)
onthread(2: omp_get_num_threads()-1 )
onthread(1)
Load Data
Process
Data
Save Data
Load Data
Process
Data
Save Data
Load Data
Load Data
Process
Data
Timeline
OpenMP Scalability: Thread Subteam
Advantages Flexible worksharing/synchronization extension Low overhead because of static partition Facilitates thread-core mapping for better data locality and
less resource contention
Thread Subteam: original thread team is divided into several subteams, each of which can work simultaneously.
Implementation in OpenUH…
void * threadsubteam; __ompv_gtid_s1 = __ompc_get_local_thread_num();__ompc_subteam_create(&idSet1,&threadsubteam);
/*threads not in the subteam skip the later work*/if (!__ompc_is_in_idset(__ompv_gtid_s1,&idSet1)) goto L111; __ompc_static_init_4(__ompv_gtid_s1, …&__do_stride, 1, 1, &threadsubteam);for(__mplocal_i = __do_lower; __mplocal_i <= __do_upper; __mplocal_i = __mplocal_i + 1) { ......... } __ompc_barrier(&threadsubteam); /*barrier at subteam only*/
L111: /* Insert a label as the boundary between two worksharing bodies*/ __ompv_gtid_s1 = __ompc_get_local_thread_num(); mpsp_status = __ompc_single(__ompv_gtid_s1); if(mpsp_status == 1) { j = omp_get_thread_num(); printf("I am the one: %d\n", j); } __ompc_end_single(__ompv_gtid_s1); __ompc_barrier (NULL); /*barrier at the default team*/
//omp for
//omp single
• Tree-structured team and subteams in runtime library• Threads not in a subteam skip the work in compiler translation• Global thread IDs are converted into local IDs for loop scheduling• Implicit barriers only affect threads in a subteam
BT-MZ Performance with Subteams
Platform: Columbia@NASA
OpenMP 3.0 and Beyond
Ideas on support for multicore / higher levels scalability Extend nested parallelism by binding threads in advance
High overhead of dynamic thread creation/cancellation Poor data locality between parallel regions executed by di
fferent threads without binding Describe structure of threads used in computation
Map to logical machine, or group Explicit data migration Subteams of threads Control over the default behavior of idle threads
Major thrust for 3.0 spec. supports non-traditional loop parallelism
What About The Tools?
Typically hard work to use, steep learning curve
Low-level interaction with user Tuning may be fragmentary effort May require multiple tools
Often not integrated with each other Let alone with compiler
Can we improve tools’ results, reduce user effort and help compiler if they interact?
Dragon Tool Browser
Dragon Tool Browser
Front End
IPL
IPA-Link
Program Info.Database
Program Info.Database
LNO Data DependenceArray Section
Control flow graph
Call graph
.vcg .ps .bmp
CFG_IPLCFG_IPL
VCGVCG
Dragon Executable
WOPT/CGfeedback
Exporting Program Information
Static and dynamic program information is exported
Productivity: Integrated Development Environment
Low-Level Trace Data
High Level Profile/ Performance Problem Analyzer
Development Environment for MPI/OpenMP
CommonProgramDatabaseInterface
Perfsuite Runtime Monitoring
Dragon Program Analysis Results
Fluid Dynamics Application
Runtime Information /
Sampling Program AnalysesHigh Level Representation
Performance Analysis Results
Queries for Application InformationApplication
Source code
Performance Feedback
KOJAK
Selective Instrumented Executable
Executing Application
Static/FeedbackOptimizations
TAU
OpenUH
http://www.cs.uh.edu/~copper NSF CCF-0444468
Offending critical region was rewritten
Courtesy of R. Morgan, NASA Ames
Cascade Results
Tuning Environment Using OpenUH selective instrumentation combined with its internal cost model
for procedures and internal call graph, we find procedures with high amount of work, called infrequently, and within a certain call path level.
Using our instrumented OpenMP runtime we can monitor parallel regions.
Selective Instrumentation analysisCompiler and Runtime Components
A Performance Problem: Specification GenIDLEST
Real world scientific simulation code Solves incompressible Navier Stokes and energy equations MPI and OpenMP versions
Platform SGI Altix 3700 Two distributed shared memory systems
Each system with 512 Intel Itanium 2 Processors
Thread count: 8 The problem: OpenMP version is slower than MPI
Timings of Diff_coeff Subroutine
OpenMP version
MPI version
We find that a single procedure isresponsible for 20% of the time andthat it is 9 times slower than MPI!
Performance AnalysisWhen comparing the metrics between OpenMP and MPI using KOJAK performance algebra.
Procedure Timings
Some loops are 27 times slower in OpenMP thanMPI. These loops contains large amounts of stallingdue to remote memory accesses to the shared heap.
We find:
Large # of:
• Exceptions• Flushes• Cache Misses• Pipeline stalls
Pseudocode of The Problem Procedure procedure diff_coeff() { allocation of arrays to heap by master thread
initialization of shared arrays
PARALLEL REGION
{ loop in parallel over lower_bound [my thread id] , upper bound [my thread id]
computation on my portion of shared arrays
….. } }
Shared Arrays
• Lower and upper bounds of computational loops are shared, and stored within the same memory page and cache line• Delays in remote memory accesses are probable causes of exceptions causing processor flushes
Solution: Privatization•Privatizing the arrays improved the performance of the whole program by 30% and resulted in a speedup of 10 for the problem procedure.•Now this procedure only takes 5% of total time•Processor Stalls are reduced significantlyOpenMP Privatized Version
Stall Cycle Breakdown for Non-Privatized (NP) and Privatized (P) Versions of diff_coeff
0.00E+005.00E+091.00E+101.50E+102.00E+102.50E+103.00E+103.50E+104.00E+104.50E+105.00E+10
D-c
ach
stal
ls
Bra
nch
mis
pred
ictio
n
Inst
ruct
ion
mis
s st
all
FLP
Uni
ts
Fron
t-end
flush
es
Cyc
les NP
P
NP-P
OpenMP Platform-awareness: Cost modeling Cost modeling:
To estimate the cost, mostly the time, of executing a program (or a portion of it) on a given system (or a component of it) using compilers, runtime systems, performance tools, etc.
An OpenMP cost model is critical for: OpenMP compiler optimizations Adaptive OpenMP runtime support Load balancing in hybrid MPI/OpenMP Targeting OpenMP to new architectures: multicore
Complementing empirical search
Example Usage of Cost Modeling
Case 1: What is the optimal tiling size for a loop tiling transformation? Cache size, Miss penalties, Loop overhead, …
DO K2 = 1, M, B DO J2 = 1, M, B DO I = 1, M DO K1 = K2, MIN(K2+B-1,M) DO J1 = J2, MIN(J2+B-1,M) Z(J1,I) = Z(J1,I) + X(K1,I) * Y(J1,K1)
Case 2: What is the maximum number of threads for parallel execution without performance degradation? Parallel overhead Ratio of parallelizable_work/total_work System capacities …
Per f ormance of an OpenMP Program
0
5000
10000
15000
20000
25000
30000
1 2 4 8 16 32 64 128
Number of Threads
MFLO
PS
CMT Platforms
OpenMP Compiler
Cost Modeling
Processor
OpenMPApplications
Cache Topology
ParallelOverheads Number of
Threads
Thread-coremapping
SchedulingPolicy
Chunksize
ComputationRequirements
MemoryReferences
OpenMP Runtime Library
Determine parameters for OpenMP execution
OpenMP Implementation
Application Features
Architectural Profiles
Usage of OpenMP Cost Modeling
Modeling OpenMP
Previous models T_parallel_region = T_fork + T_worksharing + T_join T_worksharing = T_sequential / N_threads
Our model aims to consider much more: Multiple worksharing, synchronization portions in a parallel
region Scheduling policy Chunk size Load imbalance Cache impact for multiple threads on multiple processors …
Modeling OpenMP Parallel Regions A parallel region could encompass several worksharing and
synchronization portions The sum of the longest execution time of all threads between a pair
of synchronization points dominates the final execution time: load imbalance
A parallel region
Master Thread
Modeling OpenMP Worksharing Work-sharing has overhead because of
multiple dispatching of work chunks
…Thread i
Schedule (type, chunkSize)
Time (cycles)
Implementation in OpenUH
Cost models
Processor modelCache model
Parallel model
Loop overhead
Parallel overhead
Machine cost
Cache cost
Reduction cost
Computational resource cost
Dependency latency cost
Register spillingcost
Cache costOperation cost
Issue cost
Mem_ref costTLB cost
Based on the existing cost models used in loop optimization Only works for perfectly nested loops: those permitting arbitrary transformations To guide conventional loop transformation: unrolling, titling, interchanging, To help auto-parallelization: justification, which level, interchanging
Cost Model Extensions
Added a new phase in compiler to traverse IR to conduct modeling Working on OpenMP regions instead of perfectly nested
loops Enhancement to model OpenMP details
reusing processor and cache models for processor and cache cycles
Modeling load imbalance: using max (thread_i_exe) Modeling scheduling: adding a lightweight scheduler in the
model Reading an environment variable for the desired numbers
of threads during modeling (so this is currently fixed)
Experiment
Machine: Cobalt in NCSA (National Center for Supercomputing Applications) 32-processor SGI Altix 3700 1.5 GHz Itanium 2 with 6 M L3 Cache 256 G memory
Benchmark: OpenMP version of a classic matrix-matrix multiplication (MMM) code i, k, j order 3 different double floating point matrix sizes: 500, 1000, 1500 OpenUH compiler: -O3, -mp
Cycle measuring tools: pfmon, perfsuite
#pragma omp parallel for private ( i , j , k )f o r ( i = 0 ; i < N; i ++) f o r ( k = 0 ; k < K; k++) f o r ( j = 0 ; j < M; j ++) c [ i ] [ j ]= c [ i ] [ j ]+ a [ i ] [ k ]*b [ k ] [ j ] ;
Results
Model i ng vs. Measurement
10000000
100000000
1E+09
1E+10
1E+11
1 2 3 4 5 6 7 8
Number of Threads
CPU
Cycl
es
Model - 500Measure- 500Model - 1000Measure- 1000Model - 1500Measure- 1500
Measured data have irregular fluctuation, especially for smaller dataset with larger number of threads 108 [email protected] <0.1 second, relatively big system level noise from thread management
Overestimation for 500x500 array from 1 to 5 threads, underestimation for all the rest: optimistic assumption for resource utilization more threads, more underestimation: lack of contention models for cache, memory and bus
Efficiency = Modeling_Time / Compilation_Time x100% = 0.079s/6.33s =1.25%
Relative Accuracy: Modeling Different Chunk Sizes for Static Scheduling
Modeling vs Measuring for OpenMP Scheduling
0
5
10
15
20
25
30
35
1 10 100 250 500 1000
Bill
ion
s
Chunksize
CP
U C
ycle
s
static-modeling
static-measuring
Dynamic-measuring
Guided-measuring
4 threads, matrix size 1000x1000
Load imbalance
Excessivescheduling overheads
Successfully captured the trend of measured result
Cost Model
Detailed cost model could be used to recompile program regions that perform poorly Possibly with focus on improving specific aspect of code
Current models in OpenUH are inaccurate Most often they accurately predict trends
Fail to account for resource contention This will be critical for modeling multicore platforms
What level of accuracy should we be going for?
Summary
Challenge of multicores demands “simple” parallel programming models There is very much to explore in this regard
Compiler technology has advanced and public domain software has become fairly robust
Many opportunities for exploiting this to improve Languages Compiler implementations Runtime systems OS interactions Tool behavior …