Dilemma of Parallel Programming Xinhua Lin ( 林新华 ) HPC Lab of SJTU @XJTU, 17 th Oct 2011

Dilemma of Parallel Programming

Xinhua Lin (林新华 ) HPC Lab of SJTU

@XJTU, 17th Oct 2011

Disclaimers

• I am not funded by CRAY

• Slides marked with Chapel logo are taken from Brad Chamberlain’s talk ‘The Mother of All Chapel Talks’, with permission from himself

• Funny pictures are from Internet

About me and HPC Lab in SJTU

• Directing HPC Lab• Co-translator of PPP• Co-founder of HMPP CoC for AP&Japan

• As MS HPC Invitation institutes @SH• Support For HPC Center of SJTU• Hold SJTU HPC Seminar monthly

http://itis.grid.sjtu.edu.cn/blog


Three Challenges for ParaProg in multi/many core era

• Revolution V.S. Evolution

• Low level V.S. High level– Performance V.S. Programmable

• Performance V.S. Performance Portability

For more detail:Paper Version: <中国教育网络 > Special issue for HPC and Cloud, Sep 2011Online Version: http://itis.grid.sjtu.edu.cn/blog


Outline

• Right Level to expose Parallel

• ParaProg languages Reviews

• Multiresolution and Chapel

Right Level to Expose Parallel

Can we stop water/parallel ?

Hardware

ISA

OS

Library

Language

Performance V.S. Programmable

Target Machine

MPI

OpenMP

pthreads

ExposeImplementingMechanisms

“Why is everything so tedious?”

Target MachineTarget Machine

ZPL

HPF

Higher-Level Abstractions

“Why don’t I have more control?”

Low Level High Level

ParaProg Education • Tired of teaching yet another specific lang.

– MPI for Cluster – OpenMP for SMP then Multi-core CPU– CUDA for GPU, and now OpenCL – More on the way…

• Had to explain concepts by different tools– Single lang. to explain them all?

• Similar in OS education– Production OS: Linux, Unix and Window– OS only for education: Minix

ParaProg languages Reviews

Hybrid Programming Model• MPI is insufficient in multi/many core era

– OpenMP for multi-core– CUDA/OpenCL for many-core*

• So called Hybrid Programming was invented as a temporary solution, workable but ugly– MPI+OpenMP for Multi-core cluster– MPI+CUDA/OpenCL for GPU cluster like Tianhe-1A

• Similar idea used in CUDA for thread and thread-block, OpenCL for work-item and work-group* We will wait and see how OpenMP works on Intel MIC

ParaProg from different ways

• Low Level (expose implementation mechanism )– MPI, CUDA and OpenCL– OpenMP

• High Level– PGAS: CAF, UPC and Tianuim – Global View: NESL, ZPL– APGAS: Chapel, X10

• Directive Based – HMPP, PGI, CRAY-directive

Mulutiesolution and Chapel

What is Mulutiesolution?Structure the language in a layered manner, permitting it to be

used at multiple levels as required/desired– support high-level features and automation for convenience– provide the ability to drop down to lower, more manual levels– use appropriate separation of concerns to keep these layers clean

DistributionsData parallelismTask ParallelismLocality Control

Target Machine

Base Language

language concepts

Where Chapel was born: HPCSHPCS: High Productivity Computing Systems (DARPA et al.)

– Goal: Raise productivity of high-end computing users by 10– Productivity = Performance + Programmability + Portability + Robustness

• Phase II: Cray, IBM, Sun (July 2003 – June 2006)– Evaluated the entire system architecture’s impact on productivity…

• processors, memory, network, I/O, OS, runtime, compilers, tools, …• …and new languages:

Cray: Chapel IBM: X10 Sun: Fortress

• Phase III: Cray, IBM (July 2006 – 2010)– Implement the systems and technologies resulting from phase II– (Sun also continues work on Fortress, without HPCS funding)

Global-view V.S. FragmentedProblem: “Apply 3-pt stencil to vector”global-view

=

+

(

)/2

fragmented

=

+

=

+

=

)/2 + )/2)/2

( ( (

Global-view V.S. SPMD Code

Global-Viewdef main() { var n: int = 1000; var a, b: [1..n] real;

forall i in 2..n-1 { b(i) = (a(i-1) + a(i+1))/2; }}

SPMDdef main() { var n: int = 1000; var locN: int = n/numProcs; var a, b: [0..locN+1] real;

if (iHaveRightNeighbor) { send(right, a(locN)); recv(right, a(locN+1)); } if (iHaveLeftNeighbor) { send(left, a(1)); recv(left, a(0)); } forall i in 1..locN { b(i) = (a(i-1) + a(i+1))/2; }}

Chapel Overview• A design principle for HPC

– “Support the general case, optimize for the common case”

• Data Parallel (ZPL) + Task Parallel(CRAY MTA) + Script Lang.

• Latest version 1.3.0 is available in as OSS:• http://sourceforge.net/projects/chapel

DistributionsData parallelismTask ParallelismLocality Control

Target Machine

Base Language

language concepts

Chapel example: Heat TransferA:

1.0

n

n

4

repeat until max change <

Chapel Code For Heat Transfer

Chapel as Minix in ParaProg

• If I were to offer a ParaProg class, I’d want to teach about:– data parallelism– task parallelism– concurrency– synchronization– locality/affinity– deadlock, livelock, and other pitfalls– performance tuning– …

Conclusion—Major Points

• Programmable and Performance are always the dilemma of ParaProg

• Multiresolution sounds perfect in theory but not mature enough for production

• However, Chapel could be used as Minix in ParaProg

Q&A

Documents

Dilemma of Parallel Programming Xinhua Lin ( 林新华 ) HPC Lab of SJTU @XJTU, 17 th Oct 2011