New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China

New Techniques for Programming GPU Clusters

Yifeng Chen

School of EECSPeking University, China.

Two Conflicting Approaches for

Programmability in HPCTop-down Approach

Core programming model is high-level (e.g. func parallel lang) Must rely on heavy heuristic runtime optimization Add low-level program constructs to improve low-level control Risks:

Programmers tend to avoid using “extra” constructs.Low-level controls do not fit well into the core model.

Bottom-up Approach (PARRAY PPoPP’12) Core programming model exposes the memory hierarchy Same algorithm, Same performance, Same intellectual

challenge, but Shorter code

GPUClustersTianhe: 1 GPU/ 2CPUs Tsubame ： 3GPU/ 2CPUs Mole-8.5: 6GPUs/2CPUsPKU McClus: 2GPUs/1

CPU

GPU 0

GPU 1

cu d a Me m cp yH o s tTo D e vice

4 0 9 6 20482048

P ro c0

Pro c1

MPI_ Sca tte r

PC I

ne two rk

Motivating Examples for PARRAY

Basic Notation

Dimension Tree

Type Reference

GPU 0

GPU 1

cu d a Me m cp yH o s tTo D e vice

4 0 9 6 20482048

P ro c0

Pro c1

MPI_ Sca tte r

PC I

ne two rk

Thread Arrays

#parray {pthd [2]} P#parray {paged float [2][[2048][4096]]} H#parray {dmem float # H_1} D#parray {[#P][#D]} Gfloat* host;_pa_pthd* p;#mainhost{ #create P(p) #create H(host) #detour P(p) { float* dev; INIT_GPU($tid$); #create D(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy P(p)}

pthread_create

pthread_create

sem_postsem_post

sem_waitsem_wait

pthread_joinpthread_join

Generating CUDA+Pthread

#parray { mpi [2] } M#parray { paged float [2][[2048][4096]] } H#parray { [#M][#H_1] } G

float* host;_pa_mpi* m;

#mainhosts{ #create M(m) #create H(host) #detour M(m) { float* dev; #create H_1(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy M(m)}

Generating MPI or IB/verbs

MPI_ScatterMPI_Scatter

ALLTOALL

BCAST

Other Communication Patterns

Generating Code for IB/verbs and YH

Communication LayerSemi-Bypassing the MPI layerPatching the Infiniband layerDiscontiguous RDMA communication

pattern achieving Zero-Copy.

Large-Scale FFTin 20 linesDeeply optimized algorithm (ICS 2010)Zero-copy for hmem

(Before Nov 2011)

Direct Simulation of Turbulent Flows

Scale Up to 14336 3D Single-Precision 12 distributed arrays, each with 11 TB data (128TB total) Entire Tianhe-1A with 7168 nodes

Progress 4096 3D completed 8192 3D half-way and 14336 3D tested for performance.

Software Technologies PARRAY code only 300 lines. Programming-level resilience technology for stable

computation Conclusion: GPU-accelerated large simulation on entire

Tianhe-1A is feasible.

Generated Code

DiscussionsOther programming models?

MPI (more expressive datatype) OpenACC (optimization for coalescing accesses) PGAS (generating PGAS library calls) IB/verbs (directly generating Zero-Copy IB calls)

We need a software stack! Irregular structures must be encoded into arrays

and then benefit from PARRAY. Runtime workflow possible above PARRAY Generating Pthread + CUDA + MPI (future

support of FPGA and MIC possible) + macros Macros are compiled out: no performance loss. Typical training = 3 days, friendly to engineers…

Documents

New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China