18
New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China.

New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China

  • Upload
    ordell

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China. Two Conflicting Approaches for Programmability in HPC. Top-down Approach Core programming model is high-level (e.g. func parallel lang) Must rely on heavy heuristic runtime optimization - PowerPoint PPT Presentation

Citation preview

Page 1: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China

New Techniques for Programming GPU Clusters

Yifeng Chen

School of EECSPeking University, China.

Page 2: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China

Two Conflicting Approaches for

Programmability in HPCTop-down Approach

Core programming model is high-level (e.g. func parallel lang) Must rely on heavy heuristic runtime optimization Add low-level program constructs to improve low-level control Risks:

Programmers tend to avoid using “extra” constructs.Low-level controls do not fit well into the core model.

Bottom-up Approach (PARRAY PPoPP’12) Core programming model exposes the memory hierarchy Same algorithm, Same performance, Same intellectual

challenge, but Shorter code

Page 3: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China

GPUClustersTianhe: 1 GPU/ 2CPUs Tsubame : 3GPU/ 2CPUs Mole-8.5: 6GPUs/2CPUsPKU McClus: 2GPUs/1

CPU

Page 4: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China

GPU 0

GPU 1

cu d a Me m cp yH o s tTo D e vice

4 0 9 6 20482048

P ro c0

Pro c1

MPI_ Sca tte r

PC I

ne two rk

Motivating Examples for PARRAY

Page 5: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China

Basic Notation

Dimension Tree

Type Reference

Page 6: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China
Page 7: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China

GPU 0

GPU 1

cu d a Me m cp yH o s tTo D e vice

4 0 9 6 20482048

P ro c0

Pro c1

MPI_ Sca tte r

PC I

ne two rk

Thread Arrays

Page 8: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China

#parray {pthd [2]} P#parray {paged float [2][[2048][4096]]} H#parray {dmem float # H_1} D#parray {[#P][#D]} Gfloat* host;_pa_pthd* p;#mainhost{ #create P(p) #create H(host) #detour P(p) { float* dev; INIT_GPU($tid$); #create D(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy P(p)}

pthread_create

pthread_create

sem_postsem_post

sem_waitsem_wait

pthread_joinpthread_join

Generating CUDA+Pthread

Page 9: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China

#parray { mpi [2] } M#parray { paged float [2][[2048][4096]] } H#parray { [#M][#H_1] } G

float* host;_pa_mpi* m;

#mainhosts{ #create M(m) #create H(host) #detour M(m) { float* dev; #create H_1(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy M(m)}

Generating MPI or IB/verbs

MPI_ScatterMPI_Scatter

Page 10: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China

ALLTOALL

BCAST

Other Communication Patterns

Page 11: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China

Generating Code for IB/verbs and YH

Communication LayerSemi-Bypassing the MPI layerPatching the Infiniband layerDiscontiguous RDMA communication

pattern achieving Zero-Copy.

Page 12: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China

Large-Scale FFTin 20 linesDeeply optimized algorithm (ICS 2010)Zero-copy for hmem

Page 13: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China
Page 14: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China

(Before Nov 2011)

Page 15: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China

Direct Simulation of Turbulent Flows

Scale Up to 14336 3D Single-Precision 12 distributed arrays, each with 11 TB data (128TB total) Entire Tianhe-1A with 7168 nodes

Progress 4096 3D completed 8192 3D half-way and 14336 3D tested for performance.

Software Technologies PARRAY code only 300 lines. Programming-level resilience technology for stable

computation Conclusion: GPU-accelerated large simulation on entire

Tianhe-1A is feasible.

Page 16: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China
Page 17: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China

Generated Code

Page 18: New Techniques for Programming GPU Clusters Yifeng Chen  School of EECS Peking University, China

DiscussionsOther programming models?

MPI (more expressive datatype) OpenACC (optimization for coalescing accesses) PGAS (generating PGAS library calls) IB/verbs (directly generating Zero-Copy IB calls)

We need a software stack! Irregular structures must be encoded into arrays

and then benefit from PARRAY. Runtime workflow possible above PARRAY Generating Pthread + CUDA + MPI (future

support of FPGA and MIC possible) + macros Macros are compiled out: no performance loss. Typical training = 3 days, friendly to engineers…