42
2/13/14 1 Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester Rice University Architecture-aware Algorithms and Software for Peta and Exascale Computing K2I Distinguished Lecture Series

Achitecture Aware Algorithms and Software for Peta and Exascale

Embed Size (px)

DESCRIPTION

Jack Dongarra from the University of Tennessee presented these slides at Ken Kennedy Institute of Information Technology on Feb 13, 2014. Listen to the podcast review of this talk: http://insidehpc.com/2014/02/13/week-hpc-jack-dongarra-talks-algorithms-exascale/

Citation preview

Page 1: Achitecture Aware Algorithms and Software for Peta and Exascale

2/13/14 1

Jack Dongarra

University of Tennessee Oak Ridge National Laboratory

University of Manchester Rice University

Architecture-aware Algorithms and Software for Peta and Exascale

Computing

K2I Distinguished Lecture Series

Page 2: Achitecture Aware Algorithms and Software for Peta and Exascale

Outline

•  Overview of High Performance Computing

•  Look at an implementation for some linear algebra algorithms on today’s High Performance Computers §  As an examples of the kind of thing

needed.

2

Page 3: Achitecture Aware Algorithms and Software for Peta and Exascale

3

H. Meuer, H. Simon, E. Strohmaier, & JD

- Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP

Ax=b, dense problem

- Updated twice a year

SC‘xy in the States in November Meeting in Germany in June

- All data available from www.top500.org

Size

Rat

e

TPP performance

Page 4: Achitecture Aware Algorithms and Software for Peta and Exascale

Performance Development of HPC Over the Last 20 Years

0.1

1

10

100

1000

10000

100000

1000000

10000000

100000000

1E+09

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

59.7  GFlop/s  

400  MFlop/s  

1.17  TFlop/s  

33.9  PFlop/s  

 118  TFlop/s  

224    PFlop/s  

SUM  

N=1  

N=500  

6-8 years

My Laptop (70 Gflop/s)

1993 1995 1997 1999 2001 2003 2005 2007 2009 2011

My iPad2 & iPhone 4s (1.02 Gflop/s)

2013

Page 5: Achitecture Aware Algorithms and Software for Peta and Exascale

State of Supercomputing in 2014 •  Pflops computing fully established with 31

systems. •  Three technology architecture possibilities or

“swim lanes” are thriving. •  Commodity (e.g. Intel) •  Commodity + accelerator (e.g. GPUs) •  Special purpose lightweight cores (e.g. IBM BG)

•  Interest in supercomputing is now worldwide, and growing in many new markets (over 50% of Top500 computers are in industry).

•  Exascale projects exist in many countries and regions.

5

Page 6: Achitecture Aware Algorithms and Software for Peta and Exascale

November 2013: The TOP10 Rank Site Computer Country Cores Rmax

[Pflops] % of Peak

Power [MW]

MFlops/Watt

1 National University

of Defense Technology

Tianhe-2 NUDT, Xeon 12C 2.2GHz + IntelXeon

Phi (57c) + Custom China 3,120,000 33.9 62 17.8 1905

2 DOE / OS Oak Ridge Nat Lab

Titan, Cray XK7 (16C) + Nvidia Kepler GPU (14c) + Custom USA 560,640 17.6 65 8.3 2120

3 DOE / NNSA L Livermore Nat Lab

Sequoia, BlueGene/Q (16c) + custom USA 1,572,864 17.2 85 7.9 2063

4 RIKEN Advanced Inst for Comp Sci

K computer Fujitsu SPARC64 VIIIfx (8c) + Custom Japan 705,024 10.5 93 12.7 827

5 DOE / OS Argonne Nat Lab

Mira, BlueGene/Q (16c) + Custom USA 786,432 8.16 85 3.95 2066

6 Swiss CSCS Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler (14c) + Custom Swiss 115,984 6.27 81 2.3 2726

7 Texas Advanced Computing Center

Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB USA 204,900 2.66 61 3.3 806

8 Forschungszentrum Juelich (FZJ)

JuQUEEN, BlueGene/Q, Power BQC 16C 1.6GHz+Custom Germany 458,752 5.01 85 2.30 2178

9 DOE / NNSA L Livermore Nat Lab

Vulcan, BlueGene/Q, Power BQC 16C 1.6GHz+Custom USA 393,216 4.29 85 1.97 2177

10 Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 91* 3.42 848

500 Banking HP USA 22,212 .118 50

Page 7: Achitecture Aware Algorithms and Software for Peta and Exascale

Accelerators (53 systems)

0  

10  

20  

30  

40  

50  

60  

2006   2007   2008   2009   2010   2011   2012   2013  

System

s  

Intel  MIC  (13)  

Clearspeed  CSX600  (0)  

ATI  GPU  (2)  

IBM  PowerXCell  8i  (0)  

NVIDIA  2070  (4)  

NVIDIA  2050  (7)  

NVIDIA  2090  (11)  

NVIDIA  K20  (16)  

19 US 9 China 6 Japan 4 Russia 2 France 2 Germany 2 India 1 Italy 1 Poland

1 Australia 2 Brazil 1 Saudi Arabia 1 South Korea 1 Spain 2 Switzerland 1 UK

Page 8: Achitecture Aware Algorithms and Software for Peta and Exascale

Top500 Performance Share of Accelerators

0%

5%

10%

15%

20%

25%

30%

35%

40%

2006

2007

2008

2009

2010

2011

2012

2013

Frac

tion

of

Tota

l TO

P500

Pe

rfor

man

ce

53 of the 500 systems provide 35% of the accumulated performance

Page 9: Achitecture Aware Algorithms and Software for Peta and Exascale

0 10 20 30 40 50 60 70 80 90

100

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012

For the Top 500: Rank at which Half of Total Performance is Accumulated

Num

bers

of S

yste

ms

0

5

10

15

20

25

30

35

0 100 200 300 400 500

Top 500 November 2013

Pflo

p/s

Page 10: Achitecture Aware Algorithms and Software for Peta and Exascale

Commodity plus Accelerator Today

10

Intel Xeon 8 cores 3 GHz

8*4 ops/cycle 96 Gflop/s (DP)

Nvidia K20X “Kepler” 2688 “Cuda cores”

.732 GHz 2688*2/3 ops/cycle 1.31 Tflop/s (DP)

Commodity Accelerator (GPU)

Interconnect PCI-e Gen2/3 16 lane

64 Gb/s (8 GB/s) 1 GW/s

6 GB

192 Cuda cores/SMX 2688 “Cuda cores”

Page 11: Achitecture Aware Algorithms and Software for Peta and Exascale

Linpack Efficiency

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 100 200 300 400 500

Linp

ack

Efficien

cy

Page 12: Achitecture Aware Algorithms and Software for Peta and Exascale

Linpack Efficiency

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 100 200 300 400 500

Linp

ack

Efficien

cy

Page 13: Achitecture Aware Algorithms and Software for Peta and Exascale

Linpack Efficiency

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 100 200 300 400 500

Linp

ack

Efficien

cy

Page 14: Achitecture Aware Algorithms and Software for Peta and Exascale

Linpack Efficiency

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 100 200 300 400 500

Linp

ack

Efficien

cy

Page 15: Achitecture Aware Algorithms and Software for Peta and Exascale

DLA Solvers

¨ We are interested in developing Dense Linear Algebra Solvers ¨ Retool LAPACK and ScaLAPACK for multicore and hybrid architectures

2/13/14 15

Page 16: Achitecture Aware Algorithms and Software for Peta and Exascale

Last Generations of DLA Software

MAGMA Hybrid Algorithms (heterogeneity friendly)

Rely on - hybrid scheduler - hybrid kernels

Software/Algorithms follow hardware evolution in time LINPACK (70’s) (Vector operations)

Rely on - Level-1 BLAS operations

LAPACK (80’s) (Blocking, cache friendly)

Rely on - Level-3 BLAS operations

ScaLAPACK (90’s) (Distributed Memory)

Rely on - PBLAS Mess Passing

PLASMA New Algorithms (many-core friendly)

Rely on - a DAG/scheduler - block data layout - some extra kernels

Page 17: Achitecture Aware Algorithms and Software for Peta and Exascale

A New Generation of DLA Software

MAGMA Hybrid Algorithms (heterogeneity friendly)

Rely on - hybrid scheduler - hybrid kernels

Software/Algorithms follow hardware evolution in time LINPACK (70’s) (Vector operations)

Rely on - Level-1 BLAS operations

LAPACK (80’s) (Blocking, cache friendly)

Rely on - Level-3 BLAS operations

ScaLAPACK (90’s) (Distributed Memory)

Rely on - PBLAS Mess Passing

PLASMA New Algorithms (many-core friendly)

Rely on - a DAG/scheduler - block data layout - some extra kernels

Page 18: Achitecture Aware Algorithms and Software for Peta and Exascale

Parallelization of LU and QR. Parallelize the update:

•  Easy and done in any reasonable software. •  This is the 2/3n3 term in the FLOPs count. •  Can be done efficiently with LAPACK+multithreaded BLAS

-­‐  

dgemm  

-­‐  

lu(   )  

dgeD2  

dtrsm  (+  dswp)  

dgemm  

\  

L  

U  

A(1)  

A(2)  L  

U  

Fork - Join parallelism Bulk Sync Processing

Page 19: Achitecture Aware Algorithms and Software for Peta and Exascale

Cor

es

Time

Synchronization (in LAPACK LU)

•  Fork-join, bulk synchronous processing 27

����� ����� ����� ����� ������

23

����������������

��������������

��������������

��� ���������������

����� ������ �����

Ø  fork join Ø  bulk synchronous processing

19

Page 20: Achitecture Aware Algorithms and Software for Peta and Exascale

PLASMA LU Factorization Dataflow Driven

xTRSM

xGEMM

xGEMM

xGETF2

xTRSM

xTRSM

xTRSM

xGEMM xGEMM

xGEMM

xGEMM xGEMM xGEMM

xGEMM

xGEMM xGEMM

Numerical program generates tasks and run time system executes tasks respecting data dependences.

Page 21: Achitecture Aware Algorithms and Software for Peta and Exascale

QUARK

¨  A runtime environment for the dynamic execution of precedence-constraint tasks (DAGs) in a multicore machine Ø Translation Ø  If you have a serial program that

consists of computational kernels (tasks) that are related by data dependencies, QUARK can help you execute that program (relatively efficiently and easily) in parallel on a multicore machine

21

Page 22: Achitecture Aware Algorithms and Software for Peta and Exascale

¨ Objectives Ø  High utilization of each core Ø  Scaling to large number of cores Ø  Synchronization reducing algorithms

¨ Methodology Ø  Dynamic DAG scheduling (QUARK) Ø  Explicit parallelism Ø  Implicit communication Ø  Fine granularity / block data layout

¨ Arbitrary DAG with dynamic scheduling

22

Fork-join parallelism Notice the synchronization penalty in the presence of heterogeneity.

The Purpose of a QUARK Runtime

DAG scheduled parallelism

Page 23: Achitecture Aware Algorithms and Software for Peta and Exascale

QUARK Shared Memory Superscalar Scheduling

FOR k = 0..TILES-1 A[k][k] ← DPOTRF(A[k][k]) FOR m = k+1..TILES-1 A[m][k] ← DTRSM(A[k][k], A[m][k]) FOR m = k+1..TILES-1 A[m][m] ← DSYRK(A[m][k], A[m][m]) FOR n = k+1..m-1 A[m][n] ← DGEMM(A[m][k], A[n][k], A[m][n])

for (k = 0; k < A.mt; k++) { QUARK_CORE(dpotrf,...); for (m = k+1; m < A.mt; m++) { QUARK_CORE(dtrsm,...); } for (m = k+1; m < A.mt; m++) { QUARK_CORE(dsyrk,...); for (n = k+1; n < m; n++) { QUARK_CORE(dgemm,...) } } }

definition – pseudocode

implementation – QUARK code in PLASMA

Page 24: Achitecture Aware Algorithms and Software for Peta and Exascale

Pipelining: Cholesky Inversion 3 Steps: Factor, Invert L, Multiply L’s

24

POTRF+TRTRI+LAUUM: 25 (7t-3) Cholesky Factorization alone: 3t-2

48 cores POTRF, TRTRI and LAUUM. The matrix is 4000 x 4000,tile size is 200 x 200,

Pipelined: 18 (3t+6)

Page 25: Achitecture Aware Algorithms and Software for Peta and Exascale

0

500

1000

1500

2000

0 20000 40000 60000 80000 100000 120000 140000 160000

Gflo

p/s

Matrix Size

Performance of PLASMA Cholesky, Double PrecisionComparing Various Numbers of Cores (Percentage of Theoretical Peak)

1024 Cores (64 x 16-cores) 2.00 GHz Intel Xeon X7550, 8,192 Gflop/s Peak (Double Precision) [nautilus]Static Scheduling

1000 Cores (.27)960 Cores (.27)768 Cores (.30)576 Cores (.33)384 Cores (.31)192 Cores (.48)160 Cores (.46)128 Cores (.43)

96 Cores (.51)64 Cores (.58)32 Cores (.70)16 Cores (.78)

Page 26: Achitecture Aware Algorithms and Software for Peta and Exascale

¨  DAG too large to be generated ahead of time Ø  Generate it dynamically Ø  Merge parameterized DAGs

with dynamically generated DAGs

¨  HPC is about distributed heterogeneous resources Ø  Have to be able to move the

data across multiple nodes Ø  The scheduling cannot be

centralized Ø  Take advantage of all available

resources with minimal intervention from the user

¨  Facilitate the usage of the HPC resources Ø  Languages Ø  Portability

Page 27: Achitecture Aware Algorithms and Software for Peta and Exascale

PO

GE

TR

SYTR TR

PO GE GE

TR TR

SYSY GE

PO

TR

SY SY

DPLASMA: Going to Distributed Memory

Page 28: Achitecture Aware Algorithms and Software for Peta and Exascale

for  i,j  =  0..N  

     QUARK_Insert(  GEMM,    A[i,  j],INPUT,      B[j,  i],INPUT,    C[i,i],INOUT  )  

     QUARK_Insert(  TRSM,    A[i,  j],INPUT,      B[j,  i],INOUT  )  

Start  with  PLASMA  

Analyze    dependencies  with  Omega  Test  {  1  <  i  <  N  :  GEMM(i,  j)  =>  TRSM(j)  }  

Generate  Code  which  has  the  Parameterized  DAG  

GEMM(i,  j)   TRSM(j)  

Parse  the  C  source  code  to  Abstract  Syntax  Tree  QUARK_Insert  

GEMM   A  

i   j  

B  

i   j   i   j  

B  

Loops  &  array  references  have  to  be  affine  

Page 29: Achitecture Aware Algorithms and Software for Peta and Exascale

QUARK   DAGuE  

execution window

tasks

inputs

outputs

Number of tasks in DAG:

O(n3) Cholesky: 1/3 n3

LU: 2/3 n3 QR: 4/3 n3

Number of tasks in parameterized DAG:

O(1) Cholesky: 4 (POTRF, SYRK, GEMM, TRSM)

LU: 4 (GETRF, GESSM, TSTRF, SSSSM) QR: 4 (GEQRT, LARFB, TSQRT, SSRFB) DAG: Conceptualized & Parameterized

PLASMA (On Node)

DPLASMA (Distributed System)

small enough to store on each core in every node = Scalable

Page 30: Achitecture Aware Algorithms and Software for Peta and Exascale

Task Affinity in DPLASMA

User defined data distribution function

Runtime system, called PaRSEC, is distributed

Page 31: Achitecture Aware Algorithms and Software for Peta and Exascale

Runtime DAG scheduling

¨  Every node has the symbolic DAG representation Ø Only the (node local) frontier of

the DAG is considered Ø Distributed Scheduling based on

remote completion notifications ¨  Background remote data

transfer automatic with overlap

¨  NUMA / Cache aware Scheduling Ø Work Stealing and sharing based

on memory hierarchies

Node0

Node1

Node2

Node3

PO

GE

TR

SYTR TR

PO GE GE

TR TR

SYSY GE

PO

TR

SY

PO

SY SY

Page 32: Achitecture Aware Algorithms and Software for Peta and Exascale

LU  

Cholesky  

QR  

DSBP = Distributed Square Block Packed

81 nodes Dual socket nodes Quad core Xeon L5420 Total 648 cores at 2.5 GHz ConnectX InfiniBand DDR 4x

Page 33: Achitecture Aware Algorithms and Software for Peta and Exascale

The PaRSEC framework

Cores Memory Hierarchies Coherence Data

Movement Accelerators

Data Movement

Para

llel R

unti

me

Har

dwar

e D

omai

n Sp

ecif

ic

Exte

nsio

ns

Scheduling Scheduling

Scheduling

Data

Compact Representation - PTG

Dynamic Discovered Representation - DTG

SpecializedKernels SpecializedK

ernels Specialized Kernels

Tasks Tasks

Tasks

Hardcore

Dense LA … Sparse LA Chemistry

Page 34: Achitecture Aware Algorithms and Software for Peta and Exascale

Other Systems

PaRSEC

SMPss

StarPU

Charm

++

FLAM

E

QUARK

Tblas

PTG

Scheduling Distr. (1/core)

Repl (1/node)

Repl (1/node)

Distr. (Actors)

w/ SuperMatrix

Repl (1/node) Centr. Centr.

Language Internal

or Seq. w/ Affine Loops

Seq. w/

add_task

Seq. w/

add_task

Msg-Driven Objects

Internal (LA DSL)

Seq. w/

add_task

Seq. w/

add_task Internal

Accelerator GPU GPU GPU GPU GPU

Availability Public Public Public Public Public Public Not Avail.

Not Avail.

Early stage: ParalleX Non-academic: Swarm, MadLINQ, CnC

All projects support Distributed and Shared Memory (QUARK with QUARKd; FLAME with Elemental)

Page 35: Achitecture Aware Algorithms and Software for Peta and Exascale

Performance Development in Top500

0.1

1

10

100

1000

10000

100000

1000000

10000000

100000000

1E+09

1E+10

1E+11

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

1 Eflop/s

1 Gflop/s

1 Tflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

N=1  

N=500  

Page 36: Achitecture Aware Algorithms and Software for Peta and Exascale

Systems 2014 Tianhe-2

2020-2022 Difference Today & Exa

System peak 55 Pflop/s 1 Eflop/s ~20x

Power 18 MW (3 Gflops/W)

~20 MW (50 Gflops/W)

O(1) ~15x

System memory 1.4 PB (1.024 PB CPU + .384 PB CoP)

32 - 64 PB ~50x

Node performance 3.43 TF/s (.4 CPU +3 CoP)

1.2 or 15TF/s O(1)

Node concurrency 24 cores CPU + 171 cores CoP

O(1k) or 10k ~5x - ~50x

Node Interconnect BW 6.36 GB/s 200-400GB/s ~40x

System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x

Total concurrency 3.12 M 12.48M threads (4/core)

O(billion) ~100x

MTTF Few / day O(<1 day) O(?)

Today’s #1 System

Page 37: Achitecture Aware Algorithms and Software for Peta and Exascale

Exascale System Architecture with a cap of $200M and 20MW Systems 2014

Tianhe-2 2020-2022 Difference

Today & Exa System peak 55 Pflop/s 1 Eflop/s ~20x

Power 18 MW (3 Gflops/W)

~20 MW (50 Gflops/W)

O(1) ~15x

System memory 1.4 PB (1.024 PB CPU + .384 PB CoP)

32 - 64 PB ~50x

Node performance 3.43 TF/s (.4 CPU +3 CoP)

1.2 or 15TF/s O(1)

Node concurrency 24 cores CPU + 171 cores CoP

O(1k) or 10k ~5x - ~50x

Node Interconnect BW 6.36 GB/s 200-400GB/s ~40x

System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x

Total concurrency 3.12 M 12.48M threads (4/core)

O(billion) ~100x

MTTF Few / day O(<1 day) O(?)

Page 38: Achitecture Aware Algorithms and Software for Peta and Exascale

Systems 2014 Tianhe-2

2020-2022 Difference Today & Exa

System peak 55 Pflop/s 1 Eflop/s ~20x

Power 18 MW (3 Gflops/W)

~20 MW (50 Gflops/W)

O(1) ~15x

System memory 1.4 PB (1.024 PB CPU + .384 PB CoP)

32 - 64 PB ~50x

Node performance 3.43 TF/s (.4 CPU +3 CoP)

1.2 or 15TF/s O(1)

Node concurrency 24 cores CPU + 171 cores CoP

O(1k) or 10k ~5x - ~50x

Node Interconnect BW 6.36 GB/s 200-400GB/s ~40x

System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x

Total concurrency 3.12 M 12.48M threads (4/core)

O(billion) ~100x

MTTF Few / day Many / day O(?)

Exascale System Architecture with a cap of $200M and 20MW

Page 39: Achitecture Aware Algorithms and Software for Peta and Exascale

39

Major Changes to Software & Algorithms • Must rethink the design of our

algorithms and software §  Another disruptive technology

• Similar to what happened with cluster computing and message passing

§  Rethink and rewrite the applications, algorithms, and software

§  Data movement is expense §  Flop/s are cheap, so are provisioned in

excess

Page 40: Achitecture Aware Algorithms and Software for Peta and Exascale

Critical Issues at Peta & Exascale for Algorithm and Software Design •  Synchronization-reducing algorithms

§  Break Fork-Join model

•  Communication-reducing algorithms §  Use methods which have lower bound on communication

•  Mixed precision methods §  2x speed of ops and 2x speed for data movement

•  Autotuning §  Today’s machines are too complicated, build “smarts” into

software to adapt to the hardware

•  Fault resilient algorithms §  Implement algorithms that can recover from failures/bit flips

•  Reproducibility of results §  Today we can’t guarantee this. We understand the issues,

but some of our “colleagues” have a hard time with this.

Page 41: Achitecture Aware Algorithms and Software for Peta and Exascale

Summary •  Major Challenges are ahead for extreme

computing §  Parallelism O(109)

•  Programming issues §  Hybrid

•  Peak and HPL may be very misleading •  No where near close to peak for most apps

§  Fault Tolerance •  Today Sequoia BG/Q node failure rate is 1.25 failures/day

§  Power •  50 Gflops/w (today at 2 Gflops/w)

•  We will need completely new approaches and technologies to reach the Exascale level

Page 42: Achitecture Aware Algorithms and Software for Peta and Exascale

Collaborators / Software / Support

u  PLASMA http://icl.cs.utk.edu/plasma/

u  MAGMA http://icl.cs.utk.edu/magma/

u  Quark (RT for Shared Memory) •  http://icl.cs.utk.edu/quark/

u  PaRSEC(Parallel Runtime Scheduling

and Execution Control) •  http://icl.cs.utk.edu/parsec/

42

u  Collaborating partners University of Tennessee, Knoxville

University of California, Berkeley University of Colorado, Denver MAGMA

PLASMA