76
1 Solving Finite Element Systems with Solving Finite Element Systems with 17 Billion Unknowns 17 Billion Unknowns at Sustained Teraflops Performance at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation) Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de Erlangen, July 21, 2006 B. Bergen (LSS Erlangen/Los Alamos) T. Gradl (LSS Erlangen) F. Hülsemann (LSS Erlangen/EDF) U. Rüde (LSS Erlangen, [email protected]) G. Wellein (RRZE Erlangen)

Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

1

Solving Finite Element Systems withSolving Finite Element Systems with17 Billion Unknowns17 Billion Unknowns

at Sustained Teraflops Performanceat Sustained Teraflops Performance

Lehrstuhl für Informatik 10 (Systemsimulation)

Universität Erlangen-Nürnberg

www10.informatik.uni-erlangen.de

Erlangen, July 21, 2006

B. Bergen (LSS Erlangen/Los Alamos)

T. Gradl (LSS Erlangen)

F. Hülsemann (LSS Erlangen/EDF)

U. Rüde (LSS Erlangen, [email protected])

G. Wellein (RRZE Erlangen)

Page 2: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

2

OverviewOverviewMotivation

Applications of Large-Scale FE SimulationsDirect and Inverse Bio-Electric Field Problems3D image registration/ image processing/ analysisFlow Induced Noise and AcousticsMultiphysics-Problems in Nano TechnologySimulation of High Temperature Furncaces

Scalable Finite Element SolversScalable Algorithms: Multigrid

Scalable Architecture:Scalable Architecture: H Hierarchical HHybrid GGrids (HHG)Results

OutlookTowards Peta-Scale FE Solvers

Page 3: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

3

Part I Part I

MotivationMotivation

Page 4: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

4

A little quiz ...A little quiz ...

1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012,

1 Pflops = 1015 floating point operations per second

What is the speed of your PC?

What is the speed of the fastest computer currently available (and where is it located)

What was the speed of the fastest computer in1995?2000?

2005?

Page 5: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

5

A little quiz ...A little quiz ...

1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012,

1 Pflops = 1015 floating point operations per second

What is the speed of your PC?

probably between 1 and 6.5 GFlops

Page 6: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

6

A little quiz ...A little quiz ...

1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012,

1 Pflops = 1015 floating point operations per second

What is the speed of your PC?

What is the speed of the fastest computer currently available (and where is it located)

What was the speed of the fastest computer in1995?2000?

2005?

Page 7: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

7

A little quiz ...A little quiz ...

1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012,

1 Pflops = 1015 floating point operations per second

What is the speed of your PC?

What is the speed of the fastest computer currently available (and where is it located)

367 Tflop, it is a Blue Gene/L in Livermore/California with >130 000 processors

Page 8: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

8

A little quiz ...A little quiz ...

1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012,

1 Pflops = 1015 floating point operations per second

What is the speed of your PC?

What is the speed of the fastest computer currently available (and where is it located)

What was the speed of the fastest computer in1995?2000?

2005?

Page 9: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

9

A little quiz ...A little quiz ...

1 Kflops = 103, 1 Mflops = 106, 1 Gflops = 109, 1 Tflops = 1012,

1 Pflops = 1015 floating point operations per second

What is the speed of your PC?What is the speed of the fastest computer currently available (and where is it located)What was the speed of the fastest computer in

1995? 0.236 TFlops2000? 12.3 TFlops2005? 367 TFlops

... and how much has the speed of cars/airplanes/... improved in the same time?

additional question: When do you expect that computers exceed 1 PFlops?

Page 10: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

10

Architecture ExampleArchitecture ExampleOur Pet Dinosaur Our Pet Dinosaur (now in retirement)(now in retirement)

HLRB-I: Hitachi SR 8000 at the Leibniz-Rechenzentrum der

Bayerischen Akademie der Wissenschaften

(No. 5 at time of installation in 2000)

8 Proc and 8 GB per node

1344 CPUs (168*8)

2016 Gflop total

Very sensitive to data structures

Currently being replaced by HLRB-II:4096 Processor SGI Altix

upgrade to >60 Tflop in 2007

KONWIHR = Kompetenznetzwerk für Technisch-Wissenschaftliches Hoch- und Höchstleistungsrechnen

Project Gridlib (2001-2005)

Development of Hierarchical Hybrid Grids (HHG)

Page 11: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

11

HHG Motivation I HHG Motivation I Structured vs. Unstructured GridsStructured vs. Unstructured Grids

(on Hitachi SR 8000)(on Hitachi SR 8000)

gridlib/HHG MFlops rates for matrix-vector multiplication on one node on the Hitachi

compared with highly tuned JDS results for sparse matrices

0

1000

2000

3000

4000

5000

6000

7000

8000

729 35,937 2,146,689 # unknowns

JDSStencils

Page 12: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

12

HHG Motivation II: DiMe - ProjectHHG Motivation II: DiMe - Project

Cache-optimizations for sparse matrix codes

High Performance Multigrid Solvers

Efficient LBM codes for CFD

Collaboration projectfunded 1996 - 20xy by DFG

LRR München (Prof. Bode)

LSS Erlangen (Prof. Rüde)

Zur Anzeige wird der QuickTime Dekompressor YUV420 codec

bentigt.

www10.informatik.uni-erlangen.de/de/Research/Projects/DiME/

Data Local Iterative Methods for theEfficient Solution of Partial Differential Equations

Page 13: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

13

HHG Motivation III HHG Motivation III

Memory for the discretization is often the limiting factor on the size

of problem that can be solved.

Common PracticeAdd resolution by applying regular refinement to an unstructured input grid

Refinement does not add new information about the domain ⇒ CRS is overkill (or any other sparse matrix structure)

Missed OpportunityRegularity of structured patches is not exploited

HHGDevelop new data structures that exploit regularity for enhanced performance

Employ stencil-based discretization techniques on structured patches to reduce memory usage

Page 14: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

14

Part II Part II

Applications of Large ScaleApplications of Large ScaleFinite Element SimulationsFinite Element Simulations

Page 15: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

15

Example: Flow Induced NoiseExample: Flow Induced Noise

Flow around a small square obstacle

Relatively simple geometries, but fine resolution for resolving physics

Images by M. Escobar (Dept. of Sensor Technology + LSTM)

Page 16: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

16

Part III - a Part III - a

Towards Scalable FE SoftwareTowards Scalable FE Software

Multigrid AlgorithmsMultigrid Algorithms

Page 17: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

17

What is Multigrid?What is Multigrid?Has nothing to do with „grid computing“A general methodology

multi - scale (actually it is the „original“)many different applicationsdevelopped in the 1970s - ...

Useful e.g. for solving elliptic PDEs large sparse systems of equationsiterativeconvergence rate independent of problem sizeasymptotically optimal complexity -> algorithmic scalability!can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpointefficient parallelization - if one knows how to do itbest (maybe the only?) basis for fully scalable FE solvers

Page 18: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

18

Multigrid: V-CycleMultigrid: V-Cycle

Relax on

Residual

Restrict

Correct

Solve

Interpolate

by recursion

… …

Goal:Goal: solve Ah uh = f h using a hierarchy of grids

Page 19: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

19

Parallel High Performance FE MultigridParallel High Performance FE MultigridParallelize „plain vanilla“ multigrid

partition domain

parallelize all operations on all grids

use clever data structures

Do not worry (so much) about Coarse Gridsidle processors?

short messages?

sequential dependency in grid hierarchy?

Why we do not use Domain DecompositionDD without coarse grid does not scale (algorithmically) and is inefficient for large problems/ many procssors

DD with coarse grids is still less efficient than multigrid and is as difficult to parallelize

Page 20: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

20

Part III - b Part III - b

Towards Scalable FE SoftwareTowards Scalable FE Software

Scalable ArchitectureScalable ArchitectureHierarchical Hybrid GridsHierarchical Hybrid Grids

Page 21: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

21

Hierarchical Hybrid Grids (HHG)Hierarchical Hybrid Grids (HHG)

Unstructured input grid

Resolves geometry of problem domain

Patch-wise regular refinement

generates nested grid hierarchies naturally suitable for geometric multigrid algorithms

New:

Modify storage formats and operations on the grid to exploit the regular substructures

Does an unstructured grid with 100 000 000 000 elements make sense?

HHG - Ultimate Parallel FE Performance!

Page 22: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

22

HHG Refinement exampleHHG Refinement example

Input Grid

Page 23: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

23

HHG Refinement exampleHHG Refinement example

Refinement Level one

Page 24: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

24

HHG Refinement exampleHHG Refinement example

Refinement Level Two

Page 25: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

25

HHG Refinement exampleHHG Refinement example

Structured InteriorStructured Interior

Page 26: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

26

HHG Refinement exampleHHG Refinement example

Structured Interior

Page 27: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

27

HHG Refinement exampleHHG Refinement example

Edge Interior

Page 28: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

28

HHG Refinement exampleHHG Refinement example

Edge Interior

Page 29: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

29

Common HHG MisconceptionsCommon HHG Misconceptions

Hierarchical hybrid grids (HHG)

are not only another block structured grid

HHG are more flexible (unstructured, hybrid input grids)

are not only another unstructured geometric multigrid package

HHG achieve better performance

unstructured treatment of regular regions does not improve performance

Page 30: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

30

Parallel HHG - Framework Design Parallel HHG - Framework Design GoalsGoals

To realize good parallel scalability:

Minimize latency by reducing the number of messages that must be sent

Optimize for high bandwidth interconnects ⇒ large messages

Avoid local copying into MPI buffers

Page 31: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

31

HHG for ParallelizationHHG for ParallelizationUse regular HHG patches for partitioning the domain

Page 32: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

32

HHG Parallel Update AlgorithmHHG Parallel Update Algorithmfor each vertex do

apply operation to vertexend for

for each edge docopy from vertex interiorapply operation to edgecopy to vertex halo

end for

for each element docopy from edge/vertex interiorsapply operation to elementcopy to edge/vertex halos

end for

update vertex primary dependenciesupdate vertex primary dependencies

update edge primary dependenciesupdate edge primary dependencies

update secondary dependenciesupdate secondary dependencies

Page 33: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

33

HHG for ParallelizationHHG for Parallelization

Page 34: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

34

Part III - c Part III - c

Towards Scalable FE SoftwareTowards Scalable FE Software

Performance ResultsPerformance Results

Page 35: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

35

Node Performance is Difficult! Node Performance is Difficult! (B. Gropp)(B. Gropp)DiMe project: Cache-aware Multigrid (1996- ...)DiMe project: Cache-aware Multigrid (1996- ...)

12821284131913121913

2400

2x blocking

20492134214021672389

2420

3x blocking

81984910659951417

2445

no blocking

5794906777151344

1072

standard

513325731293653333173grid size

Performance of 3D-MG-Smoother for 7-pt stencil in Mflops on Itanium 1.4 GHz

Array Padding

Temporal blocking - in EPIC assembly language

Software pipelineing in the extreme (M. Stürmer - J. Treibig)

Node Performance is Possible!

Page 36: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

36

Single Processor HHG Performance on Itanium forSingle Processor HHG Performance on Itanium forRelaxation of a Tetrahedral Finite Element Mesh Relaxation of a Tetrahedral Finite Element Mesh

Page 37: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

37

Parallel HHG Performance on Altix forParallel HHG Performance on Altix forMultigrid with a Tetrahedral Finite Element Mesh Multigrid with a Tetrahedral Finite Element Mesh

Page 38: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

38

HHG: Parallel ScalabilityHHG: Parallel Scalability

431,456/96449152103,07917,1671024

75762/54549152103,07917,167512

76409/2702457651,5398,577256

69200/1471228825,7694,288128

68100/75614412,8842,14464

Time [s]GFLOP/s#Input Els #Els x 106#DOFS x 106#Procs

Parallel scalability of Poisson problem discretized bytetrahedral finite elements - SGI Altix (Itanium-2 1.6 GHz)

B. Bergen, F. Hülsemann, U. Ruede: Is 1.7× 1010 unknowns the largest

finite element system that can be solved today?

SuperComputing, Nov‘ 2005

2006 ISC-Award for 2006 ISC-Award for Application ScalabilityApplication Scalability

Page 39: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

39

Why Multigrid?Why Multigrid?

A direct elimination banded solver has complexity O(N2.3) for a 3-D problem.

This becomes

338869529114764631553758 = O(1023)operations for our problem size

At one Teraflops this would result in a runtime of 10000 years which could be reduced to 10 years on a Petaflops system

Page 40: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

40

Part IV Part IV

OutlookOutlook

Page 41: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

41

System ConfigurationSystem ConfigurationHLRB-II (Phase I)HLRB-II (Phase I)4096 Intel Madison9M (1.6 GHz) cores

single core = 4096 cores

6 MByte L3

FSB 533 (8.5 GByte/s for one core)

1.33 Byte/Flop = 0.17 Words/Flop

17 TByte memory

26.2 TF Peak -> 24.5 TF Linpack

~ 7.4 TF aggreg. weighted Application Performance (LRZ BM)

~ 3.5 TF expected as every-day mean performance

256 core single system image

Page 42: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

42

System Configuration HLRB-II (Phase II)System Configuration HLRB-II (Phase II)Target: 3328+ Intel Montvale Sockets

dual core = 6656+ cores 9 MByte L3

FSB 667 (10.7 GB/s shared by two cores)~0.1 Words/Flop

40+ TByte memory>60 TF Peak

13 TF LRZ benchmarks~ 6 TF expected sustained512 cores per node

Installation: 2007

Page 43: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

43

ConclusionsConclusionsSupercomputer Performance is Easy!

If parallel efficiency is bad, choose a slower serial algorithmit is probably easier to parallelizeand will make your speedups look much more impressive

Introduce the “CrunchMe” variable for getting high Flops ratesadvanced method: disguise CrunchMe by using an inefficient (but compute-intensive) algorithm from the start

Introduce the “HitMe” variable to get good cache hit ratesadvanced version: disguise HitMe by within “clever data structures” that introduce a lot of overhead

Never cite “time-to-solution”who cares whether you solve a real life problem anyway

it is the MachoFlops that interest the people who pay for your research

Never waste your time by trying to use a complicated algorithm in parallel (such as multigrid)

the more primitive the algorithm the easier to maximize your MachoFlops.

Page 44: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

44

AcknowledgementsAcknowledgementsCollaborators

In Erlangen: WTM, LSE, LSTM, LGDV, RRZE, Neurozentrum, Radiologie, etc.

International: Utah, Technion, Constanta, Ghent, Boulder, ...

Dissertationen ProjectsB. Bergen (HHG development)

T. Gradl (HHG application/ Optimization)

J. Treibig (Cache Optimizations)

M. Stürmer (Architecture Aware Programming)

U. Fabricius (AMG-Methods and SW-Engineering for parallelization)

C. Freundl (Parallel Expression Templates for PDE-solver)

J. Härtlein (Expression Templates for FE-Applications)

N. Thürey (LBM, free surfaces)

T. Pohl (Parallel LBM)

... and 6 more

20 Diplom- /Master- ThesisStudien- /Bachelor- Thesis

Especially for Performance-Analysis/ Optimization for LBM• J. Wilke, K. Iglberger, S. Donath

... and 23 more

Funding: KONWIHR, DFG, NATO, BMBFFunding: KONWIHR, DFG, NATO, BMBF

Graduate Education: Elitenetzwerk BayernGraduate Education: Elitenetzwerk BayernBavarian Graduate School in Computational EngineeringBavarian Graduate School in Computational Engineering (with TUM, since 2004) (with TUM, since 2004)

Special International PhD program: Special International PhD program: Identifikation, Optimierung und Steuerung für technische AnwendungenIdentifikation, Optimierung und Steuerung für technische Anwendungen (with Bayreuth and Würzburg) since Jan. 2006(with Bayreuth and Würzburg) since Jan. 2006

Page 45: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

45

Talk is OverTalk is Over

Please wake up!Zur Anzeige wird der QuickTime

Dekompressor bentigt.

Page 46: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

46

FE Applications FE Applications (in our group)(in our group)

Solid-state lasersheat conduction, elasticity, wave equation

Bio-electric fieldsCalculation of electrostatic or electromagnetic potential in heterogeneous domains

Computational Nano TechnologyLattice-Boltzmann simulation of charged particles in colloidsCalculation of electrostatic potential in very large domains

Acousticsroom acousticsflow induced noise

Industrial High Temperature Furnaces heat transport, radiation, ...

Page 47: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

47

HHG Motivation III HHG Motivation III

Memory used to represent the discretization is often the limiting factor on the size of

problem that can be solved

Common PracticeAdd resolution by applying regular refinement to an unstructured input grid

Refinement does not add new information about the domain ⇒ CRS is overkill (or any other sparse matrix structure)

Missed OpportunityRegularity of structured patches is not exploited

HHGDevelop new data structures that exploit regularity for enhanced performance

Employ stencil-based discretization techniques on structured patches to reduce memory usage

Page 48: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

48

Large Scale Acoustic Simulations Large Scale Acoustic Simulations A concert hall may have >10 000 m3 volume

need a resolution of <1cm to resolve audible spectrum of wavelengths

need >106 cells per 1 m3

The concert hall will require to solve a system with >1010 unknowns per (implicit) time step

Flow induced noise (KONWIHR)generated by turbulence and boundary effects

high frequency noise requires a very fine resolution of acoustic field

far field computations => large domains

HHG will make unprecedentedHHG will make unprecedented

acoustic simulations possible!acoustic simulations possible!

Page 49: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

49

HHG Motivation I HHG Motivation I

Standard sparse matrix data structures (like CRS)

generally achieve poor performance!

Cache EffectsDifficult to find an ordering of the unknowns that maximizes data locality

Indirect IndexingPrecludes aggressive compiler optimizations that exploit instruction level parallelism (ILP)

Continuously Variable CoefficientsOverkill for certain class of problems

Page 50: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

50

User-Friendliness: ParExPDEUser-Friendliness: ParExPDEParallel Expression Templates forParallel Expression Templates for

Partial Differential Equations Partial Differential Equations

Library for the rapid development of numerical PDE solvers on parallel architectures

High level intuitive programming interface

Use of expression templates ensures good efficiency

Funded by KONWIHR (2001-2004)

User-friendly FE-Application development with excellent

parallel efficiency

Page 51: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

51

ConclusionsConclusionsHigh performance simulation still requires “heroic programming”Parallel Programming: single node performance is difficult Which architecture ?Which data structures?Where are we going?

the end of Moore’s lawpetaflops require >100,000 processors - and we can hardly handle 1000!the memory wall

• latency• bandwidth• Locality!

We are open for collaborations!

Page 52: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

52

Conclusions (1)Conclusions (1)High performance simulation still requires “heroic programming” … but we are on the way to make supercomputers more generally usableParallel Programming is easy, node performance is difficult (B. Gropp)Which architecture ?

ASCI-type: custom CPU, massively parallel cluster of SMPs• nobody has been able to show that these machines scale efficiently,

except on a few very special applications and using enormous human effort

Earth-simulator-type: Vector CPU, as many CPUs as affordable• impressive performance on vectorizable code, but need to check with

more demanding data and algorithm structuresHitachi Class: modified custom CPU, cluster of SMPs

• excellent performance on some codes, but unexpected slowdowns on others, too exotic to have a sufficiently large software base

Others: BlueGene, Cray X1, Multithreading, PIM, reconfigurable …, quantum computing, …

Page 53: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

53

Conclusions (2)Conclusions (2)Which data structures?

structured (inflexible) unstructured (slow)HHG (high development effort, even prototype 50 K lines of code)meshless … (useful in niches)

Where are we going?the end of Moore’s lawnobody builds CPUs with HPC specific requirements high on the list of prioritiespetaflops: 100,000 processors and we can hardly handle 1000It’s the locality - stupid!the memory wall

• latency• bandwidth

Distinguish between algorithms where control flow is• data independent: latency hiding techniques (pipelining, prefetching, etc)

can help• data dependent

Page 54: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

54

Talk is OverTalk is Over

Please wake up!

Page 55: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

55

In the Future?In the Future?

What’s beyond Moore’s Law?What’s beyond Moore’s Law?

Page 56: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

56

Conclusions (1)Conclusions (1)High performance simulation still requires “heroic programming” … but we are on the way to make supercomputers more generally usableParallel Programming is easy, node performance is difficult (B. Gropp)Which architecture ?

Large Scale Cluster: custom CPU, massively parallel cluster of SMPs• nobody has been able to show that these machines scale efficiently,

except on a few very special applications and using enormous human effort

Earth-simulator-type: Vector CPU, as many CPUs as affordable• impressive performance on vectorizable code, but need to check with

more demanding data and algorithm structuresHitachi Class: modified custom CPU, cluster of SMPs

• excellent performance on some codes, but unexpected slowdowns on others, too exotic to have a sufficiently large software base

Others: BlueGene, Cray X1, Multithreading, PIM, reconfigurable …, quantum computing, …

Page 57: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

57

Conclusions (2)Conclusions (2)Which data structures?

structured (inflexible) unstructured (slow)HHG (high development effort, even prototype 50 K lines of code)meshless … (useful in niches)

Where are we going?the end of Moore’s lawnobody builds CPUs for HPC-specific requirements petaflops require >100,000 processors - and we can hardly handle 1000!It’s the locality - stupid!the memory wall

• latency• bandwidth

Distinguish between algorithms where control flow is• data independent: latency hiding techniques (pipelining, prefetching,

etc) can help• data dependent

Page 58: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

58

VV10.000.000.000

1.000.000.000

100.000.000

10.000.000

1.000.000

100.000

1.000

10.000

100

4004

80368

Pentium

Merced

1K

64K

256K

1M4M

64M

1G

4G

1970 1975 1980 1985 1990 1995 2000 2005

Year

Tra

nsi

sto

rs/D

ie

Microprocessor(Intel)

DRAM

Growth:42% per year

Growth:52% per year

Moore's Law in Semiconductor Technology(F. Hossfeld)

80468

Pentium Pro

Page 59: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

59

1021

1018

1015

1012

109

1024

103

106

1

1950 1960 1970 1980 1990 2000 2010 2020

Year

Ato

ms/

Bit

Information Density & Energy Dissipation(adapted by F. Hossfeld from C. P. Williams et al., 1998)

10 -9

10 -6

10 -3

1

10 3

10 6

10 9

1012

1015

Ene

rgy/

logi

c O

pera

tion

[pic

o-Jo

ules

]

kT

Semiconductor Technology

≈ 2017

Page 60: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

60

AcknowledgementsAcknowledgementsCollaborators

In Erlangen: WTM, LSE, LSTM, LGDV, RRZE, etc.Especially for foams: C. Körner (WTM)International: Utah, Technion, Constanta, Ghent, Boulder, ...

Dissertationen ProjectsU. Fabricius (AMG-Verfahren and SW-Engineering for parallelization)C. Freundl (Parelle Expression Templates for PDE-solver)J. Härtlein (Expression Templates for FE-Applications)N. Thürey (LBM, free surfaces)T. Pohl (Parallel LBM)... and 6 more

16 Diplom- /Master- ThesisStudien- /Bachelor- Thesis

Especially for Performance-Analysis/ Optimization for LBM• J. Wilke, K. Iglberger, S. Donath

... and 21 more

KONWIHR, DFG, NATO, BMBFKONWIHR, DFG, NATO, BMBFElitenetzwerk BayernElitenetzwerk Bayern

Bavarian Graduate School in Computational EngineeringBavarian Graduate School in Computational Engineering (with TUM, since 2004) (with TUM, since 2004)

Special International PhD program: Special International PhD program: Identifikation, Optimierung und Steuerung für technische Identifikation, Optimierung und Steuerung für technische AnwendungenAnwendungen (with Bayreuth and Würzburg) to start Jan. 2006. (with Bayreuth and Würzburg) to start Jan. 2006.

Page 61: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

61

ParExPDE - Performance ResultsParExPDE - Performance ResultsSpeedup of ParExPDE on the LSS Cluster

8.51·10632

248.51·10616

398.51·1068

718.51·1064

1388.51·1062

2918.51·1061

Wallclock Time (s)

Number of Unknowns

# Proc. • Multigrid solver (V(2,2) cycles) for Poisson problem with Dirichlet boundary

• LSS Cluster: AMD Opteron 2.2 GHz

Page 62: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

62

ParExPDE - Performance ResultsParExPDE - Performance ResultsScaleup of ParExPDE on the LSS Cluster

766.45·10732

723.22·10716

721.61·1078

688.05·1064

684.02·1062

672.00·1061

Wallclock Time (s)

Number of Unknowns

# Proc. • Multigrid solver (V(2,2) cycles) for Poisson problem with Dirichlet boundary

• LSS Cluster: AMD Opteron 2.2 GHz

Page 63: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

63

Source LocalizationSource Localization3D MRI data3D MRI data

512 x 512 x 256512 x 512 x 256voxelsvoxels

segmentationsegmentation4 compartmens4 compartmens

Localized Localized Epileptic focusEpileptic focus

Dipole localizationDipole localizationsearch algorithmsearch algorithm = optimization= optimization

Collaborators: Univ. of Utah (Chris Johnson), Ovidius Univ. Constanta (C. Popa) Bart Vanrumste Collaborators: Univ. of Utah (Chris Johnson), Ovidius Univ. Constanta (C. Popa) Bart Vanrumste (Gent, Univ. of Canterbury, New Zealand), G. Greiner, F. Fahlbusch, C. Wolters (Münster)(Gent, Univ. of Canterbury, New Zealand), G. Greiner, F. Fahlbusch, C. Wolters (Münster)

Page 64: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

64

Source LocalizationSource Localization

Localized Localized Epileptic focusEpileptic focus

Dipole localizationDipole localizationsearch algorithmsearch algorithm = optimization= optimization

Collaborators: Univ. of Utah (Chris Johnson), Ovidius Univ. Constanta (C. Popa) Bart Vanrumste Collaborators: Univ. of Utah (Chris Johnson), Ovidius Univ. Constanta (C. Popa) Bart Vanrumste (Gent, Univ. of Canterbury, New Zealand), G. Greiner, F. Fahlbusch, C. Wolters (Münster)(Gent, Univ. of Canterbury, New Zealand), G. Greiner, F. Fahlbusch, C. Wolters (Münster)

Page 65: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

65

Data RequirementsData Requirements3D MRI data3D MRI data

512 x 512 x 256512 x 512 x 256voxelsvoxels

segmentationsegmentation4 compartmens4 compartmens

Data typesData typesto describeto describe

isotropic and anisotropicisotropic and anisotropicconductivityconductivity

Page 66: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

66

HHG Parallel Update AlgorithmHHG Parallel Update Algorithmfor each vertex do

apply operation to vertexend for

for each edge docopy from vertex interiorapply operation to edgecopy to vertex halo

end for

for each element docopy from edge/vertex interiorsapply operation to elementcopy to edge/vertex halos

end for

for each vertex doapply operation to vertex

end forupdate vertex primary dependenciesfor each edge do

copy from vertex interiorapply operation to edgecopy to vertex halo

end forupdate edge primary dependenciesfor each element do

copy from edge/vertex interiorsapply operation to elementcopy to edge/vertex halos

end forupdate secondary dependencies

Page 67: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

67

Numerical Models forNumerical Models for Source Localization Source Localization

Computational expensive treatment of singularities• Dipole Models (Blurred, Subtraction, Mathematical/ Zenger

Correction)

• Resolution and complexity of the mesh is important

Page 68: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

68

ParExPDE - Performance ResultsParExPDE - Performance ResultsSpeedup of ParExPDE

148.51·10632

248.51·10616

398.51·1068

718.51·1064

1388.51·1062

2918.51·1061

Wallclock Time (s)

Number of Unknowns

# Proc.

Multigrid solver

V(2,2) cycles for Poisson problem

Dirichlet boundary

LSS-Cluster

Compute NodesCompute Nodes(8x4 CPUs)(8x4 CPUs)

CPU: AMD Opteron 848

Page 69: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

69

ParExPDE - Performance ResultsParExPDE - Performance ResultsScaleup of ParExPDE on the LSS Cluster

766.45·10732

723.22·10716

721.61·1078

688.05·1064

684.02·1062

672.00·1061

Wallclock Time (s)

Number of Unknowns

# Proc. • Multigrid solver

• V(2,2) cycles

• Poisson problem

• Dirichlet boundary

• LSS Cluster: AMD Opteron 2.2 GHz

Page 70: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

70

ParExPDE - ApplicationsParExPDE - Applications

Simulation of solid-state lasers (Prof. Pflaum)Calculation of heat conduction

Calculation of elasticity

Solution of Helmholtz equation

Page 71: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

71

Bio-electric Field ComputationsBio-electric Field Computations

Reconstruction of electromagnetic fieldsReconstruction of electromagnetic fields from EEG-Measurements:from EEG-Measurements:

Source LocalizationSource Localization

NeurosurgeryNeurosurgeryKopfklinikum ErlangenKopfklinikum Erlangen

View throughView throughoperation microscopeoperation microscope

Page 72: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

72

Why simulate and not experiment?Why simulate and not experiment?

Open brainOpen brainEEG Measurements for EEG Measurements for

LocalizingLocalizing functional brain regions functional brain regions

Simulation basedSimulation basedVirtual operation planning Virtual operation planning

Page 73: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

73

Problem of inverse EEG/MEGProblem of inverse EEG/MEGDirect Problem: Direct Problem:

• Known:Known: Sources (strength, position, Sources (strength, position, orientation)orientation)

• Wanted:Wanted: Potentials on the head surface Potentials on the head surface• Inverse ProblemInverse Problem

• Known:Known: Potentials on the head surface Potentials on the head surface• Wanted:Wanted: Sources (strength, position, Sources (strength, position,

orientation)orientation)

Page 74: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

74

Computational RequirementsComputational Requirementsfor Biomedical FE Computationsfor Biomedical FE Computations

Resolution: 10243 ≈ 109 Voxels

Complex source models, anisotropic conductivity, inverse problem

Bio-electro-chemical model for signal propagation

Coupling to bio-mechanical effects, e.g.contraction of heart muscle

blood flow

Therapy planning, e.g. cardiac resynchronization pacemakeroptimal placement of electrodes

optimal control of resynchrnoization

ParExPDE as scalable parallel solver!

Page 75: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

75

Computational Nano TechnologyComputational Nano TechnologyExtension of the LBM particle simulation by an electrostatic potential

Long range particle-particle interactions

Simulating charged nano-particles in flow field

generates ions in the fluid (three species of fluid: +,-,o)

generates charge distribution

ParExPDE for computing the (static) potential distribution in each time step

109 - 1015 unknowns

compute forces on particles and ionized fluid from potential

Diplomarbeit C. Feichtinger

Page 76: Solving Finite Element Systems with 17 Billion …...1 Solving Finite Element Systems with 17 Billion Unknowns at Sustained Teraflops Performance Lehrstuhl für Informatik 10 (Systemsimulation)

76

3. ORCAN3. ORCAN

Component based application framework to ease the development of parallel simulations

Central issues are • Flexibility: Dynamically switch modules

• Maintainability: Enable a longterm usage of a code

• Extensibility: Enable reuse of existing codes and coupling to other solvers (multiphysics)

• Enable a well-organized SW-development process

Modern Software Infrastructure forTechnical High Performance Computing