MPIintroKurs

7/27/2019 MPIintroKurs

1/49

Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

CC1

Parallel Computer ArchitecturesParallel Computer Architectures

Dieter an Mey

Center for Computing and CommunicationRWTH Aachen University, Germany


2/49


CC2

OverviewOverview

Processor Architecture System/Node Architecture

Clusters

HPC @ RZ.RWTH-AACHEN

Top500


3/49


CC3

OverviewOverview


Clusters


Top500


4/49


CC4

SingleSingle ProcessorProcessorSystemSystem

MemoryMemory

ProcProc

Main memory to store data and program

Processor to fetch program from memory,and execute program instructions:

Load data from memory, process data and

write results back to memory.

Input/

output

is not covered here.


5/49


CC5

CachesCaches

Marc Tremblay, Sun


6/49


CC6


MemoryMemory

CacheCache

ProcProc

Caches are smaller than main memory,

but much faster.

They are employed to bridge the gap

between a bigger and slower main memoryand the processor which is much faster.

The cache is invisible to the programmer

Only when measuring the runtime, the effect

of caches will become apparent.


7/49


CC7

On


MemoryMemory

L2 CacheL2 Cache

ProcProcL1 CacheL1 Cache

off chip cache($)

on chip cache($)

With a growing number of transistors on

each chip over time, caches can be put on

the same piece of sil icon.

I am ignoring- instruction caches- address caches (TLB)- write buffers

- prefetch buffershere

as data caches are mostimportant for HPC applications


8/49


CC8

In 2005 Intel cancelled the 4 GHz ChipIn 2005 Intel cancelled the 4 GHz Chip

Fast clock cyclesmake processor chipsmore expensive,hotter and more

power consuming.


9/49


CC9

TheThe Impact ofImpact ofMooreMooress LawLaw

Source: Herb Sutterwww.gotw.ca/publications/concurrency-ddj.htm

Intel-Processors:Intel-Processors:

Clock Speed (MHz)

Transistors (x1000)

Clock Speed (MHz)

Transistors (x1000)

The number of transistors

on a chip is still doubling

every 18 months

but the clock speed is

no longer growing that fast.

Higher clock speed causes

higher temperature and higher

power consumption.

Instead well see many more

cores per chip!


10/49


CC10

DualDual CoreCore--ProcessorsProcessors

MemoryMemory

L2 CacheL2 Cache

corecoreL1L1

corecoreL1L1

Since 2005/6 Intel and AMD are

producing dualcore processorsfor the mass market.

In 2006/7 Intel and AMD introduce

quadcore processors.

By 2008 it will be hard to buy a PC

without a dualcore processor.

Your future PC / laptop will be aparallel computer!

D lD l CC PP


11/49


CC11

DualDual CoreCore--ProcessorsProcessorsIntel WoodcrestIntel Woodcrest

MemoryMemory

L2 CacheL2 Cache

corecoreL1L1

corecoreL1L1

Here:

4 MB shared Cache on chip

2 Cores with local L1 Cache

and a socket for a second processor chip

L2 Cache

coreL1

coreL1


12/49


CC12

MultiMulti CoreCore--ProcessorsProcessors

MemoryMemory

2x8MB2x8MB

corecore64KB64KB

corecore64KB64KB

MemoryMemory

32 MB32 MB

corecore64KB64KB

corecore64KB64KB

2 MB L22 MB L2

UltraSPARC IV1.2 GHz130 nm

~66 Mio trans.108 Watt

UltraSPARC IV+1.5 GHz

90 nm

295 Mio trans.

90 Watt (?)

MemoryMemory

corecore64KB64KB

corecore64KB64KB

1 MB1 MB

Opteron 8752.2 GHz90 nm

199 mm2

233 Mio trans.95 Watt

1 MB1 MB

MemoryMemory

UltraSPARC T11.0 GHz90 nm

378 mm2

300 Mio trans.72 Watt

3 MB L23 MB L288 corescores

8 KB L18 KB L1

forforeacheach corcor


13/49


CC13

What to do with all these ThreadsWhat to do with all these Threads

Marc Tremblay, Sun

Waiting in parallel


14/49


CC14

Marc Tremblay, Sun

What to do with all these ThreadsWhat to do with all these Threads


15/49


CC15

SunSun FireFire T2000 at AachenT2000 at Aachen

MemoryMemory

corecore

8 KB8 KBL1L1

.75 MB L2.75 MB L2

MemoryMemory MemoryMemory MemoryMemory

.75 MB L2.75 MB L2 .75 MB L2.75 MB L2 .75 MB L2.75 MB L2

FPUFPUInternal Crossbar 134 GB/s

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

25.6 GB/sec.

4 DDR2 memory controllers on chip

1GHz


16/49


CC16

Sun T5120Sun T5120 Eight Cores x Eight ThreadsEight Cores x Eight Threads

MemoryMemory

corecore

8 KB8 KBL1L1

0.5 MB L20.5 MB L2


0.5MB L20.5MB L2 0.5 MB L20.5 MB L2 0.5 MB L20.5 MB L2

Internal Crossbar

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

42.7 GB/sec.

4 FB DRAM memory controllers on chip

1.4 GHz8 threads

per core

1 FPU per

core

0.5 MB L20.5 MB L2 0.5MB L20.5MB L2 0.5 MB L20.5 MB L2 0.5 MB L20.5 MB L2

1 x UltraSPARC T2 (Niagara 2) @ 1.4 GHz


17/49


CC17

Chip LevelChip Level ParallelismParallelism

time

= 1.11 ns

= 0.66 ns

= 0.45 ns

UltraSPARC III

superscalar, single core4 sparc v9 instr/cycle

1 active thread per core

UltraSPARC IV+superscalar, dual core

2 x 4 sparc v9 instr/cycle1 active thread per core

Opteron 875superscalar, dual core

2 x 3 x86 instr/cycle1 active thread per core

UltraSPARC T1Single issue, 8 cores

8 x 1 sparc v9 instr/cycle4 active threads per corecontext switch comes for free

= 1.0 ns

context switch


18/49


CC18

OverviewOverview


Clusters


Top500

SharedShared MemoryMemory Parallel ComputersParallel Computers


19/49


CC19

MemoryMemory

CacheCache CacheCache CacheCache CacheCache

ProcProc ProcProc ProcProc ProcProc

CrossbarCrossbar/ Bus/ Bus

In a shared memory parallelcomputer multiple processors

have access to the same main

memory.

Yes, a dualcore / multicore

processor based machine

is a parallel computer

on a chip.

SharedShared MemoryMemory Parallel ComputersParallel ComputersUniformUniform MemoryMemoryAccess (UMA)Access (UMA)

-Crossbar adds latency-Architecture is not scalable

SharedShared MemoryMemory Parallel ComputersParallel Computers


20/49


CC20




SharedShared MemoryMemory Parallel ComputersParallel ComputersNon UniformNon Uniform MemoryMemoryAccessAccess (NUMA(NUMA))

MemoryMemory MemoryMemory MemoryMemory MemoryMemory

- Faster local memory access- slower remote memory access


21/49


CC21

SunSun FireFire E2900 at AachenE2900 at Aachen

MemoryMemory

2x8MB2x8MB

corecore64KB64KB

corecore64KB64KB

CrossbarCrossbar-- 9.6 GB/s total9.6 GB/s total peakpeak memorymemory bandwidthbandwidth

2x8MB2x8MB

corecore64KB64KB

corecore64KB64KB

- simplistic view- programers perspective- rather uniform memory access

MemoryMemory MemoryMemory

12 dual core US IV processors

2.4 GB/s.

memory controller

on chip

1.2 GHz


22/49


CC22

SunSun FireFire V40z at AachenV40z at Aachen

- simplistic view- programers perspective- non-uniform memory access

MemoryMemory

corecore64KB64KB

corecore64KB64KB

1 MB1 MB 1 MB1 MB

MemoryMemory

corecore64KB64KB

corecore64KB64KB

1 MB1 MB 1 MB1 MB

MemoryMemory

corecore64KB64KB

corecore64KB64KB

1 MB1 MB 1 MB1 MB

MemoryMemory

corecore64KB64KB

corecore64KB64KB

1 MB1 MB 1 MB1 MB

DDR 400memory control leron chip

8 GB/s

8 GB/s

8 GB/s 8 GB/s

6.4 GB/s6.4 GB/s

6.4 GB/s6.4 GB/s

2.2 GHz

S T5120 Ei ht C Ei ht Th d


23/49


CC23

Sun T5120Sun T5120 Eight Cores x Eight ThreadsEight Cores x Eight Threads

MemoryMemory

corecore

8 KB8 KBL1L1

0.5 MB L20.5 MB L2


0.5MB L20.5MB L2 0.5 MB L20.5 MB L2 0.5 MB L20.5 MB L2

Internal Crossbar

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

corecore

8 KB8 KBL1L1

42.7 GB/sec.

4 FB DRAM memory controllers on chip

1.4 GHz8 threads

per core

1 FPU per

core

0.5 MB L20.5 MB L2 0.5MB L20.5MB L2 0.5 MB L20.5 MB L2 0.5 MB L20.5 MB L2

1 x UltraSPARC T2 (Niagara 2) @ 1.4 GHz

O iO i


24/49


CC24

OverviewOverview


Clusters


Top500

Di t ib t dDi t ib t d M P ll l C t / Cl tM P ll l C t / Cl t


25/49


CC25

DistributedDistributed Memory Parallel Computer / ClusterMemory Parallel Computer / Cluster

MemoryMemory

CacheCache

ProcProc

MemoryMemory

CacheCache

ProcProc

MemoryMemory

CacheCache

ProcProc

External networkExternal network

In a distributed

memory parallel

computer each

processor

has only accessto its own main

memory.

Programs have touse an external

network for

communication

and cooperation.

They have to

exchange

messages


26/49

MPI Di t ib t dMPI on Distributed M P ll l C tMemory Parallel Computers


27/49


CC27

MemoryMemory

CacheCache

ProcProc

MemoryMemory

CacheCache

ProcProc

MemoryMemory

CacheCache

ProcProc


MPIMPI--TaskTask MPIMPI--TaskTask MPIMPI--TaskTask

Typically, when using Message Passing with MPI,

one MPI process runs on each processor (core)

MPI on DistributedMPI on Distributed Memory Parallel ComputersMemory Parallel Computers

MPI is the de-facto

standard for message

passing.

MPI is a programlibrary plus a

mechanism to launch

multiple cooperation

executable progams.

Typically it is the same

binary, which is

started on multiple

processors.

(SPMD=single

program mutliple data

paradigm)

MPI Sh d M P ll l C tMPI on Shared Memory Parallel Computers


28/49


CC28

MPI on Shared Memory Parallel ComputersMPI on Shared Memory Parallel Computers

MemoryMemory((interleavedinterleaved))



CrossbarCrossbar/ Bus/ BusMPIMPI--TaskTask

MPIMPI--TaskTask

MPIMPI--TaskTask

MPI can be used on shared memory systems as well. The shared memory servesas the network. Again, typically one MPI process runs on each processor (core)

MPI is formally specified for

C, C++ and Fortran.

All major vendors provide an

MPI library for their machines.

And there are free versionsavailable.

JAVA-implementations

are available, too, but they arenot widely used and not standardized

OpenMP on Shared Memory Parallel ComputersOpenMP on Shared Memory Parallel Computers


29/49


CC29

OpenMP on Shared Memory Parallel ComputersOpenMP on Shared Memory Parallel Computers

MemoryMemory

((interleavedinterleaved))




--ThreadThread --ThreadThread --ThreadThread

OpenMPOpenMP--

On shared memory systems shared memory programming can be used, wheretypically one light weight process (= thread) runs on each processor (core)

MPI on SMPMPI on SMP--ClustersClusters


30/49


CC30

MPI on SMPMPI on SMP ClustersClusters










MPIMPI--TaskTaskMPIMPI--TaskTask

MPIMPI--TaskTask MPIMPI--TaskTask

MPIMPI--TaskTask

MPIMPI--TaskTask

Today, most clusters have SMP nodes and MPI is well suited for this architecture.

Hybrid Parallelization on SMPHybrid Parallelization on SMP-Clusters (MPI+OpenMP)Clusters (MPI+OpenMP)


31/49


CC31

Hybrid Parallelization on SMPHybrid Parallelization on SMP--Clusters (MPI+OpenMP)Clusters (MPI+OpenMP)










Innovative: ClusterInnovative: Cluster OpenMPOpenMP


32/49


CC32

(DSM=distributed shared memory system)(DSM=distributed shared memory system)









External networkExternal networkClusterCluster-- OpenMPOpenMP

--ThreadThread--ThreaThrea

dd

--ThreadThread --ThreadThread --ThreadThread

Nodes of Todays Clusters are Shared MemoryNodes of Todays Clusters are Shared MemoryM hi i h M l i P


33/49


CC33


MemoryMemory

L2 CacheL2 Cache

corecoreL1L1

corecoreL1L1

L2 CacheL2 Cache

corecoreL1L1

corecoreL1L1

MemoryMemory

L2 CacheL2 Cache

corecoreL1L1

corecoreL1L1

L2 CacheL2 Cache

corecoreL1L1

corecoreL1L1

. . .

Machines with Multicore ProcessorsMachines with Multicore Processors

Networks and TopologiesNetworks and Topologies


34/49


CC34

Networks and TopologiesNetworks and Topologies

Networks Fast- / Gigabit-Ethernet

Myrinet

SCI

QsNet (Quadrix)

Infiniband

Proprietary networks

Topologies Bus

Tree

Fat Tree

2D-, 3D- Torus

Hypercube

Crossbar / Switch

ModernModern Parallel Computer ArchitecturesParallel Computer Architectures


35/49


CC35

Modernode Parallel Computer Architecturesa a e Co pute c tectu es

COTS (=commercial off-the-shelf),COW (=Cluster of Workstations)with 1 or 2 dual core prozessorchipsand sheap network (GigabitEthernet)self-made

Cluster of rack mounted Pizza boxeswith 1-4 CPUs dual/quad core processor chipswith fast network (Infiniband)

SMP-Cluster with standard SMP-servers andpropriatary or multi-rail networksSun Fire Cluster, SGI Columbia (Altix nodes),

ASC purple (IBM p575 nodes)

Supercomputers, designed for High-End-Computing:Cray XT3, IBM BlueGene/L, Earth Simulator (NEC SX6)

OverviewOverview


36/49


CC36

O e e


Clusters


Top500

HPC @ RZ.RWTHHPC @ RZ.RWTH--AACHEN.DEAACHEN.DE


37/49


CC37

SunFire E25KCluster

SunFireSunFire E25KE25KClusterCluster

@@

Sun Fire T2000Sun Fire T2000SunFire V40z ClusterSunFireSunFire V40z ClusterV40z Cluster

Xeon-ClusterXeonXeon--ClusterCluster

RWTH AachenRWTH Aachen ComputeCompute ClusterCluster


38/49


CC38

pp

#nodes model

processor

type #procs #cores #threads

clock

[MHz]

memor

y

[GB] network

accumulated

performance

[TFLOPS]

accumulated

memory

[TB]

2 SF E25K UltraSPARC IV 72 144 144 1050 288 Gigabit Ethernet 0.60 0.58

8 SF E6900 UltraSPARC IV 24 48 48 1200 96 Gigabit Ethernet 0.92 0.7720 SF T2000 UltraSPARC T1 1 8 64 1400 32 Gigabit Ethernet 0.2240 0.64

1 SF T2000 UltraSPARC T1 1 8 32 1000 8 Gigabit Ethernet 0.0001 0.01

64SF V40z Opteron 848

4 4 4 2200 8 Gigabit Ethernet 1.13 0.51

4SF V40z Opteron 875

4 8 8 2200 16

Gigabit Ethernet

Infiniband 0.14 0.06

2SF X4600 Opteron 885

8 16 16 2600 32 Gigabit Ethernet 0.17 0.06

7

Xeon 5160

(Woodcrest) 2 4 4 3000 8

Gigabit Ethernet


2

Xeon 5160

(Woodcrest) 2 4 4 3000 16

Gigabit Ethernet


4

Xeon 5160

(Woodcrest) 2 8 8 2667 16

Gigabit Ethernet


55

Xeon E5450

(Harpertown) 2 8 8 3000 16

Gigabit Ethernet


5Xeon E5450

(Harpertown) 2 8 8 3000 32

Gigabit Ethernet


2

Fujitsu-

Siemens

RX600

Xeon X7350

(Tigerton)4 16 16 2930 64

Gigabit Ethernet


176 sum 1740 2884 6.64 3.95

Compute Cluster der RWTH Aachen

Feb 08

Fujitsu-

Siemens

RX200

Dell 1950

System ManagementSystem Management


39/49


CC39

y gy g

Frontend nodes for interactive work,program development and testing, GUIscluster.rz.RWTH-Aachen.DE =

cluster-solaris.rz.RWTH-Aachen.DE =

cluster-solaris-sparc.rz.RWTH-Aachen.DE

cluster-solaris-opteron.rz.RWTH-Aachen.DEcluster-linux.rz.RWTH-Aachen.DE =

cluster-linux-opteron.rz.RWTH-Aachen.DE

cluster-linux-xeon.rz.RWTH-Aachen.DE

cluster-windows.rz.RWTH-Aachen.DE =

cluster-windows-xeon.rz.RWTH-Aachen.DE

Abbreviations:

cl[uster],

sol[aris], lin[ux], win[dows],

x[eon], o[pteron], s[parc]

Batch system

Sun Grid Engine: jobs (> 20 min) and

Microsoft Compute Cluster resp.

OverviewOverview overoverHPC ToolsHPC Tools


40/49


CC40

Current program development environment for HPC on the SunSPARC, AMD Opteron and Intel Xeon systems at the RWTH

4 platforms :

1. SPARC/Solaris 10, 64bit2. Opteron/Solaris, 64 bit

3. Opteron/Linux and Xeon/Windows, 64 bit

4. Opteron/Windows and Xeon/Windows, 64 bit

serial programming, shared memory parallelization, message

passing

compilers / MPI libraries, debugging tools, performance analysistools

ProgrammingProgramming EnvironmentEnvironmentCompilers + Debugging ToolsCompilers + Debugging Tools


41/49


CC41

Company CompilerVersion

Language OpenMPsupport

Au topar Debugger RuntimeAnalysis

Sparc

Lin Win

Sun Studio 12 F95/C/C++ F95/C++ F95/C++

dbx

sunstudio

thread

analyzer

analyzer,

collect,

er_print,

gprof

X X X

Intel V10.0 F95/C++

F95/C++

Threading Tools F95/C++ idb vtune X X

GNU V4.0 F95/C++ gdb gprof X X

GNU V4.2 F95/C++ F95/C++ gdb gprof X X

PGI V7.1 F77/F90/C/C++ F77/F90/C/C++ F77/F90/C/C++ pgdbg pgprof X

Microsoft Visual Studio2003

C++ Visual Studio X

MicrosoftVisual Studio

2005C++ C++ Visual Studio X

Etnus TotalView 8.3 X X X

OpteronXeon

Solaris

Sun Fire Cluster Programming Environment

MPIMPI Implementations and ToolsImplementations and Tools


42/49


CC42

Provider Version MPI2 support Debugger Runtime Analysis Plattform Network

Sun HPC ClusterTools 6 yes TotalView analyzer, mpprof

Solaris 10

Opteron +

Sparc

tcp, shm

Sun HPC ClusterTools 7.1based on Open MPI

yes TotalView

Solaris 10

Opteron +

Sparc

tcp, shm, ib

IntelVersion 3.1

based on mpich2(yes) TotalView

TraceCollector

& Analyzer

(former Vampir)

Linux tcp, shm, ib

ANL mpich 1.2.6 no TotalView jumpshot Sol, Lin,Win tcp, shm

ANL mpich2 1.0.x yes (tcp) TotalView jumpshot Sol, Lin,Win tcp, shm

Open MPI

p.d.

OpenMPI 1.2.5

based on FT-MPI,

LA-MPI, LAM, PACX

yes TotalView ? tcp, myr, Infiniband

MicrosoftCCS V1

based on mpich2(yes)

Visual Studio w/

MS Compute Cluster

Pack

Windows tcp, shm, (Infiniband)

Univ Dresden ? Vampir-NG any

VI-HPS multiple research tools any

OverviewOverview


43/49


CC43


Clusters

HPC @ RZ.RWTH-AACHEN Top500

Measuring PerformanceMeasuring Performance

Linpack BenchmarkLinpack Benchmark


44/49


CC44

Linpack BenchmarkLinpack Benchmark

The theorectical peak performance is determined by the clock cycle

and the number of floating point operations per cycle. The actual floating point performance can be determined by the

LINPACK Benchmark (www.top500.org) solving a linear equationsystem with a full coefficient matrix of an abitrary size.

The unit of measurement is M[ega]flops = Million floating point operations per second

G[iga]flops, T[era]flops, P[eta]flops (=10^9, 10^12, 10^15 flops)

The Top 500 listing the fastest supercomputers is updated twice peryear.

Latest No 1 (28. list Nov. 2006):IBM BlueGene/L with 131072 processors, 32 TB total memoryLawrence Livermore National Laboratory (LLNL)

peak: 367 Tflops = 367000 Gflopslinpack: 280 Tflops 76% of peakmatrix size N=1.769.471

For Comparison: Dualcore Intel Xeon 5160 (Woodcrest) , 3 GHz:

2 cores * 4 flops/cycle (SSE) * 3 GHz = 24 Gflops

TheThe TOP500 ListTOP500 List


45/49


CC45

1

10

100

1.000

10.000

100.000

1.000.000

Jun

93

Jun

94

Jun

95

Jun

96

Jun

97

Jun

98

Jun

99

Jun

00

Jun

01

Jun

02

Jun

03

Jun

04

Jun

05

Jun

06

RWTH: PeakPerformance (R peak)RWTH: Linpack

PerformanceTOP500: Rank 1

TOP500: Rank 50

TOP500: Rank 200

TOP500: Rank 500

PC-Technology

Moore's Law

Gflops

Sun Fire Clusterat Aachen University

Fujitsu VPP300at Aachen University

The current Top 20 (Nov 07)The current Top 20 (Nov 07)


46/49


CC46

Rank Site Manufa ompute Country Procs

Linpack

RMax %peak Processor

Proc.Frequ

ency

System

Family Interconnect

1 LLNL IBM Blue Ge USA 212992 478200 80.18 PowerPC 440 700 IBM BlueGe Proprietary

2 FZ Jueli IBM Blue Ge Germany 65536 167300 75.08 PowerPC 450 850 IBM BlueGeProprietary

3 SGI/NM SGI SGI Alti USA 14336 126900 73.77 Xeon 53xx (Clov 3000 SGI Altix Infiniband4 TATA S HP Cluster India 14240 117900 69.00 Xeon 53xx (Clov 3000 HP Cluster Infiniband DDR

5 Gov AgeHP Cluster Sweden 13728 102800 70.20 Xeon 53xx (Clov 2667 HP Cluster Infiniband DDR

6 Sandia Cray In Sandia/ USA 26569 102200 80.14 Opteron Dual C 2400 Cray XT XT3 proprietary

7 Oak Rid Cray In Cray XTUSA 23016 101700 85.21 Opteron Dual C 2600 Cray XT XT3 proprietary

8 IBM Wa IBM Blue Ge USA 40960 91290 79.60 PowerPC 440 700 IBM BlueGe Proprietary9 NERSC/ Cray In Cray XTUSA 19320 85368 84.97 Opteron Dual C 2600 Cray XT XT3 proprietary

10 Stony B IBM Blue Ge USA 36864 82161 79.60 PowerPC 440 700 IBM BlueGe Proprietary

11 LLNL IBM pSeries USA 12208 75760 81.65 POWER5 1900 IBM pSerie Federation

12 Renssel IBM Blue Ge USA 32768 73032 79.60 PowerPC 440 700 IBM BlueGe Proprietary

13 Barcelo IBM BladeCe Spain 10240 63830 67.75 PowerPC 970 2300 IBM Cluster Myrinet14 NCSA Dell PowerE USA 9600 62680 69.97 Xeon 53xx (Clov 2333 Dell Cluster Infiniband SDR

15 Leibniz SGI Altix 470 Germany 9728 56520 90.78 Itanium 2 1600 SGI Altix NUMAlink

16 GSIC, TINEC/Su Sun Fir Japan 11664 56430 55.31 Opteron Dual C 2400 Sun Fire - Infiniband

17 Univ Edi Cray In Cray XTUK 11328 54648 86.15 Opteron Dual C 2800 Cray XT XT3 proprietary

18 Sandia Dell PowerE USA 9024 53000 81.57 Xeon EM64T 3600 Dell Cluster Infiniband

19 CEA Bull SA NovaSc France 9968 52840 82.83 Itanium 2 1600 Bull SMP Cl Quadrics

20 NASA/A SGI SGI Alti USA 10160 51870 85.09 Itanium 2 1500 SGI Altix Numalink/IB

Aachen on Rank 180 in June 2005Aachen on Rank 180 in June 2005

http://www.rz.rwthhttp://www.rz.rwth--aachen.de/hpc/sun/aachen.de/hpc/sun/


47/49


CC47

pp pp

Over 2 TeraFlop/s Linpack PerformanceApril 2005. The upgrade from UltraSPARC III to UltraSPARC IV including

an increase of the main memory capacity more than doubled our Linpack

performance!

A linear system with 499,200 unknowns was solved in 11:12:48.8

hours at an average speed of2054.4 billion floating point operations per

second (GFlop/s).

The program had a total memory footprint of2 Terabyte.

1276 processor cores were kept busy with 82,930,000,000 mill ion

floating point operations.

Future Parallel ComputersFuture Parallel Computers


48/49


CC48

With a growing number of cores per chip the SMP box is shrinking.

In a few years from now, many or all applications will be multi-threaded

SMP boxes with small footprint will be building blocks of large systems.

Memory hierarchies will grow (L3 caches )

Network latency will be close to 1 s

Network bandwidth several GB/s

Current research:

Distributed Shared Memory (Cluster OpenMP )

combining the advantage of SMP with scalability of DMP

In 2008/9: Petaflop/s systems by IBM, Cray, NEC, Woodcrest ~ 24 Gflop/s @ ~100 W => 1 PFL @ ~4 MW

Main problems: Power supply and cooling

SomeSome WebWeb--linkslinks


49/49


CC49

Information related to HPC at the RWTH

http://www.rz.rwth-aachen.de/hpc/

Information related to MPI at the RWTH

http://www.rz.rwth-aachen.de/mpi/

Sun Fire SMP Cluster Primerhttp://www.rz.rwth-aachen.de/hpc/primer

Web page of the SunHPC and VI-HPS workshops with more links and

informationhttp://www.rz.rwth-aachen.de/sunhpc

Joint SunHPC Seminar (March 3-4)

and VI-HPS (March 5-7) Tuning Workshop

Documents

MPIintroKurs