Upload
kamanw
View
219
Download
0
Embed Size (px)
Citation preview
7/27/2019 MPIintroKurs
1/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC1
Parallel Computer ArchitecturesParallel Computer Architectures
Dieter an Mey
Center for Computing and CommunicationRWTH Aachen University, Germany
7/27/2019 MPIintroKurs
2/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC2
OverviewOverview
Processor Architecture System/Node Architecture
Clusters
HPC @ RZ.RWTH-AACHEN
Top500
7/27/2019 MPIintroKurs
3/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC3
OverviewOverview
Processor Architecture System/Node Architecture
Clusters
HPC @ RZ.RWTH-AACHEN
Top500
7/27/2019 MPIintroKurs
4/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC4
SingleSingle ProcessorProcessorSystemSystem
MemoryMemory
ProcProc
Main memory to store data and program
Processor to fetch program from memory,and execute program instructions:
Load data from memory, process data and
write results back to memory.
Input/
output
is not covered here.
7/27/2019 MPIintroKurs
5/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC5
CachesCaches
Marc Tremblay, Sun
7/27/2019 MPIintroKurs
6/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC6
SingleSingle ProcessorProcessorSystemSystem
MemoryMemory
CacheCache
ProcProc
Caches are smaller than main memory,
but much faster.
They are employed to bridge the gap
between a bigger and slower main memoryand the processor which is much faster.
The cache is invisible to the programmer
Only when measuring the runtime, the effect
of caches will become apparent.
7/27/2019 MPIintroKurs
7/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC7
On
SingleSingle ProcessorProcessorSystemSystem
MemoryMemory
L2 CacheL2 Cache
ProcProcL1 CacheL1 Cache
off chip cache($)
on chip cache($)
With a growing number of transistors on
each chip over time, caches can be put on
the same piece of sil icon.
I am ignoring- instruction caches- address caches (TLB)- write buffers
- prefetch buffershere
as data caches are mostimportant for HPC applications
7/27/2019 MPIintroKurs
8/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC8
In 2005 Intel cancelled the 4 GHz ChipIn 2005 Intel cancelled the 4 GHz Chip
Fast clock cyclesmake processor chipsmore expensive,hotter and more
power consuming.
7/27/2019 MPIintroKurs
9/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC9
TheThe Impact ofImpact ofMooreMooress LawLaw
Source: Herb Sutterwww.gotw.ca/publications/concurrency-ddj.htm
Intel-Processors:Intel-Processors:
Clock Speed (MHz)
Transistors (x1000)
Clock Speed (MHz)
Transistors (x1000)
The number of transistors
on a chip is still doubling
every 18 months
but the clock speed is
no longer growing that fast.
Higher clock speed causes
higher temperature and higher
power consumption.
Instead well see many more
cores per chip!
7/27/2019 MPIintroKurs
10/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC10
DualDual CoreCore--ProcessorsProcessors
MemoryMemory
L2 CacheL2 Cache
corecoreL1L1
corecoreL1L1
Since 2005/6 Intel and AMD are
producing dualcore processorsfor the mass market.
In 2006/7 Intel and AMD introduce
quadcore processors.
By 2008 it will be hard to buy a PC
without a dualcore processor.
Your future PC / laptop will be aparallel computer!
D lD l CC PP
7/27/2019 MPIintroKurs
11/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC11
DualDual CoreCore--ProcessorsProcessorsIntel WoodcrestIntel Woodcrest
MemoryMemory
L2 CacheL2 Cache
corecoreL1L1
corecoreL1L1
Here:
4 MB shared Cache on chip
2 Cores with local L1 Cache
and a socket for a second processor chip
L2 Cache
coreL1
coreL1
7/27/2019 MPIintroKurs
12/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC12
MultiMulti CoreCore--ProcessorsProcessors
MemoryMemory
2x8MB2x8MB
corecore64KB64KB
corecore64KB64KB
MemoryMemory
32 MB32 MB
corecore64KB64KB
corecore64KB64KB
2 MB L22 MB L2
UltraSPARC IV1.2 GHz130 nm
~66 Mio trans.108 Watt
UltraSPARC IV+1.5 GHz
90 nm
295 Mio trans.
90 Watt (?)
MemoryMemory
corecore64KB64KB
corecore64KB64KB
1 MB1 MB
Opteron 8752.2 GHz90 nm
199 mm2
233 Mio trans.95 Watt
1 MB1 MB
MemoryMemory
UltraSPARC T11.0 GHz90 nm
378 mm2
300 Mio trans.72 Watt
3 MB L23 MB L288 corescores
8 KB L18 KB L1
forforeacheach corcor
7/27/2019 MPIintroKurs
13/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC13
What to do with all these ThreadsWhat to do with all these Threads
Marc Tremblay, Sun
Waiting in parallel
7/27/2019 MPIintroKurs
14/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC14
Marc Tremblay, Sun
What to do with all these ThreadsWhat to do with all these Threads
7/27/2019 MPIintroKurs
15/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC15
SunSun FireFire T2000 at AachenT2000 at Aachen
MemoryMemory
corecore
8 KB8 KBL1L1
.75 MB L2.75 MB L2
MemoryMemory MemoryMemory MemoryMemory
.75 MB L2.75 MB L2 .75 MB L2.75 MB L2 .75 MB L2.75 MB L2
FPUFPUInternal Crossbar 134 GB/s
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
25.6 GB/sec.
4 DDR2 memory controllers on chip
1GHz
7/27/2019 MPIintroKurs
16/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC16
Sun T5120Sun T5120 Eight Cores x Eight ThreadsEight Cores x Eight Threads
MemoryMemory
corecore
8 KB8 KBL1L1
0.5 MB L20.5 MB L2
MemoryMemory MemoryMemory MemoryMemory
0.5MB L20.5MB L2 0.5 MB L20.5 MB L2 0.5 MB L20.5 MB L2
Internal Crossbar
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
42.7 GB/sec.
4 FB DRAM memory controllers on chip
1.4 GHz8 threads
per core
1 FPU per
core
0.5 MB L20.5 MB L2 0.5MB L20.5MB L2 0.5 MB L20.5 MB L2 0.5 MB L20.5 MB L2
1 x UltraSPARC T2 (Niagara 2) @ 1.4 GHz
7/27/2019 MPIintroKurs
17/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC17
Chip LevelChip Level ParallelismParallelism
time
= 1.11 ns
= 0.66 ns
= 0.45 ns
UltraSPARC III
superscalar, single core4 sparc v9 instr/cycle
1 active thread per core
UltraSPARC IV+superscalar, dual core
2 x 4 sparc v9 instr/cycle1 active thread per core
Opteron 875superscalar, dual core
2 x 3 x86 instr/cycle1 active thread per core
UltraSPARC T1Single issue, 8 cores
8 x 1 sparc v9 instr/cycle4 active threads per corecontext switch comes for free
= 1.0 ns
context switch
7/27/2019 MPIintroKurs
18/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC18
OverviewOverview
Processor Architecture System/Node Architecture
Clusters
HPC @ RZ.RWTH-AACHEN
Top500
SharedShared MemoryMemory Parallel ComputersParallel Computers
7/27/2019 MPIintroKurs
19/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC19
MemoryMemory
CacheCache CacheCache CacheCache CacheCache
ProcProc ProcProc ProcProc ProcProc
CrossbarCrossbar/ Bus/ Bus
In a shared memory parallelcomputer multiple processors
have access to the same main
memory.
Yes, a dualcore / multicore
processor based machine
is a parallel computer
on a chip.
SharedShared MemoryMemory Parallel ComputersParallel ComputersUniformUniform MemoryMemoryAccess (UMA)Access (UMA)
-Crossbar adds latency-Architecture is not scalable
SharedShared MemoryMemory Parallel ComputersParallel Computers
7/27/2019 MPIintroKurs
20/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC20
CacheCache CacheCache CacheCache CacheCache
ProcProc ProcProc ProcProc ProcProc
CrossbarCrossbar/ Bus/ Bus
SharedShared MemoryMemory Parallel ComputersParallel ComputersNon UniformNon Uniform MemoryMemoryAccessAccess (NUMA(NUMA))
MemoryMemory MemoryMemory MemoryMemory MemoryMemory
- Faster local memory access- slower remote memory access
7/27/2019 MPIintroKurs
21/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC21
SunSun FireFire E2900 at AachenE2900 at Aachen
MemoryMemory
2x8MB2x8MB
corecore64KB64KB
corecore64KB64KB
CrossbarCrossbar-- 9.6 GB/s total9.6 GB/s total peakpeak memorymemory bandwidthbandwidth
2x8MB2x8MB
corecore64KB64KB
corecore64KB64KB
- simplistic view- programers perspective- rather uniform memory access
MemoryMemory MemoryMemory
12 dual core US IV processors
2.4 GB/s.
memory controller
on chip
1.2 GHz
7/27/2019 MPIintroKurs
22/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC22
SunSun FireFire V40z at AachenV40z at Aachen
- simplistic view- programers perspective- non-uniform memory access
MemoryMemory
corecore64KB64KB
corecore64KB64KB
1 MB1 MB 1 MB1 MB
MemoryMemory
corecore64KB64KB
corecore64KB64KB
1 MB1 MB 1 MB1 MB
MemoryMemory
corecore64KB64KB
corecore64KB64KB
1 MB1 MB 1 MB1 MB
MemoryMemory
corecore64KB64KB
corecore64KB64KB
1 MB1 MB 1 MB1 MB
DDR 400memory control leron chip
8 GB/s
8 GB/s
8 GB/s 8 GB/s
6.4 GB/s6.4 GB/s
6.4 GB/s6.4 GB/s
2.2 GHz
S T5120 Ei ht C Ei ht Th d
7/27/2019 MPIintroKurs
23/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC23
Sun T5120Sun T5120 Eight Cores x Eight ThreadsEight Cores x Eight Threads
MemoryMemory
corecore
8 KB8 KBL1L1
0.5 MB L20.5 MB L2
MemoryMemory MemoryMemory MemoryMemory
0.5MB L20.5MB L2 0.5 MB L20.5 MB L2 0.5 MB L20.5 MB L2
Internal Crossbar
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
corecore
8 KB8 KBL1L1
42.7 GB/sec.
4 FB DRAM memory controllers on chip
1.4 GHz8 threads
per core
1 FPU per
core
0.5 MB L20.5 MB L2 0.5MB L20.5MB L2 0.5 MB L20.5 MB L2 0.5 MB L20.5 MB L2
1 x UltraSPARC T2 (Niagara 2) @ 1.4 GHz
O iO i
7/27/2019 MPIintroKurs
24/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC24
OverviewOverview
Processor Architecture System/Node Architecture
Clusters
HPC @ RZ.RWTH-AACHEN
Top500
Di t ib t dDi t ib t d M P ll l C t / Cl tM P ll l C t / Cl t
7/27/2019 MPIintroKurs
25/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC25
DistributedDistributed Memory Parallel Computer / ClusterMemory Parallel Computer / Cluster
MemoryMemory
CacheCache
ProcProc
MemoryMemory
CacheCache
ProcProc
MemoryMemory
CacheCache
ProcProc
External networkExternal network
In a distributed
memory parallel
computer each
processor
has only accessto its own main
memory.
Programs have touse an external
network for
communication
and cooperation.
They have to
exchange
messages
7/27/2019 MPIintroKurs
26/49
MPI Di t ib t dMPI on Distributed M P ll l C tMemory Parallel Computers
7/27/2019 MPIintroKurs
27/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC27
MemoryMemory
CacheCache
ProcProc
MemoryMemory
CacheCache
ProcProc
MemoryMemory
CacheCache
ProcProc
External networkExternal network
MPIMPI--TaskTask MPIMPI--TaskTask MPIMPI--TaskTask
Typically, when using Message Passing with MPI,
one MPI process runs on each processor (core)
MPI on DistributedMPI on Distributed Memory Parallel ComputersMemory Parallel Computers
MPI is the de-facto
standard for message
passing.
MPI is a programlibrary plus a
mechanism to launch
multiple cooperation
executable progams.
Typically it is the same
binary, which is
started on multiple
processors.
(SPMD=single
program mutliple data
paradigm)
MPI Sh d M P ll l C tMPI on Shared Memory Parallel Computers
7/27/2019 MPIintroKurs
28/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC28
MPI on Shared Memory Parallel ComputersMPI on Shared Memory Parallel Computers
MemoryMemory((interleavedinterleaved))
CacheCache CacheCache CacheCache CacheCache
ProcProc ProcProc ProcProc ProcProc
CrossbarCrossbar/ Bus/ BusMPIMPI--TaskTask
MPIMPI--TaskTask
MPIMPI--TaskTask
MPI can be used on shared memory systems as well. The shared memory servesas the network. Again, typically one MPI process runs on each processor (core)
MPI is formally specified for
C, C++ and Fortran.
All major vendors provide an
MPI library for their machines.
And there are free versionsavailable.
JAVA-implementations
are available, too, but they arenot widely used and not standardized
OpenMP on Shared Memory Parallel ComputersOpenMP on Shared Memory Parallel Computers
7/27/2019 MPIintroKurs
29/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC29
OpenMP on Shared Memory Parallel ComputersOpenMP on Shared Memory Parallel Computers
MemoryMemory
((interleavedinterleaved))
CacheCache CacheCache CacheCache CacheCache
ProcProc ProcProc ProcProc ProcProc
CrossbarCrossbar/ Bus/ Bus
--ThreadThread --ThreadThread --ThreadThread
OpenMPOpenMP--
On shared memory systems shared memory programming can be used, wheretypically one light weight process (= thread) runs on each processor (core)
MPI on SMPMPI on SMP--ClustersClusters
7/27/2019 MPIintroKurs
30/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC30
MPI on SMPMPI on SMP ClustersClusters
MemoryMemory((interleavedinterleaved))
CacheCache CacheCache CacheCache CacheCache
ProcProc ProcProc ProcProc ProcProc
CrossbarCrossbar/ Bus/ Bus
MemoryMemory((interleavedinterleaved))
CacheCache CacheCache CacheCache CacheCache
ProcProc ProcProc ProcProc ProcProc
CrossbarCrossbar/ Bus/ Bus
External networkExternal network
MPIMPI--TaskTaskMPIMPI--TaskTask
MPIMPI--TaskTask MPIMPI--TaskTask
MPIMPI--TaskTask
MPIMPI--TaskTask
Today, most clusters have SMP nodes and MPI is well suited for this architecture.
Hybrid Parallelization on SMPHybrid Parallelization on SMP-Clusters (MPI+OpenMP)Clusters (MPI+OpenMP)
7/27/2019 MPIintroKurs
31/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC31
Hybrid Parallelization on SMPHybrid Parallelization on SMP--Clusters (MPI+OpenMP)Clusters (MPI+OpenMP)
MemoryMemory((interleavedinterleaved))
CacheCache CacheCache CacheCache CacheCache
ProcProc ProcProc ProcProc ProcProc
CrossbarCrossbar/ Bus/ Bus
MemoryMemory((interleavedinterleaved))
CacheCache CacheCache CacheCache CacheCache
ProcProc ProcProc ProcProc ProcProc
CrossbarCrossbar/ Bus/ Bus
External networkExternal network
Innovative: ClusterInnovative: Cluster OpenMPOpenMP
7/27/2019 MPIintroKurs
32/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC32
(DSM=distributed shared memory system)(DSM=distributed shared memory system)
MemoryMemory((interleavedinterleaved))
CacheCache CacheCache CacheCache CacheCache
ProcProc ProcProc ProcProc ProcProc
CrossbarCrossbar/ Bus/ Bus
MemoryMemory((interleavedinterleaved))
CacheCache CacheCache CacheCache CacheCache
ProcProc ProcProc ProcProc ProcProc
CrossbarCrossbar/ Bus/ Bus
External networkExternal networkClusterCluster-- OpenMPOpenMP
--ThreadThread--ThreaThrea
dd
--ThreadThread --ThreadThread --ThreadThread
Nodes of Todays Clusters are Shared MemoryNodes of Todays Clusters are Shared MemoryM hi i h M l i P
7/27/2019 MPIintroKurs
33/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC33
External networkExternal network
MemoryMemory
L2 CacheL2 Cache
corecoreL1L1
corecoreL1L1
L2 CacheL2 Cache
corecoreL1L1
corecoreL1L1
MemoryMemory
L2 CacheL2 Cache
corecoreL1L1
corecoreL1L1
L2 CacheL2 Cache
corecoreL1L1
corecoreL1L1
. . .
Machines with Multicore ProcessorsMachines with Multicore Processors
Networks and TopologiesNetworks and Topologies
7/27/2019 MPIintroKurs
34/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC34
Networks and TopologiesNetworks and Topologies
Networks Fast- / Gigabit-Ethernet
Myrinet
SCI
QsNet (Quadrix)
Infiniband
Proprietary networks
Topologies Bus
Tree
Fat Tree
2D-, 3D- Torus
Hypercube
Crossbar / Switch
ModernModern Parallel Computer ArchitecturesParallel Computer Architectures
7/27/2019 MPIintroKurs
35/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC35
Modernode Parallel Computer Architecturesa a e Co pute c tectu es
COTS (=commercial off-the-shelf),COW (=Cluster of Workstations)with 1 or 2 dual core prozessorchipsand sheap network (GigabitEthernet)self-made
Cluster of rack mounted Pizza boxeswith 1-4 CPUs dual/quad core processor chipswith fast network (Infiniband)
SMP-Cluster with standard SMP-servers andpropriatary or multi-rail networksSun Fire Cluster, SGI Columbia (Altix nodes),
ASC purple (IBM p575 nodes)
Supercomputers, designed for High-End-Computing:Cray XT3, IBM BlueGene/L, Earth Simulator (NEC SX6)
OverviewOverview
7/27/2019 MPIintroKurs
36/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC36
O e e
Processor Architecture System/Node Architecture
Clusters
HPC @ RZ.RWTH-AACHEN
Top500
HPC @ RZ.RWTHHPC @ RZ.RWTH--AACHEN.DEAACHEN.DE
7/27/2019 MPIintroKurs
37/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC37
SunFire E25KCluster
SunFireSunFire E25KE25KClusterCluster
@@
Sun Fire T2000Sun Fire T2000SunFire V40z ClusterSunFireSunFire V40z ClusterV40z Cluster
Xeon-ClusterXeonXeon--ClusterCluster
RWTH AachenRWTH Aachen ComputeCompute ClusterCluster
7/27/2019 MPIintroKurs
38/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC38
pp
#nodes model
processor
type #procs #cores #threads
clock
[MHz]
memor
y
[GB] network
accumulated
performance
[TFLOPS]
accumulated
memory
[TB]
2 SF E25K UltraSPARC IV 72 144 144 1050 288 Gigabit Ethernet 0.60 0.58
8 SF E6900 UltraSPARC IV 24 48 48 1200 96 Gigabit Ethernet 0.92 0.7720 SF T2000 UltraSPARC T1 1 8 64 1400 32 Gigabit Ethernet 0.2240 0.64
1 SF T2000 UltraSPARC T1 1 8 32 1000 8 Gigabit Ethernet 0.0001 0.01
64SF V40z Opteron 848
4 4 4 2200 8 Gigabit Ethernet 1.13 0.51
4SF V40z Opteron 875
4 8 8 2200 16
Gigabit Ethernet
Infiniband 0.14 0.06
2SF X4600 Opteron 885
8 16 16 2600 32 Gigabit Ethernet 0.17 0.06
7
Xeon 5160
(Woodcrest) 2 4 4 3000 8
Gigabit Ethernet
Infiniband 0.17 0.06
2
Xeon 5160
(Woodcrest) 2 4 4 3000 16
Gigabit Ethernet
Infiniband 0.05 0.03
4
Xeon 5160
(Woodcrest) 2 8 8 2667 16
Gigabit Ethernet
Infiniband 0.17 0.06
55
Xeon E5450
(Harpertown) 2 8 8 3000 16
Gigabit Ethernet
Infiniband 2.64 0.88
5Xeon E5450
(Harpertown) 2 8 8 3000 32
Gigabit Ethernet
Infiniband 0.24 0.16
2
Fujitsu-
Siemens
RX600
Xeon X7350
(Tigerton)4 16 16 2930 64
Gigabit Ethernet
Infiniband 0.19 0.13
176 sum 1740 2884 6.64 3.95
Compute Cluster der RWTH Aachen
Feb 08
Fujitsu-
Siemens
RX200
Dell 1950
System ManagementSystem Management
7/27/2019 MPIintroKurs
39/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC39
y gy g
Frontend nodes for interactive work,program development and testing, GUIscluster.rz.RWTH-Aachen.DE =
cluster-solaris.rz.RWTH-Aachen.DE =
cluster-solaris-sparc.rz.RWTH-Aachen.DE
cluster-solaris-opteron.rz.RWTH-Aachen.DEcluster-linux.rz.RWTH-Aachen.DE =
cluster-linux-opteron.rz.RWTH-Aachen.DE
cluster-linux-xeon.rz.RWTH-Aachen.DE
cluster-windows.rz.RWTH-Aachen.DE =
cluster-windows-xeon.rz.RWTH-Aachen.DE
Abbreviations:
cl[uster],
sol[aris], lin[ux], win[dows],
x[eon], o[pteron], s[parc]
Batch system
Sun Grid Engine: jobs (> 20 min) and
Microsoft Compute Cluster resp.
OverviewOverview overoverHPC ToolsHPC Tools
7/27/2019 MPIintroKurs
40/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC40
Current program development environment for HPC on the SunSPARC, AMD Opteron and Intel Xeon systems at the RWTH
4 platforms :
1. SPARC/Solaris 10, 64bit2. Opteron/Solaris, 64 bit
3. Opteron/Linux and Xeon/Windows, 64 bit
4. Opteron/Windows and Xeon/Windows, 64 bit
serial programming, shared memory parallelization, message
passing
compilers / MPI libraries, debugging tools, performance analysistools
ProgrammingProgramming EnvironmentEnvironmentCompilers + Debugging ToolsCompilers + Debugging Tools
7/27/2019 MPIintroKurs
41/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC41
Company CompilerVersion
Language OpenMPsupport
Au topar Debugger RuntimeAnalysis
Sparc
Lin Win
Sun Studio 12 F95/C/C++ F95/C++ F95/C++
dbx
sunstudio
thread
analyzer
analyzer,
collect,
er_print,
gprof
X X X
Intel V10.0 F95/C++
F95/C++
Threading Tools F95/C++ idb vtune X X
GNU V4.0 F95/C++ gdb gprof X X
GNU V4.2 F95/C++ F95/C++ gdb gprof X X
PGI V7.1 F77/F90/C/C++ F77/F90/C/C++ F77/F90/C/C++ pgdbg pgprof X
Microsoft Visual Studio2003
C++ Visual Studio X
MicrosoftVisual Studio
2005C++ C++ Visual Studio X
Etnus TotalView 8.3 X X X
OpteronXeon
Solaris
Sun Fire Cluster Programming Environment
MPIMPI Implementations and ToolsImplementations and Tools
7/27/2019 MPIintroKurs
42/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC42
Provider Version MPI2 support Debugger Runtime Analysis Plattform Network
Sun HPC ClusterTools 6 yes TotalView analyzer, mpprof
Solaris 10
Opteron +
Sparc
tcp, shm
Sun HPC ClusterTools 7.1based on Open MPI
yes TotalView
Solaris 10
Opteron +
Sparc
tcp, shm, ib
IntelVersion 3.1
based on mpich2(yes) TotalView
TraceCollector
& Analyzer
(former Vampir)
Linux tcp, shm, ib
ANL mpich 1.2.6 no TotalView jumpshot Sol, Lin,Win tcp, shm
ANL mpich2 1.0.x yes (tcp) TotalView jumpshot Sol, Lin,Win tcp, shm
Open MPI
p.d.
OpenMPI 1.2.5
based on FT-MPI,
LA-MPI, LAM, PACX
yes TotalView ? tcp, myr, Infiniband
MicrosoftCCS V1
based on mpich2(yes)
Visual Studio w/
MS Compute Cluster
Pack
Windows tcp, shm, (Infiniband)
Univ Dresden ? Vampir-NG any
VI-HPS multiple research tools any
OverviewOverview
7/27/2019 MPIintroKurs
43/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC43
Processor Architecture System/Node Architecture
Clusters
HPC @ RZ.RWTH-AACHEN Top500
Measuring PerformanceMeasuring Performance
Linpack BenchmarkLinpack Benchmark
7/27/2019 MPIintroKurs
44/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC44
Linpack BenchmarkLinpack Benchmark
The theorectical peak performance is determined by the clock cycle
and the number of floating point operations per cycle. The actual floating point performance can be determined by the
LINPACK Benchmark (www.top500.org) solving a linear equationsystem with a full coefficient matrix of an abitrary size.
The unit of measurement is M[ega]flops = Million floating point operations per second
G[iga]flops, T[era]flops, P[eta]flops (=10^9, 10^12, 10^15 flops)
The Top 500 listing the fastest supercomputers is updated twice peryear.
Latest No 1 (28. list Nov. 2006):IBM BlueGene/L with 131072 processors, 32 TB total memoryLawrence Livermore National Laboratory (LLNL)
peak: 367 Tflops = 367000 Gflopslinpack: 280 Tflops 76% of peakmatrix size N=1.769.471
For Comparison: Dualcore Intel Xeon 5160 (Woodcrest) , 3 GHz:
2 cores * 4 flops/cycle (SSE) * 3 GHz = 24 Gflops
TheThe TOP500 ListTOP500 List
7/27/2019 MPIintroKurs
45/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC45
1
10
100
1.000
10.000
100.000
1.000.000
Jun
93
Jun
94
Jun
95
Jun
96
Jun
97
Jun
98
Jun
99
Jun
00
Jun
01
Jun
02
Jun
03
Jun
04
Jun
05
Jun
06
RWTH: PeakPerformance (R peak)RWTH: Linpack
PerformanceTOP500: Rank 1
TOP500: Rank 50
TOP500: Rank 200
TOP500: Rank 500
PC-Technology
Moore's Law
Gflops
Sun Fire Clusterat Aachen University
Fujitsu VPP300at Aachen University
The current Top 20 (Nov 07)The current Top 20 (Nov 07)
7/27/2019 MPIintroKurs
46/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC46
Rank Site Manufa ompute Country Procs
Linpack
RMax %peak Processor
Proc.Frequ
ency
System
Family Interconnect
1 LLNL IBM Blue Ge USA 212992 478200 80.18 PowerPC 440 700 IBM BlueGe Proprietary
2 FZ Jueli IBM Blue Ge Germany 65536 167300 75.08 PowerPC 450 850 IBM BlueGeProprietary
3 SGI/NM SGI SGI Alti USA 14336 126900 73.77 Xeon 53xx (Clov 3000 SGI Altix Infiniband4 TATA S HP Cluster India 14240 117900 69.00 Xeon 53xx (Clov 3000 HP Cluster Infiniband DDR
5 Gov AgeHP Cluster Sweden 13728 102800 70.20 Xeon 53xx (Clov 2667 HP Cluster Infiniband DDR
6 Sandia Cray In Sandia/ USA 26569 102200 80.14 Opteron Dual C 2400 Cray XT XT3 proprietary
7 Oak Rid Cray In Cray XTUSA 23016 101700 85.21 Opteron Dual C 2600 Cray XT XT3 proprietary
8 IBM Wa IBM Blue Ge USA 40960 91290 79.60 PowerPC 440 700 IBM BlueGe Proprietary9 NERSC/ Cray In Cray XTUSA 19320 85368 84.97 Opteron Dual C 2600 Cray XT XT3 proprietary
10 Stony B IBM Blue Ge USA 36864 82161 79.60 PowerPC 440 700 IBM BlueGe Proprietary
11 LLNL IBM pSeries USA 12208 75760 81.65 POWER5 1900 IBM pSerie Federation
12 Renssel IBM Blue Ge USA 32768 73032 79.60 PowerPC 440 700 IBM BlueGe Proprietary
13 Barcelo IBM BladeCe Spain 10240 63830 67.75 PowerPC 970 2300 IBM Cluster Myrinet14 NCSA Dell PowerE USA 9600 62680 69.97 Xeon 53xx (Clov 2333 Dell Cluster Infiniband SDR
15 Leibniz SGI Altix 470 Germany 9728 56520 90.78 Itanium 2 1600 SGI Altix NUMAlink
16 GSIC, TINEC/Su Sun Fir Japan 11664 56430 55.31 Opteron Dual C 2400 Sun Fire - Infiniband
17 Univ Edi Cray In Cray XTUK 11328 54648 86.15 Opteron Dual C 2800 Cray XT XT3 proprietary
18 Sandia Dell PowerE USA 9024 53000 81.57 Xeon EM64T 3600 Dell Cluster Infiniband
19 CEA Bull SA NovaSc France 9968 52840 82.83 Itanium 2 1600 Bull SMP Cl Quadrics
20 NASA/A SGI SGI Alti USA 10160 51870 85.09 Itanium 2 1500 SGI Altix Numalink/IB
Aachen on Rank 180 in June 2005Aachen on Rank 180 in June 2005
http://www.rz.rwthhttp://www.rz.rwth--aachen.de/hpc/sun/aachen.de/hpc/sun/
7/27/2019 MPIintroKurs
47/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC47
pp pp
Over 2 TeraFlop/s Linpack PerformanceApril 2005. The upgrade from UltraSPARC III to UltraSPARC IV including
an increase of the main memory capacity more than doubled our Linpack
performance!
A linear system with 499,200 unknowns was solved in 11:12:48.8
hours at an average speed of2054.4 billion floating point operations per
second (GFlop/s).
The program had a total memory footprint of2 Terabyte.
1276 processor cores were kept busy with 82,930,000,000 mill ion
floating point operations.
Future Parallel ComputersFuture Parallel Computers
7/27/2019 MPIintroKurs
48/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC48
With a growing number of cores per chip the SMP box is shrinking.
In a few years from now, many or all applications will be multi-threaded
SMP boxes with small footprint will be building blocks of large systems.
Memory hierarchies will grow (L3 caches )
Network latency will be close to 1 s
Network bandwidth several GB/s
Current research:
Distributed Shared Memory (Cluster OpenMP )
combining the advantage of SMP with scalability of DMP
In 2008/9: Petaflop/s systems by IBM, Cray, NEC, Woodcrest ~ 24 Gflop/s @ ~100 W => 1 PFL @ ~4 MW
Main problems: Power supply and cooling
SomeSome WebWeb--linkslinks
7/27/2019 MPIintroKurs
49/49
Parallel Computer ArchitecturesCenter forComput ing and Communicat ion
CC49
Information related to HPC at the RWTH
http://www.rz.rwth-aachen.de/hpc/
Information related to MPI at the RWTH
http://www.rz.rwth-aachen.de/mpi/
Sun Fire SMP Cluster Primerhttp://www.rz.rwth-aachen.de/hpc/primer
Web page of the SunHPC and VI-HPS workshops with more links and
informationhttp://www.rz.rwth-aachen.de/sunhpc
Joint SunHPC Seminar (March 3-4)
and VI-HPS (March 5-7) Tuning Workshop