MPIintroKurs

  • Upload
    kamanw

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

  • 7/27/2019 MPIintroKurs

    1/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC1

    Parallel Computer ArchitecturesParallel Computer Architectures

    Dieter an Mey

    Center for Computing and CommunicationRWTH Aachen University, Germany

  • 7/27/2019 MPIintroKurs

    2/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC2

    OverviewOverview

    Processor Architecture System/Node Architecture

    Clusters

    HPC @ RZ.RWTH-AACHEN

    Top500

  • 7/27/2019 MPIintroKurs

    3/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC3

    OverviewOverview

    Processor Architecture System/Node Architecture

    Clusters

    HPC @ RZ.RWTH-AACHEN

    Top500

  • 7/27/2019 MPIintroKurs

    4/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC4

    SingleSingle ProcessorProcessorSystemSystem

    MemoryMemory

    ProcProc

    Main memory to store data and program

    Processor to fetch program from memory,and execute program instructions:

    Load data from memory, process data and

    write results back to memory.

    Input/

    output

    is not covered here.

  • 7/27/2019 MPIintroKurs

    5/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC5

    CachesCaches

    Marc Tremblay, Sun

  • 7/27/2019 MPIintroKurs

    6/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC6

    SingleSingle ProcessorProcessorSystemSystem

    MemoryMemory

    CacheCache

    ProcProc

    Caches are smaller than main memory,

    but much faster.

    They are employed to bridge the gap

    between a bigger and slower main memoryand the processor which is much faster.

    The cache is invisible to the programmer

    Only when measuring the runtime, the effect

    of caches will become apparent.

  • 7/27/2019 MPIintroKurs

    7/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC7

    On

    SingleSingle ProcessorProcessorSystemSystem

    MemoryMemory

    L2 CacheL2 Cache

    ProcProcL1 CacheL1 Cache

    off chip cache($)

    on chip cache($)

    With a growing number of transistors on

    each chip over time, caches can be put on

    the same piece of sil icon.

    I am ignoring- instruction caches- address caches (TLB)- write buffers

    - prefetch buffershere

    as data caches are mostimportant for HPC applications

  • 7/27/2019 MPIintroKurs

    8/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC8

    In 2005 Intel cancelled the 4 GHz ChipIn 2005 Intel cancelled the 4 GHz Chip

    Fast clock cyclesmake processor chipsmore expensive,hotter and more

    power consuming.

  • 7/27/2019 MPIintroKurs

    9/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC9

    TheThe Impact ofImpact ofMooreMooress LawLaw

    Source: Herb Sutterwww.gotw.ca/publications/concurrency-ddj.htm

    Intel-Processors:Intel-Processors:

    Clock Speed (MHz)

    Transistors (x1000)

    Clock Speed (MHz)

    Transistors (x1000)

    The number of transistors

    on a chip is still doubling

    every 18 months

    but the clock speed is

    no longer growing that fast.

    Higher clock speed causes

    higher temperature and higher

    power consumption.

    Instead well see many more

    cores per chip!

  • 7/27/2019 MPIintroKurs

    10/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC10

    DualDual CoreCore--ProcessorsProcessors

    MemoryMemory

    L2 CacheL2 Cache

    corecoreL1L1

    corecoreL1L1

    Since 2005/6 Intel and AMD are

    producing dualcore processorsfor the mass market.

    In 2006/7 Intel and AMD introduce

    quadcore processors.

    By 2008 it will be hard to buy a PC

    without a dualcore processor.

    Your future PC / laptop will be aparallel computer!

    D lD l CC PP

  • 7/27/2019 MPIintroKurs

    11/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC11

    DualDual CoreCore--ProcessorsProcessorsIntel WoodcrestIntel Woodcrest

    MemoryMemory

    L2 CacheL2 Cache

    corecoreL1L1

    corecoreL1L1

    Here:

    4 MB shared Cache on chip

    2 Cores with local L1 Cache

    and a socket for a second processor chip

    L2 Cache

    coreL1

    coreL1

  • 7/27/2019 MPIintroKurs

    12/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC12

    MultiMulti CoreCore--ProcessorsProcessors

    MemoryMemory

    2x8MB2x8MB

    corecore64KB64KB

    corecore64KB64KB

    MemoryMemory

    32 MB32 MB

    corecore64KB64KB

    corecore64KB64KB

    2 MB L22 MB L2

    UltraSPARC IV1.2 GHz130 nm

    ~66 Mio trans.108 Watt

    UltraSPARC IV+1.5 GHz

    90 nm

    295 Mio trans.

    90 Watt (?)

    MemoryMemory

    corecore64KB64KB

    corecore64KB64KB

    1 MB1 MB

    Opteron 8752.2 GHz90 nm

    199 mm2

    233 Mio trans.95 Watt

    1 MB1 MB

    MemoryMemory

    UltraSPARC T11.0 GHz90 nm

    378 mm2

    300 Mio trans.72 Watt

    3 MB L23 MB L288 corescores

    8 KB L18 KB L1

    forforeacheach corcor

  • 7/27/2019 MPIintroKurs

    13/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC13

    What to do with all these ThreadsWhat to do with all these Threads

    Marc Tremblay, Sun

    Waiting in parallel

  • 7/27/2019 MPIintroKurs

    14/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC14

    Marc Tremblay, Sun

    What to do with all these ThreadsWhat to do with all these Threads

  • 7/27/2019 MPIintroKurs

    15/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC15

    SunSun FireFire T2000 at AachenT2000 at Aachen

    MemoryMemory

    corecore

    8 KB8 KBL1L1

    .75 MB L2.75 MB L2

    MemoryMemory MemoryMemory MemoryMemory

    .75 MB L2.75 MB L2 .75 MB L2.75 MB L2 .75 MB L2.75 MB L2

    FPUFPUInternal Crossbar 134 GB/s

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    25.6 GB/sec.

    4 DDR2 memory controllers on chip

    1GHz

  • 7/27/2019 MPIintroKurs

    16/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC16

    Sun T5120Sun T5120 Eight Cores x Eight ThreadsEight Cores x Eight Threads

    MemoryMemory

    corecore

    8 KB8 KBL1L1

    0.5 MB L20.5 MB L2

    MemoryMemory MemoryMemory MemoryMemory

    0.5MB L20.5MB L2 0.5 MB L20.5 MB L2 0.5 MB L20.5 MB L2

    Internal Crossbar

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    42.7 GB/sec.

    4 FB DRAM memory controllers on chip

    1.4 GHz8 threads

    per core

    1 FPU per

    core

    0.5 MB L20.5 MB L2 0.5MB L20.5MB L2 0.5 MB L20.5 MB L2 0.5 MB L20.5 MB L2

    1 x UltraSPARC T2 (Niagara 2) @ 1.4 GHz

  • 7/27/2019 MPIintroKurs

    17/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC17

    Chip LevelChip Level ParallelismParallelism

    time

    = 1.11 ns

    = 0.66 ns

    = 0.45 ns

    UltraSPARC III

    superscalar, single core4 sparc v9 instr/cycle

    1 active thread per core

    UltraSPARC IV+superscalar, dual core

    2 x 4 sparc v9 instr/cycle1 active thread per core

    Opteron 875superscalar, dual core

    2 x 3 x86 instr/cycle1 active thread per core

    UltraSPARC T1Single issue, 8 cores

    8 x 1 sparc v9 instr/cycle4 active threads per corecontext switch comes for free

    = 1.0 ns

    context switch

  • 7/27/2019 MPIintroKurs

    18/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC18

    OverviewOverview

    Processor Architecture System/Node Architecture

    Clusters

    HPC @ RZ.RWTH-AACHEN

    Top500

    SharedShared MemoryMemory Parallel ComputersParallel Computers

  • 7/27/2019 MPIintroKurs

    19/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC19

    MemoryMemory

    CacheCache CacheCache CacheCache CacheCache

    ProcProc ProcProc ProcProc ProcProc

    CrossbarCrossbar/ Bus/ Bus

    In a shared memory parallelcomputer multiple processors

    have access to the same main

    memory.

    Yes, a dualcore / multicore

    processor based machine

    is a parallel computer

    on a chip.

    SharedShared MemoryMemory Parallel ComputersParallel ComputersUniformUniform MemoryMemoryAccess (UMA)Access (UMA)

    -Crossbar adds latency-Architecture is not scalable

    SharedShared MemoryMemory Parallel ComputersParallel Computers

  • 7/27/2019 MPIintroKurs

    20/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC20

    CacheCache CacheCache CacheCache CacheCache

    ProcProc ProcProc ProcProc ProcProc

    CrossbarCrossbar/ Bus/ Bus

    SharedShared MemoryMemory Parallel ComputersParallel ComputersNon UniformNon Uniform MemoryMemoryAccessAccess (NUMA(NUMA))

    MemoryMemory MemoryMemory MemoryMemory MemoryMemory

    - Faster local memory access- slower remote memory access

  • 7/27/2019 MPIintroKurs

    21/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC21

    SunSun FireFire E2900 at AachenE2900 at Aachen

    MemoryMemory

    2x8MB2x8MB

    corecore64KB64KB

    corecore64KB64KB

    CrossbarCrossbar-- 9.6 GB/s total9.6 GB/s total peakpeak memorymemory bandwidthbandwidth

    2x8MB2x8MB

    corecore64KB64KB

    corecore64KB64KB

    - simplistic view- programers perspective- rather uniform memory access

    MemoryMemory MemoryMemory

    12 dual core US IV processors

    2.4 GB/s.

    memory controller

    on chip

    1.2 GHz

  • 7/27/2019 MPIintroKurs

    22/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC22

    SunSun FireFire V40z at AachenV40z at Aachen

    - simplistic view- programers perspective- non-uniform memory access

    MemoryMemory

    corecore64KB64KB

    corecore64KB64KB

    1 MB1 MB 1 MB1 MB

    MemoryMemory

    corecore64KB64KB

    corecore64KB64KB

    1 MB1 MB 1 MB1 MB

    MemoryMemory

    corecore64KB64KB

    corecore64KB64KB

    1 MB1 MB 1 MB1 MB

    MemoryMemory

    corecore64KB64KB

    corecore64KB64KB

    1 MB1 MB 1 MB1 MB

    DDR 400memory control leron chip

    8 GB/s

    8 GB/s

    8 GB/s 8 GB/s

    6.4 GB/s6.4 GB/s

    6.4 GB/s6.4 GB/s

    2.2 GHz

    S T5120 Ei ht C Ei ht Th d

  • 7/27/2019 MPIintroKurs

    23/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC23

    Sun T5120Sun T5120 Eight Cores x Eight ThreadsEight Cores x Eight Threads

    MemoryMemory

    corecore

    8 KB8 KBL1L1

    0.5 MB L20.5 MB L2

    MemoryMemory MemoryMemory MemoryMemory

    0.5MB L20.5MB L2 0.5 MB L20.5 MB L2 0.5 MB L20.5 MB L2

    Internal Crossbar

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    corecore

    8 KB8 KBL1L1

    42.7 GB/sec.

    4 FB DRAM memory controllers on chip

    1.4 GHz8 threads

    per core

    1 FPU per

    core

    0.5 MB L20.5 MB L2 0.5MB L20.5MB L2 0.5 MB L20.5 MB L2 0.5 MB L20.5 MB L2

    1 x UltraSPARC T2 (Niagara 2) @ 1.4 GHz

    O iO i

  • 7/27/2019 MPIintroKurs

    24/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC24

    OverviewOverview

    Processor Architecture System/Node Architecture

    Clusters

    HPC @ RZ.RWTH-AACHEN

    Top500

    Di t ib t dDi t ib t d M P ll l C t / Cl tM P ll l C t / Cl t

  • 7/27/2019 MPIintroKurs

    25/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC25

    DistributedDistributed Memory Parallel Computer / ClusterMemory Parallel Computer / Cluster

    MemoryMemory

    CacheCache

    ProcProc

    MemoryMemory

    CacheCache

    ProcProc

    MemoryMemory

    CacheCache

    ProcProc

    External networkExternal network

    In a distributed

    memory parallel

    computer each

    processor

    has only accessto its own main

    memory.

    Programs have touse an external

    network for

    communication

    and cooperation.

    They have to

    exchange

    messages

  • 7/27/2019 MPIintroKurs

    26/49

    MPI Di t ib t dMPI on Distributed M P ll l C tMemory Parallel Computers

  • 7/27/2019 MPIintroKurs

    27/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC27

    MemoryMemory

    CacheCache

    ProcProc

    MemoryMemory

    CacheCache

    ProcProc

    MemoryMemory

    CacheCache

    ProcProc

    External networkExternal network

    MPIMPI--TaskTask MPIMPI--TaskTask MPIMPI--TaskTask

    Typically, when using Message Passing with MPI,

    one MPI process runs on each processor (core)

    MPI on DistributedMPI on Distributed Memory Parallel ComputersMemory Parallel Computers

    MPI is the de-facto

    standard for message

    passing.

    MPI is a programlibrary plus a

    mechanism to launch

    multiple cooperation

    executable progams.

    Typically it is the same

    binary, which is

    started on multiple

    processors.

    (SPMD=single

    program mutliple data

    paradigm)

    MPI Sh d M P ll l C tMPI on Shared Memory Parallel Computers

  • 7/27/2019 MPIintroKurs

    28/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC28

    MPI on Shared Memory Parallel ComputersMPI on Shared Memory Parallel Computers

    MemoryMemory((interleavedinterleaved))

    CacheCache CacheCache CacheCache CacheCache

    ProcProc ProcProc ProcProc ProcProc

    CrossbarCrossbar/ Bus/ BusMPIMPI--TaskTask

    MPIMPI--TaskTask

    MPIMPI--TaskTask

    MPI can be used on shared memory systems as well. The shared memory servesas the network. Again, typically one MPI process runs on each processor (core)

    MPI is formally specified for

    C, C++ and Fortran.

    All major vendors provide an

    MPI library for their machines.

    And there are free versionsavailable.

    JAVA-implementations

    are available, too, but they arenot widely used and not standardized

    OpenMP on Shared Memory Parallel ComputersOpenMP on Shared Memory Parallel Computers

  • 7/27/2019 MPIintroKurs

    29/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC29

    OpenMP on Shared Memory Parallel ComputersOpenMP on Shared Memory Parallel Computers

    MemoryMemory

    ((interleavedinterleaved))

    CacheCache CacheCache CacheCache CacheCache

    ProcProc ProcProc ProcProc ProcProc

    CrossbarCrossbar/ Bus/ Bus

    --ThreadThread --ThreadThread --ThreadThread

    OpenMPOpenMP--

    On shared memory systems shared memory programming can be used, wheretypically one light weight process (= thread) runs on each processor (core)

    MPI on SMPMPI on SMP--ClustersClusters

  • 7/27/2019 MPIintroKurs

    30/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC30

    MPI on SMPMPI on SMP ClustersClusters

    MemoryMemory((interleavedinterleaved))

    CacheCache CacheCache CacheCache CacheCache

    ProcProc ProcProc ProcProc ProcProc

    CrossbarCrossbar/ Bus/ Bus

    MemoryMemory((interleavedinterleaved))

    CacheCache CacheCache CacheCache CacheCache

    ProcProc ProcProc ProcProc ProcProc

    CrossbarCrossbar/ Bus/ Bus

    External networkExternal network

    MPIMPI--TaskTaskMPIMPI--TaskTask

    MPIMPI--TaskTask MPIMPI--TaskTask

    MPIMPI--TaskTask

    MPIMPI--TaskTask

    Today, most clusters have SMP nodes and MPI is well suited for this architecture.

    Hybrid Parallelization on SMPHybrid Parallelization on SMP-Clusters (MPI+OpenMP)Clusters (MPI+OpenMP)

  • 7/27/2019 MPIintroKurs

    31/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC31

    Hybrid Parallelization on SMPHybrid Parallelization on SMP--Clusters (MPI+OpenMP)Clusters (MPI+OpenMP)

    MemoryMemory((interleavedinterleaved))

    CacheCache CacheCache CacheCache CacheCache

    ProcProc ProcProc ProcProc ProcProc

    CrossbarCrossbar/ Bus/ Bus

    MemoryMemory((interleavedinterleaved))

    CacheCache CacheCache CacheCache CacheCache

    ProcProc ProcProc ProcProc ProcProc

    CrossbarCrossbar/ Bus/ Bus

    External networkExternal network

    Innovative: ClusterInnovative: Cluster OpenMPOpenMP

  • 7/27/2019 MPIintroKurs

    32/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC32

    (DSM=distributed shared memory system)(DSM=distributed shared memory system)

    MemoryMemory((interleavedinterleaved))

    CacheCache CacheCache CacheCache CacheCache

    ProcProc ProcProc ProcProc ProcProc

    CrossbarCrossbar/ Bus/ Bus

    MemoryMemory((interleavedinterleaved))

    CacheCache CacheCache CacheCache CacheCache

    ProcProc ProcProc ProcProc ProcProc

    CrossbarCrossbar/ Bus/ Bus

    External networkExternal networkClusterCluster-- OpenMPOpenMP

    --ThreadThread--ThreaThrea

    dd

    --ThreadThread --ThreadThread --ThreadThread

    Nodes of Todays Clusters are Shared MemoryNodes of Todays Clusters are Shared MemoryM hi i h M l i P

  • 7/27/2019 MPIintroKurs

    33/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC33

    External networkExternal network

    MemoryMemory

    L2 CacheL2 Cache

    corecoreL1L1

    corecoreL1L1

    L2 CacheL2 Cache

    corecoreL1L1

    corecoreL1L1

    MemoryMemory

    L2 CacheL2 Cache

    corecoreL1L1

    corecoreL1L1

    L2 CacheL2 Cache

    corecoreL1L1

    corecoreL1L1

    . . .

    Machines with Multicore ProcessorsMachines with Multicore Processors

    Networks and TopologiesNetworks and Topologies

  • 7/27/2019 MPIintroKurs

    34/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC34

    Networks and TopologiesNetworks and Topologies

    Networks Fast- / Gigabit-Ethernet

    Myrinet

    SCI

    QsNet (Quadrix)

    Infiniband

    Proprietary networks

    Topologies Bus

    Tree

    Fat Tree

    2D-, 3D- Torus

    Hypercube

    Crossbar / Switch

    ModernModern Parallel Computer ArchitecturesParallel Computer Architectures

  • 7/27/2019 MPIintroKurs

    35/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC35

    Modernode Parallel Computer Architecturesa a e Co pute c tectu es

    COTS (=commercial off-the-shelf),COW (=Cluster of Workstations)with 1 or 2 dual core prozessorchipsand sheap network (GigabitEthernet)self-made

    Cluster of rack mounted Pizza boxeswith 1-4 CPUs dual/quad core processor chipswith fast network (Infiniband)

    SMP-Cluster with standard SMP-servers andpropriatary or multi-rail networksSun Fire Cluster, SGI Columbia (Altix nodes),

    ASC purple (IBM p575 nodes)

    Supercomputers, designed for High-End-Computing:Cray XT3, IBM BlueGene/L, Earth Simulator (NEC SX6)

    OverviewOverview

  • 7/27/2019 MPIintroKurs

    36/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC36

    O e e

    Processor Architecture System/Node Architecture

    Clusters

    HPC @ RZ.RWTH-AACHEN

    Top500

    HPC @ RZ.RWTHHPC @ RZ.RWTH--AACHEN.DEAACHEN.DE

  • 7/27/2019 MPIintroKurs

    37/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC37

    SunFire E25KCluster

    SunFireSunFire E25KE25KClusterCluster

    @@

    Sun Fire T2000Sun Fire T2000SunFire V40z ClusterSunFireSunFire V40z ClusterV40z Cluster

    Xeon-ClusterXeonXeon--ClusterCluster

    RWTH AachenRWTH Aachen ComputeCompute ClusterCluster

  • 7/27/2019 MPIintroKurs

    38/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC38

    pp

    #nodes model

    processor

    type #procs #cores #threads

    clock

    [MHz]

    memor

    y

    [GB] network

    accumulated

    performance

    [TFLOPS]

    accumulated

    memory

    [TB]

    2 SF E25K UltraSPARC IV 72 144 144 1050 288 Gigabit Ethernet 0.60 0.58

    8 SF E6900 UltraSPARC IV 24 48 48 1200 96 Gigabit Ethernet 0.92 0.7720 SF T2000 UltraSPARC T1 1 8 64 1400 32 Gigabit Ethernet 0.2240 0.64

    1 SF T2000 UltraSPARC T1 1 8 32 1000 8 Gigabit Ethernet 0.0001 0.01

    64SF V40z Opteron 848

    4 4 4 2200 8 Gigabit Ethernet 1.13 0.51

    4SF V40z Opteron 875

    4 8 8 2200 16

    Gigabit Ethernet

    Infiniband 0.14 0.06

    2SF X4600 Opteron 885

    8 16 16 2600 32 Gigabit Ethernet 0.17 0.06

    7

    Xeon 5160

    (Woodcrest) 2 4 4 3000 8

    Gigabit Ethernet

    Infiniband 0.17 0.06

    2

    Xeon 5160

    (Woodcrest) 2 4 4 3000 16

    Gigabit Ethernet

    Infiniband 0.05 0.03

    4

    Xeon 5160

    (Woodcrest) 2 8 8 2667 16

    Gigabit Ethernet

    Infiniband 0.17 0.06

    55

    Xeon E5450

    (Harpertown) 2 8 8 3000 16

    Gigabit Ethernet

    Infiniband 2.64 0.88

    5Xeon E5450

    (Harpertown) 2 8 8 3000 32

    Gigabit Ethernet

    Infiniband 0.24 0.16

    2

    Fujitsu-

    Siemens

    RX600

    Xeon X7350

    (Tigerton)4 16 16 2930 64

    Gigabit Ethernet

    Infiniband 0.19 0.13

    176 sum 1740 2884 6.64 3.95

    Compute Cluster der RWTH Aachen

    Feb 08

    Fujitsu-

    Siemens

    RX200

    Dell 1950

    System ManagementSystem Management

  • 7/27/2019 MPIintroKurs

    39/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC39

    y gy g

    Frontend nodes for interactive work,program development and testing, GUIscluster.rz.RWTH-Aachen.DE =

    cluster-solaris.rz.RWTH-Aachen.DE =

    cluster-solaris-sparc.rz.RWTH-Aachen.DE

    cluster-solaris-opteron.rz.RWTH-Aachen.DEcluster-linux.rz.RWTH-Aachen.DE =

    cluster-linux-opteron.rz.RWTH-Aachen.DE

    cluster-linux-xeon.rz.RWTH-Aachen.DE

    cluster-windows.rz.RWTH-Aachen.DE =

    cluster-windows-xeon.rz.RWTH-Aachen.DE

    Abbreviations:

    cl[uster],

    sol[aris], lin[ux], win[dows],

    x[eon], o[pteron], s[parc]

    Batch system

    Sun Grid Engine: jobs (> 20 min) and

    Microsoft Compute Cluster resp.

    OverviewOverview overoverHPC ToolsHPC Tools

  • 7/27/2019 MPIintroKurs

    40/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC40

    Current program development environment for HPC on the SunSPARC, AMD Opteron and Intel Xeon systems at the RWTH

    4 platforms :

    1. SPARC/Solaris 10, 64bit2. Opteron/Solaris, 64 bit

    3. Opteron/Linux and Xeon/Windows, 64 bit

    4. Opteron/Windows and Xeon/Windows, 64 bit

    serial programming, shared memory parallelization, message

    passing

    compilers / MPI libraries, debugging tools, performance analysistools

    ProgrammingProgramming EnvironmentEnvironmentCompilers + Debugging ToolsCompilers + Debugging Tools

  • 7/27/2019 MPIintroKurs

    41/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC41

    Company CompilerVersion

    Language OpenMPsupport

    Au topar Debugger RuntimeAnalysis

    Sparc

    Lin Win

    Sun Studio 12 F95/C/C++ F95/C++ F95/C++

    dbx

    sunstudio

    thread

    analyzer

    analyzer,

    collect,

    er_print,

    gprof

    X X X

    Intel V10.0 F95/C++

    F95/C++

    Threading Tools F95/C++ idb vtune X X

    GNU V4.0 F95/C++ gdb gprof X X

    GNU V4.2 F95/C++ F95/C++ gdb gprof X X

    PGI V7.1 F77/F90/C/C++ F77/F90/C/C++ F77/F90/C/C++ pgdbg pgprof X

    Microsoft Visual Studio2003

    C++ Visual Studio X

    MicrosoftVisual Studio

    2005C++ C++ Visual Studio X

    Etnus TotalView 8.3 X X X

    OpteronXeon

    Solaris

    Sun Fire Cluster Programming Environment

    MPIMPI Implementations and ToolsImplementations and Tools

  • 7/27/2019 MPIintroKurs

    42/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC42

    Provider Version MPI2 support Debugger Runtime Analysis Plattform Network

    Sun HPC ClusterTools 6 yes TotalView analyzer, mpprof

    Solaris 10

    Opteron +

    Sparc

    tcp, shm

    Sun HPC ClusterTools 7.1based on Open MPI

    yes TotalView

    Solaris 10

    Opteron +

    Sparc

    tcp, shm, ib

    IntelVersion 3.1

    based on mpich2(yes) TotalView

    TraceCollector

    & Analyzer

    (former Vampir)

    Linux tcp, shm, ib

    ANL mpich 1.2.6 no TotalView jumpshot Sol, Lin,Win tcp, shm

    ANL mpich2 1.0.x yes (tcp) TotalView jumpshot Sol, Lin,Win tcp, shm

    Open MPI

    p.d.

    OpenMPI 1.2.5

    based on FT-MPI,

    LA-MPI, LAM, PACX

    yes TotalView ? tcp, myr, Infiniband

    MicrosoftCCS V1

    based on mpich2(yes)

    Visual Studio w/

    MS Compute Cluster

    Pack

    Windows tcp, shm, (Infiniband)

    Univ Dresden ? Vampir-NG any

    VI-HPS multiple research tools any

    OverviewOverview

  • 7/27/2019 MPIintroKurs

    43/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC43

    Processor Architecture System/Node Architecture

    Clusters

    HPC @ RZ.RWTH-AACHEN Top500

    Measuring PerformanceMeasuring Performance

    Linpack BenchmarkLinpack Benchmark

  • 7/27/2019 MPIintroKurs

    44/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC44

    Linpack BenchmarkLinpack Benchmark

    The theorectical peak performance is determined by the clock cycle

    and the number of floating point operations per cycle. The actual floating point performance can be determined by the

    LINPACK Benchmark (www.top500.org) solving a linear equationsystem with a full coefficient matrix of an abitrary size.

    The unit of measurement is M[ega]flops = Million floating point operations per second

    G[iga]flops, T[era]flops, P[eta]flops (=10^9, 10^12, 10^15 flops)

    The Top 500 listing the fastest supercomputers is updated twice peryear.

    Latest No 1 (28. list Nov. 2006):IBM BlueGene/L with 131072 processors, 32 TB total memoryLawrence Livermore National Laboratory (LLNL)

    peak: 367 Tflops = 367000 Gflopslinpack: 280 Tflops 76% of peakmatrix size N=1.769.471

    For Comparison: Dualcore Intel Xeon 5160 (Woodcrest) , 3 GHz:

    2 cores * 4 flops/cycle (SSE) * 3 GHz = 24 Gflops

    TheThe TOP500 ListTOP500 List

  • 7/27/2019 MPIintroKurs

    45/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC45

    1

    10

    100

    1.000

    10.000

    100.000

    1.000.000

    Jun

    93

    Jun

    94

    Jun

    95

    Jun

    96

    Jun

    97

    Jun

    98

    Jun

    99

    Jun

    00

    Jun

    01

    Jun

    02

    Jun

    03

    Jun

    04

    Jun

    05

    Jun

    06

    RWTH: PeakPerformance (R peak)RWTH: Linpack

    PerformanceTOP500: Rank 1

    TOP500: Rank 50

    TOP500: Rank 200

    TOP500: Rank 500

    PC-Technology

    Moore's Law

    Gflops

    Sun Fire Clusterat Aachen University

    Fujitsu VPP300at Aachen University

    The current Top 20 (Nov 07)The current Top 20 (Nov 07)

  • 7/27/2019 MPIintroKurs

    46/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC46

    Rank Site Manufa ompute Country Procs

    Linpack

    RMax %peak Processor

    Proc.Frequ

    ency

    System

    Family Interconnect

    1 LLNL IBM Blue Ge USA 212992 478200 80.18 PowerPC 440 700 IBM BlueGe Proprietary

    2 FZ Jueli IBM Blue Ge Germany 65536 167300 75.08 PowerPC 450 850 IBM BlueGeProprietary

    3 SGI/NM SGI SGI Alti USA 14336 126900 73.77 Xeon 53xx (Clov 3000 SGI Altix Infiniband4 TATA S HP Cluster India 14240 117900 69.00 Xeon 53xx (Clov 3000 HP Cluster Infiniband DDR

    5 Gov AgeHP Cluster Sweden 13728 102800 70.20 Xeon 53xx (Clov 2667 HP Cluster Infiniband DDR

    6 Sandia Cray In Sandia/ USA 26569 102200 80.14 Opteron Dual C 2400 Cray XT XT3 proprietary

    7 Oak Rid Cray In Cray XTUSA 23016 101700 85.21 Opteron Dual C 2600 Cray XT XT3 proprietary

    8 IBM Wa IBM Blue Ge USA 40960 91290 79.60 PowerPC 440 700 IBM BlueGe Proprietary9 NERSC/ Cray In Cray XTUSA 19320 85368 84.97 Opteron Dual C 2600 Cray XT XT3 proprietary

    10 Stony B IBM Blue Ge USA 36864 82161 79.60 PowerPC 440 700 IBM BlueGe Proprietary

    11 LLNL IBM pSeries USA 12208 75760 81.65 POWER5 1900 IBM pSerie Federation

    12 Renssel IBM Blue Ge USA 32768 73032 79.60 PowerPC 440 700 IBM BlueGe Proprietary

    13 Barcelo IBM BladeCe Spain 10240 63830 67.75 PowerPC 970 2300 IBM Cluster Myrinet14 NCSA Dell PowerE USA 9600 62680 69.97 Xeon 53xx (Clov 2333 Dell Cluster Infiniband SDR

    15 Leibniz SGI Altix 470 Germany 9728 56520 90.78 Itanium 2 1600 SGI Altix NUMAlink

    16 GSIC, TINEC/Su Sun Fir Japan 11664 56430 55.31 Opteron Dual C 2400 Sun Fire - Infiniband

    17 Univ Edi Cray In Cray XTUK 11328 54648 86.15 Opteron Dual C 2800 Cray XT XT3 proprietary

    18 Sandia Dell PowerE USA 9024 53000 81.57 Xeon EM64T 3600 Dell Cluster Infiniband

    19 CEA Bull SA NovaSc France 9968 52840 82.83 Itanium 2 1600 Bull SMP Cl Quadrics

    20 NASA/A SGI SGI Alti USA 10160 51870 85.09 Itanium 2 1500 SGI Altix Numalink/IB

    Aachen on Rank 180 in June 2005Aachen on Rank 180 in June 2005

    http://www.rz.rwthhttp://www.rz.rwth--aachen.de/hpc/sun/aachen.de/hpc/sun/

  • 7/27/2019 MPIintroKurs

    47/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC47

    pp pp

    Over 2 TeraFlop/s Linpack PerformanceApril 2005. The upgrade from UltraSPARC III to UltraSPARC IV including

    an increase of the main memory capacity more than doubled our Linpack

    performance!

    A linear system with 499,200 unknowns was solved in 11:12:48.8

    hours at an average speed of2054.4 billion floating point operations per

    second (GFlop/s).

    The program had a total memory footprint of2 Terabyte.

    1276 processor cores were kept busy with 82,930,000,000 mill ion

    floating point operations.

    Future Parallel ComputersFuture Parallel Computers

  • 7/27/2019 MPIintroKurs

    48/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC48

    With a growing number of cores per chip the SMP box is shrinking.

    In a few years from now, many or all applications will be multi-threaded

    SMP boxes with small footprint will be building blocks of large systems.

    Memory hierarchies will grow (L3 caches )

    Network latency will be close to 1 s

    Network bandwidth several GB/s

    Current research:

    Distributed Shared Memory (Cluster OpenMP )

    combining the advantage of SMP with scalability of DMP

    In 2008/9: Petaflop/s systems by IBM, Cray, NEC, Woodcrest ~ 24 Gflop/s @ ~100 W => 1 PFL @ ~4 MW

    Main problems: Power supply and cooling

    SomeSome WebWeb--linkslinks

  • 7/27/2019 MPIintroKurs

    49/49

    Parallel Computer ArchitecturesCenter forComput ing and Communicat ion

    CC49

    Information related to HPC at the RWTH

    http://www.rz.rwth-aachen.de/hpc/

    Information related to MPI at the RWTH

    http://www.rz.rwth-aachen.de/mpi/

    Sun Fire SMP Cluster Primerhttp://www.rz.rwth-aachen.de/hpc/primer

    Web page of the SunHPC and VI-HPS workshops with more links and

    informationhttp://www.rz.rwth-aachen.de/sunhpc

    Joint SunHPC Seminar (March 3-4)

    and VI-HPS (March 5-7) Tuning Workshop