Pendahuluan Paralel Komputer

Embed Size (px)

Citation preview

  • 7/30/2019 Pendahuluan Paralel Komputer

    1/167

    Introduction toHigh Performance Computing:

    Parallel Computing, DistributedComputing, Grid Computing and More

    Dr. Jay Boisseau

    Director, Texas Advanced Computing [email protected]

    December 3, 2001

    The University of Texas at AustinTexas Advanced Computing Center

    mailto:[email protected]:[email protected]
  • 7/30/2019 Pendahuluan Paralel Komputer

    2/167

    Introduction to High Performance Computing

    Outline

    Preface

    What is High Performance Computing?

    Parallel Computing

    Distributed Computing, Grid Computing, andMore

    Future Trends in HPC

  • 7/30/2019 Pendahuluan Paralel Komputer

    3/167

    Introduction to High Performance Computing

    Purpose

    Purpose of this workshop: to educate researchers about the value and

    impact of high performance computing (HPC)techniques and technologies in conducting

    computational science and engineering

    Purpose of this presentation:

    to educate researchers about the techniques and

    tools ofparallel computing, and to show them thepossibilities presented by distributed computingand Grid computing

  • 7/30/2019 Pendahuluan Paralel Komputer

    4/167

    Introduction to High Performance Computing

    Goals

    Goals of this presentation are to help you:1. understand the big picture of high performance

    computing

    2. develop a comprehensive understanding ofparallel computing

    3. begin to understand how Grid and distributedcomputing will further enhance computationalscience capabilities

  • 7/30/2019 Pendahuluan Paralel Komputer

    5/167

    Introduction to High Performance Computing

    Content and Context

    This material is an introductionand anoverview It is nota comprehensive HPC, so further reading

    (much more!) is recommended.

    Presentation is followed by additionalspeakers with detailed presentations onspecific HPC and science topics

    Together, these presentations will helpprepare you to use HPC in your scientificdiscipline.

  • 7/30/2019 Pendahuluan Paralel Komputer

    6/167

    Introduction to High Performance Computing

    Background - me

    Director of the Texas Advanced ComputingCenter (TACC) at the University of Texas

    Formerly at San Diego Supercomputer

    Center (SDSC), Artic Region SupercomputingCenter

    10+ years in HPC

    Known Luis for 4 years - plan to developstrong relationship between TACC andCeCalCULA

  • 7/30/2019 Pendahuluan Paralel Komputer

    7/167Introduction to High Performance Computing

    Background TACC

    Mission: to enhance the academic research capabilities of

    the University of Texas and its affiliatesthroughthe application of advanced computing resources

    and expertise

    TACC activities include:

    Resources

    Support Development

    Applied research

  • 7/30/2019 Pendahuluan Paralel Komputer

    8/167Introduction to High Performance Computing

    TACC Activities

    TACC resources and support includes: HPC systems

    Scientific visualization resources

    Data storage/archival systems

    TACC research and development areas:

    HPC

    Scientific Visualization

    Grid Computing

  • 7/30/2019 Pendahuluan Paralel Komputer

    9/167Introduction to High Performance Computing

    Current HPC Systems

    FDDI

    HiPPI

    CRAY SV1

    16 CPU, 16GB

    Memory

    ARCHIVE640

    GB

    CRAY T3E

    256+ procs

    128 MB/proc

    500

    GBaurora

    goldenIBM SP

    64+ procs

    256 MB/proc

    azure

    300

    GB

    Ascend

    Router

  • 7/30/2019 Pendahuluan Paralel Komputer

    10/167Introduction to High Performance Computing

    New HPC Systems

    Four IBM p690 HPC servers 16 Power4 Processors

    1.3 GHz: 5.2 Gflops per proc,83.2 Gflops per server

    16 GB Shared Memory >200 GB/s memory bandwidth!

    144 GB Disk

    1 TB disk to partition across servers Will configure as single system (1/3 Tflop)

    with single GPFS system (1 TB) in 2Q02

  • 7/30/2019 Pendahuluan Paralel Komputer

    11/167Introduction to High Performance Computing

    New HPC Systems

    IA64 Cluster 20 2-way nodes

    Itanium (800 MHz)processors

    2 GB memory/node

    72 GB disk/node

    Myrinet 2000 switch

    180GB shared disk

    IA32 Cluster 32 2-way nodes

    Pentium III (1 GHz)processors

    1 GB Memory

    18.2 GB disk/node

    Myrinet 2000 Switch

    750 GB IBM GPFS parallel file system for both clusters

  • 7/30/2019 Pendahuluan Paralel Komputer

    12/167Introduction to High Performance Computing

    World-Class Vislab

    SGI Onyx2 24 CPUs, 6 Infinite Reality 2 Graphics Pipelines

    24 GB Memory, 750 GB Disk

    Front and Rear Projection Systems

    3x1 cylindrically-symmetric Power Wall 5x2 large-screen, 16:9 panel Power Wall

    Matrix switch between systems, projectors, rooms

  • 7/30/2019 Pendahuluan Paralel Komputer

    13/167Introduction to High Performance Computing

    More Information

    URL: www.tacc.utexas.edu

    E-mail Addresses:

    General Information: [email protected]

    Technical assistance: [email protected]

    Telephone Numbers:

    Main Office: (512) 475-9411

    Facsimile transmission: (512) 475-9445 Operations Room: (512) 475-9410

    http://www.tacc.utexas.edu/mailto:[email protected]:[email protected]:[email protected]:[email protected]://www.tacc.utexas.edu/
  • 7/30/2019 Pendahuluan Paralel Komputer

    14/167Introduction to High Performance Computing

    Outline

    Preface

    What is High Performance Computing?

    Parallel Computing

    Distributed Computing, Grid Computing, andMore

    Future Trends in HPC

  • 7/30/2019 Pendahuluan Paralel Komputer

    15/167Introduction to High Performance Computing

    Supercomputing

    First HPC systems were vector-basedsystems (e.g. Cray)

    named supercomputers because they were an

    order of magnitude more powerful than

    commercial systems

    Now, supercomputer has little meaning

    large systems are now just scaled up versions of

    smaller systems

    However, high performance computing hasmany meanings

  • 7/30/2019 Pendahuluan Paralel Komputer

    16/167

    Introduction to High Performance Computing

    HPC Defined

    High performance computing: can mean high flop count

    per processor

    totaled over many processors working on the same

    problem totaled over many processors working on related

    problems

    can mean faster turnaround time

    more powerful system scheduled to first available system(s)

    using multiple systems simultaneously

  • 7/30/2019 Pendahuluan Paralel Komputer

    17/167

    Introduction to High Performance Computing

    My Definitions

    HPC: anycomputational technique thatsolves a large problem faster than possibleusing single, commoditysystems

    Custom-designed, high-performance processors(e.g. Cray, NEC)

    Parallel computing

    Distributed computing

    Grid computing

  • 7/30/2019 Pendahuluan Paralel Komputer

    18/167

    Introduction to High Performance Computing

    My Definitions

    Parallel computing: single systems with manyprocessors working on the same problem

    Distributed computing: many systems loosely

    coupled by a scheduler to work on relatedproblems

    Grid Computing: many systems tightly

    coupled by software and networks to worktogether on single problems or on relatedproblems

  • 7/30/2019 Pendahuluan Paralel Komputer

    19/167

    Introduction to High Performance Computing

    Importance of HPC

    HPC has had tremendousimpact on all areasof computational science and engineering inacademia, government, and industry.

    Many problems have been solved with HPCtechniques that were impossible to solvewithindividual workstations or personalcomputers.

  • 7/30/2019 Pendahuluan Paralel Komputer

    20/167

    Introduction to High Performance Computing

    Outline

    Preface

    What is High Performance Computing?

    Parallel Computing

    Distributed Computing, Grid Computing, andMore

    Future Trends in HPC

  • 7/30/2019 Pendahuluan Paralel Komputer

    21/167

  • 7/30/2019 Pendahuluan Paralel Komputer

    22/167

    Introduction to High Performance Computing

    Parallel vs. Serial Computers

    Two big advantages of parallel computers:1. total performance

    2. total memory

    Parallel computers enable us to solveproblems that:

    benefit from, or require, fast solution

    require large amounts of memory

    example that requires both: weather forecasting

  • 7/30/2019 Pendahuluan Paralel Komputer

    23/167

    Introduction to High Performance Computing

    Parallel vs. Serial Computers

    Some benefits of parallel computing include: more data points

    bigger domains

    better spatial resolution

    more particles more time steps

    longer runs

    better temporal resolution

    faster execution faster time to solution more solutions in same time

    lager simulations in real time

  • 7/30/2019 Pendahuluan Paralel Komputer

    24/167

    Introduction to High Performance Computing

    Serial Processor Performance

    Time (years)

    perform

    ance

    Although Moores

    Law predicts that

    single processor

    performancedoubles every 18months, eventuallyphysical limits on

    manufacturingtechnology will bereached

  • 7/30/2019 Pendahuluan Paralel Komputer

    25/167

    Introduction to High Performance Computing

    Types of Parallel Computers

    The simplest and most useful way to classifymodern parallel computers is by their memorymodel:

    shared memory

    distributed memory

  • 7/30/2019 Pendahuluan Paralel Komputer

    26/167

    Introduction to High Performance Computing

    P P P P P P

    BUS

    Memory

    M

    P

    M

    P

    M

    P

    M

    P

    M

    P

    M

    P

    Network

    Shared memory - single addressspace. All processors have access to apool of shared memory. (Ex: SGIOrigin, Sun E10000)

    Distributed memory - eachprocessor has its own local

    memory. Must do message passingto exchange data betweenprocessors. (Ex: CRAY T3E, IBMSP, clusters)

    Shared vs. Distributed Memory

  • 7/30/2019 Pendahuluan Paralel Komputer

    27/167

    Introduction to High Performance Computing

    P P P P P P

    BUS

    Memory

    Uniform memory access (UMA):Each processor has uniformaccess to memory. Also knownas symmetric multiprocessors, orSMPs (Sun E10000)

    P P P P

    BUS

    Memory

    P P P P

    BUS

    Memory

    Network

    Non-uniform memory access(NUMA): Time for memoryaccess depends on locationof data. Local access is fasterthan non-local access. Easierto scale than SMPs (SGIOrigin)

    Shared Memory: UMA vs. NUMA

  • 7/30/2019 Pendahuluan Paralel Komputer

    28/167

    Introduction to High Performance Computing

    Distributed Memory: MPPs vs. Clusters

    Processor-memory nodes are connected bysome type of interconnect network

    Massively Parallel Processor (MPP): tightlyintegrated, single system image.

    Cluster: individual computers connected by s/w

    CPU

    MEM

    CPU

    MEM

    CPU

    MEMCPU

    MEMCPU

    MEM

    CPU

    MEM

    CPU

    MEMCPU

    MEM

    CPU

    MEM

    InterconnectNetwork

  • 7/30/2019 Pendahuluan Paralel Komputer

    29/167

    Introduction to High Performance Computing

    Processors, Memory, & Networks

    Both shared and distributed memorysystems have:

    1. processors: now generally commodity RISCprocessors

    2. memory: now generally commodity DRAM

    3. network/interconnect: between the processorsand memory (bus, crossbar, fat tree, torus,hypercube, etc.)

    We will now begin to describe these piecesin detail, starting with definitions of terms.

  • 7/30/2019 Pendahuluan Paralel Komputer

    30/167

    Introduction to High Performance Computing

    Processor-Related Terms

    Clock period (cp): the minimum time intervalbetween successive actions in the processor.Fixed: depends on design of processor.Measured in nanoseconds (~1-5 for fastestprocessors). Inverse of frequency (MHz).

    Instruction: an action executed by a processor,such as a mathematical operation or a

    memory operation.

    Register: a small, extremely fast location forstoring data or instructions in the processor.

  • 7/30/2019 Pendahuluan Paralel Komputer

    31/167

    Introduction to High Performance Computing

    Processor-Related Terms

    Functional Unit (FU): a hardware element thatperforms an operation on an operand or pairof operations. Common FUs are ADD, MULT,INV, SQRT, etc.

    Pipeline : technique enabling multipleinstructions to be overlapped in execution.

    Superscalar: multiple instructions are possibleper clock period.

    Flops: floating point operations per second.

  • 7/30/2019 Pendahuluan Paralel Komputer

    32/167

    Introduction to High Performance Computing

    Processor-Related Terms

    Cache: fast memory (SRAM) near theprocessor. Helps keep instructions and dataclose to functional units so processor canexecute more instructions more rapidly.

    Translation-Lookaside Buffer (TLB): keepsaddresses of pages (block of memory) inmain memory that have recently been

    accessed (a cache for memory addresses)

  • 7/30/2019 Pendahuluan Paralel Komputer

    33/167

    Introduction to High Performance Computing

    Memory-Related Terms

    SRAM: Static Random Access Memory (RAM).Very fast (~10 nanoseconds), made using thesame kind of circuitry as the processors, sospeed is comparable.

    DRAM: Dynamic RAM. Longer access times(~100 nanoseconds), but hold more bits andare much less expensive (10x cheaper).

    Memory hierarchy: the hierarchy of memory in aparallel system, from registers to cache tolocal memory to remote memory. More later.

  • 7/30/2019 Pendahuluan Paralel Komputer

    34/167

    Introduction to High Performance Computing

    Interconnect-Related Terms

    Latency: Networks: How long does it take to start sending a

    "message"? Measured in microseconds.

    Processors: How long does it take to output

    results of some operations, such as floating pointadd, divide etc., which are pipelined?)

    Bandwidth: What data rate can be sustained

    once the message is started? Measured inMbytes/sec or Gbytes/sec

  • 7/30/2019 Pendahuluan Paralel Komputer

    35/167

    Introduction to High Performance Computing

    Interconnect-Related Terms

    Topology: the manner in which the nodes areconnected.

    Best choice would be a fully connected network(every processor to every other). Unfeasible for

    cost and scaling reasons. Instead, processors are arranged in some

    variation of a grid, torus, or hypercube.

    3-d hypercube 2-d mesh 2-d torus

  • 7/30/2019 Pendahuluan Paralel Komputer

    36/167

    Introduction to High Performance Computing

    Processor-Memory Problem

    Processors issue instructions roughly everynanosecond.

    DRAM can be accessed roughly every 100

    nanoseconds (!). DRAM cannot keep processors busy! And the

    gap is growing:

    processors getting faster by 60% per year DRAM getting faster by 7% per year (SDRAM and

    EDO RAM might help, but not enough)

  • 7/30/2019 Pendahuluan Paralel Komputer

    37/167

    Introduction to High Performance Computing

    Processor-Memory Performance Gap

    Proc60%/yr.

    DRAM7%/yr.

    1

    10

    100

    1000

    1980

    1981

    1983

    1984

    1985

    1986

    1987

    1988

    1989

    1990

    1991

    1992

    1993

    1994

    1995

    1996

    1997

    1998

    1999

    2000

    DRAM

    CPU

    1982

    Processor-MemoryPerformance Gap:(grows 50% / year)

    Performa

    nce Moores Law

    From D. Patterson, CS252, Spring 1998 UCB

  • 7/30/2019 Pendahuluan Paralel Komputer

    38/167

    Introduction to High Performance Computing

    Processor-Memory Performance Gap

    Problem becomes worse when remote(distributed or NUMA) memory is needed

    network latency is roughly 1000-10000nanoseconds (roughly 1-10 microseconds)

    networks getting faster, but not fast enough

    Therefore, cache is used in all processors

    almost as fast as processors (same circuitry)

    sits between processors and local memory expensive, can only use small amounts

    must design system to load cache effectively

  • 7/30/2019 Pendahuluan Paralel Komputer

    39/167

    Introduction to High Performance Computing

    CPU

    Main Memory

    Cache

    Processor-Cache-Memory

    Cache is much smaller than main memoryand hence there is mappingof data frommain memory to cache.

  • 7/30/2019 Pendahuluan Paralel Komputer

    40/167

    Introduction to High Performance Computing

    CPU

    Cache

    Local

    Memory

    Remote

    Memory

    SPEED SIZECOST/BIT

    Memory Hierarchy

  • 7/30/2019 Pendahuluan Paralel Komputer

    41/167

    Introduction to High Performance Computing

    Cache-Related Terms

    ICACHE : Instruction cache

    DCACHE (L1) : Data cache closest toregisters

    SCACHE (L2) : Secondary data cache Data from SCACHE has to go through DCACHE

    to registers

    SCACHE is larger than DCACHE

    Not all processors have SCACHE

  • 7/30/2019 Pendahuluan Paralel Komputer

    42/167

    Introduction to High Performance Computing

    Cache Benefits

    Data cache was designed with two keyconcepts in mind

    Spatial Locality When an element is referenced its neighbors will be

    referenced also Cache lines are fetched together

    Work on consecutive data elements in the same cacheline

    Temporal Locality When an element is referenced, it might be referencedagain soon

    Arrange code so that data in cache is reused often

  • 7/30/2019 Pendahuluan Paralel Komputer

    43/167

    Introduction to High Performance Computing

    cache

    main memory

    Direct-Mapped Cache

    Direct mapped cache: A block from main memory can goin exactly one place in the cache. This is called directmapped because there is direct mapping from any blockaddress in memory to a single location in the cache.

  • 7/30/2019 Pendahuluan Paralel Komputer

    44/167

    Introduction to High Performance Computing

    cache

    Main memory

    Fully Associative Cache

    Fully Associative Cache : A block from main memory can beplaced in any location in the cache. This is called fullyassociative because a block in main memory may be

    associated with any entry in the cache.

  • 7/30/2019 Pendahuluan Paralel Komputer

    45/167

    Introduction to High Performance Computing

    2-way set-associative cache

    Main memory

    Set Associative Cache

    Set associative cache : The middle range of designs betweendirect mapped cache and fully associative cache is called set-associative cache. In a n-way set-associative cache a blockfrom main memory can go into N (N > 1) locations in the cache.

  • 7/30/2019 Pendahuluan Paralel Komputer

    46/167

    Introduction to High Performance Computing

    Cache-Related Terms

    Least Recently Used (LRU): Cachereplacement strategy for set associativecaches. The cache block that is least recentlyused is replaced with a new block.

    Random Replace: Cache replacement strategyfor set associative caches. A cache block israndomly replaced.

  • 7/30/2019 Pendahuluan Paralel Komputer

    47/167

    Introduction to High Performance Computing

    Example: CRAY T3E Cache

    The CRAY T3E processors can execute 2 floating point ops (1 add, 1 multiply) and

    2 integer/memory ops (includes 2 loads or 1 store)

    To help keep the processors busy on-chip 8 KB direct-mapped data cache

    on-chip 8 KB direct-mapped instruction cache

    on-chip 96 KB 3-way set associative secondary

    data cachewith random replacement.

  • 7/30/2019 Pendahuluan Paralel Komputer

    48/167

    Introduction to High Performance Computing

    Putting the Pieces Together

    Recall: Shared memory architectures:

    Uniform Memory Access (UMA): Symmetric Multi-Processors (SMP). Ex: Sun E10000

    Non-Uniform Memory Access (NUMA): Most common areDistributed Shared Memory (DSM), or cc-NUMA (cachecoherent NUMA) systems. Ex: SGI Origin 2000

    Distributed memory architectures: Massively Parallel Processor (MPP): tightly integrated

    system, single system image. Ex: CRAY T3E, IBM SP Clusters: commodity nodes connected by interconnect.

    Example: Beowulf clusters.

  • 7/30/2019 Pendahuluan Paralel Komputer

    49/167

    Introduction to High Performance Computing

    Symmetric Multiprocessors (SMPs)

    SMPs connect processors to global sharedmemory using one of:

    bus

    crossbar

    Provides simple programming model, but hasproblems:

    buses can become saturated

    crossbar size must increase with # processors

    Problem grows with number of processors,limiting maximum size of SMPs

  • 7/30/2019 Pendahuluan Paralel Komputer

    50/167

    Introduction to High Performance Computing

    Shared Memory Programming

    Programming models are easier sincemessage passing is not necessary.Techniques:

    autoparallelization via compiler options

    loop-level parallelism via compiler directives

    OpenMP

    pthreads

    More on programming models later.

  • 7/30/2019 Pendahuluan Paralel Komputer

    51/167

    Introduction to High Performance Computing

    Massively Parallel Processors

    Each processor has its own memory: memory is not shared globally

    adds another layer to memory hierarchy (remotememory)

    Processor/memory nodes are connected byinterconnect network

    many possible topologies

    processors must pass data via messages communication overhead must be minimized

  • 7/30/2019 Pendahuluan Paralel Komputer

    52/167

    Introduction to High Performance Computing

    Communications Networks

    Custom Many vendors have custom interconnects that

    provide high performance for their MPP system

    CRAY T3E interconnect is the fastest for MPPs:

    lowest latency, highest bandwidth

    Commodity

    Used in some MPPs and all clusters

    Myrinet, Gigabit Ethernet, Fast Ethernet, etc.

  • 7/30/2019 Pendahuluan Paralel Komputer

    53/167

    Introduction to High Performance Computing

    Types of Interconnects

    Fully connected not feasible

    Array and torus Intel Paragon (2D array), CRAY T3E (3D torus)

    Crossbar IBM SP (8 nodes)

    Hypercube

    SGI Origin 2000 (hypercube), Meiko CS-2 (fattree)

    Combinations of some of the above IBM SP (crossbar & fully connected for 80 nodes)

    IBM SP (fat tree for > 80 nodes)

  • 7/30/2019 Pendahuluan Paralel Komputer

    54/167

    Introduction to High Performance Computing

    Clusters

    Similar to MPPs Commodity processors and memory

    Processor performance must be maximized

    Memory hierarchy includes remote memory

    No shared memory--message passing Communication overhead must be minimized

    Different from MPPs

    All commodity, including interconnect and OS Multiple independent systems: more robust

    Separate I/O systems

  • 7/30/2019 Pendahuluan Paralel Komputer

    55/167

    Introduction to High Performance Computing

    Cluster Pros and Cons

    Pros Inexpensive

    Fastest processors first

    Potential for true parallel I/O

    High availability

    Cons:

    Less mature software (programming and system)

    More difficult to manage (changing slowly)

    Lower performance interconnects: not as scalableto large number (but have almost caught up!)

  • 7/30/2019 Pendahuluan Paralel Komputer

    56/167

    Introduction to High Performance Computing

    Distributed Memory Programming

    Message passing is most efficient MPI

    MPI-2

    Active/one-sided messages Vendor: SHMEM (T3E), LAPI (SP

    Coming in MPI-2

    Shared memory models can be implemented

    in software, but are not as efficient. More on programming models in the next

    section.

  • 7/30/2019 Pendahuluan Paralel Komputer

    57/167

    Introduction to High Performance Computing

    Distributed Shared Memory

    More generally called cc-NUMA (cachecoherent NUMA)

    Consists ofmSMPs with nprocessors in aglobal address space: Each processor has some local memory (SMP)

    Allprocessors can access allmemory: extradirectory hardware on each SMP tracks valuesstored in all SMPs

    Hardware guarantees cache coherency Access to memory on other SMPs slower (NUMA)

  • 7/30/2019 Pendahuluan Paralel Komputer

    58/167

    Introduction to High Performance Computing

    Distributed Shared Memory

    Easier to build because of slower access toremote memory (no expensive bus/crossbar)

    Similar cache problems

    Code writers should be aware of datadistribution

    Load balance: Minimize access of far

    memory

  • 7/30/2019 Pendahuluan Paralel Komputer

    59/167

    Introduction to High Performance Computing

    DSM Rationale and Realities

    Rationale: combine easeof SMPprogramming with scalabilityof MPPprogramming at much at cost of MPP

    Reality: NUMA introduces additional layers inSMP memory hierarchy relative to SMPs, soscalability is limited if programmed as SMP

    Reality: Performance and high scalabilityrequire programming to the architecture.

  • 7/30/2019 Pendahuluan Paralel Komputer

    60/167

    Introduction to High Performance Computing

    Clustered SMPs

    Simpler than DSMs: composed of nodes connected by network, like an

    MPP or cluster

    each node is an SMP

    processors on one SMP do not share memory onother SMPs (no directory hardware in SMP nodes)

    communication between SMP nodes is bymessage passing

    Ex: IBM Power3-based SP systems

  • 7/30/2019 Pendahuluan Paralel Komputer

    61/167

    Introduction to High Performance Computing

    Clustered SMP Diagram

    Network

    P P P P

    BUS

    Memory

    P P P P

    BUS

    Memory

  • 7/30/2019 Pendahuluan Paralel Komputer

    62/167

    Introduction to High Performance Computing

    Reasons for Clustered SMPs

    Natural extension of SMPs and clusters SMPs offer great performance up to their

    crossbar/bus limit

    Connecting nodes is how memory and

    performance are increased beyond SMP levels Can scale to larger number of processors with

    less scalable interconnect

    Maximum performance:

    Optimize at SMP level - no communication overhead Optimize at MPP level - fewer messages necessary for

    same number of processors

  • 7/30/2019 Pendahuluan Paralel Komputer

    63/167

    Introduction to High Performance Computing

    Clustered SMP Drawbacks

    Clustering SMPs has drawbacks No shared memory access over entire system,

    unlike DSMs

    Has other disadvantages of DSMs

    Extra layer in memory hierarchy Performance requires more effort from programmer than

    SMPs or MPPs

    However, clustered SMPs provide a means

    for obtaining very high performance andscalability

  • 7/30/2019 Pendahuluan Paralel Komputer

    64/167

    Introduction to High Performance Computing

    Clustered SMP: NPACI Blue Horizon

    IBM SP system: Power3 processors: good peak performance (~1.5

    Gflops)

    better sustained performance (highly superscalar

    and pipelined) than for many other processors SMP nodes have 8 Power3 processors

    System has 144 SMP nodes (1154 processorstotal)

  • 7/30/2019 Pendahuluan Paralel Komputer

    65/167

    Introduction to High Performance Computing

    Programming Clustered SMPs

    NSF: Most users use only MPI, even for intra-node messages

    DoE: Most applications are being developed

    with MPI (between nodes) and OpenMP(intra-node)

    MPI+OpenMP programming is more complex,but mightyield maximum performance

    Active messages and pthreads wouldtheoretically give maximum performance

  • 7/30/2019 Pendahuluan Paralel Komputer

    66/167

    Introduction to High Performance Computing

    Data parallelism Task parallelism

    Types of Parallelism

    Data parallelism: each processor performsthe same task on different sets or sub-regionsof data

    Task parallelism: each processor performs adifferent task

    Most parallel applications fall somewhere onthe continuum between these two extremes.

  • 7/30/2019 Pendahuluan Paralel Komputer

    67/167

    Introduction to High Performance Computing

    Data vs. Task Parallelism

    Example of data parallelism: In a bottling plant, we see several processors, or

    bottle cappers, applying bottle caps concurrentlyon rows of bottles.

    Example of task parallelism;

    In a restaurant kitchen, we see several chefs, orprocessors, working simultaneously on different

    parts of different meals.

    A good restaurant kitchen also demonstrates loadbalancingand synchronization--more on thosetopics later.

  • 7/30/2019 Pendahuluan Paralel Komputer

    68/167

    Introduction to High Performance Computing

    Example: Master-Worker Parallelism

    A common form of parallelism used indeveloping applications years ago (especiallyin PVM) was Master-Worker parallelism:

    a single processor is responsible for distributing

    data and collecting results (task parallelism) all other processors perform same task on their

    portion of data (data parallelism)

  • 7/30/2019 Pendahuluan Paralel Komputer

    69/167

    Introduction to High Performance Computing

    Parallel Programming Models

    The primary programming models in currentuse are

    Data parallelism - operations are performed inparallel on collections of data structures. A

    generalization of array operations. Message passing - processes possess local

    memory and communicate with other processesby sending and receiving messages.

    Shared memory - each processor has access to asingle shared pool of memory

  • 7/30/2019 Pendahuluan Paralel Komputer

    70/167

    Introduction to High Performance Computing

    Parallel Programming Models

    Most parallelization efforts fall under thefollowing categories.

    Codes can be parallelized using message-passinglibraries such as MPI.

    Codes can be parallelized using compilerdirectives such as OpenMP.

    Codes can be written in new parallel languages.

  • 7/30/2019 Pendahuluan Paralel Komputer

    71/167

    Introduction to High Performance Computing

    Programming Models Architectures

    Natural mappings data parallel CM-2 (SIMD machine)

    message passing IBM SP (MPP)

    shared memory SGI Origin, Sun E10000

    Implemented mappings

    HPF (a data parallel language) and MPI (amessage passing library) have been implemented

    on nearly all parallel machines OpenMP (a set of directives, etc. for shared

    memory programming) has been implemented onmost shared memory systems.

  • 7/30/2019 Pendahuluan Paralel Komputer

    72/167

    Introduction to High Performance Computing

    SPMD

    All current machines are MIMD systems(Multiple Instruction, Multiple Data) and arecapable of either data parallelism or taskparallelism.

    The primary paradigmfor programmingparallel machines is the SPMD paradigm:Single Program, Multiple Data

    each processor runs a copy of same source code enables data parallelism (through data

    decomposition) and task parallelism (throughintrinsic functions that return the processor ID)

  • 7/30/2019 Pendahuluan Paralel Komputer

    73/167

    Introduction to High Performance Computing

    OpenMP - Shared Memory Standard

    OpenMP is a new standard for sharedmemory programming: SMPs and cc-NUMAs.

    OpenMP provides a standard set of directives,run-time library routines, and

    environment variables for parallelizing code undera shared memory model.

    Very similar to Cray PVP autotasking directives,but with much more functionality. (Cray now uses

    supports OpenMP.) See http://www.openmp.org for more information

  • 7/30/2019 Pendahuluan Paralel Komputer

    74/167

    Introduction to High Performance Computing

    program add_arrays

    parameter (n=1000)

    real x(n),y(n),z(n)

    read(10) x,y,z

    do i=1,n

    x(i) = y(i) + z(i)

    enddo

    ...

    end

    Fortran 77program add_arrays

    parameter (n=1000)

    real x(n),y(n),z(n)

    read(10) x,y,z

    !$OMP PARALLEL DOdo i=1,n

    x(i) = y(i) + z(i)

    enddo

    ...

    end

    Fortran 77 + OpenMP

    Highlighted directive specifies that loop is executed in parallel.Each processor executes a subset of the loop iterations.

    OpenMP Example

  • 7/30/2019 Pendahuluan Paralel Komputer

    75/167

    Introduction to High Performance Computing

    MPI - Message Passing Standard

    MPI has emerged as the standard formessage passing in both C and Fortranprograms. No longer need to know MPL,PVM, TCGMSG, etc.

    MPI is both large and small:

    MPI is large, since it contains 125 functions whichgive the programmer fine control over

    communications MPI is small, since message passing programs

    can be written using a core set of just sixfunctions.

    S

  • 7/30/2019 Pendahuluan Paralel Komputer

    76/167

    Introduction to High Performance Computing

    PE 0 calls MPI_SEND to pass the real variable x to PE 1.

    PE 1 calls MPI_RECV to receive the real variable y from PE 0

    if(myid.eq.0) then

    call MPI_SEND(x,1,MPI_REAL,1,100,MPI_COMM_WORLD,ierr)

    endif

    if(myid.eq.1) thencall MPI_RECV(y,1,MPI_REAL,0,100,MPI_COMM_WORLD,

    status,ierr)

    endif

    MPI Examples - Send and Receive

    MPI messages are two-way: they require asend anda matching receive:

    MPI E l Gl b l O i

  • 7/30/2019 Pendahuluan Paralel Komputer

    77/167

    Introduction to High Performance Computing

    MPI Example - Global Operations

    PE 6 collects the single (1) integer value n from all other processors and puts the

    sum (MPI_SUM) into into sum

    call MPI_REDUCE(n,allsum,1,MPI_INTEGER,MPI_SUM,6,

    MPI_COMM_WORLD,ierr)

    MPI also has global operations to broadcastand reduce (collect) information

    PE 5 broadcasts the single (1) integer value n to all other processors

    call MPI_BCAST(n,1,MPI_INTEGER,5,

    MPI_COMM_WORLD,ierr)

    MPI I l i

  • 7/30/2019 Pendahuluan Paralel Komputer

    78/167

    Introduction to High Performance Computing

    MPI Implementations

    MPI is typically implemented on top of the highestperformance native message passing library forevery distributed memory machine.

    MPI is a natural model for distributed memory

    machines (MPPs, clusters)

    MPI offers higher performance on DSMs beyond thesize of an individual SMP

    MPI is useful between SMPs that are clustered

    MPI can be implemented on shared memorymachines

    E i MPI MPI 2

  • 7/30/2019 Pendahuluan Paralel Komputer

    79/167

    Introduction to High Performance Computing

    Extensions to MPI: MPI-2

    A standard for MPI-2 has been developedwhich extends the functionality of MPI. Newfeatures include:

    One sided communications - eliminates the need

    to post matching sends and receives. Similar infunctionality to the shmemPUT and GET on theCRAY T3E (most systems have analogous library)

    Support for parallel I/O

    Extended collective operations No full implementation yet - it is difficult for

    vendors

    MPI O MP

  • 7/30/2019 Pendahuluan Paralel Komputer

    80/167

    Introduction to High Performance Computing

    MPI vs. OpenMP

    There is no single best approach to writing aparallel code. Each has pros and cons:

    MPI - powerful, general, and universally availablemessage passing library which provides very fine

    control over communications, but forces theprogrammer to operate at a relatively low level ofabstraction.

    OpenMP - conceptually simple approach for

    creating parallel codes on a shared memorymachines, but not applicable to distributedmemory platforms.

    MPI O MP

  • 7/30/2019 Pendahuluan Paralel Komputer

    81/167

    Introduction to High Performance Computing

    MPI vs. OpenMP

    MPI is the most general (problems types) andportable (platforms, although not efficient forSMPs)

    The architecture and the problem type oftenmake the decision for you.

    P ll l Lib i

  • 7/30/2019 Pendahuluan Paralel Komputer

    82/167

    Introduction to High Performance Computing

    Parallel Libraries

    Finally, there are parallel mathematicslibraries that enable users to write (serial)codes, then call parallel solver routines :

    ScaLAPACK is for solving dense linear system of

    equations, eigenvalues and least squareproblems. Also see PLAPACK.

    PETSc is for solving linear and non-linear partialdifferential equations (includes various iterative

    solvers for sparse matrices). Many others: check NETLIB for complete survey:

    http://www.netlib.org

    H dl i P ll l C ti

  • 7/30/2019 Pendahuluan Paralel Komputer

    83/167

    Introduction to High Performance Computing

    Hurdles in Parallel Computing

    There are some hurdles in parallel computing: Scalar performance: Fast parallel codes require

    efficient use of the underlying scalar hardware

    Parallel algorithms: Not all scalar algorithms

    parallelize well, may need to rethink problem Communications: Need to minimize the time spent doing

    communications

    Load balancing: All processors should do roughly thesame amount of work

    Amdahls Law: Fundamental limit on parallel

    computing

    S l P f

  • 7/30/2019 Pendahuluan Paralel Komputer

    84/167

    Introduction to High Performance Computing

    Scalar Performance

    Underlying every good parallel code is a goodscalar code.

    If a code scales to 256 processors but only

    gets 1% of peak performance, it is still a badparallel code.

    Good news: Everything that you know about serialcomputing will be useful in parallel computing!

    Bad news: It is difficult to get good performanceout of the processors and memory used in parallelmachines. Need to use cache effectively.

    S i l P f

  • 7/30/2019 Pendahuluan Paralel Komputer

    85/167

    Introduction to High Performance Computing

    Number of processors

    time

    In this case, the parallel

    code achieves perfectscaling, but does not

    match the performance of

    the serial code until 32

    processors are used

    Serial Performance

    Use Cache Effecti el

  • 7/30/2019 Pendahuluan Paralel Komputer

    86/167

    Introduction to High Performance Computing

    main memory

    cache

    CPU

    A simplified memoryhierarchy

    Small& fast

    Big

    & slow

    The data cache was designed withtwo key concepts in mind:

    Spatial locality - cache is loaded an

    entire line (4-32 words) at a time totake advantage of the fact that if alocation in memory is required,nearby locations will probably alsobe required

    Temporal locality - once a word isloaded into cache it remains thereuntil the cache line is needed tohold another word of data.

    Use Cache Effectively

    Non Cache Issues

  • 7/30/2019 Pendahuluan Paralel Komputer

    87/167

    Introduction to High Performance Computing

    Non-Cache Issues

    There are other issues to consider to achievegood serial performance:

    Force reductions, e.g., replacement of divisionswith multiplications-by-inverse

    Evaluate and replace common sub-expressions Pushing loops inside subroutines to minimize

    subroutine call overhead

    Force function inlining (compiler option)

    Perform interprocedural analysis to eliminateredundant operations (compiler option)

    Parallel Algorithms

  • 7/30/2019 Pendahuluan Paralel Komputer

    88/167

    Introduction to High Performance Computing

    Parallel Algorithms

    The algorithm must be naturally parallel! Certain serial algorithms do not parallelize well.

    Developing a new parallel algorithm to replace aserial algorithm can be one of the most difficult

    task in parallel computing. Keep in mind that your parallel algorithm may

    involve additional work or a higher floating pointoperation count.

    Parallel Algorithms

  • 7/30/2019 Pendahuluan Paralel Komputer

    89/167

    Introduction to High Performance Computing

    Parallel Algorithms

    Keep in mind that the algorithm should need the minimum amount of communication (Monte

    Carlo algorithms are excellent examples)

    balance the load among the processors equally

    Fortunately, a lot of research has been done inparallel algorithms, particularly in the area of linearalgebra. Dont reinvent the wheel, take full

    advantage of the work done by others: use parallel libraries supplied by the vendor whenever

    possible! use ScaLAPACK, PETSc, etc. when applicable

    Load Balancing

  • 7/30/2019 Pendahuluan Paralel Komputer

    90/167

    Introduction to High Performance Computing

    Busy timeIdle time

    t

    PE 0PE 1

    The figures below show the timeline for parallel codes run on twoprocessors. In both cases, the total amount of work done is thesame, but in the second case the work is distributed more evenlybetween the two processors resulting in a shorter time to solution.

    PE 0

    PE 1

    Synchronizationpoints

    Load Balancing

    Communications

  • 7/30/2019 Pendahuluan Paralel Komputer

    91/167

    Introduction to High Performance Computing

    Communications

    Two key parameters of the communicationsnetwork are

    Latency: time required to initiate a message. Thisis the critical parameter in fine grained codes,

    which require frequent interprocessorcommunications. Can be thought of as the timerequired to send a message of zero length.

    Bandwidth: steady-state rate at which data can be

    sent over the network.This is the critical parameterin coarse grained codes, which require infrequentcommunication of large amounts of data.

    Latency and Bandwidth Example

  • 7/30/2019 Pendahuluan Paralel Komputer

    92/167

    Introduction to High Performance Computing

    Latency and Bandwidth Example

    Bucket brigade: the old style of fighting firesin which the townspeople formed a line fromthe well to the fire and passed buckets ofwater down the line

    latency - the delay until the first bucket to arrivesat the fire

    bandwidth - the rate at which buckets arrive at thefire

    More on Communications

  • 7/30/2019 Pendahuluan Paralel Komputer

    93/167

    Introduction to High Performance Computing

    Sequential: t = t(comp) + t(comm)

    Overlapped: t = t(comp) + t(comm) - t(comp) t(comm)

    More on Communications

    Time spent performing communications isconsidered overhead. Try minimize theimpact of communications:

    minimize the effect of latency by combining large

    numbers of small messages into small numbers oflarge messages.

    communications and computation do not have tobe done sequentially, can often overlap

    communication and computations

    Combining Small Messages into

  • 7/30/2019 Pendahuluan Paralel Komputer

    94/167

    Introduction to High Performance Computing

    dial

    Hi mom

    hang up

    dialHow are things?

    hang up

    dial

    in the U.S.?

    hang up

    dialAt this point many mothers

    would not pick up the next call.

    dial

    Hi mom. How are things

    in the U.S.?. Yak, yak... hang up

    By transmitting a single large message, Ionly have to pay the price for the dialing

    latency once. I transmit more informationin less time.

    The following examples of phoning home illustrate the valueof combining many small messages into a single larger one.

    g gLarger Ones

    Overlapping Communications and

  • 7/30/2019 Pendahuluan Paralel Komputer

    95/167

    Introduction to High Performance Computing

    In the following example, a stencil operation is performed on a 10x 10 array that has been distributed over two processors. Assumeperiodic boundary conditions.

    Boundary elements - requires data

    from neighboring processor

    Interior elements

    Initiate communications

    Perform computations on interior elements

    Wait till communications are finished

    Perform computations on boundary elements

    Stencil operation:

    y(i,j)=x(i+1,j)+x(i-1,j)+x(i,j+1)+x(i,j-1)

    PE0 PE1

    Computations

    Amdahls Law

  • 7/30/2019 Pendahuluan Paralel Komputer

    96/167

    Introduction to High Performance Computing

    Amdahls Law places a strict limit on the speedup that can berealized by using multiple processors. Two equivalentexpressions for Amdahls Law are given below:

    tN

    = (fp/N + f

    s)t

    1Effect of multiple processors on run time

    S = 1/(fs + fp/N) Effect of multiple processors on speedup

    Where:

    fs = serial fraction of codefp = parallel fraction of code = 1 - fs

    N = number of processors

    Amdahl s Law

    Illustration of Amdahls Law

  • 7/30/2019 Pendahuluan Paralel Komputer

    97/167

    Introduction to High Performance Computing

    Number of processors

    speedup

    It takes only a small fraction of serial content in a code to degrade the

    parallel performance. It is essential to determine the scaling behavior ofyour code before doing production runs using large numbers ofprocessors

    Illustration of Amdahl s Law

    Amdahls Law Vs Reality

  • 7/30/2019 Pendahuluan Paralel Komputer

    98/167

    Introduction to High Performance Computing

    Amdahls Law provides a theoretical upper limit on parallel

    speedup assuming that there are no costs forcommunications. In reality, communications (and I/O) willresult in a further degradation of performance.

    0 50 100 150 200 250

    Number of processors

    speedup

    f = 0.99

    Amdahl s Law Vs. Reality

    More on Amdahls Law

  • 7/30/2019 Pendahuluan Paralel Komputer

    99/167

    Introduction to High Performance Computing

    More on Amdahl s Law

    Amdahls Law can be generalized to any twoprocesses of with different speeds

    Ex.: Apply to fprocessorand fmemory:

    The growing processor-memory performance gapwill undermine our efforts at achieving maximumpossible speedup!

    Generalized Amdahls Law

  • 7/30/2019 Pendahuluan Paralel Komputer

    100/167

    Introduction to High Performance Computing

    Generalized Amdahl s Law

    Amdahls Law can be further generalized to handlean arbitrary number of processes of various speeds.(The total fractions representing each process muststill equal 1.)

    This is a weighted Harmonic mean. Applicationperformance is limited by performance of the slowestcomponent as much as it is determined by thefastest.

    Ravg =

    1

    fi

    R ii 1

    N

    Gustafsons Law

  • 7/30/2019 Pendahuluan Paralel Komputer

    101/167

    Introduction to High Performance Computing

    Gustafson s Law

    Thus, Amdahls Law predicts that there is amaximum scalability for an application,determined by its parallel fraction, and thislimit is generally not large.

    There is a way around this: increase theproblem size bigger problems mean bigger grids or more

    particles: bigger arrays

    number of serial operations generally remainsconstant; number of parallel operations increases:parallel fraction increases

    The 1st Question to Ask Yourself

  • 7/30/2019 Pendahuluan Paralel Komputer

    102/167

    Introduction to High Performance Computing

    Before You Parallelize Your Code

    Is it worth my time? Do the CPU requirements justify parallelization?

    Do I need a parallel machine in order to getenough aggregate memory?

    Will the code be used just once or will it be amajor production code?

    Your time is valuable, and it can be very time

    consuming to write, debug, and test a parallelcode. The more time you spend writing aparallel code, the less time you have to spenddoing your research.

    The 2nd Question to Ask Yourself

  • 7/30/2019 Pendahuluan Paralel Komputer

    103/167

    Introduction to High Performance Computing

    Before You Parallelize Your Code

    How should I decompose my problem? Do the computations consist of a large number of

    small, independent problems - trajectories,parameter space studies, etc? May want to

    consider a scheme in which each processor runsthe calculation for a different set of data

    Does each computation have large memory orCPU requirements? Will probably have to break

    up a single problem across multiple processors

    Distributing the Data

  • 7/30/2019 Pendahuluan Paralel Komputer

    104/167

    Introduction to High Performance Computing

    Distributing the Data

    Decision on how to distribute the data shouldconsider these issues: Load balancing:

    Often implies an equal distribution of data, but moregenerally means an equal distribution of work

    Communications:Want to minimize the impact of communications, taking intoaccount both size and number of messages

    Physics:Choice of distribution will depend on the processes that are

    being modeled in each direction.

    A Data Distribution Example

  • 7/30/2019 Pendahuluan Paralel Komputer

    105/167

    Introduction to High Performance Computing

    A good distribution if the physics of theproblem is the same in both directions.

    Minimizes the amount of data that must

    be communicated between processors.

    If expensive global operations need to be

    carried out in the x-direction (ex. FFTs),

    this is probably a better choice.

    A Data Distribution Example

    A More Difficult Example

  • 7/30/2019 Pendahuluan Paralel Komputer

    106/167

    Introduction to High Performance Computing

    Imagine that we are doing a simulation

    in which more work is required for the

    grid points covering the shaded object.

    Neither data distribution from theprevious example will result in good

    load balancing.

    May need to consider an irregular grid

    or a different data structure.

    A More Difficult Example

    Choosing a Resource

  • 7/30/2019 Pendahuluan Paralel Komputer

    107/167

    Introduction to High Performance Computing

    Choosing a Resource

    The following factors should be taken intoaccount when choosing a resource:

    What is the granularity of my code?

    Are there any special hardware features that I

    need or can take advantage of? How many processors will the code be run on?

    What are my memory requirements?

    By carefully considering these points, you canmake the right choice of computationalplatform.

    Choosing a Resource: Granularity

  • 7/30/2019 Pendahuluan Paralel Komputer

    108/167

    Introduction to High Performance Computing

    Granularity is a measure of the amount of work done by eachprocessor between synchronization events.

    PE 0

    PE 1

    Low-granularity application

    PE 0

    PE 1

    High-granularity application

    Generally, latency is the critical parameter for low-granularitycodes, while processor performance is the key factor for high-granularity applications.

    Choosing a Resource: Granularity

    Choosing a Resource: SpecialH d F

  • 7/30/2019 Pendahuluan Paralel Komputer

    109/167

    Introduction to High Performance Computing

    Hardware Features

    Various HPC platforms have differenthardware features that your code may beable to take advantage of. Examples include:

    Hardware support for divide and square root

    operations (IBM SP) Parallel I/O file system (IBM SP)

    Data streams (CRAY T3E)

    Control over cache alignment (CRAY T3E)

    E-registers for by-passing cache hierarchy(CRAY T3E)

    Importance of Parallel Computing

  • 7/30/2019 Pendahuluan Paralel Komputer

    110/167

    Introduction to High Performance Computing

    Importance of Parallel Computing

    High performance computing has becomealmost synonymous with parallel computing.

    Parallel computing is necessary to solve bigproblems (high resolution, lots of timesteps,etc.) in science and engineering.

    Developing and maintaining efficient, scalableparallel applications is difficult. However, the

    payoff can be tremendous.

    Importance of Parallel Computing

  • 7/30/2019 Pendahuluan Paralel Komputer

    111/167

    Introduction to High Performance Computing

    Importance of Parallel Computing

    Before jumping in, think about whether or not your code truly needs to beparallelized

    how to decompose your problem.

    Then choose a programming model based onyour problem and your available architecture.

    Take advantage of the resources that areavailable - compilers libraries, debuggers,performance analyzers, etc. - to help youwrite efficient parallel code.

    Useful References

  • 7/30/2019 Pendahuluan Paralel Komputer

    112/167

    Introduction to High Performance Computing

    Useful References

    Hennessy, J. L. and Patterson, D. A. ComputerArchitecture: A Quantitative Approach.

    Patterson, D.A. and Hennessy, J.L., ComputerOrganization and Design: The Hardware/Software

    Interface. D. Dowd, High Performance Computing.

    D. Kuck, High Performance Computing. Oxford U.Press (New York) 1996.

    D. Culler and J. P. Singh, Parallel ComputerArchitecture.

    Outline

  • 7/30/2019 Pendahuluan Paralel Komputer

    113/167

    Introduction to High Performance Computing

    Outline

    Preface What is High Performance Computing?

    Parallel Computing

    Distributed Computing, Grid Computing, andMore

    Future Trends in HPC

    Distributed Computing

  • 7/30/2019 Pendahuluan Paralel Komputer

    114/167

    Introduction to High Performance Computing

    Distributed Computing

    Concept has been used for two decades Basic idea: run scheduler across systems to

    runs processes on least-used systems first

    Maximize utilization

    Minimize turnaround time

    Have to load executables and input files toselected resource

    Shared file system

    File transfers upon resource selection

    Examples of Distributed Computing

  • 7/30/2019 Pendahuluan Paralel Komputer

    115/167

    Introduction to High Performance Computing

    a p es o st buted Co put g

    Workstation farms, Condor flocks, etc. Generally share file system

    SETI@home, Entropia, etc.

    Only one source code; central server copiescorrect binary code and input data to each system

    Napster, Gnutella: file/data sharing

    NetSolve Runs numerical kernel on any of multipleindependent systems, much like a Grid solution

    SETI@home: Global DistributedC ti

  • 7/30/2019 Pendahuluan Paralel Komputer

    116/167

    Introduction to High Performance Computing

    Computing Running on 500,000 PCs, ~1000 CPU Years per Day

    485,821 CPU Years so far

    Sophisticated Data & Signal Processing Analysis

    Distributes Datasets from Arecibo Radio Telescope

    Distributed vs. Parallel Computing

  • 7/30/2019 Pendahuluan Paralel Komputer

    117/167

    Introduction to High Performance Computing

    p g

    Different Distributed computing executes independent (but

    possibly related) applications on different systems;jobs do not communicate with each other

    Parallel computing executes a single applicationacross processors, distributing the work and/ordata but allowing communication betweenprocesses

    Non-exclusive: can distribute parallelapplications to parallel computing systems

    Grid Computing

  • 7/30/2019 Pendahuluan Paralel Komputer

    118/167

    Introduction to High Performance Computing

    p g

    Enable communities (virtual organizations)to share geographically distributed resourcesas they pursue common goalsin theabsence of central control, omniscience, trust

    relationships.

    Resources (HPC systems, visualizationsystems & displays, storage systems,

    sensors, instruments, people) are integratedvia middleware to facilitate use of allresources.

    Why Grids?

  • 7/30/2019 Pendahuluan Paralel Komputer

    119/167

    Introduction to High Performance Computing

    y

    Resources have different functions, butmultiple classes resources are necessary formost interesting problems.

    Power of any single resource is smallcompared to aggregations of resources

    Network connectivity is increasing rapidly inbandwidth and availability

    Large problems require teamwork andcomputation

    Network Bandwidth Growth

  • 7/30/2019 Pendahuluan Paralel Komputer

    120/167

    Introduction to High Performance Computing

    Network vs. computer performance

    Computer speed doubles every 18 months

    Network speed doubles every 9 months

    Difference = order of magnitude per 5 years

    1986 to 2000

    Computers: x 500

    Networks: x 340,000

    2001 to 2010 Computers: x 60

    Networks: x 4000

    Moores Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-

    2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.

    Grid Possibilities

  • 7/30/2019 Pendahuluan Paralel Komputer

    121/167

    Introduction to High Performance Computing

    A biochemist exploits 10,000 computers to screen100,000 compounds in an hour

    1,000 physicists worldwide pool resources for petaflopanalyses of petabytes of data

    Civil engineers collaborate to design, execute, &analyze shake table experiments

    Climate scientists visualize, annotate, & analyzeterabyte simulation datasets

    An emergency response team couples real time data,weather model, population data

    Some Grid Usage Models

  • 7/30/2019 Pendahuluan Paralel Komputer

    122/167

    Introduction to High Performance Computing

    g

    Distributed computing: job scheduling on Gridresources with secure, automated data transfer

    Workflow: synchronized scheduling and automateddata transfer from one system to next in pipeline (e.g.

    HPC system to visualization lab to storage system) Coupled codes, with pieces running on different

    systems simultaneously

    Meta-applications: parallel apps spanning multiple

    systems

    Grid Usage Models

  • 7/30/2019 Pendahuluan Paralel Komputer

    123/167

    Introduction to High Performance Computing

    g

    Some models are similar to models alreadybeing used, but are much simpler due to:

    single sign-on

    automatic process scheduling

    automated data transfers

    But Grids can encompass new resourceslikes sensors and instruments, so new usage

    models will arise

    Selected Major Grid Projects

  • 7/30/2019 Pendahuluan Paralel Komputer

    124/167

    Introduction to High Performance Computing

    Name URL & Sponsors FocusAccess Grid www.mcs.anl.gov/FL/

    accessgrid; DOE, NSFCreate & deploy group collaboration systemsusing commodity technologies

    BlueGrid IBM Grid testbed linking IBM laboratories

    DISCOM www.cs.sandia.gov/discom

    DOE Defense Programs

    Create operational Grid providing access toresources at three U.S. DOE weapons

    laboratories

    DOE ScienceGrid

    sciencegrid.org

    DOE Office of Science

    Create operational Grid providing access toresources & applications at U.S. DOE sciencelaboratories & partner universities

    Earth SystemGrid (ESG)

    earthsystemgrid.orgDOE Office of Science

    Delivery and analysis of large climate modeldatasets for the climate research community

    EuropeanUnion (EU)DataGrid

    eu-datagrid.org

    European Union

    Create & apply an operational grid forapplications in high energy physics,environmental science, bioinformatics

    g

    g

    g

    g

    g

    g

    Selected Major Grid Projects

  • 7/30/2019 Pendahuluan Paralel Komputer

    125/167

    Introduction to High Performance Computing

    Name URL/Sponsor FocusEuroGrid, GridInteroperability (GRIP)

    eurogrid.org

    European Union

    Create technologies for remote access tosupercomputer resources & simulation codes;in GRIP, integrate with Globus

    Fusion Collaboratory fusiongrid.org

    DOE Off. Science

    Create a national computational collaboratoryfor fusion research

    Globus Project globus.org

    DARPA, DOE, NSF,NASA, Msoft

    Research on Grid technologies; developmentand support of Globus Toolkit; application anddeployment

    GridLab gridlab.org

    European Union

    Grid technologies and applications

    GridPP gridpp.ac.uk

    U.K. eScience

    Create & apply an operational grid within theU.K. for particle physics research

    Grid ResearchIntegration Dev. &Support Center

    grids-center.org

    NSF

    Integration, deployment, support of the NSFMiddleware Infrastructure for research &education

    g

    g

    g

    g

    g

    g

    Selected Major Grid Projects

  • 7/30/2019 Pendahuluan Paralel Komputer

    126/167

    Introduction to High Performance Computing

    Name URL/Sponsor FocusGrid Application Dev.Software

    hipersoft.rice.edu/grads; NSF

    Research into program developmenttechnologies for Grid applications

    Grid Physics Network griphyn.org

    NSF

    Technology R&D for data analysis in physicsexpts: ATLAS, CMS, LIGO, SDSS

    Information Power

    Grid

    ipg.nasa.gov

    NASA

    Create and apply a production Grid for

    aerosciences and other NASA missions

    International VirtualData Grid Laboratory

    ivdgl.org

    NSF

    Create international Data Grid to enable large-scale experimentation on Grid technologies &applications

    Network for

    Earthquake Eng.Simulation Grid

    neesgrid.org

    NSF

    Create and apply a production Grid for

    earthquake engineering

    Particle Physics DataGrid

    ppdg.net

    DOE Science

    Create and apply production Grids for dataanalysis in high energy and nuclear physicsexperiments

    g

    g

    g

    g

    g

    g

    Selected Major Grid Projects

  • 7/30/2019 Pendahuluan Paralel Komputer

    127/167

    Introduction to High Performance Computing

    Name URL/Sponsor Focus

    TeraGrid teragrid.org

    NSF

    U.S. science infrastructure linking four majorresource sites at 40 Gb/s

    UK Grid SupportCenter

    grid-support.ac.uk

    U.K. eScience

    Support center for Grid projects within the U.K.

    Unicore BMBFT Technologies for remote access tosupercomputers

    g

    g

    New

    There are also many technology R&Dprojects: e.g., Globus, Condor, NetSolve,

    Ninf, NWS, etc.

    Example Application Projects

  • 7/30/2019 Pendahuluan Paralel Komputer

    128/167

    Introduction to High Performance Computing

    Earth Systems Grid: environment (US DOE)

    EU DataGrid: physics, environment, etc. (EU)

    EuroGrid: various (EU)

    Fusion Collaboratory (US DOE)

    GridLab: astrophysics, etc. (EU)

    Grid Physics Network (US NSF)

    MetaNEOS: numerical optimization (US NSF)

    NEESgrid: civil engineering (US NSF)

    Particle Physics Data Grid (US DOE)

    Some Grid RequirementsSystems/Deployment Perspective

  • 7/30/2019 Pendahuluan Paralel Komputer

    129/167

    Introduction to High Performance Computing

    Systems/Deployment Perspective

    Identity & authentication

    Authorization & policy

    Resource discovery

    Resource characterization

    Resource allocation

    (Co-)reservation, workflow

    Distributed algorithms

    Remote data access

    High-speed data transfer

    Performance guarantees

    Monitoring

    Adaptation

    Intrusion detection

    Resource management

    Accounting & payment

    Fault management

    System evolution

    Etc.

    Etc.

  • 7/30/2019 Pendahuluan Paralel Komputer

    130/167

    The Systems Challenges:Resource Sharing Mechanisms That

  • 7/30/2019 Pendahuluan Paralel Komputer

    131/167

    Introduction to High Performance Computing

    Resource Sharing Mechanisms That

    Address security and policy concerns ofresource owners and users

    Are flexible enough to deal with manyresource types and sharing modalities

    Scale to large number of resources, manyparticipants, many program components

    Operate efficiently when dealing with largeamounts of data & computation

    The Security Problem

  • 7/30/2019 Pendahuluan Paralel Komputer

    132/167

    Introduction to High Performance Computing

    Resources being used may be extremely valuable &the problems being solved extremely sensitive

    Resources are often located in distinct administrativedomains

    Each resource may have own policies & procedures The set of resources used by a single computation

    may be large, dynamic, and/or unpredictable Not just client/server

    It must be broadly available & applicable Standard, well-tested, well-understood protocols

    Integration with wide variety of tools

    The Resource Management Problem

  • 7/30/2019 Pendahuluan Paralel Komputer

    133/167

    Introduction to High Performance Computing

    Enabling secure, controlled remote access tocomputational resources and management ofremote computation

    Authentication and authorization

    Resource discovery & characterization Reservation and allocation

    Computation monitoring and control

    Grid Systems Technologies

  • 7/30/2019 Pendahuluan Paralel Komputer

    134/167

    Introduction to High Performance Computing

    Systems and security problems addressed bynew protocols & services. E.g., Globus:

    Grid Security Infrastructure (GSI) for security

    Globus Metadata Directory Service (MDS) for

    discovery Globus Resource Allocations Manager (GRAM)

    protocol as a basic building block Resource brokering & co-allocation services

    GridFTP for data movement

    The Programming Problem

  • 7/30/2019 Pendahuluan Paralel Komputer

    135/167

    Introduction to High Performance Computing

    How does a user develop robust, secure,long-lived applications for dynamic,heterogeneous, Grids?

    Presumably need:

    Abstractions and models to add tospeed/robustness/etc. of development

    Tools to ease application development anddiagnose common problems

    Code/tool sharing to allow reuse of codecomponents developed by others

    Grid Programming Technologies

  • 7/30/2019 Pendahuluan Paralel Komputer

    136/167

    Introduction to High Performance Computing

    Grid applications are incredibly diverse (data,

    collaboration, computing, sensors, ) Seems unlikely there is one solution

    Most applications have been written from scratch,

    with or without Grid services Application-specific libraries have been shown to

    provide significant benefits

    No new language, programming model, etc., has yet

    emerged that transforms things But certainly still quite possible

    Examples of GridProgramming Technologies

  • 7/30/2019 Pendahuluan Paralel Komputer

    137/167

    Introduction to High Performance Computing

    Programming Technologies MPICH-G2: Grid-enabled message passing

    CoG Kits, GridPort: Portal construction, based on N-tier architectures

    GDMP, Data Grid Tools, SRB: replica management,

    collection management

    Condor-G: simple workflow management

    Legion: object models for Grid computing

    Cactus: Grid-aware numerical solver framework Note tremendous variety, application focus

    MPICH-G2: A Grid-Enabled MPI

  • 7/30/2019 Pendahuluan Paralel Komputer

    138/167

    Introduction to High Performance Computing

    A complete implementation of the MessagePassing Interface (MPI) for heterogeneous, widearea environments

    Based on the Argonne MPICH implementation of MPI

    (Gropp and Lusk)

    Globus services for authentication, resourceallocation, executable staging, output, etc.

    Programs run in wide area without change!

    See also: MetaMPI, PACX, STAMPI, MAGPIE

    www.globus.org/mpi

    Grid Events

  • 7/30/2019 Pendahuluan Paralel Komputer

    139/167

    Introduction to High Performance Computing

    Global Grid Forum: working meeting Meets 3 times/year, alternates U.S.-Europe, withJuly meeting as major event

    HPDC: major academic conference

    HPDC-11 in Scotland with GGF-8, July 2002

    Other meetings include

    IPDPS, CCGrid, EuroGlobus, Globus Retreats

    www.gridforum.org, www.hpdc.org

    Useful References

  • 7/30/2019 Pendahuluan Paralel Komputer

    140/167

    Introduction to High Performance Computing

    Book (Morgan Kaufman) www.mkp.com/grids

    Perspective on Grids The Anatomy of the Grid: Enabling Scalable

    Virtual Organizations, IJSA, 2001 www.globus.org/research/papers/anatomy.pdf

    All URLs in this section of the presentation,especially: www.gridforum.org, www.grids-center.org,

    www.globus.org

    Outline

    http://www.globus.org/research/papers/anatomy.pdfhttp://www.gridforum.org/http://www.grids-center.org/http://www.globus.org/http://www.globus.org/http://www.grids-center.org/http://www.grids-center.org/http://www.grids-center.org/http://www.gridforum.org/http://www.globus.org/research/papers/anatomy.pdf
  • 7/30/2019 Pendahuluan Paralel Komputer

    141/167

    Introduction to High Performance Computing

    Preface What is High Performance Computing?

    Parallel Computing

    Distributed Computing, Grid Computing, andMore

    Future Trends in HPC

    Value of Understanding Future Trends

  • 7/30/2019 Pendahuluan Paralel Komputer

    142/167

    Introduction to High Performance Computing

    Monitoring and understanding future trends inHPC is important:

    users: applications should be written to beefficient on current and future architectures

    developers:tools should be written to be efficienton current and future architectures

    computing centers: system purchases areexpensive and should have upgrade paths

    The Next Decade

  • 7/30/2019 Pendahuluan Paralel Komputer

    143/167

    Introduction to High Performance Computing

    1980s and 1990s: academic and government requirements stronglyinfluenced parallel computing architectures

    academic influence was greatest in developingparallel computing software (for science & eng.)

    commercial influence grew steadily in late 1990s

    In the next decade: commercialization will become dominant in

    determining the architecture ofsystems academic/research innovations will continue to

    drive the development of the HPC software

    Commercialization

  • 7/30/2019 Pendahuluan Paralel Komputer

    144/167

    Introduction to High Performance Computing

    Computing technologies (including HPC) arenow propelled by profits, not sustained bysubsidies

    Web servers, databases, transaction processing

    and especially multimedia applications drive theneed for computational performance.

    Most HPC systems are scaled up commercial

    systems, with less additional hardware and

    software compared to commercial systems. Its not engineering, its economics.

    Processors and Nodes

  • 7/30/2019 Pendahuluan Paralel Komputer

    145/167

    Introduction to High Performance Computing

    Easy predictions: microprocessors performance increase continuesat ~60% per year (Moores Law) for 5+ years.

    total migration to 64-bit microprocessors

    use of even more cache, more memory hierarchy. increased emphasis on SMPs

    Tougher predictions:

    resurgence of vectors in microprocessors? Maybe dawn of multithreading in microprocessors? Yes

    Building Fat Nodes: SMPs

  • 7/30/2019 Pendahuluan Paralel Komputer

    146/167

    Introduction to High Performance Computing

    More processors are faster, of course SMPs are simplest form of parallel systems efficient if not limited by memory bus contention:

    small numbers of processors

    Commercial market for high performanceservers at low cost drives needfor SMPs

    HPC market for highest performance, ease of

    programming drives developmentof SMPs

    Building Fat Nodes: SMPs

  • 7/30/2019 Pendahuluan Paralel Komputer

    147/167

    Introduction to High Performance Computing

    Trends are to: build bigger SMPs attempt to share memory across SMPs (cc-

    NUMA)

    Resurgence of Vectors

  • 7/30/2019 Pendahuluan Paralel Komputer

    148/167

    Introduction to High Performance Computing

    Vectors keep functional units busy

    vector registers are veryfast

    vectors are more efficient for loops of any stride

    vectors are greatfor many science & eng. apps

    Possible resurgence of vectors SGI/Cray plans has built SV1ex, building SV2

    NEC continues building (CMOS) parallel-vector,Cray-like systems

    Microprocessors (Pentium4, G4) have addedvector-like functionality for multimedia purposes

    Dawn of Multithreading?

  • 7/30/2019 Pendahuluan Paralel Komputer

    149/167

    Introduction to High Performance Computing

    Memory speed will always be a bottleneck Must overlap computation with memory

    accesses: toleratelatency

    requires immense amount of parallelism

    requires processors with multiple streams andcompilers that can define multiple threads

    Multithreading Diagram

  • 7/30/2019 Pendahuluan Paralel Komputer

    150/167

    Introduction to High Performance Computing

    Multithreading

  • 7/30/2019 Pendahuluan Paralel Komputer

    151/167

    Introduction to High Performance Computing

    Tera MTA was first multithreaded HPCsystem

    scientific success, production failure

    MTA-2 will be delivered in a few months.

    Multithreading will be implemented (in morelimited fashion) in commercial processors.

    Networks

  • 7/30/2019 Pendahuluan Paralel Komputer

    152/167

    Introduction to High Performance Computing

    Commercial network bandwidth and latencyapproaching custom performance.

    Dramatic performance increases likely

    the network is the computer (Sun slogan)

    more companies, more competition

    no severe physical, economic limits

    Implications of faster networks

    more clusters

    collaborative, visual supercomputing

    Grid computing

    Commodity Clusters

  • 7/30/2019 Pendahuluan Paralel Komputer

    153/167

    Introduction to High Performance Computing

    Clusters provide some real advantages: computing power: leverage workstations and PCs high availability: replace one at a time

    inexpensive: leverage existing competitive market

    simple path to installing parallel computing system

    Major disadvantages were robustness ofhardware and software, but both have

    improved NCSA has huge clusters in production based

    on Pentium III and Itanium.

    Clustering SMPs

  • 7/30/2019 Pendahuluan Paralel Komputer

    154/167

    Introduction to High Performance Computing

    Inevitable (already here!): leverages SMP nodes effectively for same

    reasons clusters leverage individual processors

    Commercial markets drive need for SMPs

    Combine advantages of SMPs, clusters more powerful nodes through multiprocessing

    more powerful nodes -> more powerful cluster

    Interconnect scalability requirements reduced for

    same number of processors

    Continued Linux Growth in HPC

  • 7/30/2019 Pendahuluan Paralel Komputer

    155/167

    Introduction to High Performance Computing

    Linux popularity growing due to price andavailability of source code

    Major players now supporting Linux, esp. IBM

    Head start on Intel Itanium

    Programming Tools

  • 7/30/2019 Pendahuluan Paralel Komputer

    156/167

    Introduction to High Performance Computing

    However, programming tools will continue tolag behind hardware and OS capabilities:

    Researchers will continue to drive the need for themost powerful tools to create the most efficientapplications on the largest systems

    Such technologies will look more like MPI than theWeb maybe worse due to multi-tiered clusters ofSMPs (MPI + OpenMP; Active messages +threads?).

    Academia will continue to play a large role in HPCsoftware development.

    Grid Computing

  • 7/30/2019 Pendahuluan Paralel Komputer

    157/167

    Introduction to High Performance Computing

    Parallelism will continue to grow in the form of

    SMPs

    clusters

    Cluster of SMPs (and maybe DSMs)

    Grids provide the next level

    connects multiple computers into virtual systems

    Already here: IBM, other vendors supporting Globus

    SC2001 dominated by Grid technologies

    Many major government awards (>$100M in past year)

    Emergence of Grids

  • 7/30/2019 Pendahuluan Paralel Komputer

    158/167

    Introduction to High Performance Computing

    But Grids enable much more than appsrunning on multiple computers (which can beachieved with MPI alone)

    virtual operating system: provides global

    workspace/address space via a single login automatically manages files, data, accounts, and

    security issues

    connects other resources (archival data facilities,

    instruments, devices) and people (collaborativeenvironments)

    Grids Are Inevitable

  • 7/30/2019 Pendahuluan Paralel Komputer

    159/167

    Introduction to High Performance Computing

    Inevitable (at least in HPC):

    leverages computational power of all availablesystems

    manages resources as a single system--easier forusers

    provides most flexible resource selection andmanagement, load sharing

    researchers desire to solve bigger problems will

    always outpace performance increases of singlesystems; just as multiple processors are needed,multiple multiprocessors will be deemed so

    Grid-Enabled Software

  • 7/30/2019 Pendahuluan Paralel Komputer

    160/167

    Introduction to High Performance Computing

    Commercialapplications on single parallelsystems and Grids will require that:

    underlying architectures must be invisible: noparallel computing expertise required

    usage must be simple development must not be to difficult

    Developments in ease-of-use will benefitscientists as users(not as developers)

    Web-based interfaces: transparentsupercomputing(MPIRE, Meta-MEME, etc.).

    Grid-Enabled Collaborative andVisual Supercomputing

  • 7/30/2019 Pendahuluan Paralel Komputer

    161/167

    Introduction to High Performance Computing

    Commercial world demands:

    multimedia applications

    real-time data processing

    online transaction processing

    rapid prototyping and simulation in engineering,chemistry and biology

    interactive, remote collaboration

    3D graphics, animation and virtual reality

    visualization

    Grid-enabled Collaborative, VisualSupercomputing

  • 7/30/2019 Pendahuluan Paralel Komputer

    162/167

    Introduction to High Performance Computing

    Academic world will leverage resulting Gridslinking computing and visualization systemsvia high-speed networks:

    collaborative post-processing of data already here

    simulations will be visualized in 3D, virtual worldsin real-time

    such simulations can then be steered

    multiple scientists can participate in these visual

    simulations the time to insight (SGI slogan) will be reduced

    Web-based Grid Computing

  • 7/30/2019 Pendahuluan Paralel Komputer

    163/167

    Introduction to High Performance Computing

    Web currently used mostly for contentdelivery

    Web servers on HPC systems can executeapplications

    Web servers on Grids can launchapplications, move/store/retrieve data, displayvisualizations, etc.

    NPACI HotPage already enables single sign-on to NPACI Grid Resources

    Summary of Expectations

  • 7/30/2019 Pendahuluan Paralel Komputer

    164/167

    Introduction to High Performance Computing

    HPC systemswill grow in performance butprobably change little in design (5-10 years): HPC systems will be larger versions of smaller

    commercial systems, mostly large SMPs andclusters of inexpensive nodes

    Some processors will exploit vectors, as well asmore/larger caches.

    Best HPC systems will have been designed top-down instead of bottom-up, but all will have beendesigned to make the bottom profitable.

    Multithreading is the only likely, near-term majorarchitectural change.

    Summary of Expectations

  • 7/30/2019 Pendahuluan Paralel Komputer

    165/167

    Introduction to High Performance Computing

    Using HPC systemswill change much more:

    Grid computing will become widespread in HPCand in commercial computing

    Visual supercomputing and collaborativesimulation will be commonplace.

    WWW interfaces to HPC resources will maketransparent supercomputing commonplace.

    But programmingthe most powerfulresources most effectivelywill remain difficult.

    Caution

  • 7/30/2019 Pendahuluan Paralel Komputer

    166/167

    Introduction to High Performance Computing

    Change is difficult to predict (and I am anastrophysicist, not an astrologer):

    Accuracy of linear extrapolation predictionsdegrade over long times (like weather forecasts)

    Entirely new ideas can change everything: WWW is an excellent example; Grid computing isprobably the next

    Eventually, something truly different will replace CMOStechnology (nanotechnology? molecular computing?

    DNA computing?)

    Final Prediction

  • 7/30/2019 Pendahuluan Paralel Komputer

    167/167

    The thing about change is that

    things will be different afterwards.

    Alan McMahon (Cornell University)