Pendahuluan Paralel Komputer

7/30/2019 Pendahuluan Paralel Komputer

1/167

Introduction toHigh Performance Computing:

Parallel Computing, DistributedComputing, Grid Computing and More

Dr. Jay Boisseau

Director, Texas Advanced Computing [email protected]

December 3, 2001

The University of Texas at AustinTexas Advanced Computing Center
mailto:[email protected]:[email protected]


2/167

Introduction to High Performance Computing

Outline

Preface

What is High Performance Computing?

Parallel Computing

Distributed Computing, Grid Computing, andMore

Future Trends in HPC


3/167


Purpose

Purpose of this workshop: to educate researchers about the value and

impact of high performance computing (HPC)techniques and technologies in conducting

computational science and engineering

Purpose of this presentation:

to educate researchers about the techniques and

tools ofparallel computing, and to show them thepossibilities presented by distributed computingand Grid computing


4/167


Goals

Goals of this presentation are to help you:1. understand the big picture of high performance

computing

2. develop a comprehensive understanding ofparallel computing

3. begin to understand how Grid and distributedcomputing will further enhance computationalscience capabilities


5/167


Content and Context

This material is an introductionand anoverview It is nota comprehensive HPC, so further reading

(much more!) is recommended.

Presentation is followed by additionalspeakers with detailed presentations onspecific HPC and science topics

Together, these presentations will helpprepare you to use HPC in your scientificdiscipline.


6/167


Background - me

Director of the Texas Advanced ComputingCenter (TACC) at the University of Texas

Formerly at San Diego Supercomputer

Center (SDSC), Artic Region SupercomputingCenter

10+ years in HPC

Known Luis for 4 years - plan to developstrong relationship between TACC andCeCalCULA


7/167Introduction to High Performance Computing

Background TACC

Mission: to enhance the academic research capabilities of

the University of Texas and its affiliatesthroughthe application of advanced computing resources

and expertise

TACC activities include:

Resources

Support Development

Applied research



TACC Activities

TACC resources and support includes: HPC systems

Scientific visualization resources

Data storage/archival systems

TACC research and development areas:

HPC

Scientific Visualization

Grid Computing



Current HPC Systems

FDDI

HiPPI

CRAY SV1

16 CPU, 16GB

Memory

ARCHIVE640

GB

CRAY T3E

256+ procs

128 MB/proc

500

GBaurora

goldenIBM SP

64+ procs

256 MB/proc

azure

300

GB

Ascend

Router



New HPC Systems

Four IBM p690 HPC servers 16 Power4 Processors

1.3 GHz: 5.2 Gflops per proc,83.2 Gflops per server

16 GB Shared Memory >200 GB/s memory bandwidth!

144 GB Disk

1 TB disk to partition across servers Will configure as single system (1/3 Tflop)

with single GPFS system (1 TB) in 2Q02



New HPC Systems

IA64 Cluster 20 2-way nodes

Itanium (800 MHz)processors

2 GB memory/node

72 GB disk/node

Myrinet 2000 switch

180GB shared disk

IA32 Cluster 32 2-way nodes

Pentium III (1 GHz)processors

1 GB Memory

18.2 GB disk/node

Myrinet 2000 Switch

750 GB IBM GPFS parallel file system for both clusters



World-Class Vislab

SGI Onyx2 24 CPUs, 6 Infinite Reality 2 Graphics Pipelines

24 GB Memory, 750 GB Disk

Front and Rear Projection Systems

3x1 cylindrically-symmetric Power Wall 5x2 large-screen, 16:9 panel Power Wall

Matrix switch between systems, projectors, rooms



More Information

URL: www.tacc.utexas.edu

E-mail Addresses:

General Information: [email protected]

Technical assistance: [email protected]

Telephone Numbers:

Main Office: (512) 475-9411

Facsimile transmission: (512) 475-9445 Operations Room: (512) 475-9410
http://www.tacc.utexas.edu/mailto:[email protected]:[email protected]:[email protected]:[email protected]://www.tacc.utexas.edu/



Outline

Preface


Parallel Computing





Supercomputing

First HPC systems were vector-basedsystems (e.g. Cray)

named supercomputers because they were an

order of magnitude more powerful than

commercial systems

Now, supercomputer has little meaning

large systems are now just scaled up versions of

smaller systems

However, high performance computing hasmany meanings


16/167


HPC Defined

High performance computing: can mean high flop count

per processor

totaled over many processors working on the same

problem totaled over many processors working on related

problems

can mean faster turnaround time

more powerful system scheduled to first available system(s)

using multiple systems simultaneously


17/167


My Definitions

HPC: anycomputational technique thatsolves a large problem faster than possibleusing single, commoditysystems

Custom-designed, high-performance processors(e.g. Cray, NEC)

Parallel computing

Distributed computing

Grid computing


18/167


My Definitions

Parallel computing: single systems with manyprocessors working on the same problem

Distributed computing: many systems loosely

coupled by a scheduler to work on relatedproblems

Grid Computing: many systems tightly

coupled by software and networks to worktogether on single problems or on relatedproblems


19/167


Importance of HPC

HPC has had tremendousimpact on all areasof computational science and engineering inacademia, government, and industry.

Many problems have been solved with HPCtechniques that were impossible to solvewithindividual workstations or personalcomputers.


20/167


Outline

Preface


Parallel Computing




21/167


22/167


Parallel vs. Serial Computers

Two big advantages of parallel computers:1. total performance

2. total memory

Parallel computers enable us to solveproblems that:

benefit from, or require, fast solution

require large amounts of memory

example that requires both: weather forecasting


23/167


Parallel vs. Serial Computers

Some benefits of parallel computing include: more data points

bigger domains

better spatial resolution

more particles more time steps

longer runs

better temporal resolution

faster execution faster time to solution more solutions in same time

lager simulations in real time


24/167


Serial Processor Performance

Time (years)

perform

ance

Although Moores

Law predicts that

single processor

performancedoubles every 18months, eventuallyphysical limits on

manufacturingtechnology will bereached


25/167


Types of Parallel Computers

The simplest and most useful way to classifymodern parallel computers is by their memorymodel:

shared memory

distributed memory


26/167


P P P P P P

BUS

Memory

M

P

M

P

M

P

M

P

M

P

M

P

Network

Shared memory - single addressspace. All processors have access to apool of shared memory. (Ex: SGIOrigin, Sun E10000)

Distributed memory - eachprocessor has its own local

memory. Must do message passingto exchange data betweenprocessors. (Ex: CRAY T3E, IBMSP, clusters)

Shared vs. Distributed Memory


27/167


P P P P P P

BUS

Memory

Uniform memory access (UMA):Each processor has uniformaccess to memory. Also knownas symmetric multiprocessors, orSMPs (Sun E10000)

P P P P

BUS

Memory

P P P P

BUS

Memory

Network

Non-uniform memory access(NUMA): Time for memoryaccess depends on locationof data. Local access is fasterthan non-local access. Easierto scale than SMPs (SGIOrigin)

Shared Memory: UMA vs. NUMA


28/167


Distributed Memory: MPPs vs. Clusters

Processor-memory nodes are connected bysome type of interconnect network

Massively Parallel Processor (MPP): tightlyintegrated, single system image.

Cluster: individual computers connected by s/w

CPU

MEM

CPU

MEM

CPU

MEMCPU

MEMCPU

MEM

CPU

MEM

CPU

MEMCPU

MEM

CPU

MEM

InterconnectNetwork


29/167


Processors, Memory, & Networks

Both shared and distributed memorysystems have:

1. processors: now generally commodity RISCprocessors

2. memory: now generally commodity DRAM

3. network/interconnect: between the processorsand memory (bus, crossbar, fat tree, torus,hypercube, etc.)

We will now begin to describe these piecesin detail, starting with definitions of terms.


30/167


Processor-Related Terms

Clock period (cp): the minimum time intervalbetween successive actions in the processor.Fixed: depends on design of processor.Measured in nanoseconds (~1-5 for fastestprocessors). Inverse of frequency (MHz).

Instruction: an action executed by a processor,such as a mathematical operation or a

memory operation.

Register: a small, extremely fast location forstoring data or instructions in the processor.


31/167



Functional Unit (FU): a hardware element thatperforms an operation on an operand or pairof operations. Common FUs are ADD, MULT,INV, SQRT, etc.

Pipeline : technique enabling multipleinstructions to be overlapped in execution.

Superscalar: multiple instructions are possibleper clock period.

Flops: floating point operations per second.


32/167



Cache: fast memory (SRAM) near theprocessor. Helps keep instructions and dataclose to functional units so processor canexecute more instructions more rapidly.

Translation-Lookaside Buffer (TLB): keepsaddresses of pages (block of memory) inmain memory that have recently been

accessed (a cache for memory addresses)


33/167


Memory-Related Terms

SRAM: Static Random Access Memory (RAM).Very fast (~10 nanoseconds), made using thesame kind of circuitry as the processors, sospeed is comparable.

DRAM: Dynamic RAM. Longer access times(~100 nanoseconds), but hold more bits andare much less expensive (10x cheaper).

Memory hierarchy: the hierarchy of memory in aparallel system, from registers to cache tolocal memory to remote memory. More later.


34/167


Interconnect-Related Terms

Latency: Networks: How long does it take to start sending a

"message"? Measured in microseconds.

Processors: How long does it take to output

results of some operations, such as floating pointadd, divide etc., which are pipelined?)

Bandwidth: What data rate can be sustained

once the message is started? Measured inMbytes/sec or Gbytes/sec


35/167


Interconnect-Related Terms

Topology: the manner in which the nodes areconnected.

Best choice would be a fully connected network(every processor to every other). Unfeasible for

cost and scaling reasons. Instead, processors are arranged in some

variation of a grid, torus, or hypercube.

3-d hypercube 2-d mesh 2-d torus


36/167


Processor-Memory Problem

Processors issue instructions roughly everynanosecond.

DRAM can be accessed roughly every 100

nanoseconds (!). DRAM cannot keep processors busy! And the

gap is growing:

processors getting faster by 60% per year DRAM getting faster by 7% per year (SDRAM and

EDO RAM might help, but not enough)


37/167


Processor-Memory Performance Gap

Proc60%/yr.

DRAM7%/yr.

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Performa

nce Moores Law

From D. Patterson, CS252, Spring 1998 UCB


38/167


Processor-Memory Performance Gap

Problem becomes worse when remote(distributed or NUMA) memory is needed

network latency is roughly 1000-10000nanoseconds (roughly 1-10 microseconds)

networks getting faster, but not fast enough

Therefore, cache is used in all processors

almost as fast as processors (same circuitry)

sits between processors and local memory expensive, can only use small amounts

must design system to load cache effectively


39/167


CPU

Main Memory

Cache

Processor-Cache-Memory

Cache is much smaller than main memoryand hence there is mappingof data frommain memory to cache.


40/167


CPU

Cache

Local

Memory

Remote

Memory

SPEED SIZECOST/BIT

Memory Hierarchy


41/167


Cache-Related Terms

ICACHE : Instruction cache

DCACHE (L1) : Data cache closest toregisters

SCACHE (L2) : Secondary data cache Data from SCACHE has to go through DCACHE

to registers

SCACHE is larger than DCACHE

Not all processors have SCACHE


42/167


Cache Benefits

Data cache was designed with two keyconcepts in mind

Spatial Locality When an element is referenced its neighbors will be

referenced also Cache lines are fetched together

Work on consecutive data elements in the same cacheline

Temporal Locality When an element is referenced, it might be referencedagain soon

Arrange code so that data in cache is reused often


43/167


cache

main memory

Direct-Mapped Cache

Direct mapped cache: A block from main memory can goin exactly one place in the cache. This is called directmapped because there is direct mapping from any blockaddress in memory to a single location in the cache.


44/167


cache

Main memory

Fully Associative Cache

Fully Associative Cache : A block from main memory can beplaced in any location in the cache. This is called fullyassociative because a block in main memory may be

associated with any entry in the cache.


45/167


2-way set-associative cache

Main memory

Set Associative Cache

Set associative cache : The middle range of designs betweendirect mapped cache and fully associative cache is called set-associative cache. In a n-way set-associative cache a blockfrom main memory can go into N (N > 1) locations in the cache.


46/167


Cache-Related Terms

Least Recently Used (LRU): Cachereplacement strategy for set associativecaches. The cache block that is least recentlyused is replaced with a new block.

Random Replace: Cache replacement strategyfor set associative caches. A cache block israndomly replaced.


47/167


Example: CRAY T3E Cache

The CRAY T3E processors can execute 2 floating point ops (1 add, 1 multiply) and

2 integer/memory ops (includes 2 loads or 1 store)

To help keep the processors busy on-chip 8 KB direct-mapped data cache

on-chip 8 KB direct-mapped instruction cache

on-chip 96 KB 3-way set associative secondary

data cachewith random replacement.


48/167


Putting the Pieces Together

Recall: Shared memory architectures:

Uniform Memory Access (UMA): Symmetric Multi-Processors (SMP). Ex: Sun E10000

Non-Uniform Memory Access (NUMA): Most common areDistributed Shared Memory (DSM), or cc-NUMA (cachecoherent NUMA) systems. Ex: SGI Origin 2000

Distributed memory architectures: Massively Parallel Processor (MPP): tightly integrated

system, single system image. Ex: CRAY T3E, IBM SP Clusters: commodity nodes connected by interconnect.

Example: Beowulf clusters.


49/167


Symmetric Multiprocessors (SMPs)

SMPs connect processors to global sharedmemory using one of:

bus

crossbar

Provides simple programming model, but hasproblems:

buses can become saturated

crossbar size must increase with # processors

Problem grows with number of processors,limiting maximum size of SMPs


50/167


Shared Memory Programming

Programming models are easier sincemessage passing is not necessary.Techniques:

autoparallelization via compiler options

loop-level parallelism via compiler directives

OpenMP

pthreads

More on programming models later.


51/167


Massively Parallel Processors

Each processor has its own memory: memory is not shared globally

adds another layer to memory hierarchy (remotememory)

Processor/memory nodes are connected byinterconnect network

many possible topologies

processors must pass data via messages communication overhead must be minimized


52/167


Communications Networks

Custom Many vendors have custom interconnects that

provide high performance for their MPP system

CRAY T3E interconnect is the fastest for MPPs:

lowest latency, highest bandwidth

Commodity

Used in some MPPs and all clusters

Myrinet, Gigabit Ethernet, Fast Ethernet, etc.


53/167


Types of Interconnects

Fully connected not feasible

Array and torus Intel Paragon (2D array), CRAY T3E (3D torus)

Crossbar IBM SP (8 nodes)

Hypercube

SGI Origin 2000 (hypercube), Meiko CS-2 (fattree)

Combinations of some of the above IBM SP (crossbar & fully connected for 80 nodes)

IBM SP (fat tree for > 80 nodes)


54/167


Clusters

Similar to MPPs Commodity processors and memory

Processor performance must be maximized

Memory hierarchy includes remote memory

No shared memory--message passing Communication overhead must be minimized

Different from MPPs

All commodity, including interconnect and OS Multiple independent systems: more robust

Separate I/O systems


55/167


Cluster Pros and Cons

Pros Inexpensive

Fastest processors first

Potential for true parallel I/O

High availability

Cons:

Less mature software (programming and system)

More difficult to manage (changing slowly)

Lower performance interconnects: not as scalableto large number (but have almost caught up!)


56/167


Distributed Memory Programming

Message passing is most efficient MPI

MPI-2

Active/one-sided messages Vendor: SHMEM (T3E), LAPI (SP

Coming in MPI-2

Shared memory models can be implemented

in software, but are not as efficient. More on programming models in the next

section.


57/167


Distributed Shared Memory

More generally called cc-NUMA (cachecoherent NUMA)

Consists ofmSMPs with nprocessors in aglobal address space: Each processor has some local memory (SMP)

Allprocessors can access allmemory: extradirectory hardware on each SMP tracks valuesstored in all SMPs

Hardware guarantees cache coherency Access to memory on other SMPs slower (NUMA)


58/167


Distributed Shared Memory

Easier to build because of slower access toremote memory (no expensive bus/crossbar)

Similar cache problems

Code writers should be aware of datadistribution

Load balance: Minimize access of far

memory


59/167


DSM Rationale and Realities

Rationale: combine easeof SMPprogramming with scalabilityof MPPprogramming at much at cost of MPP

Reality: NUMA introduces additional layers inSMP memory hierarchy relative to SMPs, soscalability is limited if programmed as SMP

Reality: Performance and high scalabilityrequire programming to the architecture.


60/167


Clustered SMPs

Simpler than DSMs: composed of nodes connected by network, like an

MPP or cluster

each node is an SMP

processors on one SMP do not share memory onother SMPs (no directory hardware in SMP nodes)

communication between SMP nodes is bymessage passing

Ex: IBM Power3-based SP systems


61/167


Clustered SMP Diagram

Network

P P P P

BUS

Memory

P P P P

BUS

Memory


62/167


Reasons for Clustered SMPs

Natural extension of SMPs and clusters SMPs offer great performance up to their

crossbar/bus limit

Connecting nodes is how memory and

performance are increased beyond SMP levels Can scale to larger number of processors with

less scalable interconnect

Maximum performance:

Optimize at SMP level - no communication overhead Optimize at MPP level - fewer messages necessary for

same number of processors


63/167


Clustered SMP Drawbacks

Clustering SMPs has drawbacks No shared memory access over entire system,

unlike DSMs

Has other disadvantages of DSMs

Extra layer in memory hierarchy Performance requires more effort from programmer than

SMPs or MPPs

However, clustered SMPs provide a means

for obtaining very high performance andscalability


64/167


Clustered SMP: NPACI Blue Horizon

IBM SP system: Power3 processors: good peak performance (~1.5

Gflops)

better sustained performance (highly superscalar

and pipelined) than for many other processors SMP nodes have 8 Power3 processors

System has 144 SMP nodes (1154 processorstotal)


65/167


Programming Clustered SMPs

NSF: Most users use only MPI, even for intra-node messages

DoE: Most applications are being developed

with MPI (between nodes) and OpenMP(intra-node)

MPI+OpenMP programming is more complex,but mightyield maximum performance

Active messages and pthreads wouldtheoretically give maximum performance


66/167


Data parallelism Task parallelism

Types of Parallelism

Data parallelism: each processor performsthe same task on different sets or sub-regionsof data

Task parallelism: each processor performs adifferent task

Most parallel applications fall somewhere onthe continuum between these two extremes.


67/167


Data vs. Task Parallelism

Example of data parallelism: In a bottling plant, we see several processors, or

bottle cappers, applying bottle caps concurrentlyon rows of bottles.

Example of task parallelism;

In a restaurant kitchen, we see several chefs, orprocessors, working simultaneously on different

parts of different meals.

A good restaurant kitchen also demonstrates loadbalancingand synchronization--more on thosetopics later.


68/167


Example: Master-Worker Parallelism

A common form of parallelism used indeveloping applications years ago (especiallyin PVM) was Master-Worker parallelism:

a single processor is responsible for distributing

data and collecting results (task parallelism) all other processors perform same task on their

portion of data (data parallelism)


69/167


Parallel Programming Models

The primary programming models in currentuse are

Data parallelism - operations are performed inparallel on collections of data structures. A

generalization of array operations. Message passing - processes possess local

memory and communicate with other processesby sending and receiving messages.

Shared memory - each processor has access to asingle shared pool of memory


70/167


Parallel Programming Models

Most parallelization efforts fall under thefollowing categories.

Codes can be parallelized using message-passinglibraries such as MPI.

Codes can be parallelized using compilerdirectives such as OpenMP.

Codes can be written in new parallel languages.


71/167


Programming Models Architectures

Natural mappings data parallel CM-2 (SIMD machine)

message passing IBM SP (MPP)

shared memory SGI Origin, Sun E10000

Implemented mappings

HPF (a data parallel language) and MPI (amessage passing library) have been implemented

on nearly all parallel machines OpenMP (a set of directives, etc. for shared

memory programming) has been implemented onmost shared memory systems.


72/167


SPMD

All current machines are MIMD systems(Multiple Instruction, Multiple Data) and arecapable of either data parallelism or taskparallelism.

The primary paradigmfor programmingparallel machines is the SPMD paradigm:Single Program, Multiple Data

each processor runs a copy of same source code enables data parallelism (through data

decomposition) and task parallelism (throughintrinsic functions that return the processor ID)


73/167


OpenMP - Shared Memory Standard

OpenMP is a new standard for sharedmemory programming: SMPs and cc-NUMAs.

OpenMP provides a standard set of directives,run-time library routines, and

environment variables for parallelizing code undera shared memory model.

Very similar to Cray PVP autotasking directives,but with much more functionality. (Cray now uses

supports OpenMP.) See http://www.openmp.org for more information


74/167


program add_arrays

parameter (n=1000)

real x(n),y(n),z(n)

read(10) x,y,z

do i=1,n

x(i) = y(i) + z(i)

enddo

...

end

Fortran 77program add_arrays

parameter (n=1000)

real x(n),y(n),z(n)

read(10) x,y,z

!$OMP PARALLEL DOdo i=1,n

x(i) = y(i) + z(i)

enddo

...

end

Fortran 77 + OpenMP

Highlighted directive specifies that loop is executed in parallel.Each processor executes a subset of the loop iterations.

OpenMP Example


75/167


MPI - Message Passing Standard

MPI has emerged as the standard formessage passing in both C and Fortranprograms. No longer need to know MPL,PVM, TCGMSG, etc.

MPI is both large and small:

MPI is large, since it contains 125 functions whichgive the programmer fine control over

communications MPI is small, since message passing programs

can be written using a core set of just sixfunctions.

S


76/167


PE 0 calls MPI_SEND to pass the real variable x to PE 1.

PE 1 calls MPI_RECV to receive the real variable y from PE 0

if(myid.eq.0) then

call MPI_SEND(x,1,MPI_REAL,1,100,MPI_COMM_WORLD,ierr)

endif

if(myid.eq.1) thencall MPI_RECV(y,1,MPI_REAL,0,100,MPI_COMM_WORLD,

status,ierr)

endif

MPI Examples - Send and Receive

MPI messages are two-way: they require asend anda matching receive:

MPI E l Gl b l O i


77/167


MPI Example - Global Operations

PE 6 collects the single (1) integer value n from all other processors and puts the

sum (MPI_SUM) into into sum

call MPI_REDUCE(n,allsum,1,MPI_INTEGER,MPI_SUM,6,

MPI_COMM_WORLD,ierr)

MPI also has global operations to broadcastand reduce (collect) information

PE 5 broadcasts the single (1) integer value n to all other processors

call MPI_BCAST(n,1,MPI_INTEGER,5,

MPI_COMM_WORLD,ierr)

MPI I l i


78/167


MPI Implementations

MPI is typically implemented on top of the highestperformance native message passing library forevery distributed memory machine.

MPI is a natural model for distributed memory

machines (MPPs, clusters)

MPI offers higher performance on DSMs beyond thesize of an individual SMP

MPI is useful between SMPs that are clustered

MPI can be implemented on shared memorymachines

E i MPI MPI 2


79/167


Extensions to MPI: MPI-2

A standard for MPI-2 has been developedwhich extends the functionality of MPI. Newfeatures include:

One sided communications - eliminates the need

to post matching sends and receives. Similar infunctionality to the shmemPUT and GET on theCRAY T3E (most systems have analogous library)

Support for parallel I/O

Extended collective operations No full implementation yet - it is difficult for

vendors

MPI O MP


80/167


MPI vs. OpenMP

There is no single best approach to writing aparallel code. Each has pros and cons:

MPI - powerful, general, and universally availablemessage passing library which provides very fine

control over communications, but forces theprogrammer to operate at a relatively low level ofabstraction.

OpenMP - conceptually simple approach for

creating parallel codes on a shared memorymachines, but not applicable to distributedmemory platforms.

MPI O MP


81/167


MPI vs. OpenMP

MPI is the most general (problems types) andportable (platforms, although not efficient forSMPs)

The architecture and the problem type oftenmake the decision for you.

P ll l Lib i


82/167


Parallel Libraries

Finally, there are parallel mathematicslibraries that enable users to write (serial)codes, then call parallel solver routines :

ScaLAPACK is for solving dense linear system of

equations, eigenvalues and least squareproblems. Also see PLAPACK.

PETSc is for solving linear and non-linear partialdifferential equations (includes various iterative

solvers for sparse matrices). Many others: check NETLIB for complete survey:

http://www.netlib.org

H dl i P ll l C ti


83/167


Hurdles in Parallel Computing

There are some hurdles in parallel computing: Scalar performance: Fast parallel codes require

efficient use of the underlying scalar hardware

Parallel algorithms: Not all scalar algorithms

parallelize well, may need to rethink problem Communications: Need to minimize the time spent doing

communications

Load balancing: All processors should do roughly thesame amount of work

Amdahls Law: Fundamental limit on parallel

computing

S l P f


84/167


Scalar Performance

Underlying every good parallel code is a goodscalar code.

If a code scales to 256 processors but only

gets 1% of peak performance, it is still a badparallel code.

Good news: Everything that you know about serialcomputing will be useful in parallel computing!

Bad news: It is difficult to get good performanceout of the processors and memory used in parallelmachines. Need to use cache effectively.

S i l P f


85/167


Number of processors

time

In this case, the parallel

code achieves perfectscaling, but does not

match the performance of

the serial code until 32

processors are used

Serial Performance

Use Cache Effecti el


86/167


main memory

cache

CPU

A simplified memoryhierarchy

Small& fast

Big

& slow

The data cache was designed withtwo key concepts in mind:

Spatial locality - cache is loaded an

entire line (4-32 words) at a time totake advantage of the fact that if alocation in memory is required,nearby locations will probably alsobe required

Temporal locality - once a word isloaded into cache it remains thereuntil the cache line is needed tohold another word of data.

Use Cache Effectively

Non Cache Issues


87/167


Non-Cache Issues

There are other issues to consider to achievegood serial performance:

Force reductions, e.g., replacement of divisionswith multiplications-by-inverse

Evaluate and replace common sub-expressions Pushing loops inside subroutines to minimize

subroutine call overhead

Force function inlining (compiler option)

Perform interprocedural analysis to eliminateredundant operations (compiler option)

Parallel Algorithms


88/167


Parallel Algorithms

The algorithm must be naturally parallel! Certain serial algorithms do not parallelize well.

Developing a new parallel algorithm to replace aserial algorithm can be one of the most difficult

task in parallel computing. Keep in mind that your parallel algorithm may

involve additional work or a higher floating pointoperation count.

Parallel Algorithms


89/167


Parallel Algorithms

Keep in mind that the algorithm should need the minimum amount of communication (Monte

Carlo algorithms are excellent examples)

balance the load among the processors equally

Fortunately, a lot of research has been done inparallel algorithms, particularly in the area of linearalgebra. Dont reinvent the wheel, take full

advantage of the work done by others: use parallel libraries supplied by the vendor whenever

possible! use ScaLAPACK, PETSc, etc. when applicable

Load Balancing


90/167


Busy timeIdle time

t

PE 0PE 1

The figures below show the timeline for parallel codes run on twoprocessors. In both cases, the total amount of work done is thesame, but in the second case the work is distributed more evenlybetween the two processors resulting in a shorter time to solution.

PE 0

PE 1

Synchronizationpoints

Load Balancing

Communications


91/167


Communications

Two key parameters of the communicationsnetwork are

Latency: time required to initiate a message. Thisis the critical parameter in fine grained codes,

which require frequent interprocessorcommunications. Can be thought of as the timerequired to send a message of zero length.

Bandwidth: steady-state rate at which data can be

sent over the network.This is the critical parameterin coarse grained codes, which require infrequentcommunication of large amounts of data.

Latency and Bandwidth Example


92/167


Latency and Bandwidth Example

Bucket brigade: the old style of fighting firesin which the townspeople formed a line fromthe well to the fire and passed buckets ofwater down the line

latency - the delay until the first bucket to arrivesat the fire

bandwidth - the rate at which buckets arrive at thefire

More on Communications


93/167


Sequential: t = t(comp) + t(comm)

Overlapped: t = t(comp) + t(comm) - t(comp) t(comm)

More on Communications

Time spent performing communications isconsidered overhead. Try minimize theimpact of communications:

minimize the effect of latency by combining large

numbers of small messages into small numbers oflarge messages.

communications and computation do not have tobe done sequentially, can often overlap

communication and computations

Combining Small Messages into


94/167


dial

Hi mom

hang up

dialHow are things?

hang up

dial

in the U.S.?

hang up

dialAt this point many mothers

would not pick up the next call.

dial

Hi mom. How are things

in the U.S.?. Yak, yak... hang up

By transmitting a single large message, Ionly have to pay the price for the dialing

latency once. I transmit more informationin less time.

The following examples of phoning home illustrate the valueof combining many small messages into a single larger one.

g gLarger Ones

Overlapping Communications and


95/167


In the following example, a stencil operation is performed on a 10x 10 array that has been distributed over two processors. Assumeperiodic boundary conditions.

Boundary elements - requires data

from neighboring processor

Interior elements

Initiate communications

Perform computations on interior elements

Wait till communications are finished

Perform computations on boundary elements

Stencil operation:

y(i,j)=x(i+1,j)+x(i-1,j)+x(i,j+1)+x(i,j-1)

PE0 PE1

Computations

Amdahls Law


96/167


Amdahls Law places a strict limit on the speedup that can berealized by using multiple processors. Two equivalentexpressions for Amdahls Law are given below:

tN

= (fp/N + f

s)t

1Effect of multiple processors on run time

S = 1/(fs + fp/N) Effect of multiple processors on speedup

Where:

fs = serial fraction of codefp = parallel fraction of code = 1 - fs

N = number of processors

Amdahl s Law

Illustration of Amdahls Law


97/167



speedup

It takes only a small fraction of serial content in a code to degrade the

parallel performance. It is essential to determine the scaling behavior ofyour code before doing production runs using large numbers ofprocessors

Illustration of Amdahl s Law

Amdahls Law Vs Reality


98/167


Amdahls Law provides a theoretical upper limit on parallel

speedup assuming that there are no costs forcommunications. In reality, communications (and I/O) willresult in a further degradation of performance.

0 50 100 150 200 250


speedup

f = 0.99

Amdahl s Law Vs. Reality

More on Amdahls Law


99/167


More on Amdahl s Law

Amdahls Law can be generalized to any twoprocesses of with different speeds

Ex.: Apply to fprocessorand fmemory:

The growing processor-memory performance gapwill undermine our efforts at achieving maximumpossible speedup!

Generalized Amdahls Law


100/167


Generalized Amdahl s Law

Amdahls Law can be further generalized to handlean arbitrary number of processes of various speeds.(The total fractions representing each process muststill equal 1.)

This is a weighted Harmonic mean. Applicationperformance is limited by performance of the slowestcomponent as much as it is determined by thefastest.

Ravg =

1

fi

R ii 1

N

Gustafsons Law


101/167


Gustafson s Law

Thus, Amdahls Law predicts that there is amaximum scalability for an application,determined by its parallel fraction, and thislimit is generally not large.

There is a way around this: increase theproblem size bigger problems mean bigger grids or more

particles: bigger arrays

number of serial operations generally remainsconstant; number of parallel operations increases:parallel fraction increases

The 1st Question to Ask Yourself


102/167


Before You Parallelize Your Code

Is it worth my time? Do the CPU requirements justify parallelization?

Do I need a parallel machine in order to getenough aggregate memory?

Will the code be used just once or will it be amajor production code?

Your time is valuable, and it can be very time

consuming to write, debug, and test a parallelcode. The more time you spend writing aparallel code, the less time you have to spenddoing your research.

The 2nd Question to Ask Yourself


103/167


Before You Parallelize Your Code

How should I decompose my problem? Do the computations consist of a large number of

small, independent problems - trajectories,parameter space studies, etc? May want to

consider a scheme in which each processor runsthe calculation for a different set of data

Does each computation have large memory orCPU requirements? Will probably have to break

up a single problem across multiple processors

Distributing the Data


104/167


Distributing the Data

Decision on how to distribute the data shouldconsider these issues: Load balancing:

Often implies an equal distribution of data, but moregenerally means an equal distribution of work

Communications:Want to minimize the impact of communications, taking intoaccount both size and number of messages

Physics:Choice of distribution will depend on the processes that are

being modeled in each direction.

A Data Distribution Example


105/167


A good distribution if the physics of theproblem is the same in both directions.

Minimizes the amount of data that must

be communicated between processors.

If expensive global operations need to be

carried out in the x-direction (ex. FFTs),

this is probably a better choice.

A Data Distribution Example

A More Difficult Example


106/167


Imagine that we are doing a simulation

in which more work is required for the

grid points covering the shaded object.

Neither data distribution from theprevious example will result in good

load balancing.

May need to consider an irregular grid

or a different data structure.

A More Difficult Example

Choosing a Resource


107/167


Choosing a Resource

The following factors should be taken intoaccount when choosing a resource:

What is the granularity of my code?

Are there any special hardware features that I

need or can take advantage of? How many processors will the code be run on?

What are my memory requirements?

By carefully considering these points, you canmake the right choice of computationalplatform.

Choosing a Resource: Granularity


108/167


Granularity is a measure of the amount of work done by eachprocessor between synchronization events.

PE 0

PE 1

Low-granularity application

PE 0

PE 1

High-granularity application

Generally, latency is the critical parameter for low-granularitycodes, while processor performance is the key factor for high-granularity applications.

Choosing a Resource: Granularity

Choosing a Resource: SpecialH d F


109/167


Hardware Features

Various HPC platforms have differenthardware features that your code may beable to take advantage of. Examples include:

Hardware support for divide and square root

operations (IBM SP) Parallel I/O file system (IBM SP)

Data streams (CRAY T3E)

Control over cache alignment (CRAY T3E)

E-registers for by-passing cache hierarchy(CRAY T3E)

Importance of Parallel Computing


110/167



High performance computing has becomealmost synonymous with parallel computing.

Parallel computing is necessary to solve bigproblems (high resolution, lots of timesteps,etc.) in science and engineering.

Developing and maintaining efficient, scalableparallel applications is difficult. However, the

payoff can be tremendous.



111/167



Before jumping in, think about whether or not your code truly needs to beparallelized

how to decompose your problem.

Then choose a programming model based onyour problem and your available architecture.

Take advantage of the resources that areavailable - compilers libraries, debuggers,performance analyzers, etc. - to help youwrite efficient parallel code.

Useful References


112/167


Useful References

Hennessy, J. L. and Patterson, D. A. ComputerArchitecture: A Quantitative Approach.

Patterson, D.A. and Hennessy, J.L., ComputerOrganization and Design: The Hardware/Software

Interface. D. Dowd, High Performance Computing.

D. Kuck, High Performance Computing. Oxford U.Press (New York) 1996.

D. Culler and J. P. Singh, Parallel ComputerArchitecture.

Outline


113/167


Outline

Preface What is High Performance Computing?

Parallel Computing



Distributed Computing


114/167


Distributed Computing

Concept has been used for two decades Basic idea: run scheduler across systems to

runs processes on least-used systems first

Maximize utilization

Minimize turnaround time

Have to load executables and input files toselected resource

Shared file system

File transfers upon resource selection

Examples of Distributed Computing


115/167


a p es o st buted Co put g

Workstation farms, Condor flocks, etc. Generally share file system

SETI@home, Entropia, etc.

Only one source code; central server copiescorrect binary code and input data to each system

Napster, Gnutella: file/data sharing

NetSolve Runs numerical kernel on any of multipleindependent systems, much like a Grid solution

SETI@home: Global DistributedC ti


116/167


Computing Running on 500,000 PCs, ~1000 CPU Years per Day

485,821 CPU Years so far

Sophisticated Data & Signal Processing Analysis

Distributes Datasets from Arecibo Radio Telescope

Distributed vs. Parallel Computing


117/167


p g

Different Distributed computing executes independent (but

possibly related) applications on different systems;jobs do not communicate with each other

Parallel computing executes a single applicationacross processors, distributing the work and/ordata but allowing communication betweenprocesses

Non-exclusive: can distribute parallelapplications to parallel computing systems

Grid Computing


118/167


p g

Enable communities (virtual organizations)to share geographically distributed resourcesas they pursue common goalsin theabsence of central control, omniscience, trust

relationships.

Resources (HPC systems, visualizationsystems & displays, storage systems,

sensors, instruments, people) are integratedvia middleware to facilitate use of allresources.

Why Grids?


119/167


y

Resources have different functions, butmultiple classes resources are necessary formost interesting problems.

Power of any single resource is smallcompared to aggregations of resources

Network connectivity is increasing rapidly inbandwidth and availability

Large problems require teamwork andcomputation

Network Bandwidth Growth


120/167


Network vs. computer performance

Computer speed doubles every 18 months

Network speed doubles every 9 months

Difference = order of magnitude per 5 years

1986 to 2000

Computers: x 500

Networks: x 340,000

2001 to 2010 Computers: x 60

Networks: x 4000

Moores Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-

2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.

Grid Possibilities


121/167


A biochemist exploits 10,000 computers to screen100,000 compounds in an hour

1,000 physicists worldwide pool resources for petaflopanalyses of petabytes of data

Civil engineers collaborate to design, execute, &analyze shake table experiments

Climate scientists visualize, annotate, & analyzeterabyte simulation datasets

An emergency response team couples real time data,weather model, population data

Some Grid Usage Models


122/167


g

Distributed computing: job scheduling on Gridresources with secure, automated data transfer

Workflow: synchronized scheduling and automateddata transfer from one system to next in pipeline (e.g.

HPC system to visualization lab to storage system) Coupled codes, with pieces running on different

systems simultaneously

Meta-applications: parallel apps spanning multiple

systems

Grid Usage Models


123/167


g

Some models are similar to models alreadybeing used, but are much simpler due to:

single sign-on

automatic process scheduling

automated data transfers

But Grids can encompass new resourceslikes sensors and instruments, so new usage

models will arise

Selected Major Grid Projects


124/167


Name URL & Sponsors FocusAccess Grid www.mcs.anl.gov/FL/

accessgrid; DOE, NSFCreate & deploy group collaboration systemsusing commodity technologies

BlueGrid IBM Grid testbed linking IBM laboratories

DISCOM www.cs.sandia.gov/discom

DOE Defense Programs

Create operational Grid providing access toresources at three U.S. DOE weapons

laboratories

DOE ScienceGrid

sciencegrid.org

DOE Office of Science

Create operational Grid providing access toresources & applications at U.S. DOE sciencelaboratories & partner universities

Earth SystemGrid (ESG)

earthsystemgrid.orgDOE Office of Science

Delivery and analysis of large climate modeldatasets for the climate research community

EuropeanUnion (EU)DataGrid

eu-datagrid.org

European Union

Create & apply an operational grid forapplications in high energy physics,environmental science, bioinformatics

g

g

g

g

g

g



125/167


Name URL/Sponsor FocusEuroGrid, GridInteroperability (GRIP)

eurogrid.org

European Union

Create technologies for remote access tosupercomputer resources & simulation codes;in GRIP, integrate with Globus

Fusion Collaboratory fusiongrid.org

DOE Off. Science

Create a national computational collaboratoryfor fusion research

Globus Project globus.org

DARPA, DOE, NSF,NASA, Msoft

Research on Grid technologies; developmentand support of Globus Toolkit; application anddeployment

GridLab gridlab.org

European Union

Grid technologies and applications

GridPP gridpp.ac.uk

U.K. eScience

Create & apply an operational grid within theU.K. for particle physics research

Grid ResearchIntegration Dev. &Support Center

grids-center.org

NSF

Integration, deployment, support of the NSFMiddleware Infrastructure for research &education

g

g

g

g

g

g



126/167


Name URL/Sponsor FocusGrid Application Dev.Software

hipersoft.rice.edu/grads; NSF

Research into program developmenttechnologies for Grid applications

Grid Physics Network griphyn.org

NSF

Technology R&D for data analysis in physicsexpts: ATLAS, CMS, LIGO, SDSS

Information Power

Grid

ipg.nasa.gov

NASA

Create and apply a production Grid for

aerosciences and other NASA missions

International VirtualData Grid Laboratory

ivdgl.org

NSF

Create international Data Grid to enable large-scale experimentation on Grid technologies &applications

Network for

Earthquake Eng.Simulation Grid

neesgrid.org

NSF

Create and apply a production Grid for

earthquake engineering

Particle Physics DataGrid

ppdg.net

DOE Science

Create and apply production Grids for dataanalysis in high energy and nuclear physicsexperiments

g

g

g

g

g

g



127/167


Name URL/Sponsor Focus

TeraGrid teragrid.org

NSF

U.S. science infrastructure linking four majorresource sites at 40 Gb/s

UK Grid SupportCenter

grid-support.ac.uk

U.K. eScience

Support center for Grid projects within the U.K.

Unicore BMBFT Technologies for remote access tosupercomputers

g

g

New

There are also many technology R&Dprojects: e.g., Globus, Condor, NetSolve,

Ninf, NWS, etc.

Example Application Projects


128/167


Earth Systems Grid: environment (US DOE)

EU DataGrid: physics, environment, etc. (EU)

EuroGrid: various (EU)

Fusion Collaboratory (US DOE)

GridLab: astrophysics, etc. (EU)

Grid Physics Network (US NSF)

MetaNEOS: numerical optimization (US NSF)

NEESgrid: civil engineering (US NSF)

Particle Physics Data Grid (US DOE)

Some Grid RequirementsSystems/Deployment Perspective


129/167


Systems/Deployment Perspective

Identity & authentication

Authorization & policy

Resource discovery

Resource characterization

Resource allocation

(Co-)reservation, workflow

Distributed algorithms

Remote data access

High-speed data transfer

Performance guarantees

Monitoring

Adaptation

Intrusion detection

Resource management

Accounting & payment

Fault management

System evolution

Etc.

Etc.


130/167

The Systems Challenges:Resource Sharing Mechanisms That


131/167


Resource Sharing Mechanisms That

Address security and policy concerns ofresource owners and users

Are flexible enough to deal with manyresource types and sharing modalities

Scale to large number of resources, manyparticipants, many program components

Operate efficiently when dealing with largeamounts of data & computation

The Security Problem


132/167


Resources being used may be extremely valuable &the problems being solved extremely sensitive

Resources are often located in distinct administrativedomains

Each resource may have own policies & procedures The set of resources used by a single computation

may be large, dynamic, and/or unpredictable Not just client/server

It must be broadly available & applicable Standard, well-tested, well-understood protocols

Integration with wide variety of tools

The Resource Management Problem


133/167


Enabling secure, controlled remote access tocomputational resources and management ofremote computation

Authentication and authorization

Resource discovery & characterization Reservation and allocation

Computation monitoring and control

Grid Systems Technologies


134/167


Systems and security problems addressed bynew protocols & services. E.g., Globus:

Grid Security Infrastructure (GSI) for security

Globus Metadata Directory Service (MDS) for

discovery Globus Resource Allocations Manager (GRAM)

protocol as a basic building block Resource brokering & co-allocation services

GridFTP for data movement

The Programming Problem


135/167


How does a user develop robust, secure,long-lived applications for dynamic,heterogeneous, Grids?

Presumably need:

Abstractions and models to add tospeed/robustness/etc. of development

Tools to ease application development anddiagnose common problems

Code/tool sharing to allow reuse of codecomponents developed by others

Grid Programming Technologies


136/167


Grid applications are incredibly diverse (data,

collaboration, computing, sensors, ) Seems unlikely there is one solution

Most applications have been written from scratch,

with or without Grid services Application-specific libraries have been shown to

provide significant benefits

No new language, programming model, etc., has yet

emerged that transforms things But certainly still quite possible

Examples of GridProgramming Technologies


137/167


Programming Technologies MPICH-G2: Grid-enabled message passing

CoG Kits, GridPort: Portal construction, based on N-tier architectures

GDMP, Data Grid Tools, SRB: replica management,

collection management

Condor-G: simple workflow management

Legion: object models for Grid computing

Cactus: Grid-aware numerical solver framework Note tremendous variety, application focus

MPICH-G2: A Grid-Enabled MPI


138/167


A complete implementation of the MessagePassing Interface (MPI) for heterogeneous, widearea environments

Based on the Argonne MPICH implementation of MPI

(Gropp and Lusk)

Globus services for authentication, resourceallocation, executable staging, output, etc.

Programs run in wide area without change!

See also: MetaMPI, PACX, STAMPI, MAGPIE

www.globus.org/mpi

Grid Events


139/167


Global Grid Forum: working meeting Meets 3 times/year, alternates U.S.-Europe, withJuly meeting as major event

HPDC: major academic conference

HPDC-11 in Scotland with GGF-8, July 2002

Other meetings include

IPDPS, CCGrid, EuroGlobus, Globus Retreats

www.gridforum.org, www.hpdc.org

Useful References


140/167


Book (Morgan Kaufman) www.mkp.com/grids

Perspective on Grids The Anatomy of the Grid: Enabling Scalable

Virtual Organizations, IJSA, 2001 www.globus.org/research/papers/anatomy.pdf

All URLs in this section of the presentation,especially: www.gridforum.org, www.grids-center.org,

www.globus.org

Outline
http://www.globus.org/research/papers/anatomy.pdfhttp://www.gridforum.org/http://www.grids-center.org/http://www.globus.org/http://www.globus.org/http://www.grids-center.org/http://www.grids-center.org/http://www.grids-center.org/http://www.gridforum.org/http://www.globus.org/research/papers/anatomy.pdf


141/167


Preface What is High Performance Computing?

Parallel Computing



Value of Understanding Future Trends


142/167


Monitoring and understanding future trends inHPC is important:

users: applications should be written to beefficient on current and future architectures

developers:tools should be written to be efficienton current and future architectures

computing centers: system purchases areexpensive and should have upgrade paths

The Next Decade


143/167


1980s and 1990s: academic and government requirements stronglyinfluenced parallel computing architectures

academic influence was greatest in developingparallel computing software (for science & eng.)

commercial influence grew steadily in late 1990s

In the next decade: commercialization will become dominant in

determining the architecture ofsystems academic/research innovations will continue to

drive the development of the HPC software

Commercialization


144/167


Computing technologies (including HPC) arenow propelled by profits, not sustained bysubsidies

Web servers, databases, transaction processing

and especially multimedia applications drive theneed for computational performance.

Most HPC systems are scaled up commercial

systems, with less additional hardware and

software compared to commercial systems. Its not engineering, its economics.

Processors and Nodes


145/167


Easy predictions: microprocessors performance increase continuesat ~60% per year (Moores Law) for 5+ years.

total migration to 64-bit microprocessors

use of even more cache, more memory hierarchy. increased emphasis on SMPs

Tougher predictions:

resurgence of vectors in microprocessors? Maybe dawn of multithreading in microprocessors? Yes

Building Fat Nodes: SMPs


146/167


More processors are faster, of course SMPs are simplest form of parallel systems efficient if not limited by memory bus contention:

small numbers of processors

Commercial market for high performanceservers at low cost drives needfor SMPs

HPC market for highest performance, ease of

programming drives developmentof SMPs

Building Fat Nodes: SMPs


147/167


Trends are to: build bigger SMPs attempt to share memory across SMPs (cc-

NUMA)

Resurgence of Vectors


148/167


Vectors keep functional units busy

vector registers are veryfast

vectors are more efficient for loops of any stride

vectors are greatfor many science & eng. apps

Possible resurgence of vectors SGI/Cray plans has built SV1ex, building SV2

NEC continues building (CMOS) parallel-vector,Cray-like systems

Microprocessors (Pentium4, G4) have addedvector-like functionality for multimedia purposes

Dawn of Multithreading?


149/167


Memory speed will always be a bottleneck Must overlap computation with memory

accesses: toleratelatency

requires immense amount of parallelism

requires processors with multiple streams andcompilers that can define multiple threads

Multithreading Diagram


150/167


Multithreading


151/167


Tera MTA was first multithreaded HPCsystem

scientific success, production failure

MTA-2 will be delivered in a few months.

Multithreading will be implemented (in morelimited fashion) in commercial processors.

Networks


152/167


Commercial network bandwidth and latencyapproaching custom performance.

Dramatic performance increases likely

the network is the computer (Sun slogan)

more companies, more competition

no severe physical, economic limits

Implications of faster networks

more clusters

collaborative, visual supercomputing

Grid computing

Commodity Clusters


153/167


Clusters provide some real advantages: computing power: leverage workstations and PCs high availability: replace one at a time

inexpensive: leverage existing competitive market

simple path to installing parallel computing system

Major disadvantages were robustness ofhardware and software, but both have

improved NCSA has huge clusters in production based

on Pentium III and Itanium.

Clustering SMPs


154/167


Inevitable (already here!): leverages SMP nodes effectively for same

reasons clusters leverage individual processors

Commercial markets drive need for SMPs

Combine advantages of SMPs, clusters more powerful nodes through multiprocessing

more powerful nodes -> more powerful cluster

Interconnect scalability requirements reduced for

same number of processors

Continued Linux Growth in HPC


155/167


Linux popularity growing due to price andavailability of source code

Major players now supporting Linux, esp. IBM

Head start on Intel Itanium

Programming Tools


156/167


However, programming tools will continue tolag behind hardware and OS capabilities:

Researchers will continue to drive the need for themost powerful tools to create the most efficientapplications on the largest systems

Such technologies will look more like MPI than theWeb maybe worse due to multi-tiered clusters ofSMPs (MPI + OpenMP; Active messages +threads?).

Academia will continue to play a large role in HPCsoftware development.

Grid Computing


157/167


Parallelism will continue to grow in the form of

SMPs

clusters

Cluster of SMPs (and maybe DSMs)

Grids provide the next level

connects multiple computers into virtual systems

Already here: IBM, other vendors supporting Globus

SC2001 dominated by Grid technologies

Many major government awards (>$100M in past year)

Emergence of Grids


158/167


But Grids enable much more than appsrunning on multiple computers (which can beachieved with MPI alone)

virtual operating system: provides global

workspace/address space via a single login automatically manages files, data, accounts, and

security issues

connects other resources (archival data facilities,

instruments, devices) and people (collaborativeenvironments)

Grids Are Inevitable


159/167


Inevitable (at least in HPC):

leverages computational power of all availablesystems

manages resources as a single system--easier forusers

provides most flexible resource selection andmanagement, load sharing

researchers desire to solve bigger problems will

always outpace performance increases of singlesystems; just as multiple processors are needed,multiple multiprocessors will be deemed so

Grid-Enabled Software


160/167


Commercialapplications on single parallelsystems and Grids will require that:

underlying architectures must be invisible: noparallel computing expertise required

usage must be simple development must not be to difficult

Developments in ease-of-use will benefitscientists as users(not as developers)

Web-based interfaces: transparentsupercomputing(MPIRE, Meta-MEME, etc.).

Grid-Enabled Collaborative andVisual Supercomputing


161/167


Commercial world demands:

multimedia applications

real-time data processing

online transaction processing

rapid prototyping and simulation in engineering,chemistry and biology

interactive, remote collaboration

3D graphics, animation and virtual reality

visualization

Grid-enabled Collaborative, VisualSupercomputing


162/167


Academic world will leverage resulting Gridslinking computing and visualization systemsvia high-speed networks:

collaborative post-processing of data already here

simulations will be visualized in 3D, virtual worldsin real-time

such simulations can then be steered

multiple scientists can participate in these visual

simulations the time to insight (SGI slogan) will be reduced

Web-based Grid Computing


163/167


Web currently used mostly for contentdelivery

Web servers on HPC systems can executeapplications

Web servers on Grids can launchapplications, move/store/retrieve data, displayvisualizations, etc.

NPACI HotPage already enables single sign-on to NPACI Grid Resources

Summary of Expectations


164/167


HPC systemswill grow in performance butprobably change little in design (5-10 years): HPC systems will be larger versions of smaller

commercial systems, mostly large SMPs andclusters of inexpensive nodes

Some processors will exploit vectors, as well asmore/larger caches.

Best HPC systems will have been designed top-down instead of bottom-up, but all will have beendesigned to make the bottom profitable.

Multithreading is the only likely, near-term majorarchitectural change.

Summary of Expectations


165/167


Using HPC systemswill change much more:

Grid computing will become widespread in HPCand in commercial computing

Visual supercomputing and collaborativesimulation will be commonplace.

WWW interfaces to HPC resources will maketransparent supercomputing commonplace.

But programmingthe most powerfulresources most effectivelywill remain difficult.

Caution


166/167


Change is difficult to predict (and I am anastrophysicist, not an astrologer):

Accuracy of linear extrapolation predictionsdegrade over long times (like weather forecasts)

Entirely new ideas can change everything: WWW is an excellent example; Grid computing isprobably the next

Eventually, something truly different will replace CMOStechnology (nanotechnology? molecular computing?

DNA computing?)

Final Prediction


167/167

The thing about change is that

things will be different afterwards.

Alan McMahon (Cornell University)

Documents

Pendahuluan Paralel Komputer