Parallel Algorithm

Flynn’s Classifications (1972) [1]• SISD – Single Instruction stream, Single Data stream

– Conventional sequential machines– Program executed is instruction stream, and data operated on is data stream

• SIMD – Single Instruction stream, Multiple Data streams– Vector machines (superscalar)– Processors execute same program, but operate on different data streams

• MIMD – Multiple Instruction streams, Multiple Data streams– Parallel machines– Independent processors execute different programs, using unique data streams

• MISD – Multiple Instruction streams, Single Data stream– Systolic array machines– Common data structure is manipulated by separate processors, executing different

instruction streams (programs)

Anshul Kumar, CSE IITD slide 2

SISD

C P MIS IS DS


SIMD

C

P

P

MIS

DS

DS


MISD

C

C

P

P

M

IS

IS

IS

IS

DS

DS


MIMD

C

C

P

P

M

IS

IS

IS

IS

DS

DS

Classification of Parallel ArchitecturesParallel architectures

PAs

Data-parallelarchitecture

Function-parallel architectures

Instruction-level PAs Thread level PAs

Process-level PAs

ILPs MIMDs

Pipelinedprocessors

VLIWs Superscalarprocessors

DistributedMemory

MIMD

SharedMemoryMIMD

Vectorarchitectures

AssociativeAnd neural

architectures

SIMDs

Systolicarchitectures

DPs

[Ref : Sima et al]

What is Pipelining

• A technique used in advanced microprocessors where the microprocessor begins executing a second instruction before the first has been completed.

- A Pipeline is a series of stages, where some work is done at each stage. The work is not finished until it has passed through all stages.

• With pipelining, the computer architecture allows the next instructions to be fetched while the processor is performing arithmetic operations, holding them in a buffer close to the processor until each instruction operation can performed.

Four Pipelined Instructions

IF

IF

IF

IF

ID

ID

ID

ID

EX

EX

EX

EX M

M

M

M

W

W

W

W

5

1

1

1

Instructions Fetch

• The instruction Fetch (IF) stage is responsible for obtaining the requested instruction from memory. The instruction and the program counter (which is incremented to the next instruction) are stored in the IF/ID pipeline register as temporary storage so that may be used in the next stage at the start of the next clock cycle.

Instruction Decode

• The Instruction Decode (ID) stage is responsible for decoding the instruction and sending out the various control lines to the other parts of the processor. The instruction is sent to the control unit where it is decoded and the registers are fetched from the register file.

Memory and IO

• The Memory and IO (MEM) stage is responsible for storing and loading values to and from memory. It also responsible for input or output from the processor. If the current instruction is not of Memory or IO type than the result from the ALU is passed through to the write back stage.

DATA FLOW COMPUTERS

• Data flow computers execute the instructions as the data becomes available

• Data Flow architectures are highly asynchronous

• In the data flow architecture there is no need to store intermediate or final results, because they are passed as tokens among instructions.

(cont.)

• The program sequencing depends on the data availability

• The information appears as operation packets and data tokens

• Operation packet = opcode + operands + destination

• Data tokens = data(result) +destination

(cont.)

• Data flow computers have packet communication architecture

• Dataflow computers have distributed multiprocessor organization

Copyright 2004 David J. Lilja 15

What is a performance metric?

• Count– Of how many times an event occurs

• Duration– Of a time interval

• Size– Of some parameter

• A value derived from these fundamental measurements

Copyright 2004 David J. Lilja 16

Good metrics are …

• Linear -- nice, but not necessary• Reliable -- required• Repeatable -- required• Easy to use -- nice, but not necessary• Consistent -- required• Independent -- required

III. PERFORMANCE METRICSA performance metric is a measure of a systems performance.It focuses on measuring a certain aspect of thesystem and allows comparison of various types of systems.The criteria for evaluating performance in parallel computingcan include: speedup, efficiency and scalability.SpeedupSpeedup is the most basic of parameters in multiprocessingsystems and shows how much a parallel algorithm is fasterthan a sequential algorithm. It is defined as follows:Sp =T1/Tp

where Sp is the speedup, T1 is the execution time for asequential algorithm, Tp is the execution time for a parallelalgorithm and p is the number of processors.

There are three possibilities for speedup: linear, sublinearand super-linear, shown in Figure 1. When Sp = p, i.e.when the speedup is equal to the number of processors, thespeedup is called linear. In such a case, doubling the number ofprocessors, will double the speedup. In the case of sub-linearspeedup, increasing the number of processors, decreases thespeedup. Most algorithms are sub-linear, because of variousoverheads associated with multiple processors, like communication.This can occur because of the increasing paralleloverhead from such areas as: interprocessor communication,load imbalance, synchronization, and extra computation. Aninteresting case occurs in super-linear speedup, which canmainly be due to cache size increase

EfficiencyAnother performance metric in parallel computing is efficiency,It is defined as the achieved fraction of totalpotential parallel processing gain. It estimates how well theprocessors are used in solving the problem.Ep =Sp/p =T1/pTpwhere Ep is the efficiency.

The Random Access Machine Model

RAM model of serial computers:– Memory is a sequence of words, each capable

of containing an integer.– Each memory access takes one unit of time– Basic operations (add, multiply, compare) take

one unit time.– Instructions are not modifiable– Read-only input tape, write-only output tape

7.2.2 The PRAM modelThe PRAM is an idealized parallel machine which was developed as astraightforward generalization of the sequential RAM. Because we will beusing it often, we give here a detailed description of it.DescriptionA PRAM uses p identical processors, each one with a distinct id-number, andable to perform the usual computation of a sequential RAM that is equippedwith a finite amount of local memory. The processors communicate throughsome shared global memory (Figure 7.4) to which all are connected. Theshared memory contains a finite number of memory cells. There is a globalclock that sets the pace of the machine executon. In one time-unit periodeach processor can perform, if so wishes, any or all of the following threesteps:1. Read from a memory location, global or local;2. Execute a single RAM operation, and3. Write to a memory location, global or local.

Control

Global memory

P1

Private memory

…

P2

Private memory

…

Pn

Private memory

…

…

Interconnection network

…

Advanced Topics in Algorithms and Data Structures

Classification of the PRAM model

• The power of a PRAM depends on the kind of access to the shared memory locations.



In every clock cycle,• In the Exclusive Read Exclusive Write

(EREW) PRAM, each memory location can be accessed only by one processor.

• In the Concurrent Read Exclusive Write (CREW) PRAM, multiple processor can read from the same memory location, but only one processor can write.



• In the Concurrent Read Concurrent Write (CRCW) PRAM, multiple processor can read from or write to the same memory location.



• It is easy to allow concurrent reading. However, concurrent writing gives rise to conflicts.

• If multiple processors write to the same memory location simultaneously, it is not clear what is written to the memory location.



• In the Common CRCW PRAM, all the processors must write the same value.

• In the Arbitrary CRCW PRAM, one of the processors arbitrarily succeeds in writing.

• In the Priority CRCW PRAM, processors have priorities associated with them and the highest priority processor succeeds in writing.



• The EREW PRAM is the weakest and the Priority CRCW PRAM is the strongest PRAM model.

• The relative powers of the different PRAM models are as follows.



• An algorithm designed for a weaker model can be executed within the same time and work complexities on a stronger model.



• We say model A is less powerful compared to model B if either:• the time complexity for solving a

problem is asymptotically less in model B as compared to model A. or,

• if the time complexities are the same, the processor or work complexity is asymptotically less in model B as compared to model A.



An algorithm designed for a stronger PRAM model can be simulated on a weaker model either with asymptotically more processors (work) or with asymptotically more time.


Adding n numbers on a PRAM




• This algorithm works on the EREW PRAM model as there are no read or write conflicts.

• We will use this algorithm to design a matrix multiplication algorithm on the EREW PRAM.

PRAMs are classified as EREW, CREW and CRCW.EREW. In the exclusive-read-exclusive-write (EREW) PRAM model, noconflicts are permitted for either reading or writing. If, during the executionof a program on this model, some conflict occurs, the program’sbehavior is undefined.CREW. In the concurrent-read-exclusive-write (CREW) PRAM model, simultaneousreadings by many processors from some memory cell arepermitted. However, if a writing conflict occurs, the behavior of theprogram is undefined. The idea behind this model is that it may becheap to implement some broadcasting primitive on a real machine, soone needs to examine the usefulness of such a decision.

CRCW. Finally, the concurrent-read-concurrent-write (CRCW) PRAM,the strongest of these models, permits simultaneous accesses for bothreading and writing. In the case of multiple processors trying to writeto some memory cell, one must define which of the processors eventuallydoes the writing. There are several answers that researchers havegiven to this question. The most commonly used are:1

Theorem 8 Any algorithm written for the PRIORITY CRCW model canbe simulated on the EREW model with a slowdown of O(lg r) where r is the number of processors employed by the algorithm.Proof. We have an algorithm that runs correctly on a PRIORITYCRCW PRAM, and a EREW PRAM machine on which we want to simulatethe algorithm. If we were to execute the algorithm without modification onthe EREW machine, it would not work. The problem would not be in the

executable part of the code, since both machines understand the same setof executable statements. Instead, the problem would be in the statementsthat access the shared memory for reading or writing. For example, everytime the algorithm says:Processor Pi reads from memory location y into xit might involve a concurrent reading which the EREW machine cannothandle. The same is true for a concurrent write statement (depicted visuallyin figure 7.8). In order to fix the problem, we have to simulate eachstatement of that sort into a sequence of statements that do not involve anyconcurrency, but have the same result as if we had concurrency.Let us assume that the algorithm uses r processors, named P1, P2 . . . , Pr;the EREW machine also uses r processors. The EREW machine, however,will need a little more memory: r auxiliary memory locations A[1..r], whichwill be used to resolve the conflicts.The idea is to replace each fragment code of the algorithm:

Processor Pi accesses (reads into x or writes x into) memory location ywith code which:a) has the processors request permission to access a particular memorylocation,b) finds out, for every memory location, whether there is conflict, andc) decides which processsor of the competing will do the access.This is achieved by the following fragment of code:

1. Processor Pi writes (y, i) into A[i]2. Auxiliary array A[1..r] is sorted lexicographically in increasing order.3. Processor Pi reads A[i − 1] and A[i] and determines whether it isthe highest priority processor accessing some memory location4. If Pi is the highest priority processor, then:If the operation was a write, Pi does the writingElse Pi does the reading, and the value read is propagatedto the processors interested in this value.The last step takes O(lg r) time. The sorting step also takes O(lg r), asthe following non-trivial fact shows. (We mention it here without proof. Fora proof, consult [?].)

41

Prefix Sum - Doubling CREW PRAM

Given: n elements in A[0 … n-1]Var: A & j are global, i is localspawn (P1, P2,..Pn-1) // note # of PC

For all Pi, 1 <= i <= n -1)for j = 0 to log n - 1 do

if (i - 2j >= 0) A[i] = A[i] + A[i - 2j]

42

Sum of elements – EREW PRAM

Given: n elements in A[0 … n-1]Var: A & j are global, i is localspawn (P0, P1, P2,..Pn/2-1)// P = n/2

For all Pi, 0 <= i <= (n/2 -1)for j = 0 to log n - 1 do

if (i mod 2j = 0) & (2i + 2j < n)A[2i] = A[2i] + A[2i + 2j]

AlgorithmLet A and B be the given shared arrays of r and s elements respectively, sorted in nondecreasingorder. It is required to merge them into a shared array C.As presented by Akl[4], the algorithm is as follows:Let P1, P2... PN be N processors available.Step 1: The Algorithm selects N-1 elements from array A for a shared array A'. This divides Ainto N approximately equal size segments. A shared array B’ of N-1 elements of B is chosen similarly.For this step, each Pi inserts A[ i * r/N ] and B[ i * r/N ] in parallel into the ith location of A’ and B’respectively.Step 2: This step merges A’ and B’ into a shared array V of size 2N - 2. Each element v of V is atriple consisting of an element A’ or B’ followed by its position in A’ or B’ followed by the name A or B.For this step each Pi:a. Using Sequential BINARY SEARCH each processor searches the array B’ in parallel to find thesmallest j such that A’[i] < B’[j]. If such j exists, then V[i + j -1] is set by the triple (A’[i], i, “A”),otherwise V[i + N -1] is set by the triple (A’[i], i, “A”).

b. Using Sequential BINARY SEARCH each processor searches the array A’ to find the smallest jsuch that B’[i] < A’[j]. If such j exists, then V[i + j -1] is set by the triple(B’[i], i, “B”), otherwise V[i + N -1] is set by the triple (B’[i], i, “B”).Step 3: To merge A and B into shared array C, the indices of two elements (one in A and one inB) at which each processor is to begin merging are computed in a shared array Q of ordered pairs. This stepis executed as follows:a. P1 sets Q[1] by (1,1).b. Each Pi , i >=2, checks if V[2i -2] is equal to (A’[k], k, “A”) or not. If it is equal then Pi searches Busing BINARY SEARCH to find the smallest j such that B[j] > A’[k] and sets Q[i] by (k * r/N, j),otherwise Pi searches A using BINARY SEARCH to find the smallest j such thatA[j] > B’[k] and sets Q[i] by (j, k * s/N ).Step 4: Each Pi, i < N uses the sequential merge, and Q[i] = (x, y), and Q[i+1] = (u, v) to mergetwo subarrays A[x..u-1] and B[y..v-1] and places the results of the merge in array C at position x + y - 1.Processor PN uses Q[N] = (w, z) to merge two subarrays A[w..r] and B[z..s].

45

List ranking –EREW algorithm

• LIST-RANK(L) (in O(lg n) time)1. for each processor i, in parallel2. do if next[i]=nil 3. then d[i]04. else d[i]15. while there exists an object i such that next[i]nil6. do for each processor i, in parallel7. do if next[i]nil8. then d[i] d[i]+ d[next[i]]9. next[i] next[next[i]]

46

List-ranking –EREW algorithm

1

3

1

4

1

6

1

1

1

0

0

5(a)

3 4 6 1 0 5(b) 2 2 2 2 1 0

3 4 6 1 0 5(c) 4 4 3 2 1 0

3 4 6 1 0 5(d) 5 4 3 2 1 0

47

Applications of List Ranking

• Expression Tree Evaluation• Parentheses Matching• Tree Traversals• Ear–Decomposition of Graphs• Euler tour of trees

1

Graph coloring

Determining the vertices of a graph can be colored with c colors so that no two adjacent vertices are assigned the same color is called thegraph coloring problem. To solve the problem quickly, we can create a processor for every possible coloring of the graph, then each processor checks to see if the coloring it represents is valid.

49

Linear Arrays and Rings

• Linear Array– Asymmetric network– Degree d=2– Diameter D=N-1– Bisection bandwidth: b=1– Allows for using different sections of the channel by different sources concurrently.

• Ring– d=2 – D=N-1 for unidirectional ring or for bidirectional ring

Linear Array

Ring

Ring arranged to use short wires

2/ND

50

Ring

• Fully Connected Topology– Needs N(N-1)/2 links to connect N

processor nodes. – Example

• N=16 -> 136 connections.• N=1,024 -> 524,288 connections

– D=1– d=N-1

• Chordal ring– Example

• N=16, d=3 -> D=5

51

Multidimensional Meshes and Tori

• Mesh– Popular topology, particularly for SIMD architectures since they match many data

parallel applications (eg image processing, weather forecasting). – Illiac IV, Goodyear MPP, CM-2, Intel Paragon– Asymmetric – d= 2k except at boundary nodes.– k-dimensional mesh has N=nk nodes.

• Torus – Mesh with looping connections at the boundaries to provide symmetry.

2D Grid 3D Cube

52

Trees

• Diameter and ave distance logarithmic– k-ary tree, height d = logk N– address specified d-vector of radix k coordinates describing path down

from root• Fixed degree• Route up to common ancestor and down• Bisection BW?

53

Trees (cont.)

• Fat tree – The channel width increases as we go up– Solves bottleneck problem toward the root

• Star– Two level tree with d=N-1, D=2– Centralized supervisor node

54

Hypercubes

• Each PE is connected to (d = log N) other PEs• d = log N• Binary labels of neighbor PEs differ in only one bit• A d-dimensional hypercube can be partitioned into two (d-1)-dimensional

hypercubes• The distance between Pi and Pj in a hypercube: the number of bit positions in

which i and j differ (ie. the Hamming distance) – Example:

• 10011 01001 = 11010• Distance between PE11 and PE9 is 3

0-D 1-D 2-D 3-D 4-D 5-D

001 011

000 010

100 110

111101

*From Parallel Computer Architectures; A Hardware/Software approach, D. E. Culler

55

Hypercube routing functions

• ExampleConsider 4D hypercube (n=4)Source address s = 0110 and destination address

d = 1101Direction bits r = 0110 1101 = 10111. Route from 0110 to 0111 because r = 10112. Route from 0111 to 0101 because r = 10113. Skip dimension 3 because r = 10114. Route from 0101 to 1101 because r = 1011

56

k-ary n-cubes

• Rings, meshes, torii and hypercubes are special cases of a general topology called a k-ary n-cube

• Has n dimensions with k nodes along each dimension– An n processor ring is a n-ary 1-cube– An nxn mesh is a n-ary 2-cube (without end-around connections)– An n-dimensional hypercube is a 2-ary n-cube

• N=kn

• Routing distance is minimized for topologies with higher dimension

• Cost is lowest for lower dimension. Scalability is also greatest and VLSI layout is easiest.

57

Cube-connected cycle

• d=3• D=2k-1+• Example N=8

– We can use the 2CCC network

2/k

58

59

Network properties

• Node degree d - the number of edges incident on a node.– In degree– Out degree

• Diameter D of a network is the maximum shortest path between any two nodes.

• The network is symmetric if it looks the same from any node.

• The network is scalable if it expandable with scalable performance when the machine resources are increased.

60

Bisection width

• Bisection width is the minimum number of wires that must be cut to divide the network into two equal halves. Small bisection width -> low bandwidth A large bisection width -> a lot of extra wires

• A cut of a network C(N1,N2) is a set of channels that partition the set of all nodes into two disjoint sets N1 and N2. Each element of C(N1,N2) is a channel with a source in N1 and destination in N2 or vice versa.

• A bisection of a network is a cut that partitions the entire network nearly in half, such that |N2|≤|N1|≤|N2+1|. Here |N2| means the number of nodes that belong to the partition N2.

• The channel bisection of a network is the minimum channel count over all bisections of the network:

|)2,1(|minsec

NNCBctionsbi

61

Factors Affecting Performance

• Functionality – how the network supports data routing, interrupt handling, synchronization, request/message combining, and coherence

• Network latency – worst-case time for a unit message to be transferred

• Bandwidth – maximum data rate• Hardware complexity – implementation costs

for wire, logic, switches, connectors, etc.

62

2 × 2 Switches

*From Advanced Computer Architectures, K. Hwang, 1993.

63

Switches

Module size Legitimate states Permutation connection

2 × 2 4 2

4 × 4 256 24

8 × 8 16,777,216 40,320

N × N NN N!

• Permutation function: each input can only be connected a single output.

• Legitimate state: Each input can be connected to multiple outputs, but each output can only be connected to a single input

64

Single-stage networks

• Single stage Shuffle-Exchange IN (left) • Perfect shuffle mapping function (right)• Perfect shuffle operation: cyclic shift 1

place left, eg 101 --> 011• Exchange operation: invert least

significant bit, e.g. 101 --> 100

*From Ben Macey at http://www.ee.uwa.edu.au/~maceyb/aca319-2003

65

Multistage Interconnection Networks

• The capability of single stage networks are limited but if we cascade enough of them together, they form a completely connected MIN (Multistage Interconnection Network).

• Switches can perform their own routing or can be controlled by a central router • This type of networks can be classified into the following four categories:• Nonblocking

– A network is called strictly nonblocking if it can connect any idle input to any idle output regardless of what other connections are currently in process

• Rearrangeable nonblocking– In this case a network should be able to establish all possible connections between

inputs and outputs by rearranging its existing connections. • Blocking interconnection

– A network is said to be blocking if it can perform many, but not all, possible connections between terminals.

– Example: the Omega network

66

Omega networks

• A multi-stage IN using 2 × 2 switch boxes and a perfect shuffle interconnect pattern between the stages

• In the Omega MIN there is one unique path from each input to each output.

• No redundant paths → no fault tolerance and the possibility of blocking

Example: • Connect input 101 to output

001 • Use the bits of the destination

address, 001, for dynamically selecting a path

• Routing: - 0 means use upper output - 1 means use lower output

*From Ben Macey at http://www.ee.uwa.edu.au/~maceyb/aca319-2003

67

Omega networks

• log2N stages of 2 × 2 switches• N/2 switches per stage• S=(N/2) log2(N) switches• Number of permutations in a omega network

2S

68

Baseline networks

• The network can be generated recursively• The first stage N × N, the second (N/2) × (N/2)• Networks are topologically equivalent if one network can be easily

reproduced from the other networks by simply rearranging nodes at each stage.


69

Crossbar Network

• Each junction is a switching component – connecting the row to the column.

• Can only have one connection in each column


70

Crossbar Network

• The major advantage of the cross-bar switch is its potential for speed.

• In one clock, a connection can be made between source and destination.

• The diameter of the cross-bar is one.• Blocking if the destination is in use • Because of its complexity, the cost of the cross-bar switch

can become the dominant factor for a large multiprocessor system.

• Crossbars can be used to implement the a×b switches used in MIN’s. In this case each crossbar is small so costs are kept down.

71

Performance Comparison

Network Latency Switchingcomplexity

Wiringcomplexity

Blocking

Bus Constant O(N)

O(1) O(w) yes

MIN O(log2N) O(Nlog2N) O(Nw log2 N)

yes

Crossbar O(1) O(N2) O(N2w) no

PCAM Algorithm Design

• Partitioning– Computation and data are decomposed.

• Communication– Coordinate task execution

• Agglomeration– Combining of tasks for performance

• Mapping– Assignment of tasks to processors

Partitioning

• Ignore the number of processors and the target architecture.

• Expose opportunities for parallelism.• Divide up both the computation and data• Can take two approaches

– domain decomposition– functional decomposition

Domain Decomposition

• Start algorithm design by analyzing the data • Divide the data into small pieces

– Approximately equal in size• Then partition the computation by associating

it with the data.• Communication issues may arise as one task

needs the data from another task.

Functional Decomposition

• Focus on the computation• Divide the computation into disjoint tasks

– Avoid data dependency among tasks• After dividing the computation, examine the

data requirements of each task.

Functional Decomposition

• Not as natural as domain decomposition

• Consider search problems• Often functional

decomposition is very useful at a higher level.– Climate modeling

• Ocean simulation• Hydrology• Atmosphere, etc.

Communication

• The information flow between tasks is specified in this stage of the design

• Remember:– Tasks execute concurrently.– Data dependencies may limit concurrency.

Communication

• Define Channel– Link the producers with the consumers.– Consider the costs

• Intellectual• Physical

– Distribute the communication.• Specify the messages that are sent.

Communication Patterns

• Local vs. Global• Structured vs. Unstructured• Static vs. Dynamic• Synchronous vs. Asynchronous

Local Communication

• Communication within a neighborhood.

Algorithm choice determines communication.

Global Communication

• Not localized.• Examples

– All-to-All– Master-Worker

53

72

1

Structured Communication

• Each task’s communication resembles each other task’s communication

• Is there a pattern?

Unstructured Communication

• No regular pattern that can be exploited.

• Examples– Unstructured Grid– Resolution changes

Complicates the next stages of design

Synchronous Communication

• Both consumers and producers are aware when communication is required

• Explicit and simple

t = 1 t = 2 t = 3

Asynchronous Communication

• Timing of send/receive is unknown.– No pattern

• Consider: very large data structure– Distribute among computational tasks (polling)– Define a set of read/write tasks– Shared Memory

Agglomeration

• Partition and Communication steps were abstract

• Agglomeration moves to concrete.• Combine tasks to execute efficiently on some

parallel computer.• Consider replication.

Mapping

• Specify where each task is to operate.• Mapping may need to change depending on

the target architecture.• Mapping is NP-complete.

Mapping

• Goal: Reduce Execution Time– Concurrent tasks ---> Different processors– High communication ---> Same processor

• Mapping is a game of trade-offs.

Mapping

• Many domain-decomposition problems make mapping easy.– Grids– Arrays– etc.

91

Speedup in Simplest Terms

• Speed Up= Sequential Access Time/ Parallel Access Time

•Quinn’s notation for speedup is

(n,p)

for data size n and p processors.

92

Linear Speedup Usually Optimal• Speedup is linear if S(n) = (n) • Theorem: The maximum possible speedup for parallel

computers with n PEs for “traditional problems” is n. • Proof:

– Assume a computation is partitioned perfectly into n processes of equal duration.

– Assume no overhead is incurred as a result of this partitioning of the computation – (e.g., partitioning process, information passing, coordination of processes, etc),

– Under these ideal conditions, the parallel computation will execute n times faster than the sequential computation.

– The parallel running time is ts /n.– Then the parallel speedup of this computation is

S(n) = ts /(ts /n) = n

93

Linear Speedup Usually Optimal (cont)

• We shall later see that this “proof” is not valid for certain types of nontraditional problems.

• Unfortunately, the best speedup possible for most applications is much smaller than n– The optimal performance assumed in last proof is

unattainable. – Usually some parts of programs are sequential and allow

only one PE to be active.– Sometimes a large number of processors are idle for

certain portions of the program.• During parts of the execution, many PEs may be waiting

to receive or to send data. • E.g., recall blocking can occur in message passing

94

Superlinear Speedup• Superlinear speedup occurs when S(n) > n • Most texts besides Akl’s and Quinn’s argue that

– Linear speedup is the maximum speedup obtainable.• The preceding “proof” is used to argue that

superlinearity is always impossible.– Occasionally speedup that appears to be superlinear may

occur, but can be explained by other reasons such as • the extra memory in parallel system.• a sub-optimal sequential algorithm used.• luck, in case of algorithm that has a random aspect in

its design (e.g., random selection)

95

Superlinearity (cont)

• Selim Akl has given a multitude of examples that establish that superlinear algorithms are required for many nonstandad problems– If a problem either cannot be solved or cannot be solved in the

required time without the use of parallel computation, it seems fair to say that ts=.

Since for a fixed tp>0, S(n) = ts/tp is greater than 1 for all sufficiently large values of ts, it seems reasonable to consider these solutions to be “superlinear”.

– Examples include “nonstandard” problems involving• Real-Time requirements where meeting deadlines is part of the

problem requirements.• Problems where all data is not initially available, but has to be

processed after it arrives.• Real life situations such as a “person who can only keep a driveway

open during a severe snowstorm with the help of friends”.– Some problems are natural to solve using parallelism and sequential

solutions are inefficient.

96

Superlinearity (cont)• The last chapter of Akl’s textbook and several journal

papers by Akl were written to establish that superlinearity can occur. – It may still be a long time before the possibility of

superlinearity occurring is fully accepted. – Superlinearity has long been a hotly debated topic and is

unlikely to be widely accepted quickly.• For more details on superlinearity, see [2] “Parallel

Computation: Models and Methods”, Selim Akl, pgs 14-20 (Speedup Folklore Theorem) and Chapter 12.

• This material is covered in more detail in my PDA class.

97

Speedup Analysis

• Recall speedup definition: (n,p) = ts/tp

• A bound on the maximum speedup is given by

(n,p) = [(n) +(n)]/[(n) +(n)/p +(n,p)]

– Inherently sequential computations are (n)– Potentially parallel computations are (n)– Communication operations are (n,p)– The “≤” bound above is due to the assumption in formula

that the speedup of the parallel portion of computation will be exactly p.

– Note (n,p) =0 for SIMDs, since communication steps are usually included with computation steps.

98

Execution time for parallel portion (n)/p

Shows nontrivial parallel algorithm’s computation component as a decreasing function of the number of processors used.

processors

time

99

Time for communication (n,p)

Shows a nontrivial parallel algorithm’s communication component as an increasing function of the number of processors.

processors

time

100

Execution Time of Parallel Portion(n)/p + (n,p)

Combining these, we see for a fixed problem size, there is an optimum number of processors that minimizes overall execution time.

processors

time

101

Speedup Plot“elbowing out”

processors

speedup

102

Cost• The cost of a parallel algorithm (or program) is

Cost = Parallel running time #processors• Since “cost” is a much overused word, the term

“algorithm cost” is sometimes used for clarity. • The cost of a parallel algorithm should be compared

to the running time of a sequential algorithm.– Cost removes the advantage of parallelism by

charging for each additional processor.– A parallel algorithm whose cost is big-oh of the

running time of an optimal sequential algorithm is called cost-optimal.

103

Cost Optimal

• From last slide, a parallel algorithm is optimal ifparallel cost = O(f(t)),

where f(t) is the running time of an optimal sequential algorithm.

• Equivalently, a parallel algorithm for a problem is said to be cost-optimal if its cost is proportional to the running time of an optimal sequential algorithm for the same problem.– By proportional, we means that

cost tp n = k ts

where k is a constant and n is nr of processors. • In cases where no optimal sequential algorithm is

known, then the “fastest known” sequential algorithm is sometimes used instead.

104

Efficiency

used Processors

Speedup Efficiency

timeexecution Parallel used Processors

timeexecution Sequential Efficiency

processors pon n size of

problem afor p)(n,by Quinn in denoted Efficiency

Cost

timerunning SequentialEfficiency

Processors

Speedup Efficiency

timerunning Parallel Processors

timerunning Sequential Efficiency

105

Bounds on Efficiency

• Recall (1)

• For algorithms for traditional problems, superlinearity is not possible and

(2) speedup ≤ processors• Since speedup ≥ 0 and processors > 1, it follows from the

above two equations that0 (n,p) 1

• Algorithms for non-traditional problems also satisfy 0 (n,p). However, for superlinear algorithms if follows that (n,p) > 1 since speedup > p.

p

speedup

processors

speedupefficiency

106

Amdahl’s LawLet f be the fraction of operations in a

computation that must be performed sequentially, where 0 ≤ f ≤ 1. The maximum speedup achievable by a parallel computer with n processors is

fnffpS

1

/)1(

1)(

• The word “law” is often used by computer scientists when it is an observed phenomena (e.g, Moore’s Law) and not a theorem that has been proven in a strict sense.

• However, Amdahl’s law can be proved for traditional problems

107

Proof for Traditional Problems: If the fraction of the computation that cannot be divided into concurrent tasks is f, and no overhead incurs when the computation is divided into concurrent parts, the time to perform the computation with n

processors is given by tp ≥ fts + [(1 - f )ts] / n, as shown below:

108

Proof of Amdahl’s Law (cont.)

• Using the preceding expression for tp

• The last expression is obtained by dividing numerator and

denominator by ts , which establishes Amdahl’s law.

• Multiplying numerator & denominator by n produces the following alternate version of this formula:

nf

f

ntf

ft

t

t

tnS

ss

s

p

s

)1(1

)1()(

fn

n

fnf

nnS

)1(1)1()(

109

Amdahl’s Law

• Preceding proof assumes that speedup can not be superliner; i.e.,

S(n) = ts/ tp n – Assumption only valid for traditional problems.– Question: Where is this assumption used?

• The pictorial portion of this argument is taken from chapter 1 of Wilkinson and Allen

• Sometimes Amdahl’s law is just stated as S(n) 1/f• Note that S(n) never exceeds 1/f and approaches 1/f

as n increases.

110

Consequences of Amdahl’s Limitations to Parallelism

• For a long time, Amdahl’s law was viewed as a fatal flaw to the usefulness of parallelism.

• Amdahl’s law is valid for traditional problems and has several useful interpretations.

• Some textbooks show how Amdahl’s law can be used to increase the efficient of parallel algorithms – See Reference (16), Jordan & Alaghband textbook

• Amdahl’s law shows that efforts required to further reduce the fraction of the code that is sequential may pay off in large performance gains.

• Hardware that achieves even a small decrease in the percent of things executed sequentially may be considerably more efficient.

111

Limitations of Amdahl’s Law– A key flaw in past arguments that Amdahl’s law is

a fatal limit to the future of parallelism is • Gustafon’s Law: The proportion of the computations

that are sequential normally decreases as the problem size increases.

– Note: Gustafon’s law is a “observed phenomena” and not a theorem.

– Other limitations in applying Amdahl’s Law:• Its proof focuses on the steps in a particular algorithm,

and does not consider that other algorithms with more parallelism may exist

• Amdahl’s law applies only to ‘standard’ problems were superlinearity can not occur

112

Other Limitations of Amdahl’s Law• Recall

• Amdahl’s law ignores the communication cost (n,p)n in MIMD systems.– This term does not occur in SIMD systems, as

communications routing steps are deterministic and counted as part of computation cost.

• On communications-intensive applications, even the (n,p) term does not capture the additional communication slowdown due to network congestion.

• As a result, Amdahl’s law usually overestimates speedup achievable

),(/)()(

)()(),(

pnpnn

nnpn

113

Amdahl Effect

• Typically communications time (n,p) has lower complexity than (n)/p (i.e., time for parallel part)

• As n increases, (n)/p dominates (n,p)• As n increases,

– sequential portion of algorithm decreases– speedup increases

• Amdahl Effect: Speedup is usually an increasing function of the problem size.

114

Illustration of Amdahl Effect

n = 100

n = 1,000

n = 10,000Speedup

Processors

115

The Isoefficiency Metric(Terminology)

• Parallel system – a parallel program executing on a parallel computer

• Scalability of a parallel system - a measure of its ability to increase performance as number of processors increases

• A scalable system maintains efficiency as processors are added

• Isoefficiency - a way to measure scalability

116

Notation Needed for the Isoefficiency Relation

• n data size• p number of processors• T(n,p) Execution time, using p processors• (n,p) speedup• (n) Inherently sequential computations• (n) Potentially parallel computations• (n,p) Communication operations • (n,p) Efficiency

Note: At least in some printings, there appears to be a misprint on page 170 in Quinn’s textbook, with (n) being sometimes replaced with (n). To correct, simply replace each with .

117

Isoefficiency Concepts

• T0(n,p) is the total time spent by processes doing work not done by sequential algorithm.

• T0(n,p) = (p-1)(n) + p(n,p)• We want the algorithm to maintain a constant

level of efficiency as the data size n increases. Hence, (n,p) is required to be a constant.

• Recall that T(n,1) represents the sequential execution time.

118

The Isoefficiency RelationSuppose a parallel system exhibits efficiency (n,p). Define

In order to maintain the same level of efficiency as the number of processors increases, n must be increased so that the following inequality is satisfied.

),()()1(),(T

),(1

),(

0 pnpnppn

pn

pnC

),()1,( 0 pnCTnT

119

Isoefficiency Relation Derivation(See page 170-17 in Quinn)

MAIN STEPS:• Begin with speedup formula• Compute total amount of overhead• Assume efficiency remains constant• Determine relation between sequential

execution time and overhead

120

Deriving Isoefficiency Relation(see Quinn, pgs 170-17)

),()()1(),( pnpnppnTo Determine overhead

Substitute overhead into speedup equation

),()()())()((

0),( pnTnn

nnppn

Substitute T(n,1) = (n) + (n). Assume efficiency is constant.

),( )1,( 0 pnCTnT Isoefficiency Relation

121

Isoefficiency Relation Usage

• Used to determine the range of processors for which a given level of efficiency can be maintained

• The way to maintain a given efficiency is to increase the problem size when the number of processors increase.

• The maximum problem size we can solve is limited by the amount of memory available

• The memory size is a constant multiple of the number of processors for most parallel systems

122

The Scalability Function

• Suppose the isoefficiency relation reduces to n f(p)

• Let M(n) denote memory required for problem of size n

• M(f(p))/p shows how memory usage per processor must increase to maintain same efficiency

• We call M(f(p))/p the scalability function [i.e., scale(p) = M(f(p))/p) ]

123

Meaning of Scalability Function

• To maintain efficiency when increasing p, we must increase n

• Maximum problem size is limited by available memory, which increases linearly with p

• Scalability function shows how memory usage per processor must grow to maintain efficiency

• If the scalability function is a constant this means the parallel system is perfectly scalable

124

Interpreting Scalability Function

Number of processors

Mem

ory

need

ed p

er p

roce

ssor

Cplogp

Cp

Clogp

C

Memory Size

Can maintainefficiency

Cannot maintainefficiency

CSE 160/Berman

Odd-Even Transposition Sort

• Parallel version of bubblesort – many compare-exchanges done simultaneously

• Algorithm consists of Odd Phases and Even Phases– In even phase, even-numbered processes exchange

numbers (via messages) with their right neighbor– In odd phase, odd-numbered processes exchange

numbers (via messages) with their right neighbor

• Algorithm alternates odd phase and even phase for O(n) iterations

CSE 160/Berman


• Data Movement

General Pattern for n=5

P0 P1 P2 P3 P4

T=1

T=2

T=3

T=4

T=5

CSE 160/Berman


• Example

General Pattern for n=5

P0 P1 P2 P3 P4

3 10 4 8 1

3 4 10 1 8

3 4 1 10 8

3 1 4 8 10

T=1

T=2

T=3

T=4

1 3 4 8 10T=5

3 10 4 8 1 T=0

CSE 160/Berman

Odd-Even Transposition Code• Compare-exchange accomplished through message passing• Odd Phase

• Even Phase

P_i = 0, 2,4,…,n-2

recv(&A, P_i+1); send(&B, P_i+1);if (A<B) B=A;

P_i = 1,3,5,…,n-1

send(&A, P_i-1); recv(&B, P_i-1);if (A<B) A=B;

P_i = 2,4,6,…,n-2

send(&A, P_i-1);recv(&B, P_i-1); if (A<B) A=B;

P_i = 1,3,5,…,n-3

recv(&A, P_i+1); send(&B, P_i+1);if (A<B) B=A;

P0 P1 P2 P3 P4

P0 P1 P2 P3 P4

9/06/99

Sorting on the CRCW and CREW PRAMs

• Sort on the CRCW PRAM

– Similar idea used to design MIN-CRCW

– Need more powerful (less realistic) model of resolving concurrent

writes that sums the value to be concurrently written

– Use n(n-1)/2 processors and in a constant time

– Drawback : processors and powerful model

• Sort on the CREW PRAM

– uses an auxiliary two-dimensional array Win[1:n,1:n]

– separate write and separate sum in O(1) and O(log n) using

O(n**2) processors

9/06/99

9/06/99

Odd-Even Merge Sort on the EREW PRAM

• Sort EREW PRAM model : use too many processors• Merge Sort on the EREW PRAM

– Based on the idea of divide-and-conquer

– Idea : divide a list into two sub-lists, recursively sort and merge

– Use n processors with complexity Q(n)

– S(n) = Q(log n), C(n) = Q(n**2)

• Odd-Even Merge Sort on the EREW PRAM

– Speedup the Merge process

– Make Odd-index list and Even-index list, and recursive merge

– Last do a even-odd compare exchange step

9/06/99

Odd-Even Merge Sort on the EREW PRAM : cont’d

• Odd-Even Merge Sort on the EREW PRAM

– number of processors used : n

– W(n) = 1+ 2+ … + log n = Q((log n)**2)

– C(n) = Q(n log2n)

9/06/99

9/06/99

9/06/99

Sorting on the One-dimensional Mesh

• Any comparison-based parallel sorting algorithm on Mp must perform at least n-1 communications steps to properly decide between the relative order of the elements in P1 and Pn.

• Can achieve a speedup of at most log n on the one-dimensional mesh

• Two algorithms– Insertion Sort– Odd-Even Transposition Sort

9/06/99

Sorting on the Two-dimensional Mesh

• Order in the two-dimensional mesh– Row, column-major orders– Snake order

° Snake-order sorting• Repetition of row and column sort• When column sort, the direction is snake-order direction• W(n) = (\ceil(log n) + 1)\root(n) ~= \root(n)• C(n) = nW(n)• S(n) ~= \root(n)

9/06/99

9/06/99

9/06/99

Bitonic MergeSort EREW PRAM

• What is Bitonic List?– A sequence of numbers, x1, …, xn, with the property that

• there exists an index I such that x1<x2<…<xi and xi>xi+1>…>xn or else• there exists a cyclic shift of indices so that condition (1) holds

• What is rearrangement?– For X=(x1, …, xn), X’ = (x1’, …., xn’) is defined as

• xi’ = min (xi, xi+n/2), xi+n/2’ = max(xi, xi+n/2)

• Property– Let A and B be the sub-lists of X’ after the rearrangement– Then A and B are bitonic lists.– And any element in B is larger than all elements in A.

9/06/99

9/06/99

Bitonic MergeSort EREW PRAM : cont’d

• Bitonic Sort – Input : A bitonic list– Output : A sorted list– Algorithm :

• Recursive rearrangement

• Bitonic Merge– Merge of two increasing-order lists as one sorted list– Change the index for rearrangement : (i + n/2) => (n-i+1)

• results are two bitonic sub-lists

– Call Bitonic Sort for two different lists• Bitonic MergeSort Algorithm

– Input : a random list– Output : a sorted list– Recursive call of Bitonic MergeSort and following call of

Bitonic Merge

9/06/99

9/06/99

9/06/99

Bitonic MergeSort EREW PRAM : cont’d

• Bitonic MergeSort Complexity – W(n) = (Q (log n)**2)– C(n) = (Q n log**2 n))– S(n) = (Q n / log n)

• Bitonic Merge Sorting Network– See the figure

9/06/99

Documents

Parallel Algorithm