1990 International Conference on Parallel Processinghj/conferences/128.pdf · study of the PASM parallel processing system [10,29,30]. The parallel applications addressed by this

1990 International Conference on Parallel Processing

MINIMIZINGMEMORYREQUIREMENTSFORPARTITIONABLESIMD/SPMD MACHINES

Mark A. Nichols, markni@ecn. purdue . eduHoward Jay Siegel, h j@ecn. purdue . eduHenry G. Dietz, hankd@ecn. purdue . edu

Russell W. Quong, quong@ecn. purdue . eduWayne G. Nation, wn@ecn. purdue . edu

Parallel Processing LaboratorySchool of Electrical Engineering

Purdue University-West Lafayette, IN 47907 USA

Abstract — A model for the creation of "perfect" memorymaps for parallel machines capable of user-controlled par-titionable SIMD/SPMD operation is developed. The term"perfect" implies that no memory fragmentation occursand ensures that the memory map size is kept to aminimum. The major constraint on solving this problem isone inherent to the single program nature of both the SIMDand SPMD modes of parallelism. Specifically, all processorswithin the same submachine need to use identical addressesto access data allocated across their local memories. Neces-sary and sufficient conditions are derived for being able tocreate "perfect" memory maps and these results areapplied to several partitionable interconnection networks.

1. IntroductionA partitionable machine has the ability to dynami-

cally group together (partition) its processors to formindependent or cooperating submachines of various sizes.Due to partitionability, different submachines can havedifferent code and data segments. When a partitionablemachine is operating in either the SIMD (Single Instruction- Multiple Data) mode [11] of parallelism or the SPMD (Sin-gle Program - Multiple Data) [6,7] restriction of the MIMD(Multiple Instruction-Multiple Data) mode [11] of parallel-ism, the processor memory utilization problem must bedealt with under addressing constraints inherent to thesemodes. Specifically, it is assumed that their singleinstruction/program nature forces all parallel code andvariables that are distributed across all processor memorieswithin the same submachine to be located at identicaladdresses within these memories. An implication of this isthat if the code/data segments are packed into the memorymaps in an arbitrary order, fragmentation within thememory maps can occur. Worst case, this fragmentationgrows exponentially with the number of processors becauseit is directly related to the number of distinct submachinesthat can be utilized within a system.

This paper develops a model for the creation offragment-free (perfect) memory maps for large-scale (e.g.,26- to 216- processor) partitionable machines capable ofSIMD and/or SPMD operation, where each processor hasan associated local memory (i.e., memory is physically dis-tributed). The partitioning can be changed under user con-trol at instruction level granularity. Thus, these resultsapply to many types of parallel architectures: partitionableSIMD, partitionable SPMD, MIMD when operating in par-titionable SPMD, partitionable SIMD/SPMD (capable ofdynamically switching between SIMD and SPMD atinstruction level granularity), and partitionableSIMD/MIMD when operating in partitionableSIMD/SPMD mode. This research is motivated by the

This research was supported in part by the Office of NavalResearch under grant no. N00014-90-J-1483, the NationalScience Foundation under grant no. 8624385-CDR, and theNaval Ocean Systems Center under the High PerformanceComputing Block, ONT.

study of the PASM parallel processing system [10,29,30].The parallel applications addressed by this paper con-

sist of an overall task that can be decomposed into manysubtasks, where each subtask can employ parallelism and ismapped onto a specific submachine by the user. Typically,some subtasks will be able to execute concurrently, whileothers must adhere to temporal precedence constraints. Inthis paper, it is assumed that the user controls the partition-ing of the system into submachines and specifies the map-ping of subtasks onto submachines at compile-time. Thismapping must specify any precedence constraints. Whileautomatic system reconfiguration of partitionablemachines is a long term goal [4], much research needs to bedone. For the class of systems under consideration, explicituser control will be appropriate and necessary for the nearterm future to utilize these systems more efficiently and tohelp gain the prerequisite knowledge for automaticreconfiguration.

This paper provides necessary and sufficient condi-tions for a partitionable SIMD and/or SPMD machine tohave fragment-free (perfect) memory maps for execution ofany parallel programs. Because machine partitionability isusually a function of a machine's interconnection network,a study is also performed on a few networks to determinewhether machines that employ these networks satisfy thesenecessary and sufficient conditions. The generation of pro-cessor memory maps is the primary focus; therefore, thetreatment of SPMD mode code and data segments andSIMD mode data segments is considered. It is assumed thatall such code and data for a given task will be stored in theprocessor memories. The code segments for SIMD mode arenot kept within processor memories so, in general, they willnot be addressed.

The resulting size of the memory maps for a parallelprogram is important for a number of reasons. First, if theentire memory requirement for a job can be satisfied by pri-mary memory, then there is no need to go to a secondarymemory source. Second, on some parallel machines, thetime to load a parallel program into processor memories issignificant because loading is performed one processormemory at a time and memory fragmentation will degradeperformance. Lastly, hardware cost is a factor, especiallywithin large-scale parallel processing systems. If a schemealways generates memory maps that contain no fragmenta-tion, then the amount of memory required for an "average-sized" program is minimized. This provides system archi-tects the option of implementing smaller memories.

Some parallel machines that have been built with phy-sically distributed memory (i.e., each processor has an asso-ciated local memory) and some form of partitioning are: theBBN Butterfly* [5], IBM RP3* [23], Connection Machine[14,33], iPSC Hypercube [19], NCube system [13], andPASM [29,30]. Of these, the Butterfly, iPSC Hypercube,

* These machines use a shared memory addressing scheme,but the memories are physically distributed with each memoryassociated with one local processor.

I-S4


NCube system, and RP3 are MIMD machines capable ofpartitionable SPMD mode. The Connection Machinehardware can support partitionable SIMD mode. (Alsonote that the proposed MAP [20] machine and the originaldesigns for the Illiac IV [3] were partitionable SIMD archi-tectures.) PASM is a partitionable SIMD/MIMD machine(capable of partitionable SIMD/SPMD operation). (Whilenot partitionable, OPSILA is a SIMD/SPMD machine thathas been built [2,8].)

Section 2 develops a model for describing partitioning,partitionable SIMD/SPMD addressing constraints, andcreating "perfect" memory maps. Necessary and sufficientconditions on a partitionable machine for the creation of"perfect" memory maps are derived in Section 3, and Sec-tion 4 applies these results to machines that employdifferent types of partitionable interconnection networks.

2. Memory Model for PartitionableSIMD/SPMD Machines

The formalism presented within this section and thenext requires some concepts from discrete mathematics1,12,31], that are reviewed below.a) Let R be a binary relation on a set A. Relation R is

reflexive if x £ A —* x R x.(b) Let R be a binary relation on a set A. Relation R is

antisymmetric if for x, y £ A,(xRy A yRx)—>-x=y.

(c) Let R be a binary relation on a set A. Relation R is tran-sitive if forx,y,z 6 A,(xRy A yRz)—>-xRz.

(d) Given set A and binary relation < on A, if ;£ isreflexive, antisymmetric, and transitive then theordered pair (A, S; ) defines a partial ordering on A.

(e) The Hasse diagram of the partial ordering (A, < ) isthe directed acyclic graph (DAG) constructed by plac-ing an arc from node x to node y if x < y,x^y, and theredoes not exist a z such that x < z < y, f or x, y, z 6 A.

(f) Given set A and binary relation < on A, if for all a, b 6 Aeither a < b or b < a, then the ordered pair (A, < )defines a total ordering on A.

(g) Given a partial ordering (A, S ), a topological sort of(A, ;£ ) is a total ordering (A, < ) that has embeddedwithin it the partial ordering (A, 51). That is, if a, b 6 A,then a < b —• a < b, while a < b •** a < b.

The remainder of this section defines a model used todescribe machine partitioning capabilities, partitionableSIMD/SPMD addressing constraints, and the creation of"perfect" memory maps. The first set of definitions coverspartitionable machines capable of operating in SIMDand/or SPMD mode.(a) A processing element (PE) is a processor/memory

pair.(b) A processing element group (PEG) is a set of one or

more PEs (this paper assumes that the set of PEs withina given PEG is fixed for the duration of a parallel pro-gram [24]).

(c) A partitionable machine, M, includes a set of BPEGs,i.e.,M = {pego,pegi,;.., pegB_!}. _

(d) A submachine, S, of partitionable machine M is a sub-set of M, i.e., S CM, that is executing a given subtask.

(e) In a partitionable SIMD machine, the PEGs of asubmachine are broadcast all of the code they execute.All PEs within a submachine are implicitly synchron-ized after every instruction.

(f) In a partitionable SPMD machine, each PE func-tions independently using its own program counter andall PEs in a given submachine are executing the sameprogram. Because the PEs will more than likely beoperating on different data sets, they may be taking

different control paths through the program. There isno implicit synchronization as in the partitionableSIMD machine case.

(g) In a partitionable SIMD/SPMD machine, all sub-machines can independently dynamically switchbetween SIMD mode and SPMD mode at instructionlevel granularity.

The following collection of definitions focuses onmachine partitioning, tasks, and the set of all possible sub-machines that can be generated on a partitionable machine.The submachine set is instrumental in the determination ofwhether a partitionable machine supports fragment-freememory maps.(h) A partitioning, P, of machine M is a set of sub-

machines, P ={S0, SJ, ..., Sip |_i}, such that each PEGof M is in exactly one S; for 0 < i < |P |, i.e., P consistsof mutually disjoint subsets that cover M. The term|P I denotes the cardinality of the set P.

(i) The partitioning set, R, for machine M is the set of allpossible partitionings permitted within M, i.e.,R = !P0,P1, . . . ,P |R|_1}.

(j) A submachine set, Q, for machine M with partition-ing set R is the set of all possible submachines permittedwithin M, i.e.,

IR l-iQ = { S 0 , S 1 , . . . , S | Q | _ 1 } = U Pi,

i=0wherePj £RforO < i < |R |.

(k) A submachine set Q is nested if for all Sx,Sy£Q,(Sx fi Sy) ^ 0 implies that either Sx C Sy or Sy C Sx.

(1) A task, E, performed on machine M is a sequence ofpartitionings that are ordered in time, i.e.,E = <P 0 , Pi , ..., P | E | - I >> where each P; is a parti-tioning of M. M begins execution of E by beingconfigured for partitioning Po- For the duration of E, Msuccessively reconfigures itself at run-time for each par-titioning in E.

(m)A task submachine set, V, for task E is the set of allsubmachines used by E, i.e.,

IE 1-1V = {S0 )S1, . . . ,S|v|-i}= .U Pi,

where Pj for 0 < i < |E | is a partitioning within E.Note that VCQ.

As an example, consider a machine M that has fourPEGs, i.e., M = {peg0, peg!, peg2,peg3}, and that the par-titioning set R consists of four partitionings, i.e.,

R = {P9,P1,P2,P3},whereP0={M},P l = { (pego, pegi}, |peg2, peg3 } },P2 = { {pego}, {pegi, peg2 }, |peg3} }, andP3={{pego}> {pegi}, {peg2}, {peg3}}.

By definition, the corresponding submachine set isQ= UPi ={S0, Su ..., S7}, where

i-0S0=M Si= peg0, peg!)S2={pegi,peg2} S3= peg2 peg3}S4={pego} S5= pegt jS6={peg2} S7= peg3}.

TheHasse diagram H of (Q, D) is shown in Fig. 1. Note thatbecause (Si DS2) = {peg1} and that Si and S2 are not sub-sets of one another, the submachine set Q is not nested.

If a different partitioning set R' ={PO, P I , P3} is used,the new submachine set is Q'=PoUPiLJP3 ={So, Si, S3, S4, S5|, S6, S7} (i.e., Q' =Q-{S2}). The Hassediagram H of (Q , D) is shown in Fig. 2. Note that Q isnested.

1-85


Fig. 2: Hasse diagram H .Within the next set of definitions, memory maps,

"perfect" memory maps, code/data segments, andaddressing constraints are formally denned.(n) For peg; EM, a PEG memory map, Uj, is an array of

u; memory locations Uj[O..u;—1] that contains all thecode/data for peg; required for a task. The notation"a..b" refers to all addresses between a and b inclusive.

(o) A submachine segment, Dj, for submachine Sj £ Q isa space of dj words DJO..dj-ij that contains the SIMDdata segments and SPMD data and code segments forthe different subtasks executed by Sj within a particularparallel program. The data values held in this space candiffer among PEs in S;, however it is assumed that d; isthe same for all PEs in S;. It is important to note that fora particular E. a given Sj may be contained in multiplePj's, 0 ^ j < |R [. Whenever this exact submachine Sj(with no more or no fewer PEGs) is used within any par-titionings (Pj's) within a given task E, all SIMD dataand SPMD data and code segments that are used by thesubmachine Sj are included within its submachine seg-ment Dj.

(p) Given a submachine segment D for submachine S £ Q,the partitionable SIMD/SPMD addressing con-straints state that D can be placed within the PEGmemory maps for all pegj £ S only if there exists astarting address, a, for 0 < a < (UJ — d) such that forall peg; £S,

Uj[a..cH-d-l]=D[O..d-l].Note that the same a must be used for all peg; £ S.*

(q) A concise set of PEG memory maps is a set of PEGmemory maps where for any segment D;, correspondingto submachine Sj £ Q, that has starting address OJJwithin the PEG memory maps, either a, =0 or there

* A different model for partitionable SIMD/SPMD mayassume that each processor has special hardware to allow foreach PEG to have an independent base address for each D;,so the a can be different in different PEGs. This is notassumed here for both theoretical and practical reasons.Theoretically, it makes the memory map problemuninteresting. Practically: (1) it increases the complexity ofthe hardware and may degrade general system performance,and (2) it complicates and increases the time required forinter-PE data references, due to the need to track and employmultiple relocatable bases.

exists a pegj£S; such that the memory locationUi[aj—1] is occupied by another segment (associatedwith another submachine). Alternately stated, all seg-ments within a concise set of PEG memory maps arebounded from below by either the bottom of thememory map or by a different segment.

(r) A PEG memory map hole is any unused portion of aPEG memory map that is located between address 0and at least one submachine segment starting address.

(s) A perfect set of PEG memory maps is a set of PEGmemory maps that does not contain any PEG memorymap holes.

Lastly, methods for "packing" submachine segmentsand sets of submachine segments into a set of PEG memorymaps are defined below.(t) A D-packing of some submachine segment D for sub-

machine S£Q consists of placing D into the PEGmemory maps associated with each PEG in S at somestarting address a subject to the partitionableSIMD/SPMD addressing constraints and such that noother previously D-packed segments are overwritten.

(u) A concise D-packing is a D-packing whose startingaddress a is made as small as possible while still adher-ing to the constraints associated with a D-packing.

(v) A Q-packing of a submachine set Q consists of D-packing each segment associated with a submachine inQ. If S; £ Q, but S j 3 V, then d; = 0 (segment D j is null).

(w)A concise Q-packing is a Q-packing that only usesconcise D-packings and that assigns a temporal order-ing to all of the concise D-packings. Let such an order-ing be represented by a total ordering of Q (a Q-packordering). Note that a concise Q-packing always resultsin a concise set of PEG memory maps.

(x) A Q-pack ordering, T, of a submachine set Q is a totalordering of Q, i.e., (Q, < ), where if Si £ Q has segmentDi, S2 £ Q has segment D2, and Sx < S2 within T, thenDx is concisely D-packed before D2 in time. Givenasub-machine set Q, the only variation in concise Q-packingsof Q is in the choice of which Q-pack ordering to use.

(y) A perfect Q-packing is one that results in aset of PEGmemory maps that is perfect.

Fig. 3: Submachine segment sizes.Continuing the previous example, assume that a par-

ticular parallel program is to be executed on machine Mwith partitioning set R', and with corresponding sub-machine set Q'. The various segment sizes (d;'s) of thedifferent submachine segments (D;'s) to be concisely Q-packed into M's PEG memory maps are displayed in Fig. 3.Let Tx be the following total ordering of Q:S3 < S7 < Si < So < S6 < S4 < S5. If Tx is used as a Q-pack ordering, then a concise Q-packing of the submachineset Q' results in the concise set of PEG memory maps shownin Fig. 4. Each location represents a fixed-size block ofmemory, e.g., each location corresponds to IK words.Unused portions of the PEG memory maps, up to the max-imum required address, are represented with diagonal lines.Because two PEG memory map holes exist (one in PEG 2'smemory map over locations 0 to 6 and another in PEG 3'smemory map over locations 5 and 6), the set of PEGmemory maps in Fig. 4 is not perfect.

Consider another Q-pack ordering T2 represented bythe following total ordering of Q:So < Sj < S3 < S4 < S5 < S6 < S7. When T2 is used,

1-86


Fig. 4: Concise set of PEG memory maps.rather than Tj, a concise Q-packing results in the perfectset of PEG memory maps shown in Fig. 5. Note that thelargest memory requirement over all PEGs is 14 locationswhen T2 is used as the Q-pack ordering, whereas when Tjwas used, the requirement was 19 locations.

Fig. 5: Perfect set of PEG memory maps.The following section uses the above model to derive

the necessary and sufficient conditions for a machine in thisclass to be able to achieve a perfect Q-packing for all pro-grams it executes.

3. Perfect Memory Maps for PartitionableSIMD/SPMD Machines

This section treats the problem of Q-packing a set ofsubmachine segments obtained from an arbitrary parallelprogram so that the resulting set of PEG memory mapscreated does not suffer from any fragmentation. Theresults presented in this section are used to derive necessaryand sufficient conditions on a partitionable machine forbeing able to always obtain a perfect Q-packing. Theseresults are important because if perfect Q-packings are notobtainable, then the memory map creation problem forpartitionable SIMD/SPMD programs reverts into a varia-tion of the bin packing problem.

Consider the creation of PEG memory maps atcompile-time for a partitionable SIMD/SPMD machineprogram. It is possible for a compiler to determine whatsubmachine segments will be needed on which PEGs so thatall segments are not sent to all PEGs. A particular PEGmemory map generated for a partitionable SIMD/SPMDmachine program will consist of the submachine segments

from those submachines that incorporate the PEG.The proposed method for assimilating a set of sub-

machine segments generated from a parallel program into aset of PEG memory maps is to use a concise Q-packing. Themotivation behind this is provided by the followingtheorem.

Theorem 1: Assume that a given partitionable machine Mhas a corresponding submachine set Q. Given a set of sub-machine segments generated from a particular parallel pro-gram, if there exists a Q-packing that is perfect, then thereexists a concise Q-packing that is also perfect.Proof: By contradiction. Assume that there exists a per-fect Q-packing and that all concise Q-packings are not per-fect. It is shown that there exists a Q-pack ordering for aconcise Q-packing that results in the same perfect set ofPEG memory maps that are generated by the perfect Q-packing that is assumed to exist. Let the Q-pack orderingbe a topological sort over the precedences that show whichsegments are "placed" above other segments in spacewithin the perfect Q-packing assumed to exist. Specifically,for all Sx, Sy G Q, let ax and ay be the starting addresses ofSx 's segment and Sy 's segment, respectively, within the per-fect Q-packing that exists. If (Sx fi Sy) ^ 0 and if a* < ay,then Sx must precede Sy within the Q-pack ordering. Aconcise Q-packing that uses this Q-pack ordering will resultin a perfect set of PEG memory maps. d

Due to the fact that the only variation allowed within aconcise Q-packing is its Q-pack ordering, the number of dis-tinct concise Q-packings for a submachine set Q must beIQ |!. Therefore, Theorem 1 is significant because it says

that if there exists a way to perfectly "pack" a set of seg-ments into a set of PEG memory maps, then one of a possi-ble IQ |! concise Q-packings must exist that can also do itperfectly. The following theorem indicates why perfect Q-packings are desirable.Theorem 2: For each possible Q-pack ordering T; for agiven task, let fi-t be the maximum memory size required forany PEG. A perfect Q-packing yields the minimum fa overall possible Q-pack orderings.Proof: Given that a perfect Q-packing cannot contain anyPEG memory map holes, if a Q-pack ordering T; results in aperfect Q-packing then all unused memory locations withinany particular PEG memory map must exist after the lastD-packed segment in space within the PEG memory map(e.g., see Fig. 5). This implies that jLij must be a minimumover all the possible rt's. E

The following group of theorems derive the conditionson a partitionable machine M that are necessary andsufficient for being able to perfectly Q-pack the segmentsfrom any parallel program executed on M.Theorem 3: Assume that a given partitionable machine Mhas a corresponding submachine set Q and let H be theHasse diagram of the partial ordering (Q, 15). If Q is nestedthen H is a forest of one or more trees.Proof: By contradiction. Assume that Q is nested and thatH is not a forest of one or more trees. Because H is a DAGand not a forest, H must contain a node S (E Q that has morethan one immediate predecessor within (Q, ~D). Let two ofthese immediate predecessors be Sx € Q and Sy E Q. Forexample, in Fig. 1, the Hasse diagram is a DAG and not atree because the node S5 = {peg!} has exactly two immedi-ate predecessors, namely, Sx ={peg0, pegi} andS2 = {peg!, peg2 }. Due to the fact that S C Sx and S C Sy, itmust be the case that (Sx nSy) ^ 0 . Because Q is nested,either SxCSy or SyCSx. Without loss of generality,assume that Sx CSy; therefore, there must be a path fromSy to Sx within H. Because there is an edge from Sx to S, this

1-87


contradicts the fact that Sy is an immediate predecessor ofS within H. •

Theorem 4: Assume that a given partitionable machine Mhas a corresponding submachine set Q and let T be a Q-pack ordering of Q. If Q is nested and if T is a topologicalsort of the partial ordering (Q, D)> then all partitionableSIMD/SPMD programs executed on M that are conciselyQ-packed by T will have perfect PEG memory maps.Proof: By contradiction. Assume that Q is nested, that Tis a topological sort of (Q, D), and that there exists a parti-tionable SIMD/SPMD program executed on M that whenconcisely Q-packed by T will not have perfect PEGmemory maps. Therefore, no matter what Q-pack orderingis used to concisely Q-pack the segments of this programinto M's PEG memory maps, there will always exist at leastone PEG memory map hole. This implies that there willalways exist a submachine segment Dx (corresponding toSx GQ) with starting address ax>0 that bounds the PEGmemory map hole from above in space. For example, inFig. 4, the peg2 memory map hole over locations 0 to 6 isbounded from above by submachine segment Do- Bydefinition of a concise Q-packing, Dx was concisely D-packed. This implies that there exists another submachinesegment Dy (corresponding to Sy G Q) on which Dx was"placed." Referring back to Fig. 4, if submachine segmentDo corresponds to submachine segment Dx, then segmentDj corresponds to segment Dy. Due to the PEG memorymap hole, there exists a PEG memory map which containsDx and does not contain Dy (in Fig. 4 where DQ and Dxcorrespond to Dx and Dy, respectively, this PEG memorymap is the one for peg2). Coupling this with the facts thatDx was "placed on" Dy and that Q is nested implies thatSyCSx. Therefore, because T is a topological sort of(Q, D), Dx must be concisely D-packed before Dy in time.However, this leads to a contradiction because Dx was con-cisely D-packed "on" Dy which implies that Dy was con-cisely D-packed before Dx in time. D

Theorem 5: Assume that a given partitionable machine Mhas a corresponding submachine set Q and let T be a Q-pack ordering of Q. If all partitionable SIMD/SPMD pro-grams executed on M that are concisely Q-packed by T willhave perfect PEG memory maps, then T is a topologicalsort of the partial ordering (Q, 3) .Proof: By contradiction. Assume that all partitionableSIMD/SPMD programs executed on M that are conciselyQ-packed by T will have perfect PEG memory maps andthat T is not a topological sort of (Q, 15). Choose a parti-tionable SIMD/SPMD program that operates solely on twosubmachines, V = {SX, Sy}CQ, where Sx CS y . Because Tis not a topological sort of (Q, D), T must order Sx beforeSy. This ensures that Sy's segment is concisely D-packedafter Sx's segment in time. Due to the fact that Sx CS yimplies that ( S x n S y ) ^ 0 , Sy's segment is concisely D-packed after Sx's segment in space. Because Sx and Sy arethe only submachines that have segments to concisely D-pack, PEG memory map holes are created within all PEGmemory maps for PEGs in the non-null set Sy — Sx, con-tradicting the assumption that the memory maps are per-fect. For the example in Fig. 4, let Sx ={peg0, }

{ } DD d Dp g , x {pg0, p g i } ,Sy ={pego,pegi,peg2,peg3}, DX=D1 ; and Dy =D0(ignore D5, D6, and D7 in the figure). Then there will bePEG memory map holes in locations 0 to 6 for PEGs{peg2)peg3} = Sy-Sx . D

Theorem 8: Assume that a given partitionable machine Mhas a corresponding submachine set Q and let T be a Q-pack ordering of Q. If all partitionable SIMD/SPMD pro-grams executed on M that are concisely Q-packed by T willhave perfect PEG memory maps, then Q is nested.

Proof: By contradiction. Assume that all partitionableSIMD/SPMD programs executed on M that are conciselyQ-packed by T will have perfect PEG memory maps andthat Q is not nested. Choose a partitionable SIMD/SPMDprogram that operates solely on two submachines,V = {Sx, Sy } C Q, where (Sx n Sy) # 0 , Sx is not a subset ofSy, and Sy is not a subset of Sx. These circumstancesdirectly guarantee that Q is not nested. By Theorem 5,because all partitionable SIMD/SPMD programs executedon M can have their segments concisely Q-packed to formperfect PEG memory maps, T is a topological sort of(Q, D). Due to the fact that Sx is not a subset of Sy and viceversa, T can order Sx before or after Sy; without loss of gen-erality, assume that T orders Sx before Sy. This ensuresthat Sy's segment is concisely D-packed after Sx's segmentin time. Further, because ( S x n S y ) ^ 0 , Sy's segment isconcisely D-packed after Sx's segment in space. Because Sxand Sy are the only submachines that have segments to con-cisely D-pack, PEG memory map holes are created withinall PEG memory maps for PEGs in the non-null set Sy — Sx,contradicting the assumption that the memory maps areperfect. For example, consider Fig. 6 which assumes thatSx ={pegi, peg2} and Sy ={peg2, peg3}. There is a PEGmemory map hole over locations 0 and 1 for PEGs withinthesetSy-Sx={peg3}. •

Fig. 6: Maps for a non-nested submachine set.

Corollary 1: (Converse of Theorem 4) Assume that agiven partitionable machine M has a corresponding sub-machine set Q and let T be a Q-pack ordering of Q. If allpartitionable SIMD/SPMD programs executed on M thatare concisely Q-packed by T will have perfect PEG memorymaps, then Q is nested and T is a topological sort of the par-tial ordering (Q, D).Proof: Follows from Theorems 5 and 6. •

Corollary 2: Assume that a given partitionable machineM has a corresponding submachine set Q and let T be a Q-pack ordering of Q. All partitionable SIMD/SPMD pro-grams executed on M that are concisely Q-packed by T willhave perfect PEG memory maps if and only if Q is nestedand T is a topological sort of the partial ordering (Q, ^ ) .Proof: Follows from Theorem 4 and Corollary 1. d

As an example of Corollary 2, Q', shown in Fig. 2, isnested, and the Q-pack ordering T2 defined in Section 2 is atopological sort of the partial ordering (Q , 5)- The result-ing perfect Q-packing was shown in Fig. 5. The Q-packordering ^ , defined in Section 2, for the nested Q' is not atopological sort (Sj is not a superset of So), and the resultingQ-packing shown in Fig. 4 is not perfect. The Q shown inFig. 1 is not nested, and for the example in Fig. 6, eventhough the Q-packing is representable by a topological sort,the Q-packing is not perfect.

1-88


If a partitionable machine supports a submachine setQ that is nested, the only constraint on the Q-pack orderingthat ensures a perfect Q-packing is that it be a topologicalsort of the partial ordering (Q, D). As long as (Q, D) is not atotal ordering, many topological sorts of (Q, D) will exist.The process of determining such a topological sort atcompile-time is straightforward. For example, given that Qis nested, a breadth-first traversal of the Hasse diagram ofthe partial ordering (Q, D) (which by Theorem 3 is a forestof one or more trees) is equivalent to a topological sort of(Q.3).

In some instances, Corollary 2 can also be applied tothe SIMD code segments for each submachine used in a par-titionable SIMD program. If each PEG has its own controlunit that broadcasts instructions (as in PASM), the parti-tionable SIMD/SPMD addressing constraints can beapplied to the control unit memory maps that contain theSIMD code to be broadcast. On such a partitionable SIMDsystem, Corollary 2 could be used so that the submachinecode segments could be concisely Q-packed to form perfectmemory maps.

4. Perfect Memory Maps Based on SpecificInterconnection Networks

If the partitioning capability of partitionable machineM satisfies Corollary 2, the PEG memory maps for anyparallel program executed on M can be perfectly Q-packed.Because, for the most part, a machine's partitioning capa-bility is based on the partitioning capability of the intercon-nection network it employs, this section will examine twoclasses of common partitionable networks. This sectionalso examines what can be done if perfect memory maps aredesired on a partitionable machine that does not satisfyCorollary 2 (i.e., Q is not nested). It is concerning this pointthat a distinction must be made between a machine's parti-tioning capability and the partitioning capability providedfor the programmer. These two do not necessarily have tobe the same. Specifically, this section looks at the casewhere the submachine set Q is not nested, and by forcingthe programmer to utilize submachines from a reducedsubmachine set Qr CQ, it is possible for the new sub-machine set Qr to be nested. This method can be used suc-cessfully with partitionable machines employing single-stage cube (hypercube) [25,27] or multistage cube [27,28]type interconnection networks. It will also be shown thatthe PM2I [25,27] and data manipulator family [0,27] inter-connection networks satisfy Corollary 2. Without loss ofgenerality, only the single-stage cube and PM2I networkswill be studied in depth in this section, because their parti-tioning properties are the same as those of the multistagecube and data manipulator family, respectively [26].

Interconnection networks provide a means fortransferring data among the PEs of a parallel machine. Asignificant feature supported by many interconnection net-works is the ability to partition themselves into indepen-dent subnetworks that each maintain complete networkfunctionality [26,27]. Given a machine with N = 2m PEsand given that the PEs are addressed 0 through N—1, thePM2I interconnection network is based on the followingconnections:

PM2+i (P) =P + 21 modulo N andPM2_; (P) = P - 2' modulo N for 0 < i < m

where P is the address of an arbitrary PE. The connectionsfor the single-stage cube interconnection network are:

cubei(pm_1...pi+1pipi_1...po) =Pm-i - Pi+i PT Pi-i... Po for 0 < i < N

and where pm-i . . . Pj+i pj Pj_i... Po is the binary address ofan arbitrary PE, and p[ is the ones complement of bit pj.

Thus, with the PM2I network, PE P is connected to the 2mPEsPM2±i(P), 0 < i < m, and with the single-stage cubenetwork, PE P is connected to the m PEs cubej(P),0 < i < m. Of the machines mentioned in Section 1, theConnection Machine, iPSC Hypercube, and NCube systememploy single-stage cube interconnection networks.

All submachines created from the partitioning ofsingle-stage cube and PM2I networks can be represented byusing PE address masks for PEGs rather than for PEs. APE address mask [25,27] is an m-position mask used tospecify a set of PEs, where each position of the maskcorresponds to a bit position in the PE addresses. Eachposition of the mask is either a 0,1, or X ("don't care"). Forexample, if a machine has N = 2m PEs addressed 0 to N—1,then the set of even-numbered PEs is represented by themask [X111"^]. The notation X111"1 is a string of m-lX's . Asubmachine mask, that specifies sets of PEGs instead ofsets of PEs, can represent any distinct submachines possi-ble with respect to the partitioning capabilities of thesingle-stage cube and PM2I interconnection networks.This section assumes that a partitionable machine M hasB = 2b PEGs addressed 0 to B - l .

The PM2I interconnection network is partitioned bystarting with the submachine mask [X ] and progressively"instantiating" mask positions that contain X's from theleast significant position to the most significant position tocreate additional submachine masks. The instantiationof a submachine mask position containing an X consists ofcreating two new submachine masks: one mask that is thesame as the original except for a 1 in the X's mask positionand another mask that is the same as the original except fora 0 in the X's mask position. Each new submachine mayindependently be partitioned again. Because these instan-tiations are ordered from the least significant mask positionto the most significant mask position, any PM2I sub-machine SpM2i with K = 2k PEGs must be represented witha submachine mask whose least significant b—k mask posi-tions are either 0's or l's and whose most significant k maskpositions are X's. The process of creating PM2I submachinemasks by instantiation simply starts with the set of allPEGs and progressively divides this set and resulting sets inhalf; therefore, QPM2I ls guaranteed to be nested, whereQPM2I

ls the submachine set obtained from PM2I partition-ings. Thus, by Corollary 2, given that a partitionablemachine is using aPM2I interconnection network, aWparti-tionable SIMD/SPMD programs executed on the machinecan have perfect memory maps. By Theorem 3, it is readilyseen that the Hasse diagram of the partial ordering(QpM2I> =5) ' s a complete binary tree with height b, whichleads to the fact that |QPM2I | = 2 b + 1 - l .

When the single-stage cube interconnection networkis considered, Corollary 2 no longer applies. The reason isthat the partitioning capability of the single-stage cube net-work is much more flexible than that of the PM2I network.There are no restrictions on the instantiation of mask posi-tions when forming submachine masks. This flexibilityimplies that there exists a bijection between the set of allpossible distinct b-position submachine masks and Qcube>the set of all distinct single-stage cube submachines. There-fore, |QCubel=3 which is considerably larger thanIQPM2I I- Fig- 7 shows, for B=4, the submachines within

Qcube represented by their submachine masks and theHasse diagram of (Qcube, 3) incorporating all of those sub-machines. It can be seen that Qcut,e, for B = 4, is not nested.Because single-stage cube machines with B > 4 are madeup of single-stage cube machines with B=4, the sub-machine sets for single-stage cube machines with B > 4 arealso not nested. Consequently, Corollary 2 cannot bedirectly applied to single-stage cube interconnection net-

1-89


Fig. 7: Full single-stage cube partitioning for B=4.A PM2I network's submachine set QpM2I satisfies

Corollary 2 because: (1) it is created by instantiating the bsubmachine mask positions of the initial submachine mask[X ] in a fixed order (least significant mask position to mostsignificant) and, (2) by definition, an instantiation of a sub-machine mask position containing an X simply divides theoriginal submachine into two new submachines of equalsize. Incorporating this rationale, it is possible to constrainthe partitioning capability of the single-stage cube networkso that a reduced submachine set Qcube ' s generated thatadheres to the conditions required by Corollary 2.

To generate a nested Qcube > the instantiations of the bsubmachine mask positions of [Xb] must be done in a fixedorder for any given task. If mask position i is instantiatedbefore mask position j (i ̂ j) then the submachinesrepresented by the masks Pb-i ...Pj+i OPj-i "-Po an(^Pb_i ...Pj+i lpj_i ...po are disjoint subsets of the sub-machine Pb-i ... Pj+i Xpj_i ... po, where p; is already fixedat 0 or 1. Thus, the resulting Qcube ls nested and the partialordering (Qcube > =3 depends solely on the ordering appliedto the instantiation of the submachine mask positions.

Fig. 8: ADM network topology.These results also apply to the multistage version of

the PM2I network. The augmented data manipulator(ADM) network [18,27] has N = 2m inputs and outputsand m stages where each stage consists of N switching ele-ments (nodes) and 3N output links. There is also an(m+l)st column of network output nodes. Fig. 8 illustratesan ADM network topology for N = 8. The inverse ADM

(IADM) [17] and gamma [21] networks have topologieswith similar partitioning properties. The connections inthis network are based on the PM2I interconnection func-tions defined earlier in this section. Specifically, each nodein stage i, 0 < i < m, can be independently set to straightacross, — 21 modulo N, or +21 modulo N. The stages areordered m—1 to 0 from input to output. Partitioning theADM network is essentially the same as for the PM2I net-work. If a set of 2 PEs is represented byXk Pm-(k+i) ••• Pi Poi where each p; is fixed at either 0 or 1,then each switch numbered X pm-(k+i) "-Pi Po m stagesm—(k+l) to 0 is set to straight. Because the partitionabilityof the ADM network is in direct correspondence with thepartitionability of the PM2I network, Corollary 2 can beapplied to all machines employing an ADM, IADM, orgamma network.

Fig. 9: Generalized-cube network topology.The multistage version of the cube network,

represented by the generalized-cube network topology[27], has N = 2m inputs and outputs and m stages whereeach stage consists of N links and N / 2 interchange boxes.The stages are ordered m—1 to 0 from input to output. Aninterchange box has two inputs and two outputs and can beset to either map the outputs straight across from theinputs or map the outputs as a swap of the inputs (see Fig.9). Fig. 9 illustrates a generalized-cube network topologyfor N = 8. Other networks in this family include the base-line [34], indirect binary n-cube [22], multistage shuffle-exchange [32], omega [15], and SW-banyan (with S =F =2)[16]. The connections in this network are based on the cubeinterconnection functions. The link labels of a stage i inter-change box differ in the i-th bit position, i.e., stage i of thegeneralized-cube network topology contains the cube;interconnection function. Partitioning a generalized-cubenetwork is essentially the same as for a single-stage cubenetwork. Once again it is possible to generate a Q that is notnested (e.g., for Fig. 9, if stage 0 is set to straight, 0, 2,4, and6 form a submachine, while if stage 2 were set to straightinstead, 0, 1, 2, and 3 form a submachine). Due to thecorrespondence with the single-stage cube network's parti-tionability, Corollary 2 cannot be applied to machines thatemploy a multistage cube network. However, by restrictingthe possible submachines to those in QcUbe> each of whichcan be supported by an independent multistage subnet-work, [26,27], Corollary 2 can then be applied. Of themachines mentioned in Section 1, the Butterfly, PASM,and RP3 employ multistage cube interconnection net-works.

1-90


5. ConclusionThis work provides a model for creating fragment-free

(perfect) memory maps on partitionable SIMD/SPMDparallel processing systems. Specifically, it has been shownwhat machine partitioning capabilities are necessary andsufficient for always being able to create perfect PEGmemory maps for partitionable SIMD/SPMD programs.These conditions are dependent on the set of all distinctsubmachines that can be generated during the partitioningprocess and on the order in which the submachine data seg-ments are packed into the PEG memory maps. The factthat all machines employing a PM2I or data manipulatorfamily interconnection network satisfy these conditionsdemonstrates that even when these conditions are in effect,the resulting partitioning capabilities are still quite power-ful. By constraining the single-stage cube and multistagecube networks' partitioning capabilities, there exists a richset of reduced partitioning capabilities that satisfy the per-fect memory map conditions; each of these reduced parti-tioning capabilities is as powerful as the PM2I's and datamanipulator family's partitioning capabilities.

In summary, as large-scale parallel processing systemsbecome feasible, it is important to understand the varioussoftware issues that will be associated with their efficientusage. Here, a model of memory maps for partitionableSIMD and/or SPMD machines was developed, and proper-ties of these memory maps were derived.

Acknowledgements: The authors of this paper ack-nowledge many useful discussions with Shin-Dug Kim,Curtis Krauskopf, and Craig Warner.

References[I] A. V. Aho, J. E. Hopcroft, J. D. Ullman, Data Structures and

Algorithms, Addison-Wesley, Reading, MA, 1983.[2] M. Auguin, F. Boeri, J. P. Dalban, A. Vincent-Carrefour,

"Experience using a SIMD/SPMD multiprocessor architec-ture," Microprocessing and Microprogramming, v. 21, Aug.1987, pp. 171-177.

[3] G. H. Barnes, R. Brown, M. Kato, D. J. Kuck, D. L. Slot-nick, R. A. Stokes, "The Illiac IV computer," IEEE Trans.Computers, v. C-17, Aug. 1968, pp. 746-757.

[4] C. H. Chu, E. J. Delp, L. H. Jamieson, H. J. Siegel, F. J.Weil, A. B. Whinston, "A model for an intelligent operatingsystem for executing image understanding tasks on areconfigurable parallel architecture," J. Parallel and Distri-buted Computing, v. 6, Jun. 1989, pp. 598-622.

[5] W. Crowther, J. Goodhue, R. Thomas, W. Milliken, T.Blackadar, "Performance measurements on a 128-nodebutterfly parallel processor," 1985 Int'l Conf. Parallel Pro-cessing, Aug. 1985, pp. 531-540.

[6] F. Darema-Rodgers, D. A. George, V. A. Norton, G. F.Pfister, Environment and System Interface for VM/EPEX,Research Report RC11381 (#51260), IBM T. J. WatsonResearch Center, 1985.

[7] F. Darema, D. A. George, V. A. Norton, G. F. Pfister, "Asingle-program-multiple-data computational model forEPEX/FORTRAN," Parallel Computing, v. 7, Apr. 1988,pp. 11-24.

[8] P. Duclos, F. Boeri, M. Auguin, G. Giraudon, "Image pro-cessing on a SIMD/SPMD architecture: OPSILA," Int'lConf. Pattern Recognition, Nov. 1988, pp. 430-433.

[9] T. Y. Feng, "Data manipulating functions in parallel pro-cessors and their implementations," IEEE Trans. Comput-ers, v. C-23, Mar. 1974, pp. 309-318.

[10] S. A. Fineberg, T. L. Casavant, H. J. Siegel, "Experimentalanalysis of a mixed-mode parallel architecture performingsequence sorting," 1990 Int'l Conf. Parallel Processing,Aug. 1990, to appear.

[II] M. J. Flynn, "Very high-speed computing systems," Proc.IEEE, v. 54, Dec. 1966, pp. 1901-1909.

[12] J. L. Gersting, Mathematical Structures for Computer Sci-ence, W. H. Freeman & Co., NY, NY, 1982.

[13] J. P. Hayes, T. N. Mudge, Q. F. Stout, S. Colley, "Architec-ture of a hypercube supercomputer," 1986 Int'l Conf.Parallel Processing, Aug. 1986, pp. 653-660.

[14] W. D. Hillis, The Connection Machine, MIT Press, Cam-bridge, MA, 1985.

[15] D. H. Lawrie, "Access and alignment of data in an arrayprocessor," IEEE Trans. Computers, v. C-24, Dec. 1975,pp.1145-1155.

[16] G. J. Lipovski, M. Malek, Parallel Computing: Theory andComparisons, John Wiley & Sons, Inc., NY, NY, 1987.

[17] R. J. McMillen, H. J. Siegel, "Routing schemes for the aug-mented data manipulator network in an MIMD system,"IEEE Trans. Computers, v. C-31, Dec. 1982, pp. 1202-1214.

[18] R. J. McMillen, H. J. Siegel, "Evaluation of cube and datamanipulator networks," / . Parallel and Distributed Com-puting, v. 2, Feb. 1985, pp. 79-107.

[19] S. F. Nugent, "The iPSC/2 Direct-Connect communica-tions technology," 3rd Conf. Hypercube Computers andApplications, Jan. 1988, pp. 51-60.

[20] G. J. Nutt, "Microprocessor implementation of a parallelprocessor," 4th Ann. Symp. Computer Architecture, Mar.1977, pp. 147-152.

[21] D. S. Parker, C. S. Raghavendra, "The gamma network,"IEEE Trans. Computers, v. C-33, Apr. 1984, pp. 367-373.

[22] M. C. Pease III, "The indirect binary n-cube microprocessorarray," IEEE Trans. Computers, v. C-26, May 1977, pp.458-473.

[23] G. F. Pfister, W. C. Brantley, D. A. George, S. L. Harvey,W. J. Kleinfelder, K. P. McAuliffe, E. A. Melton, V. A. Nor-ton, J. Weiss, "The IBM Research Parallel Processor Proto-type (RP3): introduction and architecture," 1985 Int'lConf. Parallel Processing, Aug. 1985, pp. 764-771.

[24] M. D. Rice, S. D. Seidman, P. Y. Wang, "A formal model forSIMD computation," 2nd Symp. Frontiers of MassivelyParallel Computation, Oct. 1988, pp. 601-607.

[25] H. J. Siegel, "Analysis techniques for SIMD machine inter-connection networks and the effects of processor addressmasks," IEEE Trans. Computers, v. C-26, Feb. 1977, pp.153-161.

[26] H. J. Siegel, "The theory underlying the partitioning of per-mutation networks," IEEE Trans. Computers, v. C-29,Sep. 1980, pp. 791-801.

[27] H. J. Siegel, Interconnection Networks for Large-ScaleParallel Processing: Theory and Case Studies, 2nd Edition,McGraw-Hill, NY, NY, 1990

[28] H. J. Siegel, W. G. Nation, C. P. Kruskal, L. M. Napolitano,"Using the multistage cube network topology in parallelsupercomputers," Proc. IEEE, v. 77, Dec. 1989, pp. 1932-1953.

[29] H. J. Siegel, L. J. Siegel, F. C. Kemmerer, P. T. Mueller, Jr.,H. E. Smalley, Jr., S. D. Smith, "PASM: a partitionableSIMD/MIMD system for image processing and patternrecognition," IEEE Trans. Computers, v. C-30, Dec. 1981,pp. 934-947.

[30] H. J. Siegel, T. Schwederski, J. T. Kuehn, N. J. Davis IV,"An overview of the PASM parallel processing system," inComputer Architecture, Gajski, Milutinovic, Siegel, Furht,eds., IEEE Computer Society Press, Wash., DC, 1987, pp.387-407.

[31] D. F. Stanat, D. F. McAllister, Discrete Mathematics inComputer Science, Prentice-Hall, Englewood Cliffs, NJ,1977.

[32] S. Thanawastien, V. P. Nelson, "Interference analysis ofshufHe/exchange networks," IEEE Trans. Computers, v.C-30, Aug. 1981, pp. 545-556.

[33] L. W. Tucker, G. G. Robertson, "Architecture and applica-tions of the Connection Machine," Computer, v. 21, Aug.1988, pp. 26-38.

[34] C.-L. Wu, T. Y. Feng, "On a class of multistage intercon-nection networks," IEEE Trans. Computers, v. C-29, Aug.1980, pp. 694-702.

1-91

Documents

1990 International Conference on Parallel Processinghj/conferences/128.pdf · study of the PASM parallel processing system [10,29,30]. The parallel applications addressed by this