118 - semantic scholar · 2017-11-07 · literature are the illiac, pm2i, and shuffle-exchange....

9
Performing the Shuffle with the PM2I and Illiac SIMD Interconnection Networks Robert R. Seban Howard Jay Siegel Purdue University School of Electrical Engineering West Lafayette, Indiana 47907 Abstract—Three SIMD single stage interconnection networks which have been proposed and studied in the literature are the Illiac, PM2I, and Shuffle-Exchange. Here the ability of the Illiac and PM2I networks to per- form the shuffle interconnection in an SIMD machine with N processors is examined. A lower bound of 3\/N/2 transfers for the Illiac to shuffle data is derived. An algorithm to do this task in 2\/N-l transfers is given. A lower bound of log 2 N transfers for the PM2I to shuffle data has been published previously. An algo- rithm to do this task in log 2 N + l in transfers is presented here. 1. Introduction This paper extends SIMD interconnection network studies presented in [28, 31]. In particular, the ability of the PM2I and Illiac single stage interconnection SIMD machine networks to perform the shuffle interconnection is examined. In [28] it is shown that a lower bound on the number of transfers needed for the PM2I network to perform the shuffle is log 2 N, where N is the number of processing elements in the SIMD machine. The algo- rithm presented here requires only (log 2 N) + l transfers. This algorithm is used as basis for an algorithm to do the shuffle with the Illiac network in (2\/N)-l transfers. This compares favorably an earlier result of 4(\/N-l) in [25]. In addition, a lower bound 3\/N/2 on the number transfers required for Illiac to do shuffle is proved. The model of SIMD machines used is described in Section 2. In Section 3 the interconnection networks are formally defined. An algorithm to shuffle data using the PM2I network is given in Section 4. The lower bound analysis and algorithm for performing the shuffle with the Illiac network is presented in Section 5. 2. SIMD Machine Model Typically, an SIMD (single instruction stream - mul- tiple data stream) machine [12] is a computer system con- sisting of a control unit, N processors, N memory modules, and an interconnection network. The control unit broadcasts instructions to the processors, and all active processors execute the same instruction at the same time. Each active processor executes the instruc- tion on data in its own memory module. The intercon- nection network, sometimes referred to as an alignment or permutation network, provides for communications among the processors and memory modules. Examples of SIMD machines that have been constructed are the Illiac IV [61 and STARAN [2, 3l. One way to view the physical structure of an SIMD machine is as a set of N processing elements intercon- nected by a network, where each processing element (PE) consists of a processor with its own memory. This type This material is based upon work supported by the National Science Foundation under Grant ECS-8120896. 117 Fig. 1: PE-to-PE SIMD machine configuration, with NPEs. of configuration is shown in Fig. 1. It is called the PE- to-PE organization. The network is unidirectional and connects each PE to some subset of the other PEs. A transfer instruction causes data to be moved from each PE to one of the PEs to which the PE is connected by the network. (Here only one-to-one communications will be considered, i.e., broadcasting (one-to-many) connec- tions are not considered.) To move data between two processing elements that are not directly connected, the data must be passed through intermediary processing elements by executing a programmed sequence of data transfers. An alternative to the PE-to-PE SIMD machine organization is to position a bidirectional net- work between the processors and the memories. The PE-to-PE paradigm will be used here, however, the results presented will be applicable to the other organiza- tion also. The formal model of an SIMD machine used here consists of five parts: processing elements, control unit instructions, processing element instructions, masking schemes, and interconnection functions. It is a mathematical model that provides a common basis for evaluating and comparing the various components of different SIMD machines. This model is based on the one presented in [31]. Each processing element (PE) is a processor together with its own memory. There are N PEs, addressed {num- bered) from 0 to N—1, where N = 2 m . It is assumed that the processor contains a fast access general purpose register A and a data transfer register (DTR). When data transfers among PEs occur, it is the DTR contents of each PE that are transferred. At any point in time, each PE is either in the active or the inactive mode. If a PE is active, it executes the instructions broadcast to it by the control unit. If a PE is inactive, it will not exe- cute the instructions broadcast to it. The control unit stores the SIMD programs, exe- cutes control of flow instructions, and broadcasts pro-

Upload: others

Post on 06-Apr-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 118 - Semantic Scholar · 2017-11-07 · literature are the Illiac, PM2I, and Shuffle-Exchange. Here the ability of the Illiac and PM2I networks to per-form the shuffle interconnection

Performing the Shuffle with the PM2Iand Illiac SIMD Interconnection Networks

Robert R. SebanHoward Jay Siegel

Purdue UniversitySchool of Electrical EngineeringWest Lafayette, Indiana 47907

Abstract—Three SIMD single stage interconnectionnetworks which have been proposed and studied in theliterature are the Illiac, PM2I, and Shuffle-Exchange.Here the ability of the Illiac and PM2I networks to per-form the shuffle interconnection in an SIMD machinewith N processors is examined. A lower bound of3\/N/2 transfers for the Illiac to shuffle data is derived.An algorithm to do this task in 2\/N-l transfers isgiven. A lower bound of log2N transfers for the PM2I toshuffle data has been published previously. An algo-rithm to do this task in log2N + l in transfers ispresented here.

1. Introduction

This paper extends SIMD interconnection networkstudies presented in [28, 31]. In particular, the ability ofthe PM2I and Illiac single stage interconnection SIMDmachine networks to perform the shuffle interconnectionis examined. In [28] it is shown that a lower bound onthe number of transfers needed for the PM2I network toperform the shuffle is log2N, where N is the number ofprocessing elements in the SIMD machine. The algo-rithm presented here requires only (log2N) + l transfers.This algorithm is used as basis for an algorithm to dothe shuffle with the Illiac network in (2\/N)-l transfers.This compares favorably an earlier result of 4(\/N-l) in[25]. In addition, a lower bound 3\/N/2 on the numbertransfers required for Illiac to do shuffle is proved.

The model of SIMD machines used is described inSection 2. In Section 3 the interconnection networks areformally defined. An algorithm to shuffle data using thePM2I network is given in Section 4. The lower boundanalysis and algorithm for performing the shuffle withthe Illiac network is presented in Section 5.

2. SIMD Machine Model

Typically, an SIMD (single instruction stream - mul-tiple data stream) machine [12] is a computer system con-sisting of a control unit, N processors, N memorymodules, and an interconnection network. The controlunit broadcasts instructions to the processors, and allactive processors execute the same instruction at thesame time. Each active processor executes the instruc-tion on data in its own memory module. The intercon-nection network, sometimes referred to as an alignmentor permutation network, provides for communicationsamong the processors and memory modules. Examplesof SIMD machines that have been constructed are theIlliac IV [61 and STARAN [2, 3l.

One way to view the physical structure of an SIMDmachine is as a set of N processing elements intercon-nected by a network, where each processing element (PE)consists of a processor with its own memory. This type

This material is based upon work supported by the National ScienceFoundation under Grant ECS-8120896.

117

Fig. 1: PE-to-PE SIMD machine configuration, withNPEs.

of configuration is shown in Fig. 1. It is called the PE-to-PE organization. The network is unidirectional andconnects each PE to some subset of the other PEs. Atransfer instruction causes data to be moved from eachPE to one of the PEs to which the PE is connected bythe network. (Here only one-to-one communications willbe considered, i.e., broadcasting (one-to-many) connec-tions are not considered.) To move data between twoprocessing elements that are not directly connected, thedata must be passed through intermediary processingelements by executing a programmed sequence of datatransfers. An alternative to the PE-to-PE SIMDmachine organization is to position a bidirectional net-work between the processors and the memories. ThePE-to-PE paradigm will be used here, however, theresults presented will be applicable to the other organiza-tion also.

The formal model of an SIMD machine used hereconsists of five parts: processing elements, control unitinstructions, processing element instructions, maskingschemes, and interconnection functions. It is amathematical model that provides a common basis forevaluating and comparing the various components ofdifferent SIMD machines. This model is based on theone presented in [31].

Each processing element (PE) is a processor togetherwith its own memory. There are N PEs, addressed {num-bered) from 0 to N—1, where N = 2m. It is assumed thatthe processor contains a fast access general purposeregister A and a data transfer register (DTR). Whendata transfers among PEs occur, it is the DTR contentsof each PE that are transferred. At any point in time,each PE is either in the active or the inactive mode. If aPE is active, it executes the instructions broadcast to itby the control unit. If a PE is inactive, it will not exe-cute the instructions broadcast to it.

The control unit stores the SIMD programs, exe-cutes control of flow instructions, and broadcasts pro-

Page 2: 118 - Semantic Scholar · 2017-11-07 · literature are the Illiac, PM2I, and Shuffle-Exchange. Here the ability of the Illiac and PM2I networks to per-form the shuffle interconnection

118

cessing element instructions to the PEs. An example ofa control of flow instruction is the loop statement"for i = 0 until N- l do..."

The processing element instructions consist of thoseoperations that each processor can perform on data in itsindividual memory or registers. It is assumed the set ofprocessing element instructions includes the capability tomove data among the registers. The notation "Z <— Y"means the contents of register Y are copied into registerZ. The notation "Z <—> Y" means two registersexchange their contents.

A masking scheme is a method for determiningwhich PEs will be active at a given point in time. ThePE address masking scheme uses an m-position mask tospecify which PEs are to be activated, each position ofthe mask corresponding to a bit position in the binaryaddresses of the PEs [28]. Each position of the mask willcontain either a 0, 1, or X ("don't care"). The only PEsthat will be active are those that match the mask for alli, 0 < i < m: if the mask has a 0 in the i-th position,then the PE address must have a 0 in the i-th position; ifthe mask has a 1 in the i-th position, then the PEaddress must have a 1 in the i-th position; and if themask has an X in the i-th position, then the PE addressmay have either a 0 or 1 in the i-th position. For exam-ple, if N = 8 and the mask is 1X0, then only PEs 6 and4 are active. Superscripts are used as repetition factors,e.g., X3012 is XXX011. Square brackets will be used todenote a mask. Each PE instruction and interconnectionfunction (defined below) will be accompanied by a maskspecifying which PEs will execute that command. Forexample, executing "A <- DTR [X^'O]" means thateach even numbered PE is active and loads its A registerfrom its DTR. Each odd numbered PE is inactive anddoes nothing. Further information about the use andimplementation of PE address masks is in [18, 28, 31,34].

An interconnection network can be described by aset of interconnection functions, where each interconnec-tion function is a bijection (permutation) on the set ofPE addresses [28]. When an interconnection function f isapplied, PE i sends the contents of its DTR to the DTRof PE f(i). This occurs for all i simultaneously, for0 < i < N and PE i active. Saying that an interconnec-tion function is a bijection means that every PE sendsdata to exactly one PE, and every PE receives data fromexactly one PE (assuming all PEs are active). In thismodel, it is assumed that an inactive PE can receivedata from another PE if an interconnection function isexecuted, but an inactive PE cannot send data. To passdata from one PE to another PE a programmed sequenceof one or more interconnection functions must be exe-cuted. This sequence of functions moves the data fromone PE's DTR to the other's by a single transfer or bypassing the data through intermediary PEs.

In summary, an SIMD machine can be formallyrepresented as the five-tuple (N,C,I,M,F), where:(1) N is a positive integer, representing the number of

PEs in the machine;(2) C is the set of control unit instructions, i.e.,

instructions that are executed by the control unitin order to control the flow of the program;

(3) I is the set of processing element instructions, i.e.,instructions that can be executed by each activePE and act on data within that PE;

(4) M is the set of masking schemes, where each maskpartitions the set {0, 1, ..., N-l} into two disjointsets, the enabled PEs and the disabled PEs; and

(5) F is the set of interconnection functions (i.e., the

3. The Interconnection NetworksA. Introduction

In this paper, three networks which can be con-structed from a single stage of switches are examined.In a single stage network, data items may have to bepassed through the switches several times before reach-ing their final destinations. Conceptually, a single stagenetwork can be viewed as N input selectors and N out-put selectors, as shown in Fig. 2 [30]. The way in whichthe input selectors are connected to the output selectorsdetermines the allowable interconnections.

The following notation will be used: let N — 2m ,let the binary representation of an arbitrary PE address

B. The Illiac NetworkThe Illiac network consists of four interconnection

functions defined as follows:

Fig. 2: Conceptual view of a single-stage network."IS" is input selector, "OS" is output selec-tor.

controlling the flow of loops in the program; I includesinstructions for moving data among the registers of agiven PE; M includes PE address masks; and F is varied.The assumptions made about the SIMD machine to beused as the model are intentionally minimal so that thematerial presented is applicable to a wide range ofmachines.

interconnection network), where each function is abijection on the set {0, 1, ..., N~l}, which deter-mines the communication links among the PEs.

A particular SIMD machine architecture can bedescribed by specifying N, C, I, M, and F. In this paper,N = 2m; C includes "for ... until ... do" instructions for

Page 3: 118 - Semantic Scholar · 2017-11-07 · literature are the Illiac, PM2I, and Shuffle-Exchange. Here the ability of the Illiac and PM2I networks to per-form the shuffle interconnection

119

Fig. 5: Shuffle-Exchange network for N = 8. Solidline is exchange, dashed line is shuffle.

Fig. 4: PM2I network for N = 8. (a) PM2+0 connec-tions, (b) PM2 + 1 connections, (c) PM2+2connections. For the PM2_i connections,0 < i < 2, reverse the direction of the ar-rows.

must use the same PM2I interconnection function at thesame time.

A network similar to the PM2I is used in the"Novel Multiprocessor Array" [24] and is included in thenetwork of the Omen computer [15]. The conceptunderlying the SIMDA machine's interconnection net-work is similar to that of the PM2I [36]. The PM2I con-nection pattern forms the basis for the data manipulator[10], ADM [33], and gamma [26] multistage networks.Various properties of the PM2I are discussed in [11, 27,28, 29, 31, 32].

Fig. 3: Illiac network for N = 16. (The actual IlliacIV SIMD machine had N = 64). Verticallines are +\/N and - \/N. Horizontal linesare +1 and - 1 .

D. The Shuffle-Exchange NetworkThe Shuffle-Exchange network consists of a shuffle

function and an exchange function. The shuffle isdefined by:

shuffle(pm_1pm_2...p1p0) = Pm-2Pm-3• •PiPoPm-iand the exchange is defined by:

exchange(pm-lP 2...plPo) = pm_lPm_2...p1p0.For example, shuffle(3) = 6 and exchange(6) = 7, forN > 8. This network is shown in Fig. 5 for N = 8,

Consider the conceptual model of single stage net-works shown in Fig. 2. For the Shuffle-Exchange singlestage network, input selector P = Pm-i ••PiPo is con-nected to output selectors pm_2...piPoPm-i (— shuffle(P))and Pm-i-PiPo ( = exchange(P)). Output selectorrm-i riro ge*s its inputs from input selectors r0rm_1...r2r1and r^!...^!^. As with the other networks, all activePEs must use the same interconnection function at thesame time.

Mathematical properties of the shuffle are discussedin [14, 17]. The multistage omega network is a series ofm Shuffle-Exchanges [21]. The shuffle is also included inthe networks of the Omen [15] and RAP [9] systems.

a four nearest neighbor connection pattern, as shown forN = 16 in Fig. 3. This network was implemented in theIlliac IV SIMD machine, where N = 64 [1, 6].

Relating this to the conceptual model of a singlestage network shown in Fig. 2, for each i, 0 < i < N,input selector i has lines to output selectors i + 1, i-1,i + n, and i-n, mod N. For each j , 0 < j < N, outputselector j gets its inputs from input selectors j—1, j + 1,j—n, and j+n, mod N. Since there is a single instructionstream in an SIMD machine, all active PEs must use thesame interconnection function (connection) at the sametime. For example, if PE 0 is sending data to PE 1,then all active PEs must send data using the Illiac+ 1connection.

This type of network is included in the MPP [4, 5]and DAP [16] SIMD systems. Various properties andcapabilities of the Illiac network are discussed in [6, 13,25, 28, 31, 32].

C. The Plus-Minus 2' (PM2I) NetworkThe Plus-Minus 2' (PM2I) network consists of 2m

interconnection functions defined by:

for 0 < i < m. For example, PM2 + 1(2) = 4 if N > 4.Since P + 2 m l = P - 2 m l , mod N, for all P, 0 < P < N,the interconnection functions PM2 + (m_i> and PM2_<m_jjare equivalent. Fig. 4 shows the PM2+; interconnectionsfor N = 8. Diagrammatically, PM2_; is the same asPM2+i except the direction is reversed. This network iscalled the Plus-Minus 2' since, in terms of mappingsource addresses to destinations, it can add or subtract21 from the PE addresses,, i.e., it allows PE P to senddata to any one of PE P + 21 or PE P-2', arithmetic modN, 0 < i < m.

In terms of the conceptual model of a single stagenetwork (Fig. 2), for the PM2I network, for each j ,0 < j < N, input selector j is connected to output selec-tors j+21 and j-21 mod N, for all i, 0 < i < m. Foreach j , 0 < j < N, output selector j gets its inputs frominput selectors j -2 ' and j+21 mod N, for all i,0 < i < m. As with the Illiac network, all active PEs

Page 4: 118 - Semantic Scholar · 2017-11-07 · literature are the Illiac, PM2I, and Shuffle-Exchange. Here the ability of the Illiac and PM2I networks to per-form the shuffle interconnection

120

Features of the Shuffle-Exchange are discussed in [7, 8,11, 13, 19, 20, 22, 23, 27, 28, 31, 32, 35, 37]. (The abilityof each of the PM2I and Illiac networks to perform theexchange function in just two transfers was presented in[31] and is not considered here.)

4. Shuffling -with the PM2I Network

The following ground rules will be used in thedesign and analysis of the algorithm to perform theshuffle with the PM2I network.(1) The model and definitions presented in Sections 2

and 3 will be the formal basis for the results.(2) When simulating the shuffle, the data that is origi-

nally the DTR of PE P must he transferred to the

Algorithm to perform the shuffle

This algorithm used m + 1 inter-PE data transfersand m + 1 register to register moves. The operation ofthis algorithm for N = 8 is shown in Tab. 1. For exam-ple, consider the data item initially in the DTR of PE 5(= 101). PE 5 does not match the mask in LI ([XXO]).PE 5 does match the mask in L2 ([XXI]) and the data ismoved to PE PM2+0(5) = 6 (= 110). PE 6 does matchthe mask in L4 when j = 1 ([X10]) and the data ismoved to the A register of PE 6. The data is unaffectedby L5 when j = 1 (since it is not in the DTR). PE 6does match the mask in L4 when j = 2 ([1X0]) and thedata is moved to the DTR of PE 6. PE 6 does matchthe mask in L5 when j = 2 ([XXO]) and the data ismoved to the DTR of PE PM2+2(6) = 2 . PE 2 doesmatch the mask in L6 ([XXO]) and the data is moved tothe DTR of PE PM2+0(2) = 3 . PE 3 does not matchthe mask in L7 ([XXOJ). Thus, the data from PE 5 ismoved to PE 3 — shuffle(5). This is shown by the dot-ted line in Tab. 1.

To prove the algorithm is correct, induction will beused (assume all arithmetic is mod N). The inductionhypothesis (proven correct below) is that after executingPM2+i in L1 (for j = 0) or L5 (for 1 < j < m) the dataoriginally in the DTR of PE G = g ^ j . .gjgo willcurrently be in PE P = p ^ . - . p ^ o =(gm_,...gj+2gi + 1)*2' + 1 + (gj...gIgb)*2. (When j = 0,P = (gm-i.. g2gi)*2 + (go)*2.) The data will be in the Aregister if gj = 0 and in the DTR if gj = 1.

Thus, when j = m—1, the data originally from PE Gis in PE (gm_1...g1g0)*2. The data item from the DTR ofPE (gm_1...g1g0)*2 is moved to PE (gm_I...g,g0)*2 + 1 byL6; which is correct since this data item is from a PEwhere gj = gm_t = 1, so shuffle(G) = 2*G + 1. Thedata item from the A register of PE (gm_1...g1g0)*2 ismoved to the DTR of that PE by L7; this is correctsince this data item is from a PE where g: = gm_! = 0,so shuffle(G) = 2*G.

To complete the correctness proof it must be shownthat the induction hypothesis is true.Basis: j = 0.

(3) The time for each algorithm is in terms of thenumber of executions of interconnection functionsrequired to perform the simulation.

The reason for (3) can be seen by considering theway in which various instructions can be implemented.The instructions in the simulation algorithms can be di-vided into three categories: control unit operations (inC), register to register operations (in I), and interproces-sor data transfers (in F). Control unit operations, suchas incrementing a count register in the control unit for a"for loop," can, in general, be done in parallel (over-lapped) with the previously broadcast PE instruction,thus taking no additional time. Register to registeroperations within a PE will probably involve a singlechip or, at worst, adjacent chips. The inter-PE datatransfers will involve setting the controls of the intercon-nection network and passing data among the PEs, in-volving board to board, and probably rack to rack, dis-tances. Thus, unless the number of register to registeroperations is much greater than the number of inter-PEdata transfers, the time for the interprocessor transferswill be the dominating factor in determining the execu-tion time of the simulation algorithm.

In the algorithm below ":" indicates a comment.When discussing the algorithms, "Li" is used as an ab-breviation for "line i of the algorithm."

To understand the concept underlying the algorithmto perform the shuffle, consider the "distance" the shufflemoves a data item. The data item in the DTR of PE P,

Fig. 6: The idea underlyingthe algorithm for thePM2I to perform theshuffle, shown forN=8.

Page 5: 118 - Semantic Scholar · 2017-11-07 · literature are the Illiac, PM2I, and Shuffle-Exchange. Here the ability of the Illiac and PM2I networks to per-form the shuffle interconnection

121

Tab. 1: Example of the algorithm for performing the shuffle using the PM2I whenN = 8. It is assumed that initially the DTR of PE P contains the integer P,

5. Shuffling with the Illiac Network

In this section the use of the Illiac network to per-form the shuffle will be examined. First, it will beshown that a lower bound on the number of transfers(executions of Illiac interconnection functions) needed is3n/2. Then, an algorithm requiring 2n-l transfers willbe presented.

To show that a lower bound on the number oftransfers is 3n/2, four of the N data moves which theshuffle performs will be considered. These are:

Subcase la: pk = 1. The A register data is moved tothe DTR of PE P by L 4 and then to the DTR of

Furthermore, the data is in the DTR and g = 1.Thus, the induction hypothesis is true for j = kfor this subcase.

Subcase lb: pk = 0. The A register data is kept inthe A register of PE P and not moved by L4 or

the DTR and g0 = 1. Thus, the induction hy-pothesis is true for j = 0 for this case.

Induction Step: Assume true for j = k - 1 and show truefor j = k.

Case 1: The data item from the DTR of PE

Case 1: The data item from the DTR of PE

Case 2: The data item from the DTR of PE

not moved by L2. It remains in the A registerand go = 0. Thus, the induction hypothesis is

Case 2: The data item from the DTR of PE

moves are done simultaneously when the shuffle inter-connection function is executed. It will now be shownthat the Illiac cannot do all four in less than 3n/2transfers, i.e., at least 3n/2 transfers are needed. To

Page 6: 118 - Semantic Scholar · 2017-11-07 · literature are the Illiac, PM2I, and Shuffle-Exchange. Here the ability of the Illiac and PM2I networks to per-form the shuffle interconnection

122

In order to more easily visualize the data move-ments in the Illiac network the "wrap-around" connec-tions (e.g., 7 to 8, 56 to 0) have been "unwrapped" bydrawing eight projections of the network, as shown inFig. 7. The actual network is labeled "C" for center,and the eight projections are labeled NW (north west), N(north), NE (north east), W (west), E (east), SW (southwest), S (south), and SE (south east). Thus, each PE isrepresented nine times: once in the original (center) net-work, and once in each projection.

For example, consider the data movement from PE7 to PE 8 using the Illiac+ 1 function. Normally, PE 7,which is in the rightmost column of the Illiac network,connects to PE 8, which is in the leftmost column, usinga "wrap-around" connection. For purposes of this dis-cussion, the data from PE 7 in C will be moved to C'sPE 8 equivalent in the E projection.

In order to draw the projections, two constraintsmust be satisfied.(1) Each projection has to be topologically isomorphic

to the Illiac network.(2) Each projection must have the proper adjacency to

the C network and the other projections.Proper adjacency means that two PEs, each fromdifferent projections, are drawn adjacent to one anotherif and only if they are connected in the original network.As an example of this, consider 7 in C, 63 in N, 0 in NE,and 8 in E.

One could continue generating more of these projec-tions "ad infinitum" to represent all possible implemen-tations of all possible moves. However, the goal here isthe show that the set of moves (a) through (d) abovecannot be done in less than 3n/2 steps. Therefore, pro-jections which would involve more than 3n/2 steps to doany of (a) through (d) individually are not of interestand are unnecessary.

The lower bound proof is organized as follows.First it will be shown that there are only five sets of Illi-ac function executions that can perform both the28 —> 56 and 35 —> 7 moves in less than 3n/2 steps (Fig.7 and Tab. 2). Then it will be shown that there are onlyfive sets of Illiac function executions (which happen to bedifferent from the first five sets) that can perform boththe 14 —> 28 and 49 —> 35 moves in less than 3n/2 steps(Fig. 8 and Tab. 2). Finally, it will be shown that nosingle set of less than 3n/2 Illiac function executions canperform all four moves (Tab. 3).

Tab. 2: All possible combinations of 28 —> 56 and35 —» 7 paths that can be done individuallyin less than 3n/2 steps.

Tab. 3: All possible combinations of 14 —> 28 and49 —> 35 paths that can be done individuallyin less than 3n/2 steps.

Fig. 7 shows all the paths from the source PE 28 inthe C network to its associated destination PE 56 in theC network and in the eight projections. Also shown isthe source PE 35 in the C network and its associateddestination PE 7 in the C network and in the eight pro-jections. There are only four ways to go from 28 to 56in less than 3n/2 = 12 moves and these are shown at thetop of Tab. 2. The four ways to go from 35 to 7 in lessthan 12 moves are shown on the side of Tab. 2. Thefour-tuple (w, x, y, z) means that the path consists of wIlliac+8 executions (moves), x Illiac + j executions, y Illi-ac_8 executions, and z Illiac_1 executions. Note that forthe purposes here the order of execution is irrelevant.For example, 28 in C can go to 56 in the NE projectionby (0, 4, 5, 0), i.e., the path consists of four Illiac + jmoves and five Illiac_8 moves. Any path between 28 inC and 56 in NE must include these moves. This is truein general, i.e., if the path from PE A to PE B is givenas (w, x, y, z) then (1) the moves specified by the four-tuple will send data from A to B, and (2) any path fromA to B must include the moves specified by the four-tuple. In what follows {•} will denote the generalizationof the path from n = 8 to any n.

Each square in Tab. 2 shows the set of moves need-ed to do both the 28 —> 56 and 35 —> 7 moves for all pos-sible combinations of the individual moves which can bedone in less than 12 {3n/2} steps. The five combina-tions which can be done in less than 12 {3n/2} steps aremarked by a check (,/). For example, the 28 —> 56 path

The analysis for the 14 -> 28 and 49 --> 35 transfersshown in Fig. 8 and Tab. 3 is similar. The five sets of Il-liac functions which can do both of these transfers in lessthan 12 moves are checked in Tab. 3.

The final step to the proof is to examine all combi-nations of the five sets found in each of Tabs. 2 and 3 tosee if there exists any set of transfers which can performall four transfers (28 --> 56, 35 --> 7, 14 --> 28, and49 —> 35) in less than 12 moves. This is shown in Tab.

Page 7: 118 - Semantic Scholar · 2017-11-07 · literature are the Illiac, PM2I, and Shuffle-Exchange. Here the ability of the Illiac and PM2I networks to per-form the shuffle interconnection

123

Fig. 8: The source/destination rela-tionship for the moves14 -> 28 and 49 -> 35 in an"unwrapped" Illiac network.The circle denotes a destina-tion which can be reached inless than 3n/2 steps.

Fig. 7: The source/destination rela-tionship for the moves28 --> 56 and 35 --> 7 in an"unwrapped" Illiac network.The circle denotes a destina-tion which can be reached inless than 3n/2 steps.

Page 8: 118 - Semantic Scholar · 2017-11-07 · literature are the Illiac, PM2I, and Shuffle-Exchange. Here the ability of the Illiac and PM2I networks to per-form the shuffle interconnection

124

Acknowledgements: Some of the figures and tables in thispaper are from "Interconnection Networks for LargeScale Parallel Processing: Theory and Case Studies," byH. J. Siegel, to be published by D. C. Heath and Co.

References

[1] G. H. Barnes, R. M. Brown, M. Kato, D. J. Kuck,D. L. Slotnick, and R. A. Stokes, "The Illiac IVcomputer," IEEE Trans. Comput., Vol. C-17, Aug.1968, pp. 746-757.

[2] K. E. Batcher, "STARAN parallel processor systemhardware," AFIPS Conf. Proc. 1974 Nat'I. Com-puter Conf., May 1974, pp. 405-410.

[3] K. E. Batcher, "STARAN series E," 1977 Intl.Conf. Parallel Processing, Aug. 1977, pp. 140-143.

[4] K. E. Batcher, "Design of a massively parallel pro-cessor," IEEE Trans. Comput., Vol. C-29, Sept.1980, pp. 836-840.

[5] K. E. Batcher, "Bit-serial parallel processing sys-tems," IEEE Trans. Comput., Vol. C-31, Mar.1982, pp. 377-384.

[6] W. J. Bouknight, S. A. Denneberg, D. E. McIntyre,J. M. Randall, A. H. Sameh, and D. L. Slotnick,"The Illiac IV system," Proc, of the IEEE, Vol. 60,Apr. 1972, pp. 369-388.

[7] P-Y. Chen, D. H. Lawrie, P-C. Yew, and D. A. Pa-dua, "Interconnection networks using shuffles,"Computer, Vol. 14, Dec. 1981, pp. 55-64.

[8] P-Y. Chen, P-C. Yew, and D. H. Lawrie, "Perfor-mance of packet switching in buffered single-stageshuffle-exchange networks," 3rd Intl. Conf. Distri-buted Computer Systems, Oct. 1982, pp. 622-627.

[9] G. R. Couranz, M. S. Gerhardt, and C. J. Young,"Programmable RADAR signal processing usingthe RAP," 1974 Sagamore Computer Conf. ParallelProcessing, Aug. 1974, pp. 37-52.

[10] T. Feng, "Data manipulating functions in parallelprocessors and their implementations," IEEETrans. Comput., Vol. C-23, Mar. 1974, pp. 309-318.

6. Conclusions

The ability of the PM2I and Illiac single stage inter-connection SIMD machine networks to perform theshuffle interconnect was examined. In [28] is was shownthat a lower bound on the number of transfers neededfor the PM2I network to perform the shuffle is log2N.The algorithm described here and proven correct re-quired only (log2N) + l transfers. This algorithm wasused as basis for an algorithm to do the shuffle with the

These results are of both theoretical and practicalvalue. Theoretically, they add to the body of knowledgeabout the properties of the PM2I and Illiac networks.Practically, the algorithms presented could actually beused to perform the shuffle interconnection on a systemthat has implemented the PM2I or Illiac network.Furthermore, the lower bound proof shows that it is im-possible to do the shuffle with the Illiac in any fewer

Tab. 4: Combination of relevant paths from Tabs. 2and 3.

4. As demonstrated, there is no such set. There areseven sets which require exactly 12 moves (indicated bychecks), but none which requires less than 12. For ex-ample, 28 56 and 35 7 can be done using (3, 4, 4,0), and 14 28 and 49 -+ 35 can be done using (2, 2, 2,2), however, the combination of these two sets yields (3,4, 4, 2), which is greater than 12 moves.

In summary, four of the moves performed by shuffle(28 56, 35 7, 14 28, and 49 35) have been ex-amined. It has been shown that no set of Illiac functionexecutions can do this in less than 3n/2 = 12 moves. Asindicated above, this argument can be generalized direct-ly using the substitutions listed.

Consider an algorithm for performing the shuffle in-terconnection function with the Illiac network. This willbe done by replacing each PM2I interconnection functionin the above algorithm with Illiac interconnection func-tions. For L2, use "Illiac+ 1 [X"1"1!]," sinceIlliac + , PM2+0. Similarly, for L6, use "Illiac+ 1p^• 'o j . " To do L5, first recall that only the even num-bered PEs contain the data of concern (after L2 is exe-cuted and before L6 is executed). Therefore, it is ac-ceptable to use "PM2 + : [X111]" in L5, since any datamovement among the odd numbered PEs is ignored (andoverwritten by L6). To perform "PM2+j [X™]," for1 < j < m, with the Illiac network the algorithmspresented in [31] can be used. Specifically, to perform

Page 9: 118 - Semantic Scholar · 2017-11-07 · literature are the Illiac, PM2I, and Shuffle-Exchange. Here the ability of the Illiac and PM2I networks to per-form the shuffle interconnection

125

[11] J. P. Fishburn and R. A. Finkel, "Quotient net- [25] S. E. Orcutt, "Implementation of permutationworks," IEEE Trans. Comput., Vol. C-31, Apr. functions in Illiac IV-type computers," IEEE1982, pp. 288-295. Trans. Comput., Vol. C-25, Sept. 1976, pp. 929-

[12] M. J. Flynn, "Very high-speed computing sys- 936.tems," Proc, of the IEEE, Vol. 54, Dec. 1966, pp. [26] D. S. Parker and C. S. Raghavendra, "The gamma1901-1909. network: a multiprocessor interconnection-network

[13] W. M. Gentleman, "Some complexity results for with redundant paths," 9th Annual Symp. Comput-matrix computations parallel processors," Journal er Architecture, Apr. 1982, pp. 73-80.of the A CM, Vol. 25, Jan. 1978, pp. 112-115. [27] D. K. Pradhan and K. L. Kodandapani, "A uni-

[14] S. W. Golomb, "Permutations by cutting and form representation of single- and multistage inter-shuffling," SIAM Review, Vol. 3, Oct. 1961, pp. connection networks used in SIMD machines,"293-297. IEEE Trans. Comput., Vol. C-29, Sept. 1980, pp.

[15] L. C. Higbie, "The Omen computer: associative 777-791.array processor," IEEE Computer Society Compcon [28] H. J. Siegel, "Analysis techniques for SIMD72, Sept. 1972, pp. 287-290. machine interconnection networks and the effects

[16] D. J. Hunt, "The ICL DAP and its application to of processor address masks," IEEE Trans. Com-image processing," in Languages and Architectures Put., Vol. C-26, Feb. 1977, pp. 153-161.for Image Processing, M. J. B. Duff and S. Levialdi, [29] H. J. Siegel, "Partitionable SIMD computer systemeds., Academic Press, London, England, 1981, pp. interconnection network universality," 16th Annual275-282. Allerton Conf. Communication, Control, and Com-

[17] P. B. Johnson, "Congruences and card shuffling," puting, Univ. Ill., Oct. 1978, pp. 586-595.American Mathematical Monthly, Vol. 63, Dec. [30] H. J. Siegel, "Interconnection networks for SIMD1956, pp. 718-719. machines," Computer, Vol. 12, June 1979, pp. 57-

[18] J. T. Kuehn, H. J. Siegel, and P. D. Hallenbeck, 65."Design and simulation of an MC68000-based mul- [31] H. J. Siegel, "A model of SIMD machines and atimicroprocessor system," 1982 Int'I. Conf. Parallel comparison of various interconnection networks,"Processing, Aug. 1982, pp. 353-362. IEEE Trans. Comput., Vol. C-28, Dec. 1979, pp.

[19] T. Lang, "Interconnections between processors and 907-917.memory modules using the shuffle-exchange net- [32] H. J. Siegel, "The theory underlying the partition-work," IEEE Trans. Comput., Vol. C-25, May ing of permutation networks," IEEE Trans. Com-1976, pp. 496-503. put., Vol. C-29, Sept. 1980, pp. 791-801.

[20] T. Lang and H. S. Stone, "A shuffle-exchange net- [33] H. J. Siegel and R. J. McMillen, "Using the Aug-work with simplified control," IEEE Trans. Com- mented Data Manipulator network in PASM,"put., Vol. C-25, Jan. 1976, pp. 55-66. Computer, Vol. 14, Feb. 1981, pp. 25-33.

[21] D. H. Lawrie, "Access and alignment of data in an [34] H. J. Siegel, L. J. Siegel, F. C. Kemmerer, P. T.array processor," IEEE Trans. Comput, Vol. C-24, Mueller, Jr., H. E. Smalley, Jr., and S. D. Smith,Dec. 1975, pp. 1145-1155. "PASM: a partitionable SIMD/MIMD system for

[22] D. Nassimi and S. Sahni, "Data broadcasting in image processing and pattern recognition," IEEESIMD computers," IEEE Trans. Comput., Vol. C- Trans. Comput, Vol. C-30, Dec. 1981, pp. 934-947.30, Feb. 1981, pp. 101-107. [35] H. S. Stone, "Parallel processing with the perfect

[23] D. Nassimi and S. Sahni, "Parallel permutation shuffle," IEEE Trans. Comput, Vol. C-20, Feb.and sorting algorithms and a new generalized con- 1971, pp. 153-161.nection network," Journal of the ACM, Vol. 29, [36] A. H. Wester, "Special features in SMDA," 1972July 1982, pp. 642-667. Sagamore Computer Conf, Aug. 1972, pp. 29-40.

[24] Y. Okada, H. Tajima, and R. Mori, "A [37] C. Wu and T. Feng, "The universality of thereconfigurable parallel processor with micropro- shuffle-exchange network," IEEE Trans. Comput.,gram control," IEEE Micro, Vol. 2, Nov. 1982, pp. Vol. C-30, May 1981, pp. 324-332.48-60.