nhpcc(hefei) ustc china [email protected] parallel computation architecture, algorithm and...

NHPCC(Hefei) NHPCC(Hefei) • • USTC USTC • • CHINA [email protected] [email protected]

Parallel Computation

Architecture, Algorithm and Programming

Professor Guoliang Chen Professor Guoliang Chen Dept. Of Comp. Sci. & Tech.Dept. Of Comp. Sci. & Tech.

Univ. Of Sci.&Tech. Of ChinaUniv. Of Sci.&Tech. Of China

Textbook Series for 21st Century


Browse the homepages of HPCC in USA.Browse the homepages of HPCC in USA. A crossbar chip design for SP2A crossbar chip design for SP2 Multistage crossbar network in Cray Y-MPMultistage crossbar network in Cray Y-MP The structure of fat tree (the four-way fat tree The structure of fat tree (the four-way fat tree

implemented in CM-5)implemented in CM-5)

Part I: Hardware basics for parallel computing

Chapter 1Architecture models of parallel computer


Buffered wormhole routing is usedBuffered wormhole routing is used No conflict: 8 packet cells (called flits) pass thr

ough the 8x8 switch in every 40MHz cycle hot-spot conflict: only one crosspoint will be en

abled at a time Central queueCentral queue

Dual-port RAM, 1 read&1 write in a clock cycle deserializes eight flits from FIFO into a chunk, t

hen write it to the central queue in one cycle


A crossbar chip design for SP2


4x4 and 8x8 crossbar switches and 1x8 de-mu4x4 and 8x8 crossbar switches and 1x8 de-multiplexors in three stagesltiplexors in three stages

supporting data streaming between 8 vector supporting data streaming between 8 vector processors and 256 memory banksprocessors and 256 memory banks

Multistage crossbar network in Cray Y-MP



Alleviate the bottleneck towards the root of a tree sAlleviate the bottleneck towards the root of a tree structure (Leiserson 1985)tructure (Leiserson 1985)

the number of links increases when traversing frothe number of links increases when traversing from a leaf node toward the rootm a leaf node toward the root

Each node has 4 child nodes and 2 or 4 parent nodeEach node has 4 child nodes and 2 or 4 parent nodes (shown in left figure)s (shown in left figure)


The structure of multiway fat tree

Fi gur e 1. 34 Four - way f at t r ee i mpl ement ed i n CM5


In a two-way fat tree shown in Figure 1.33, each In a two-way fat tree shown in Figure 1.33, each node has 2 parent nodes. If each oval in the node has 2 parent nodes. If each oval in the figure is looked as a single node and the multi-figure is looked as a single node and the multi-edge between a pair of nodes is looked as one edge between a pair of nodes is looked as one edge, then the fat tree can be looked as a binary edge, then the fat tree can be looked as a binary tree. The question is, if we look each square but tree. The question is, if we look each square but not the oval in the figure as a node, then what not the oval in the figure as a node, then what multilevel internet will it be when looked from multilevel internet will it be when looked from the leaves to the root?the leaves to the root?

In a butterfly network shown in Fig. 1.37, In a butterfly network shown in Fig. 1.37, NN=(=(kk+1)2+1)2kk. . Node degree=? Network Node degree=? Network Diameter=? And Bisection width=?Diameter=? And Bisection width=?

Exercises (1)


Fi gur e 1. 33 Two- way f at t r ee

0

Fi gur e 1. 37 But t er fl y net wor k k=3

Li ne1

Li ne2

Li ne3

Li ne


Exercises (2)

The Cedar multiprocessor at Illinois is built with a clustered OmThe Cedar multiprocessor at Illinois is built with a clustered Omega network as shown in the following figure. Four 8x4 crossbar ega network as shown in the following figure. Four 8x4 crossbar switches are used in stage 1 and four 4x8 crossbar switched are switches are used in stage 1 and four 4x8 crossbar switched are used in stage 2. There are 32 processors and 32 memory moduleused in stage 2. There are 32 processors and 32 memory modules, devided into four clusters with eight processors per clusters, devided into four clusters with eight processors per cluster Figure out a fixed priority scheme to avoid conflicts in using crossbar

switches for nonblocking connections. For simplicity, consider only the forward connections from processors to the memory modules.

Suppose both stages use 8x8 crossbar switches. Designed a two-stage Cedar network to provide switched connections between 64 processors and 64 memory modules.

Further expand the Cedar network to three stages using 8x8 crossbar switches as building blocks to connect 512 processors and 512 memory modules. Show all connections between adjacent stages from the input end to output end.



Chapter 2 SMP, MPP and COW

The architecture of fat hypercube in The architecture of fat hypercube in Origin-2000Origin-2000

128-way high-performance switches in 128-way high-performance switches in SP2SP2



The architecture of fat hypercube in Origin-2000


each SPIDER router (R) has six ports, two for noeach SPIDER router (R) has six ports, two for node connection and four for network connectionde connection and four for network connection

Modified from the traditional binary hypercubeModified from the traditional binary hypercube advantages: linear bisection bandwidth of hyperc

ube while avoiding an increasing node degree each node contains two processors and two nodes

are connected to a router beyond 32 nodes, employs a fat hypercube topolog

y using extra routers—metarouter

Rout er ( R)

XBOW

P

PH M

N

Node ( N)

N

R

N

N

XBOW

R

XBOW

N

XBOW

N N

R

Cr ay Rout er

( a) ( b) ( c)

( d) ( e) ( f )

( f )Fi gur e 2. 6 Fat Hyper Cube Topol ogy i n Or i gi n 2000

( a) 2 nodes ( b) 4 nodes ( c) 8 nodes ( d) 16 nodes ( e) 32 nodes( f ) 64 nodes ( g) 128 nodes


128-way high-performance switches in SP2

Each link is 8-b, bidirectionalEach link is 8-b, bidirectional Each Each frameframe consists of 16 processing nodes connected by consists of 16 processing nodes connected by

a 16-way a 16-way switchboard switchboard (each contains two stages of switch (each contains two stages of switch chips)chips)

Eight frames are interconnected by an extra stage of switcEight frames are interconnected by an extra stage of switch boards, this MIN has four switch stagesh boards, this MIN has four switch stages

packet switching, buffered wormhole routing, 40MHz clocpacket switching, buffered wormhole routing, 40MHz clockk

hardware latency when contention-free is small while muhardware latency when contention-free is small while much higher in actual application processesch higher in actual application processes


Fi gur e 2. 10 A 128- way hi gh per f or mance swi t ch bui l t wi t hf our st ages of 16- way swi t ches i n I BM SP2

N0N1N2N3

N4N5N6N7

N8N9N10N11

N12N13N14N15

Swi t ch Boar d

A Fr ame

Fr ame 3

Fr ame 4

Fr ame 5

Fr ame 6

Fr ame 7

Fr ame 8

Fr ame 1

Fr ame 2

4 l i nks

4 l i nks

Swi t ch Boar d 1

Swi t ch Boar d 2

Swi t ch Boar d 3

Swi t ch Boar d 4


Exercises (1)

Point out the differences between centralized Point out the differences between centralized file system and xFSfile system and xFS


Cl i ent Cl i ent Cl i ent Cl i ent

Cent r al i zed Fi l e Ser ver

Net wor ks

Fi gur e 2. 19 Compar i son of Two Fi l e syst ems( a) Cent r al i zed Fi l e Ser ver ( b) xFS

Cl i ent

Net wor ks

Manager

St or ageSer ver

St or ageSer ver

Cl i ent

Manager

Cl i ent

St or ageSer ver

Cl i ent

( a)

( b)

St or age Ser verManager


The code shown below is the process of short messThe code shown below is the process of short message communication between two processes by usiage communication between two processes by using active message in NOW. Process Q computes arng active message in NOW. Process Q computes array ray AA, and process P compute scalar , and process P compute scalar xx. The final re. The final result is the summation of sult is the summation of x x and and AA[7]. Please explain [7]. Please explain the process of remote fetching.the process of remote fetching.

Process PProcess P Process QProcess Qcompute compute xx compute compute AA : : : : : :: :am_enable(…); am_enable(…); am_enable(…);am_enable(…);am_request_1(Q, request_h,7); :am_request_1(Q, request_h,7); :am_poll( );am_poll( ); am_poll( ); am_poll( ); sun=x+ysun=x+y : : :: am_disable( );am_disable( );am_disable( );am_disable( );

am_request_h(vnn_t P, iam_request_h(vnn_t P, intk)ntk)int reply_h(vnn_t,Q,intz) {am_reply_1(P, reply_h int reply_h(vnn_t,Q,intz) {am_reply_1(P, reply_h A[k]);}A[k]);}{y=z}{y=z}

Exercises (2)



Chapter 3 Performance Evaluation (I)

Compute ACompute AnxnnxnxBxBnxnnxn=C=Cnxn nxn by using the computatiby using the computation model of DSM and DM.on model of DSM and DM.

Analyze the scalability of FFT algorithm on a Analyze the scalability of FFT algorithm on a n-dimension hypercuben-dimension hypercubeIn the hypercube network, the distance between all In the hypercube network, the distance between all pairs of processors communicating to each other is pairs of processors communicating to each other is 1, so1, so

(1)(1)when when p p increases, in order to keep E as a constant, increases, in order to keep E as a constant, n n must increases too, so that must increases too, so that TTee==kTkToo, that is, , that is,

(2) (2) where where kk==EE/(1-/(1-EE). Because ). Because nnloglogpp≥≥ p ploglogp, p, the second the second

item in (2) is the main item. So it needs only to consiitem in (2) is the main item. So it needs only to consider the case when der the case when ttccnnloglognn==ktktbbnnloglogp. p. By simple algebBy simple algebraic transformation we can obtain . Because raic transformation we can obtain . Because WW==ttccnnloglogn, n, the ISO-efficiency function the ISO-efficiency function ffEE((pp) is: ) is: WW= = ffEE

((pp)= (3))= (3)In (3), if In (3), if ktktbb//ttcc<1, then the increasing ratio of <1, then the increasing ratio of WW is les is less than s than OO((pploglogpp); if ); if ktktbb//ttcc>1, then ISO-efficiency functi>1, then ISO-efficiency function on WW will exacerbate rapidly with the increase of will exacerbate rapidly with the increase of ktktbb

//ttcc; When ; When ttbb==ttcc, if , if kk<1 (that is, <1 (that is, EE<0.5), then <0.5), then W=W=ΘΘ(( p ploglogpp); if ); if kk>1 (that is, >1 (that is, EE>0.5), for example, >0.5), for example, EE=0.9, =0.9, kk=9, the=9, then n WW= = ΘΘ(( p p99loglogpp). That is ISO-efficiency function gets ). That is ISO-efficiency function gets worse in this case. worse in this case.


pntppttpntttpT bsh

p

j

bhso loglog)()(1log

0

]loglog)[(log pntppttknnT bshc

cb tktpn /ppkt cb tkt

b log/


Given VGiven V1/21/2=(1/2)V=(1/2)V=(1/2)x1.7Mflops=0.85M flo=(1/2)x1.7Mflops=0.85M flops, compute:ps, compute: The upper triangle matrix of (p, p’) Drawing the function curve of (p’)=f(p’)

According to Figure 3.7, we can work out According to Figure 3.7, we can work out Table I Table I shshowing the relationship between processor numberowing the relationship between processor numbers and execution time. When average speed is consts and execution time. When average speed is constant, we can work out ant, we can work out Table IITable II by using by using Table ITable I and and FFormula ormula ((pp, , pp’)=’)=TT//T’T’..


Chapter 3 Performance Evaluation (II)

Proc Proc numnum 11 22 44 88 1616 3232 6464 128128

ExecutiExecution on

TimeTime

0.00400.00402929

0.0090.0091313

0.0130.0136262

0.0170.0174444

0.0210.0214444

0.0250.0256161

0.0290.02966

0.0330.0333838

Table I Processor number and their related execution time

Table I I upper triangle matrix of upper triangle matrix of ((pp, , pp’)’)

(1, 2)(1, 2) (1, (1, 2)2)

(1, 2)(1, 2) (1, 2)(1, 2) (1, 2)(1, 2) (1, 2)(1, 2) (1, 2)(1, 2)

Execution Execution TimeTime

0.0040290.004029 0.009130.00913 0.013620.01362 0.017440.01744 0.021440.02144 0.025610.02561


Browse the web site Browse the web site http://www.netlib.org/liblist.htmlhttp://www.netlib.org/liblist.html

Analyze the scalability of Analyze the scalability of nn-point FFT algorithm on a -point FFT algorithm on a row-numbered mesh. Supposed the time for corow-numbered mesh. Supposed the time for communicating connection to establish is mmunicating connection to establish is ttss, the latenc, the latency of hops is y of hops is ttnn, the time for sending a packet is , the time for sending a packet is ttbb, and , and the computing unit time is the computing unit time is ttcc.. Compute the communication hops of processor Pi; Compute the total communication latency To; Compute the ISO-efficiency Function W=fE(p), and disc

uss it. (Hint: Compare FFT and the topology of mesh, you can find that the communication of processors occurs only on the same row or on the same column, and the biggest communication hops=

The time complexity of the multiplication of two The time complexity of the multiplication of two NN**N N factored matrixes factored matrixes TT11==CNCN33s, where s, where C C is a constant. This a constant. The time complexity of the parallel matrix multiplicatioe time complexity of the parallel matrix multiplication on a n on a nn-node parallel machine -node parallel machine TTnn=(=(CNCN33//nn++bNbN22/ )s, w/ )s, where here bb is another constant, the first item stands for th is another constant, the first item stands for the time for computing and the second item stands for e time for computing and the second item stands for the time of communication overhead.the time of communication overhead. Compute the speedup when fixed workload and discuss

it Compute the speedup when fixed time and discuss it Compute the speedup when limited storage and discuss

it


Exercises

PP

2/P

n


Chapter 4 Algorithm Basics

PRAM ModelPRAM Model BSP ModelBSP Model Summation Algorithm in LogP ModelSummation Algorithm in LogP Model

Part II: Parallel Algorithm Designing


An Example of PRAM Model

Performance attributes:Performance attributes: The machine size n can be arbitrarily large The basic time step is called cycle Within each cycle, each processor executes one inst

ruction All processors implicitly synchronize at each cycle,

and the synchronization overhead is assumed to be zero

All instruction can be any random-access machine instruction

Compute the inner product Compute the inner product s s of two of two NN-dimension-dimensional vectors al vectors AA and and BB on an on an n-n-processor EREW PRAM processor EREW PRAM computercomputer Total execution time is 2N/n + logn

One can assign each processor to perform 2N/n additions and multiplications to generate a local result in 2N/n cycles

One can add all the n local sums to a final sum s by a treelike reduction method in logn cycles

Speedup is n/{1+nlogn/(2N)} n, when N>>n



An Example of BSP Model

Compute the inner product Compute the inner product s s of two of two NN-dimensional vect-dimensional vectors ors AA and and BB on an 8 on an 8--processor BSP computerprocessor BSP computer Super step 1

Computation: Each processor computes its local sum in w=2N/8 cycles.

Communication: Processors 0,2,4,6 send their local sums to Processors 1,3,5,7. Apply1 relation here.

Barrier synchronization Super step 2

Computation: Processors 1,3,5,7 each perform one addition (w=1). Communication: Processors 1,5 send their immediate results to Proce

ssors 3,7. A1 relation is applied here. Barrier synchronization

Super step 3 Computation: Processors 3,7 each perform one addition (w=1). Communication: Processors 3 sends its immediate result to Processor

s 7. A1 relation is applied here. Barrier synchronization

Super step 4 Computation: Processor 7 performs one addition (w=1) to generate th

e final sum. No more communication or synchronization is needed

The total execution time is 2The total execution time is 2NN/8+3/8+3gg+3+3ll+3 cycles. In gene+3 cycles. In general, the execution time is 2ral, the execution time is 2NN//nn++lognlogn((gg++ll+1) cycles on an +1) cycles on an nn-processor BSP-processor BSP



Summation Algorithm on LogP Model

In a summation algorithm, a processor, In a summation algorithm, a processor, which has which has kk children processors, needs children processors, needs to receive to receive k k messages. In order to messages. In order to overlap the overhead of overlap the overhead of gapgap, it needs at , it needs at least (k-1)*(g-o) local summation.least (k-1)*(g-o) local summation.


16P0 27

P1 P2 P3 P5

P4 P6 P7

18 14 10 6

9 5 5

13 12 11 7

6610

＋＋＋＋＋＋＋＋＋＋＋＋－－＋＋－－＋＋－－＋＋－－＋

＋＋＋＋＋＋－－

＋＋＋＋＋＋＋＋＋＋－－

＋＋＋＋＋＋＋＋＋＋＋－－＋－－

＋＋＋＋＋－－

＋＋＋＋＋＋＋＋＋＋＋－－＋＋－－＋－－

＋＋＋＋＋－－

＋＋＋＋＋＋＋＋＋－－

P0

P5

P3

P2

P7

P1

P6

P4

g g g

o o o o

o

o

o o

o

o o o

o

o

g

LL

L

L

LL

L

Figure a: Summation Tree (L=4, g=4, o=2, P=8) A processor has n (underlined number) local data. The number in circles means every processor’s active time

Figure b: Processor’s active state (N=81, t=27, L=4, g=4, o=2, P=8)


Supposed that at the beginning data dSupposed that at the beginning data dii are locat are located on ed on PPi i ( ). The summation here means usi( ). The summation here means using to replace the original ng to replace the original ddi i on on PPii. The sum. The summation algorithm on PRAM model is given in Almation algorithm on PRAM model is given in Algorithm 4.3.gorithm 4.3.

Summation Algorithm 4.3 on PRAM-EREW:Summation Algorithm 4.3 on PRAM-EREW:Input: Input: ddii are kept in are kept in PPii, where , where Output: replaces Output: replaces ddi i in processor in processor PPii

BeginBeginfor for jj=0 to logn-1 do=0 to logn-1 do

for for ii=2=2jj+1 to +1 to nn par-do par-do(i) P(i) Pii==ddii-2-2ii

(ii) (ii) ddii= = ddii + + ddii –2 –2jj

end forend forend forend for

EndEnd(1) compute the summation (n=8) step by step a(1) compute the summation (n=8) step by step according to the algorithm shown aboveccording to the algorithm shown above(2) analyze the time complexity of algorithm 4.(2) analyze the time complexity of algorithm 4.33


Exercises (1)

ni 1

i

jjd

1

ni 1

i

jjd

1


When design an algorithm on APRAM model, we should try our best to When design an algorithm on APRAM model, we should try our best to make the local computing time and the reading and writing time equivamake the local computing time and the reading and writing time equivalent with the synchronization time B in each processor. When computilent with the synchronization time B in each processor. When computing the summation of ng the summation of nn numbers on APRAM, we could use the summati numbers on APRAM, we could use the summation algorithm in on algorithm in BB-tree. Suppose -tree. Suppose PP processors are computing summatio processors are computing summation of n of n n numbers, and each processor has numbers, and each processor has nn//p p numbers. Each processor cnumbers. Each processor computes out its local summation, and then reads out its omputes out its local summation, and then reads out its BB children’s l children’s local summations from the shared memory, and then adds them togetheocal summations from the shared memory, and then adds them together and puts the result to an appointed shared memory unit SM. At last thr and puts the result to an appointed shared memory unit SM. At last the summation result in the root processor is the total summation of these summation result in the root processor is the total summation of these e n n numbers. Algorithm 4.4 shows the summation algorithm on APRAnumbers. Algorithm 4.4 shows the summation algorithm on APRAM.M.

Algorithm 4.4 The summation algorithm on APRAMAlgorithm 4.4 The summation algorithm on APRAMInput: n numbersInput: n numbersOutput: The total summation are stored in a shared memory unit SMOutput: The total summation are stored in a shared memory unit SMBeginBegin

(1) each processor computes the local summation of (1) each processor computes the local summation of nn//pp and puts it int and puts it into SMo SM(2) Barrier(2) Barrier(3) for (3) for kk = down to 0 do = down to 0 do

(3.1) for all P(3.1) for all Pii, , do, , do if Pif Pii is on the is on the kkth level thenth level then PPii summates its summates its BB children’s local summation children’s local summation

results and results and its local one, then puts the result to SMits local one, then puts the result to SM end ifend if end forend for(3.2) Barrier(3.2) Barrier

end forend forEndEnd

(1)(1) Using the parameters of model APRAM, write out the function expressiUsing the parameters of model APRAM, write out the function expression of the time complexity of Algorithm 4.4on of the time complexity of Algorithm 4.4

(2)(2) Explain the role of Barrier statement in the algorithmExplain the role of Barrier statement in the algorithm

Exercises (2)


10 pi 2)1)1((log BpB


The summation of The summation of nn numbers on BSP can be done on a numbers on BSP can be done on a dd-tree. -tree. Supposed Supposed PP processors compute the summation of processors compute the summation of nn numbers, numbers, then each processor are assigned then each processor are assigned nn//p p numbers. First, each prnumbers. First, each processor performs the local summation of ocessor performs the local summation of nn//p p numbers, then tnumbers, then the summation are performed on the he summation are performed on the dd-tree from bottom to up.-tree from bottom to up. The process is shown in algorithm 4.5 The process is shown in algorithm 4.5

Algorithm 4.5 Summation Algorithm on BSPAlgorithm 4.5 Summation Algorithm on BSPInput:Input: nn numbers numbersOutput:Output: The total summation are stored in the root processor PThe total summation are stored in the root processor P00

BeginBegin(1) for all P(1) for all Pii do do

(1.1) P(1.1) Pii computes the summation of local computes the summation of local nn//pp numbers numbers(1.2) if P(1.2) if Pii is on the level then is on the level then

PPii sends its local summation result to its parent node sends its local summation result to its parent node end ifend if

end forend for(2) Barrier(2) Barrier(3) for (3) for kk= down to 0 do= down to 0 do

(3.1) for all P(3.1) for all Pi i dodo if Pif Pi i is on the is on the rrth level thenth level then

PPii receives its receives its dd children nodes’ messages, and children nodes’ messages, and summates them with it own local susummates them with it own local su

mmation mmation result, then sends the result to its parresult, then sends the result to its parent nodeent node

end ifend if end forend for(3.2) Barrier(3.2) Barrier

end forend forEndEnd(1)(1) Analyze the time complexity of Algorithm 4.5Analyze the time complexity of Algorithm 4.5(2)(2) How to confirm How to confirm d’s d’s value?value?


Exercises (3)

10 pi

thBpB 1)1)1((log

2)1)1((log BpB

10 pi


Chapter 5 Designing Method

Parallel string matching algorithmParallel string matching algorithm The shortest paths among all pointsThe shortest paths among all points



Supposed Supposed TT=abaababaababaababaababa, =abaababaababaababaababa, PP=abaabab=abaababa (a (mm=8). Show the execution of Parallel string matchi=8). Show the execution of Parallel string matching algorithm.ng algorithm.

Answer :Answer :For pattern For pattern PP : WIT(1)=0, WIT(2)=1, WIT(3)=2, WIT(4)=4 : WIT(1)=0, WIT(2)=1, WIT(3)=2, WIT(4)=4We know that We know that PP is an aperiodic string. In order to match is an aperiodic string. In order to match

withwithString T, we shouldString T, we should

WIT[1:n-m+1] of P relative to T Compute duel(p, q)

Partition Partition T T and and PP to 2 to 2kk-blocks (k=1,2,..), and then compu-blocks (k=1,2,..), and then computete

duel(duel(pp,,qq) among the blocks parallelly.) among the blocks parallelly.kk=0, WIT=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,]=0, WIT=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,]kk=1, WIT=[0,1][2,0] [2,0][2,0] [0,1][0,1] [2,0][2,0]=1, WIT=[0,1][2,0] [2,0][2,0] [0,1][0,1] [2,0][2,0]kk=2, WIT=[0,1,2,4] [2,0,2,2] [4,1,0,1] [2,4,2,0]=2, WIT=[0,1,2,4] [2,0,2,2] [4,1,0,1] [2,4,2,0]At last, we know that matching will occur at the position At last, we know that matching will occur at the position

of 1,6,11,16 of 1,6,11,16


Parallel string matching algorithm

a b a a b a b a a b a b a a b a b a a b a

1 4 6 8 9 11 14 16

k=2

k=1

1 6 11 16


The shortest paths among all pairs of points


Compute the shortest paths among all Compute the shortest paths among all pairs of points in the directed&weighted pairs of points in the directed&weighted graph shown in Figure 5.2graph shown in Figure 5.2V0

V1

V4

V2

V3

V5

V6

7

1 6 1

1

2

18 5

3

4

Fi gur e 5. 2

2

( a)

0 1 2 3 4 5 60 4 1 0 7 0 00 0 8 0 0 0 00 0 0 2 6 0 00 5 0 0 0 0 10 0 0 0 0 0 00 0 0 2 1 0 00 3 0 0 0 1 0

0123456

( b)

0 1 2 3 4 5 60 4 1 7 0 8 0 2 6 5 0 1 0 2 1 0 3 1 0

0123456

( c)

8

8 8 8

8 8 8 8 8

8 8 8 8

8 8 8 8

8 8 8 8 8 8

8 8 8 8

8 8 8

0 1 2 3 4 5 60 4 1 3 7 0 8 7 0 2 6 3 4 0 2 1 0 7 2 1 0 3 3 3 2 1 0

0123456

( d)

8

8 8

8 8 8

8 8

8 8

8 8 8 8 8 8

8 8

0 1 2 3 4 5 60 4 1 3 7 5 4 0 8 6 0 2 5 4 3 4 0 3 2 1 00 6 2 1 0 3 3 3 2 1 0

0123456

( e)

88

88

8 8 8 8 8 8

0 1 2 3 4 5 60 4 1 3 6 5 4 0 8 6 0 2 5 4 3 4 0 3 2 1 0 6 2 1 0 3 3 3 2 1 0

0123456

( f )

88

88

8 8 8 8 8

8

10 14

13

11

10 14 12 11

12

1411

10 13 12 11

8

12

1411


The algorithm shows how to compute function duelThe algorithm shows how to compute function duel(p,q)(p,q)

Input:Input: WIT[1:n-m+1], ,WIT[1:n-m+1], ,Output:Output: return the duel survival’s position or null return the duel survival’s position or null

(which means (which means one of one of pp and and qq does not exist does not existProcedure DUEL(Procedure DUEL(pp, , qq))BeginBegin

if if pp=null then duel==null then duel=qq else else if if qq=null then duel==null then duel=pp else else (1) (1) jj==qq--pp+1+1 (2) (2) ww=WIT(=WIT(jj)) (3) if T((3) if T(qq++ww-1)<>-1)<>PP((ww) then) then

(i) WIT((i) WIT(qq)=)=ww (ii) duel=(ii) duel=pp elseelse (i) WIT((i) WIT(pp)=)=qq--pp+1+1 (ii) duel=(ii) duel=qq end ifend if

end ifend ifend ifend if

EndEnd(1) Set (1) Set TT=abaababaabaababababa, =abaababaabaababababa, PP=abaababa, comput=abaababa, comput

e WIT(e WIT(ii))(2) Consider the duel case when (2) Consider the duel case when pp=6 and =6 and qq=9=9


Exercises (1)

11 mnqp 2/)( mpq


Consider the unit-weight directed graph shoConsider the unit-weight directed graph shown in figure 5.3. Using the multiplication of bwn in figure 5.3. Using the multiplication of boolean neighboring matrixes to compute its toolean neighboring matrixes to compute its transitive closure.ransitive closure.

Exercises (2)


d c

Fi gur e 5. 3 Uni t - wei ght di r ect ed gr aph

a b


Chapter 6 Designing Techniques

PSRS AlgorithmPSRS Algorithm (m, n)- Selection Algorithm(m, n)- Selection Algorithm



Supposed that the sequence length Supposed that the sequence length nn=27, =27, there are there are p=3 p=3 processors. The sorting processors. The sorting process of PSRS is shown in Fig. 6.1process of PSRS is shown in Fig. 6.1


PSRS Algorithm

15 46 48 93 39 6 72 91 14 36 69 40 89 61 97 12 21 54 53 97 84 58 32 27 33 72 20

6 14 15 39 46 48 72 91 93 12 21 36 40 54 61 69 89 97 20 27 32 33 53 58 72 84 97

6 39 72 12 40 69 20 33 72

6 12 20 33 39 40 69 72 72

33 69

6 14 15 39 46 48 72 91 93 12 21 36 40 54 61 69 89 97 20 27 32 33 53 58 72 84 97

6 14 15

6 14 15 39 46 48 7291 9312 21 36 40 54 61 69 89 9720 27 32 33 53 58 72 84 97

39 46 48 72 91 9312 21 36 40 54 61 69 89 9720 27 32 33 53 58 72 84 97

Fi gur e 6. 1

( a) Par t i t i on：

( b)Local

sor t i ng：

( c) Sampl i ng：

( d)Sampl i ngSor t i ng

：

( e)Key

Choosi ng：

( f )Key

par t i t i on：

( h)Mer ge

sor t i ng：

( g)Gl obal

exchange：


Input: None-ordered Input: None-ordered nn-sequence -sequence AA=(=(aa11, , aa22,…, ,…, aann)) Output: The Output: The mm smallest elements in the sequence smallest elements in the sequence Algorithm bodyAlgorithm body

Partition: partition A into g=n/m groups, each group contains m elements

Local sorting: use Batcher sorting network to sort each group parallelly

Compare all the groups two by two, and obtain the MIN sequence

Sort&compare: for all the MIN sequences, do step 2 and step 3 until finally the m smallest elements are obtained


(m, n)-Selection Algorithm

( M)

m

m

m

m

m ( M)m

mm

MAX MI N

MAX MI N

Fi gur e 6. 3

S

S

S

S

S

S

...

...


Analyze the execution process of the Analyze the execution process of the convolution on a one-dimension systolic convolution on a one-dimension systolic array.array.


Exercises (1)

Cycl e

Fi gur e 6. 10

c4

w4

c3 c2 c1

w3 w2 w1x6 x5 x4 x3 x2 x1

x6 x5 x4 x3 x2 x1

x6 x5 x4 x3 x2 x1

x6 x5 x4 x3 x2 x1

x6 x5 x4 x3 x2 x1

x6 x5 x4 x3 x2

x6 x5 x4 x3 x2

x6 x5 x4 x3

x6 x5 x4 x3

x6 x5 x4

x6 x5 x4

x6 x5

x6 x5

y1 y2 y3

y1 y2 y3

y1 y2 y3

y1 y2 y3

y1 y2 y3

y1 y2 y3

y1 y2 y3

y1 y2 y3

y2 y3

y2 y3

y3

y3

1

2

3

4

5

6

7

8

9

Cycl e10

Cycl e11

Cycl e12

Cycl e

Cycl e

Cycl e

Cycl e

Cycl e

Cycl e

Cycl e

Cycl e


Using the following algorithm to get the Connected Component from the graph in Fig 6.11Using the following algorithm to get the Connected Component from the graph in Fig 6.11Algorithm 6.12 Hirschberg Algorithm for Calculating Connected Component on PRAM-CREWAlgorithm 6.12 Hirschberg Algorithm for Calculating Connected Component on PRAM-CREWInput:Input: Neighboring Matrix ANeighboring Matrix An*nn*n

Output:Output: Vector Vector DD[0:[0:nn-1], where -1], where DD((ii) represents the component in vector ) represents the component in vector DDBeginBegin

(1) for all (1) for all ii: 0 : 0 ≤≤i i ≤≤nn-1 par-do-1 par-do DD((ii) =) =ii

end forend for do step (2) through (6) for do step (2) through (6) for loglognn iterations: iterations:(2) for all (2) for all ii, , j j : 0 : 0 ≤≤ii, , j j ≤≤nn-1 par-do-1 par-do (2.1) (2.1) CC((ii)=min)=minjj{{DD((jj))∣∣A(A(ii,,jj)=1 and )=1 and DD((ii))≠≠DD((jj)})} (2.2) if none then (2.2) if none then CC((ii)= )= DD((ii) endif) endif end forend for(3) for all (3) for all ii, , j j : 0 : 0 ≤≤ii, , j j ≤≤nn-1 par-do-1 par-do (3.1) (3.1) CC((ii)=min)=minjj{{CC((jj) ) ∣∣DD((jj)=)=ii and and CC((jj))≠≠ii}} (3.2) if none then (3.2) if none then CC((ii)= )= DD((ii) endif) endif end forend for(4) for all (4) for all i i : 0 : 0 ≤≤i i ≤≤nn-1 par-do -1 par-do DD((ii)=)=CC((ii) end for) end for(5) for (5) for loglognniterations doiterations do

for all for all i i : 0 : 0 ≤≤i i ≤≤nn-1 par-do -1 par-do CC((ii)=)=CC((CC((ii)) end for)) end for end forend for(6) for all (6) for all i i : 0 : 0 ≤≤i i ≤≤nn-1 par-do -1 par-do DD((ii)=min{)=min{CC((ii), ), DD((CC((ii))} end for))} end for

EndEnd


Exercises (2)

Ver t ex

Component

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Fi gur e 6. 11 I ndi r ect ed Gr aph

1

4

8

5

6

3

7

2


Part III: Parallel Numeric Algorithm

Chapter 8 Basic communication operations

The process of All-to-All Broadcast on a HyperThe process of All-to-All Broadcast on a HyperCube by using SFCube by using SF

One-to-All and All-to-All Personalized CommuOne-to-All and All-to-All Personalized Communicationnication


The process of All-to-All Broadcast on a The process of All-to-All Broadcast on a Hypercube by using SFHypercube by using SF

2

4

0

6

1

5

7

3

( 0) ( 1)

( 2) ( 3)

( 7)( 6)

( 5)( 4)

( a)

2

4

0

6

1

5

7

3

( 6, 7)

( b)

( 0, 1)

( 2, 3)

( 0, 1)

( 2, 3)

( 4, 5) ( 5, 6)

( 6, 7)

( 4, 5)

2

4

0

6

1

5

7

3

( 0, 1, 2, 3)

( c)

( 0, 1, 2, 3)

( 0, 1, 2, 3)

( 0, 1, 2, 3)

( 4, 5, 6, 7) ( 4, 5, 6, 7)

( 4, 5, 6, 7)( 4, 5, 6, 7)

2

4

0

6

1

5

7

3

( 0, . . . , 7)

( d)

( 0, . . . , 7)

( 0, . . . , 7)

( 0, . . . , 7)

( 0, . . . , 7) ( 0, . . . , 7)

( 0, . . . , 7)( 0, . . . , 7)

Fi gur e8. 12

All-to-All Broadcast on a Hypercube All-to-All Broadcast on a Hypercube



In One-to-All Personalized Communication (also In One-to-All Personalized Communication (also called as Single-Node Scatter), there are called as Single-Node Scatter), there are pp packets in packets in the source processor, and every packet has a the source processor, and every packet has a different destination different destination

In All-to-All Personalized Communication (also In All-to-All Personalized Communication (also called as Total Exchange), every processor sends called as Total Exchange), every processor sends different packets with the same size different packets with the same size mm to the others to the others

Personalized CommunicationPersonalized Communication


Fi gur e 8. 14

Si ngl e- NodeCol l ect

( c)

0 1

0 1

0 1

0 1

( d)

. . . . . .

. . . . . .

...

Mp- 1

M1

M0p- 1 p- 1

Mp- 1M0 M1

p- 1p- 1

...

Mp- 1, p- 1

M1, p- 1

M0, p- 1

...

Mp- 1, 1

M1, 1

M0, 1

...

Mp- 1, 0

M1, 0

M0, 0

...

Mp- 1, p- 1

Mp- 1, 1

Mp- 1, 0

...

M1, p- 1

M1, 1

M1, 0

...

M0, p- 1

M0, 1

M0, 0

One- t o- Al l per soal i zedcommuni cat i on

Al l - t o- Al l Per sonal i zedcommuni cat i on


One-to-All Personalized Communication is also calOne-to-All Personalized Communication is also called as Single-Node Scatter. It is different to one-to-led as Single-Node Scatter. It is different to one-to-all broadcast in that there are all broadcast in that there are pp packets in the sou packets in the source processor, and every packet has a different derce processor, and every packet has a different destination . Fig. 8.17 shows the Single-Node Scatter stination . Fig. 8.17 shows the Single-Node Scatter process on a 8-processor hypercube. Please prove: process on a 8-processor hypercube. Please prove: the time to perform One-to-All Personalized Comthe time to perform One-to-All Personalized Communication on a hypercube by using SF and CT is munication on a hypercube by using SF and CT is

ttone-to-all-persone-to-all-pers==ttssloglogpp+m+mttww((pp-1)-1)

Exercises (1)


6

2

7

3

0 1

54

6

2

7

3

0 1

54

6

2

7

3

0 1

54

6

2

7

3

0 1

54

( 0, 1, 2, 3)

( 4, 5,6, 7)

( 0, 1)

( 2, 3)

( 6, 7)

( 4, 5)

( 0)

( 2)

( 6)

( 4)

Fi gur e 8. 17 The Si ngl e- Node Scat t er pr ocesson a 8- pr ocessor hyper cube

( 0, 1, 2, 3,4, 5, 6, 7)

( 7)

( 1)

( 5)

( 3)

( a) ( b)

( c) ( d)

a) Initial distribution

b) Distribution before 2nd step

c) Distribution before 3rd step

d) Final distribution


All-to-All Personalized Communication is also caAll-to-All Personalized Communication is also called as Total Exchange. Every processor sends diflled as Total Exchange. Every processor sends different packets with the same size ferent packets with the same size mm to the others. to the others. Fig. 8.18 shows the total exchange process on a Fig. 8.18 shows the total exchange process on a 6-processor ring. {x,y} represents {source proces6-processor ring. {x,y} represents {source processor, destination processor}, ({xsor, destination processor}, ({x11,y,y11}, {x}, {x22,y,y22},…, {x},…, {xnn,,yynn}) represents the packets stream in the transfer }) represents the packets stream in the transfer process. Each processor only accepts it own pacprocess. Each processor only accepts it own packet and delivers the others to other processors. Pket and delivers the others to other processors. Please prove: the time to perform total exchange lease prove: the time to perform total exchange on a ring is on a ring is tttotal-exchangetotal-exchange=(=(ttss+(1/2)+(1/2)mtmtwwpp)()(pp-1)-1)

Hint: the size of package on Hint: the size of package on iith step is m(th step is m(pp--ii))


Exercises (2)

5 4 3

0 1 2

( {4, 5}. . . {4, 3}) ( {3, 4}. . . {3, 2})

( {3, 5}. . . {3, 2}) ( {2, 4}. . . {2, 1})

( {2, 5}. . . {2, 1}) ( {1, 4}. . . {1, 0})

( {1, 5}, {1, 0}) ( {0, 4}, {0, 5})

( {0, 5}) ( {5, 4})5 5

1

4 43 32 2

1

( {2, 1}) ( {3, 2})

( {3, 1}, {3, 2}) ( {4, 2}, {4, 3})

( {4, 1}. . . {4, 3}) ( {5, 2}. . . {5, 4})

( {5, 1}. . . {5, 4}) ( {0, 2}. . . {0, 5})

( {0, 1}. . . {0, 5}) ( {1, 2}. . . {1, 0})1 1

5

2 2

3 34 4

5

1

( {2, 3},{2. 4},{2, 5},{2, 0},{2, 1})

2

( {1, 3},{1. 4},{1, 5},{1, 0})

3

( {0, 3},{0. 4},{0, 5})

4

( {5, 3},{5. 4})

5

( {4, 3})

1

( {5, 0},{5. 1},{5, 2},{5, 3},{5, 4})

2

( {4, 0},{4. 1},{4, 2},{4, 3})

3

( {3, 0},{3. 1},{3, 2})

4

( {2, 0},{2. 1})

5

( {1, 0})

Fi gur e 8. 18 The t ot al exchange pr ocess on a 6- pr ocessor r i ng


Fox Matrix Multiplication Algorithm: Fox Matrix Multiplication Algorithm: divide the matdivide the matrixes rixes AA and and B B into into p p blocks blocks AAi,ji,j

and and BBi,ji,j (0 (0≤i≤i, j, j≤ -1) with the size ≤ -1) with the size (n/ )* (n/ ). The assign them to processors (P(n/ )* (n/ ). The assign them to processors (P0,00,0

,P,P0,10,1 ,…,P ) At ,…,P ) At

the beginning, processor Pthe beginning, processor Pi,ji,j contains contains AAi,ji,j and and BBi,j i,j ,, and it want to cand it want to c

ompute ompute CCi,j i,j .. The steps are as following:The steps are as following: One-to-All broadcast the diagonal block Ai,i to other proce

ssors on the same row. Every processor performs the multiplication and addition op

eration on its own B block and the received A block Processor Pi,j send its own B block to processor Pi-1,j cyclically If the broadcast block in this turn is Ai,j

, then in the next turn choose block A and broadcast it to the other processors in the same row, and go back to step 2

In Figure 9.11, we show the process of Fox algorithm on In Figure 9.11, we show the process of Fox algorithm on matrix multiplication on a 16-processor machinematrix multiplication on a 16-processor machine


Chapter 9 Matrix Operation

p p

p pp 1,1 pp

1p

pji mod)1(, 1p

4444 BA



An Example of Fox Matrix Multiplication An Example of Fox Matrix Multiplication AlgorithmAlgorithm

0, 0A

1, 1A

2, 2A

3, 3A

0, 0B

1, 0B

2, 0B

3, 0B 3, 1B

0, 1B

1, 1B

2, 1B 2, 2B

3, 2B

0, 2B

1, 2B 1, 3B

2, 3B

3, 3B

0, 3B

3, 0A

0, 1A

1, 2A

2, 3A

1, 0B

2, 0B

0, 0B 0, 1B

1, 1B

2, 1B

3, 1B 3, 2B

0, 2B

1, 2B

2, 2B 2, 3B

3, 3B

0, 3B

1, 3B

( b)( a)

2, 0A

3, 1A

0, 2A

1, 3A

2, 0B

3, 0B

0, 0B

1, 0B 1, 1B

2, 1B

3, 1B

0, 1B 0, 2B

1, 2B

2, 2B

3, 2B 3, 3B

0, 3B

1, 3B

2, 3B

1, 0A

2, 1A

3, 2A

0, 3A

3, 0B

0, 0B

1, 0B

2, 0B 2, 1B

3, 1B

0, 1B

1, 1B 1, 2B

2, 2B

3, 2B

0, 2B 0, 3B

1, 3B

2, 3B

3, 3B

( d)( c)

Fi gur e 9. 11 Communi cat i on pr ocess i n Fox Mul t i pl i cat i on on 16 pr ocessor s

3, 0B

a) one-to-all broadcast A0,0 … move Bi,j upwards cyclically

b) one-to-all broadcast A0,1 … move Bi,j upwards cyclically

c) one-to-all broadcast A0,2 … move Bi,j upwards cyclically

d) one-to-all broadcast A0,3 …


Refer to Fig. 9.14, algorithm 9.8 shows the matrix mRefer to Fig. 9.14, algorithm 9.8 shows the matrix multiplication of Aultiplication of Am*nm*n*B*Bnn**kk=C=Cmm**k k on a two-dimension on a two-dimension mm**kk Systolic array. A subscript-suitable pair of elemen Systolic array. A subscript-suitable pair of elements in the matrix can be put together for multiplicatiots in the matrix can be put together for multiplication by delaying the matrix elements according to the pn by delaying the matrix elements according to the principle of pipelining.rinciple of pipelining.

Algorithm 9.8 Algorithm 9.8 Input:Input: AAmm**nn, B, Bnn**kk

Output:Output: the multiplication result Cthe multiplication result Cii,,j j is put into Pis put into Pii,,jj BeginBegin

for for ii=1 to =1 to mm par-do par-do for for jj=1 to =1 to kk par-do par-do

(1) c(1) ci,ji,j =0 =0(2) while P(2) while Pi,ji,j receives receives aa and and bb do do (2.1) c(2.1) ci,ji,j= c= ci,ji,j + +aa*b*b (2.2) if (2.2) if ii<<mm then send then send bb to P to Pii+1,+1,jj endif endif (2.3) if (2.3) if jj<<kk then send then send aa to P to Pii,,jj+1+1 endif endif endwhileendwhile

end forend forend forend for

EndEnd

The question is The question is In order to make sure that ai,s and bs,j will meet at the suita

ble time, How many time unit should row i in matrix A lag than row i-1 (2 ≤ i ≤ m)? And how many time unit should column j in matrix B lag than column j-1 (2 ≤ j ≤ k)?

When j=k, would a be sent to Pi,j+1? When i=m, would b be sent to Pi+1,j?

Exercises (1)



According to Fox Algorithm discussing in According to Fox Algorithm discussing in section 9.4.3 Write out the formal section 9.4.3 Write out the formal description of Fox multiplication.description of Fox multiplication.

Exercises (2)


P1, 1 P1, 2 P1, 3

P2, 1 P2, 2 P2, 3

P3, 1 P3, 2 P3, 3

P4, 1 P4, 2 P4, 3

b 11b 21b 31b 41b 51

b 12b 22b 32b 42b 52

bbbbb

Fi gur e 9. 14 The l oadi ng mode f r om mat r i x A and B

13

23

33

43

53

a11 a12 a13 a14 a15

a21 a22 a23 a24 a25

a31 a32 a33 a34 a35

a41 a42 a43 a44 a45


Odd-Even Reduction scheme for System of TridiOdd-Even Reduction scheme for System of Tridiagonal Equations.agonal Equations.

Guass-Seldel Scheme for the System of Linear EGuass-Seldel Scheme for the System of Linear Equationsquations

Conjugate Gradient Scheme for the System of LiConjugate Gradient Scheme for the System of Linear Equationsnear Equations


Chapter 10 System of Linear Equations


Initiate the coefficients arrayInitiate the coefficients array Reduce the coefficients to one equationReduce the coefficients to one equation Get the result Get the result XXnn Substitute it back to get Substitute it back to get XXii ( (ii=1,2,…,=1,2,…,nn-1)-1)


Odd-Even Reduction scheme (1)Odd-Even Reduction scheme (1)

#include "mpi.h"

#include "stdio.h"

#include "math.h"

int MOD (int a,int b) {

int c=a/b;

if (a>=0) return (a-c*b);

else return (b+a);

}

main (int argc, char** argv)

{

int myid,i,d,group_size,j,n=9;

double pred,starttime,endtime;

double fpie[12],gpie[12],hpie[12],bpie[12],r[12],delta[12],x[12];

double b[12]={0,0,2,7,13,18,6,3,9,1,0,0};

double f[12]={0,0,0,4,2,5,1,2,2,5,0,0};

double h[12]={0,0,1,5,6,4,1,3,1,0,0,0};

double g[12]={0,0,4,11,14,18,4,6,12,8,0,0};

MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD,&myid);

MPI_Comm_size(MPI_COMM_WORLD,&group_size);

starttime=MPI_Wtime();

for (i=0;i<12;i++) {

fpie[i]=gpie[i]=hpie[i]=bpie[i]=0;

x[i]=r[i]=delta[i]=0;

}




for (i=0;i<=3-1;i++) {

pred=i*log(2);

d=(int)exp(pred);

for (j=2*myid*d+2*d+1; j<=n-1; j+=2*d*group_size) {

r[j]=f[j]/g[j-d]; delta[j]=h[j]/g[i+d];

fpie[j]=-r[j]*f[j-d];

gpie[j]=-delta[j]*f[j+d]-r[j]*h[j-d]+g[j];

hpie[j]=-delta[j]*h[j+d];

bpie[j]=b[j]-r[j]*b[j-d]-delta[j]*b[j+d];

}

r[n]=f[n]/g[n-d]; f[n]=r[n]*f[n-d];

g[n]=g[n]-r[n]*h[n-d]; b[n]=b[n]-r[n]*b[n-d];

for (j=2*d*myid+2*d+1; j<=n-1; j+=2*d*group_size) {

f[j]=fpie[j]; g[j]=gpie[j];

h[j]=hpie[j]; b[j]=bpie[j];

}

for (j=2*d+1;j<=n-1;j+=2*d) {

int buf=MOD((j-2*d-1)/(2*d),group_size);

MPI_Bcast(&f[j],1,MPI_DOUBLE,buf,MPI_COMM_WORLD);

MPI_Bcast(&g[j],1,MPI_DOUBLE,buf,MPI_COMM_WORLD);

MPI_Bcast(&h[j],1,MPI_DOUBLE,buf,MPI_COMM_WORLD);

MPI_Bcast(&b[j],1,MPI_DOUBLE,buf,MPI_COMM_WORLD);

}

}

x[n]=b[n]/g[n];

for (i=3-1; i>-1; i--) {

pred=i*log(2); d=(int)exp(pred);

x[d+1]=(b[d+1]-h[d+1]*x[2*d+1])/g[d+1];

for (j=3*d+1+myid*2*d; j<n; j+=2*d*group_size) x[j]=(b[j]-f[j]*x[j-d]-h[j]*x[j+d])/g[j];

for (j=3*d+1; j<n; j+=2*d) {

int buf2=MOD((j-3*d-1)/(2*d),group_size);

MPI_Bcast(&x[j],1,MPI_DOUBLE,buf2,MPI_COMM_WORLD);

}

}


endtime=MPI_Wtime();

if (myid==0) {

FILE *fp=fopen("eqdata.dat","w");

for (i=2;i<=n;i++) {

printf("X[%d]=%10.5f",i-1,x[i]);

fprintf("X[%d]=%10.5f",i-1,x[i]);

}

printf("It takes %fs to calculate it"), endtime-starttime);

fprintf("It takes %fs to calculate it"), endtime-starttime);

fclose(fp);

}

MPI_Finalize();




By a series of elimination, Guass-Seldel scheme tranBy a series of elimination, Guass-Seldel scheme transforms the coefficient matrix to a diagonal matrix, tsforms the coefficient matrix to a diagonal matrix, then compute out hen compute out xxii from the from the iith equation directly.th equation directly.

In the following example, given n=4, we show the eliIn the following example, given n=4, we show the elimination in the process of Guass-Seldel schememination in the process of Guass-Seldel scheme Find the pivot whose absolute value is the maximum in

the coefficient matrix, for example, it is a11. The first equation is multiplied by –ai1/a11 and then is added to the ith equation (i=2,3,4) respectively.

Find the pivot whose absolute value is the maximum in the coefficient matrix except the first row, for example, it is b22. The second equation is multiplied by –bi2/b22 and then is added to the ith equation (i=1,3,4) respectively.

Repeat it, and finally we will get the system of linear equations, and can compute out xi directly. x1=a15/a11, x2=b25/b22, x3=c35/c33, x4=d45/d44,

Guass-Seldel SchemeGuass-Seldel Scheme


45

35

25

15

4

3

2

1

444342

343332

242322

14131211

0

0

0

a

a

a

a

x

x

x

x

bbb

bbb

bbb

aaaa

45

35

25

15

4

3

2

1

4443

3433

242322

141311

00

00

0

0

c

c

b

a

x

x

x

x

cc

cc

bbb

cca

45

35

25

15

4

3

2

1

44

33

22

11

000

000

000

000

d

c

b

a

x

x

x

x

d

c

b

a


Initiate Initiate xx(0)=0, (0)=0, PP(0)=0, (0)=0, rr(0)=-(0)=-bb,, a a((kk)=[)=[bb-A-Axx((kk-1)]/-1)]/APAP((xx))

Compute Compute rr((kk)=A)=Axx((kk-1)--1)-bb Compute direction vectorCompute direction vector

P(P(kk)=-)=-rr((kk)+[)+[rrTT((kk))rr((kk)P()P(kk-1)]/[-1)]/[rrTT((kk-1)-1)rr((kk-1)]-1)] Compute Step Compute Step

a(a(kk)=-)=-rr((kk)/[AP()/[AP(kk)]=-P)]=-PTT((kk))rr((kk)/[P)/[PTT((kk) AP() AP(PP)])]


Conjugate Gradient Scheme (1)Conjugate Gradient Scheme (1)

#include <mpi.h>

#define ITERATION 1000

double A[9][9]={{2,1,0,0,0,0,0,0,0},

{1,2,1,0,0,0,0,0,0},

{0,1,1,0,0,0,0,0,0},

{0,0,0,2,1,0,0,0,0},

{0,0,0,1,2,1,0,0,0},

{0,0,0,0,1,1,0,0,0},

{0,0,0,0,0,0,2,1,0},

{0,0,0,0,0,0,1,2,1},

{0,0,0,0,0,0,0,1,1} }; /*Matrix to solve*/

double localA[9]; /* the row of Matrix that the node conserve*/

double b[9]={1,0,1,1,0,1,1,0,1};

double r[9]={-1,0,-1,-1,0,-1,-1,0,-1};

double d[9]={0,0,0,0,0,0,0,0,0};

double x[9]={0,0,0,0,0,0,0,0,0};

double d[9],temparr[9];

double localr,localx,locald,localtemp;

int id,size;


int main(int argc, char** argv)

{

double num1,denom1,num2,denom2,temp;

int i,j;

MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD,&id);

MPI_Comm_size(MPI_COMM_WORLD,&size);

if (id==0&&size!=9) {

printf("need and only need 9 nodes!");

goto the_exit;

}

printf("Node %d%d start...",id,size);

for (i=0;i<size;i++) localA[i]=A[id][i];

localr=r[id]; locald=d[id]; localx=x[id];

for (i=0;i<1000;i++){

temp=localr*localr;

MPI_Barrier(MPI_COMM_WORLD);

/*calculate the inner product of r&r*/

MPI_Allreduce(&temp,&denom1,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);

localr=0;

for (j=0;j<size;j++) localr+=localA[j]*x[j];

localr-=localb;


/*gather the vector of r: r=Ax-b*/

MPI_Allgather(&localr,1,MPI_DOUBLE,r,1,MPI_DOUBLE,MPI_COMM_WORLD);

temp=localr*localr;





/*calculate the inner product of r&r*/

MPI_Allreduce(&temp,&num1,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);

if (num1<.0001) break;

locald=num1*locald/denom1-localr;


/*gather the vector of d*/

MPI_Allgather(&locald,1,MPI_DOUBLE,d,1,MPI_DOUBLE,MPI_COMM_WORLD);

temp=locald*localr;


/*calculate the inner product of d&r*/

MPI_Allreduce(&temp,&num1,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);

localtemp=0;

for (j=0;j<size;j++) localtemp+=localA[j]*d[j];

temp=locald*localtemp;


/*Calculate the product of d'[A]d*/

MPI_Allreduce(&temp,&denom2,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);

localx=localx-num2/denom2*locald;


/*calculate the new vector of x*/

MPI_Allgather(&localx,1,MPI_DOUBLE,x,1,MPI_DOUBLE,MPI_COMM_WORLD);


}

if (id==0) {

printf("the solution is");

for (i=0;i<size;i++) printf("x%d:%f",i,x[i]);

}

the_exit:

MPI_Finalize();

}

The solution is

x0: 2.000000

x1:-3.000000

x2: 4.000000

x3: 2.000000

x4:-3.000000

x5: 4.000000

x6: 2.000000

x7:-3.000000

x8: 4.000000

mat: end of program (1.590001 seconds)




Given Given xx(0)=[0, 0, 1](0)=[0, 0, 1]TT,, using Conjugate Gradient methusing Conjugate Gradient method to solve the equations systemod to solve the equations system

Algorithm 10.11 Odd-Even Reduction Scheme on SISD for SAlgorithm 10.11 Odd-Even Reduction Scheme on SISD for System of ystem of Tridiagonal Equations. Tridiagonal Equations.

Input:Input: AAnn**nn, , bb=[=[bb11,…,,…,bbnn]]TT

Output:Output: xx= [= [xx11,…,,…,xxnn]]TT

BeginBegin(1) for (1) for ii=0 to log=0 to lognn-1 do-1 do (1.1) (1.1) dd=2=2ii

(1.2) for (1.2) for jj=2=2ii+1 to +1 to nn-1 step 2-1 step 2dd do do rrjj==ffjj//ggjj--d d δδjj==hhjj/g/gjj++d d f f jj’’=-=-rrjjffjj--dd

ggjj’’=-=-δδjjffii++dd--rrjjhhjj--dd h’h’jj==δδjjhhii++dd

b’b’jj==bbjj++rrjjbbjj--dd--δδjjbbjj++dd

end forend for (1.3) (1.3) rrnn==ffnn/g/gnn--dd

(1.4) (1.4) ffnn==-r-rnnffnn--dd

(1.5) (1.5) ggnn= = ggnn - -rrnnhhnn--dd

(1.6) (1.6) bbnn= = bbnn + +rrnnbbnn--dd

(1.7) for (1.7) for jj=2=2ii+1 to +1 to nn-1 step 2-1 step 2d d dodo ggjj==g’g’j j ffjj==f’f’j j hhjj==h’h’j j bbjj==b’b’jj

end forend for end forend for


Exercises (1)

1

0

1

110

121

012

3

2

1

xxx


(2) (2) xxnn==bbnn//ggnn

(3) for (3) for ii=log=lognn-1 to 0 step –1 do-1 to 0 step –1 do (3.1) (3.1) dd=2=2ii

(3.2) (3.2) xxdd=(=(bbdd--hhddxx22dd)/)/ggdd

(3.3) for (3.3) for jj=3=3dd to to nn step 2 step 2dd do doxxjj=(=(bbjj--ffjjxxjj--dd--hhjjxxjj++dd)/)/ggjj

end forend for end forend for EndEnd

Suppose:Suppose:

Computer Computer AXAX==BB


Exercises (2)

1

9

3

618

13

7

2

,

85

121

00

20

00

00

00

0006

00

32

12

00

10

00

0000

00

04

00

185

614

00

2000

00

00

00

05

00

114

14

BA


Compute the Principal nth Root of Unity wheCompute the Principal nth Root of Unity when n nn=8=8


Chapter 11 FFT

1

2

2

2

2

2

2

2

2

4

5sin

4

5cos

101sincos

2

2

2

2

4

3sin

4

3cos

02

sin2

cos)(

2

2

2

2

4sin

4cos

448

437

426

145

224

123

2/28/22

4/8/21

i

i

ii

ii

ii

iiiee

iiee

ii

ii

ω8=( 1, 0)

ω6

=( 0, - i )

ω2

=( 0, i )

ω4

=( - 1, 0)

ω7

ω3

ω5

ω1

Fi gur e 11. 1Pr i nci pal 8t h Root of Uni t y


FFT algorithm on SIMD-MCFFT algorithm on SIMD-MC22

Input: ak in Pk, where k=0,…,n-1 Output: bk in Pk

1. for k=0 to n-1 par-do Ck=ak

end for2. for h=logn-1 to 0 do for k=0 to n-1 par-do (2.1) p=2h

(2.2) q=n/p (2.3) z=ωp

(2.4) if (k mod p=k mod 2p) the par-do (i) Ck = Ck + Ck+p zr(k) mod q

(ii) Ck+p = Ck -Ck+p zr(k) mod q

endif endfor3. for k=0 to n-1 par-do bk=Cr(k)

endfor


Chapter 11 FFT


An Example of FFT algorithm on SIMD-MCAn Example of FFT algorithm on SIMD-MC2 2

Supposed n=4 processors form a 2x2 matrix, and are numbered according to row-first. when

In the algorithm at step 2 , when h=1, p=2, q=2, z=ω2, processor P0 and P1 satisfy k mod 2=k mod 4

P0: c0=c0+c2(ω2)0=a0+a2, c2=c0- c2(ω2)0 =a0-a2

P1: c1=c1+c3(ω2)0=a1+a3, c3=c1- c3(ω2)0 =a0-a3

At the same time processors in the same column need to communicate to each other

h=0, p=1, q=4, z=ω, processor P0 and P2 satisfy k mod 1=k mod 2

P0: c0=c0+c1ω0=(a0+a2)+(a1+a3), c1=c0- c1ω0 = (a0+a2)-(a1+a3)

P2: c2=c2+c3ω=(a0-a2)+(a1-a3)ω,c3=c2- c3ω = (a0-a2)-(a1-a3)ω

At step 3, the execution result is b0=c0, b3=c3, b1=c2, b2=c1. When computing b1=c2 and b2=c1, the processors on the diagonal need to communicate to each other


Chapter 11 FFT

P0 P1 P2

P4 P5 P6

P8 P9 P10

P12 P13 P14

P3

P7

P11

P15

n1/ 2

n1/ 2

Fi gur e 11. 5MC2 St r uct ur e i n Par al l el FFT


Please compute the inverse DFT of the Please compute the inverse DFT of the following sequencefollowing sequence (16, -0.76+8.66i, -6+6i, -9.25+2.66i, 0, -

9.25-2.66i, -6-6i, -0.76-8.66i) (4-i, 2+i, 2+i, -i, 4-i, 2+i, 2+i, -i)

Given Given nn=8=2=8=2kk, on the butterfly network, , on the butterfly network, according to exp(according to exp(rr,,ii)=)=jj (0 (0 ≤≤i i ≤≤nn-1, 0 ≤-1, 0 ≤r r ≤≤kk), ),

please compute the element please compute the element ωωjj in the in the coefficient matrix of the FFT for the 8 coefficient matrix of the FFT for the 8 points distributed in a butterfly network.points distributed in a butterfly network.


Exercises


Three Parallelization Approaches

A sequential code fragmentA sequential code fragment for(i=0;i<N;i++) A[i]=b[i]*b[i+1];for(i=0;i<N;i++) A[i]=b[i]*b[i+1];

for(i=0;i<N;i++) c[i]=A[i]+A[i+1];for(i=0;i<N;i++) c[i]=A[i]+A[i+1]; Equivalent parallel code using library routines Equivalent parallel code using library routines

id=my_process_id();id=my_process_id();p=number_of_process();p=number_of_process();for(i=id;i<N;i=i+p) A[i]=b[i]*b[i+1];for(i=id;i<N;i=i+p) A[i]=b[i]*b[i+1];barrier();barrier();for(i=id;i<N;i=i+p) c[i]=A[i]+A[i+1];for(i=id;i<N;i=i+p) c[i]=A[i]+A[i+1];

Equivalent code in Fortran 90 using array operationsEquivalent code in Fortran 90 using array operationsmy_process_id(),number_of_process(),and barrier() my_process_id(),number_of_process(),and barrier() ??? ???A(0:N-1)=b(0:N-1)*b(1:N)A(0:N-1)=b(0:N-1)*b(1:N)c=A(0:N-1)+A(1:N)c=A(0:N-1)+A(1:N)

Equivalent code using pragmas in SGI Power CEquivalent code using pragmas in SGI Power C#pragma parallel#pragma parallel#pragma shared(A,b,c)#pragma shared(A,b,c)#pragma local(i)#pragma local(i){{ #pragma pfor iterate(i=0;N;1)#pragma pfor iterate(i=0;N;1) for(i=0;i<N;i++) A[i]=b[i]*b[i+1]; for(i=0;i<N;i++) A[i]=b[i]*b[i+1]; #pragma synchronize#pragma synchronize #pragma pfor iterate(i=0;N;1)#pragma pfor iterate(i=0;N;1)

for(i=0;i<N;i++) c[i]=A[i]+A[i+1];for(i=0;i<N;i++) c[i]=A[i]+A[i+1];}}

Part IV: Parallel Programming


A sequential C code to compute π

#define N 1000000#define N 1000000main() {main() {

double local,pi=0.0,w;double local,pi=0.0,w;long i;long i;w=1.0/N;w=1.0/N;for(i=0;i<N;i++) {for(i=0;i<N;i++) { local=(i+0.5)*w;local=(i+0.5)*w; pi=pi+4.0/(1.0+local*local);pi=pi+4.0/(1.0+local*local);}}printf(“pi is %f \n”,pi*w);printf(“pi is %f \n”,pi*w);

}/*main()*/}/*main()*/



Shared-Variable Parallel Code for Computing π

#define N 1000000#define N 1000000main() {main() {

double local,pi=0.0,w;double local,pi=0.0,w;long i;long i;

A:A: w=1.0/N;w=1.0/N;B:B: #pragma parallel#pragma parallel

#pragma shared(pi,w)#pragma shared(pi,w)#pragma local(i,local)#pragma local(i,local){{ #pragma pfor iterate(i=0;N;1)#pragma pfor iterate(i=0;N;1) for(i=0;i<N;i++) {for(i=0;i<N;i++) {

P: local=(i+0.5)*w;P: local=(i+0.5)*w; Q: local=4.0/(1.0+local*local);Q: local=4.0/(1.0+local*local);

}}C:C: #pragma critical #pragma critical

pi=pi+localpi=pi+local}}

D:D: printf(“pi is %f \n”,pi*w);printf(“pi is %f \n”,pi*w);}/*main()*/}/*main()*/



Computing π in OpenMP

Program Compute-piProgram Compute-piinteger n,iinteger n,ireal w,x,sum,pi,f,areal w,x,sum,pi,f,a

C function to integrateC function to integratef(a)=4.d0/(1.d0+a*a)f(a)=4.d0/(1.d0+a*a)print *,’Enter number of intervaprint *,’Enter number of interva

ls:’ls:’read *,nread *,n

C calculate the interval sizeC calculate the interval sizew=1.0d0/nw=1.0d0/nsum=0.0d0sum=0.0d0

! $ OMP PARALLEL DO PRIVATE(x),SHARED(w)! $ OMP PARALLEL DO PRIVATE(x),SHARED(w),REDUCTION(+:sum),REDUCTION(+:sum)

do i=1,ndo i=1,n x=w*(i+0.5d0)x=w*(i+0.5d0) sum=sum+f(x)sum=sum+f(x)enddoenddo

! $ OMP END PARALLEL DO! $ OMP END PARALLEL DOpi=w*sumpi=w*sumprint *,’compute pi=’,piprint *,’compute pi=’,pistopstopendend



Exercises:a Pthreads code computing π (1)

#include <stdio.h>#include <stdio.h>#include <pthread.h>#include <pthread.h>#include <synch.h>#include <synch.h>

extern unsigned * micro_timer;extern unsigned * micro_timer;unsigned int fin,Start;unsigned int fin,Start;semaphore_t semaphore;semaphore_t semaphore;barrier_spin_t barrier;barrier_spin_t barrier;double delta,pi;double delta,pi;typedef struct{typedef struct{

int low;int low;int high;int high;

}Arg;}Arg;

void child(arg)void child(arg)Arg *arg;Arg *arg;{{

int low=arg->low;int low=arg->low;int high=arg->high;int high=arg->high;int i;int i;double x,part=0.0;double x,part=0.0;




for(i=10; i<=high; i++){for(i=10; i<=high; i++){x=(i+0.5)*delta;x=(i+0.5)*delta;part+=1.0/(1.0+x*x);part+=1.0/(1.0+x*x);

}}pesma(&semaphore);pesma(&semaphore);

pi+=4.0*delta*part;pi+=4.0*delta*part;vsema(&semaphore);vsema(&semaphore);-barrier_spin(&barrier);-barrier_spin(&barrier);pthread_exit();pthread_exit();

}}

main(argc,argv)main(argc,argv)int argc;int argc;char *argv[ ];char *argv[ ];{{

int no_of_threads,segments,i;int no_of_threads,segments,i;pthread_t thread;pthread_t thread;Arg *arg;Arg *arg;if(arg!=3){if(arg!=3){

printf("usage:pi<no_of_threads><no_of_stprintf("usage:pi<no_of_threads><no_of_strips>\n");rips>\n");

exit(1);exit(1);}}




no_of_threads=atoi(argv[1]);no_of_threads=atoi(argv[1]);segments=atoi(arg[2]);segments=atoi(arg[2]);delta=1.0/segments;delta=1.0/segments;pi=0.0;pi=0.0;sema_init(&semaphore,1);sema_init(&semaphore,1);-barrier_spin_init(&barrier,no_of_threads+-barrier_spin_init(&barrier,no_of_threads+1);1);start=*micro_timer;start=*micro_timer;for(i=0; i<no_of_threads; i++){for(i=0; i<no_of_threads; i++){

arg=(Arg*)malloc(size of (Arg));arg=(Arg*)malloc(size of (Arg));arg->low=i*segment/no_of_threads;arg->low=i*segment/no_of_threads;arg->high=(i+1)*segment/no_of_threarg->high=(i+1)*segment/no_of_thre

ads-1;ads-1;pthread_create(&thread,pthread_attpthread_create(&thread,pthread_att

r_default,child,arg);r_default,child,arg);}}-barrier_spin(&barrier);-barrier_spin(&barrier);fin=*micro_timer;fin=*micro_timer;printf("%u\n",fin-start);printf("%u\n",fin-start);printf("\npi\t%15.14f\n",pi);printf("\npi\t%15.14f\n",pi);

}}



Message-Passing Code for Computing π

#define N 1000000#define N 1000000main(){main(){

double local,pi,w;double local,pi,w;long i,taskid,numtask;long i,taskid,numtask;

A:A: w=1.0/N;w=1.0/N;MPI_Init(&argc,&argv);MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,MPI_Comm_rank(MPI_COMM_WORLD,&taskid);&taskid);MPI_Comm_Size(MPI_COMM_WORLD,MPI_Comm_Size(MPI_COMM_WORLD,&numtask);&numtask);

B:B: for(i=taskid;i<N;i=i+numtask){for(i=taskid;i<N;i=i+numtask){ P: local=(i+0.5)*w; P: local=(i+0.5)*w; Q: local=4.0/(1.0+local*local);Q: local=4.0/(1.0+local*local);

}}C:C: MPI_Reduce(&local,&pi,MPI_Double,MPI_Reduce(&local,&pi,MPI_Double,

MPI_MAX,0,MPI_COMM_WORLD);MPI_MAX,0,MPI_COMM_WORLD);D:D: if(taskid==0) printf(“pi is %f \n”,pi*w);if(taskid==0) printf(“pi is %f \n”,pi*w);

MPI_Finalize();MPI_Finalize();}/*main()*/}/*main()*/



A MPI-C program to compute π : 1

#include “mpi.h”#include “mpi.h”#include <stdio.h>#include <stdio.h>double f(a)double f(a)double a;double a;{{

return (4.0/(1.0+a*a));return (4.0/(1.0+a*a));}}

int main(argc,argv)int main(argc,argv)int argc;int argc;char* argv[];char* argv[];{{

int done=0,n,myid,mumprocs,i;int done=0,n,myid,mumprocs,i;double PI25DT=3.141592653589793238462643;double PI25DT=3.141592653589793238462643;double mypi,pi,h,sum,x,a;double mypi,pi,h,sum,x,a;double startwtime,endwtime;double startwtime,endwtime;

MPI_Init(&argc,&argv);MPI_Init(&argc,&argv);MPI_Comm_Size(MPI_COMM_WORLD,&numprocMPI_Comm_Size(MPI_COMM_WORLD,&numprocs);s);MPI_Comm_rank(MPI_COMM_WORLD,&myid);MPI_Comm_rank(MPI_COMM_WORLD,&myid);





while(!done)while(!done){{

if(myid==0)if(myid==0){{ printf(“Enter the number of intervals:(0 printf(“Enter the number of intervals:(0

quits)”);quits)”); scanf(“%d”,&n);scanf(“%d”,&n);

startwtime=MPI_Wtime();startwtime=MPI_Wtime();}}MPI_Bcast(&n,1,MPI_INT,0,MPI_COMM_WMPI_Bcast(&n,1,MPI_INT,0,MPI_COMM_W

ORLD);ORLD);if(n==0)if(n==0) done=1;done=1;elseelse{{ h=1.0/(double)n;h=1.0/(double)n; sum=0.0sum=0.0 for(i=myid+1;i<=n;i+=numprocs)for(i=myid+1;i<=n;i+=numprocs) {{

x=h*((double)i-0.5);x=h*((double)i-0.5);sum+=f(x);sum+=f(x);

}}





mypi=h*summypi=h*sum MPI_Reduce(&mypi,&pi,1,MPI_DOMPI_Reduce(&mypi,&pi,1,MPI_DO

UBLE,UBLE,MPI_SUM,0,MPI_COMM_WORMPI_SUM,0,MPI_COMM_WOR

LD);LD); if(myid==0)if(myid==0) {{

pirntf(“pi is approximately %.pirntf(“pi is approximately %.16f,16f,

Error is %.16f\n”,pi,Error is %.16f\n”,pi,fabs(pi-PI25DT));fabs(pi-PI25DT));

endwtime=MPI_Wtime();endwtime=MPI_Wtime();pintf(“wall clock time=%f\n”,pintf(“wall clock time=%f\n”,

endwtime-startwtime);endwtime-startwtime); }}}/*else*/}/*else*/

}/*while*/}/*while*/MPI_Finalize();MPI_Finalize();

}/*main()*/}/*main()*/



A PVM-C program to compute π : 1

#define n 16 /*number of tasks*/#define n 16 /*number of tasks*/#include “pvm3.h”#include “pvm3.h”

main(int argc,char* * argv)main(int argc,char* * argv){{

int mytid,tids[n],me,i,N,rc,parent;int mytid,tids[n],me,i,N,rc,parent;double mypi,h,sum=0.0,x;double mypi,h,sum=0.0,x;me=pvm_joingroup(“pi”);me=pvm_joingroup(“pi”);parent=pvm_parent();parent=pvm_parent();

if(me==0)if(me==0){{

pvm_spawn(“pi”,(char**)0,0,””,n-pvm_spawn(“pi”,(char**)0,0,””,n-1,tids);1,tids);

printf(“Enter the number of regionprintf(“Enter the number of regions:”);s:”);

scanf(“%d”,&N);scanf(“%d”,&N);pvm_initsend(PvmDataRaw);pvm_initsend(PvmDataRaw);pvm_pkint(&N,1,1);pvm_pkint(&N,1,1);pvm_mcast(tids,n-1,5);pvm_mcast(tids,n-1,5);

}}elseelse





{{pvm_recv(parent,5);pvm_recv(parent,5);pvm_upkint(&N,1,1);pvm_upkint(&N,1,1);

}}pvm_barrier(“pi”,n); /*optional*/pvm_barrier(“pi”,n); /*optional*/h=1.0/(double)N;h=1.0/(double)N;for(i=me+1;i<=N;i+=n)for(i=me+1;i<=N;i+=n){{

x=h*((double)i-0.5);x=h*((double)i-0.5);sum+=4.0/(1.0+x*x);sum+=4.0/(1.0+x*x);

}}mypi=h*sum;mypi=h*sum;pvm_reduce(PvmSum,&mypi,1,PVMDOUBLE,pvm_reduce(PvmSum,&mypi,1,PVMDOUBLE,6,6,

””pi”,0);pi”,0);if(me==0) printf(“pi is approximately %.16f\if(me==0) printf(“pi is approximately %.16f\n”,mypi);n”,mypi);pvm_lvgroup(“pi”);pvm_lvgroup(“pi”);pvm_exit();pvm_exit();

}}




Exercises: a HPF code computing π

Reference ONLYReference ONLY

parameter N=20000parameter N=20000

real pi,hreal pi,h

integer iinteger i

real x(N),y(N)real x(N),y(N)

!HPF$ PROCESSOR Nodes(10)!HPF$ PROCESSOR Nodes(10)

!HPF$ ALIGN x(i) WITH y(i)!HPF$ ALIGN x(i) WITH y(i)

!HPF$ DISTRIBUTE x(CYCLIC) ONTO !HPF$ DISTRIBUTE x(CYCLIC) ONTO NodesNodes

h=1.0/Nh=1.0/N

FORALL(i=1:N) FORALL(i=1:N)

x(i)=(i-0.5)*hx(i)=(i-0.5)*h

y(i)=4.0/(1.0+x(i)*x(i))y(i)=4.0/(1.0+x(i)*x(i))

ENDFORALLENDFORALL

pi=SUM(y)*h pi=SUM(y)*h



Exercises: a F90 code computing π

1.1. INTEGER,PARAMETER::N=131072INTEGER,PARAMETER::N=131072

2.2. INTEGER,PARAMETER::LONG=INTEGER,PARAMETER::LONG=

3.3. SELECTED_REAL_KIND(13,99)SELECTED_REAL_KIND(13,99)

4.4. REAL(KIND=LONG)PI,WIDTHREAL(KIND=LONG)PI,WIDTH

5. INTEGER,DIMENSION(N)::ID5. INTEGER,DIMENSION(N)::ID

6.6. REAL(KIND=LONG),DIMENSION(N)::X,YREAL(KIND=LONG),DIMENSION(N)::X,Y

7.7. WIDTH=1.0_LONG/NWIDTH=1.0_LONG/N

8.8. ID=(/(I,(I=1,N)/)ID=(/(I,(I=1,N)/)

9.9. X=(ID-0.5)*WIDTHX=(ID-0.5)*WIDTH

10.10. Y=4.0/(1.0+X*X)Y=4.0/(1.0+X*X)

11.11. PI=SUM(Y)*WIDTHPI=SUM(Y)*WIDTH

12.12. FORMAT('ESTIMATION OF PI FORMAT('ESTIMATION OF PI WITH',16,WITH',16,

13.13. &'INTERVALS IS',F14.12)&'INTERVALS IS',F14.12)

14.14. PRINT 10,N,PIPRINT 10,N,PI

15. END15. END


nhpcc(hefei) ustc china [email protected] parallel computation architecture, algorithm and...

Documents