++probleme tot

8/13/2019 ++Probleme Tot

1/22

Tutorial 1 / CS-3211 / week 19-24 Jan 2004

1. Suppose a galaxy has 1011 stars. Estimate the time it would take toperform 100 iterations of the basic N-body algorithm using O(N2)computations and a computer that is capable of 500 MFlops.

2. Find the diameter of: (a) a torus; (b) a tree network; (c) an d-dimensionalmesh.

3. Look at the minimal distance deadlock-free algorithm for hypercubenetworks described in the textbook, page 15. Apply it for: (a) a five-dimensional hypercube network from node 7 to node 22; (b) repeat for

an 8 8 mesh, using its perfect embedding in a hypercube network.

4. Determine how the largest complete binary tree can be embedded intoa hypercube. What is the dilation of the mapping?

5. Which is the average distance between two nodes in: (a) a mesh net-work; (b) a hypercube?


2/22

CS-3211: Parallel and Concurrent Programming

Tutorial 3

Week 2-7 Feb 2004

1. Develop an equation for message communication time tcommthat incorpo-rates the delay through multiple links as would occur in a static intercon-nection network. Develop the equation for a mesh, for a tree, and for ahypercube network, assuming that all destinations are randomly chosen.

2. (i) Device an efficient way that a scatter operation can be done on an n-dimensional hypercube. What is its time complexity?(ii) Repeat for an n n torus.

3. In Linux (tembusu) you may use gettimeofday()function to record smallamounts of time (microseconds) as follows:

#include //add this

...

struct timeval start, stop;...

start.tv_usec=0; stop.tv_usec=0;

gettimeofday(&start,NULL);

...

gettimeofday(&stop,NULL);

printf(".. %i ..(microsec)\n",stop.tv_usec-start.tv_usec);

Measure the time to send a message in our parallel programming system(tembusu) using various communication routines [individual send-recv,broadcast, scatter, gather]. Repeat with the ping-pong method describedin the class. Estimate the startup time tstartup and the time to send onedata item tdata.

4. Certain complex MPI communication functions can be simulated by a se-ries of more basic one.(i) Use MPI Send(..) and MPI Recv(..) to write a few procedures

MyBcast, MyScatter, etc simulating MPI Bcast(..), MPI Scatter(..),MPI Gather(..), MPI Reduce(..) (See the MPI manual or Appendix B

1


3/22

of the textbook for details on these routines.)(ii) Use the procedure described in the above question to estimate the timetaken by your simulating routines and compare with the time taken by thecorresponding MPI routines.

5. Experiment with latency hiding on your system to determine how muchcomputation is possible between sending messages. Investigate using bothnonblocking and locally blocking send routines.

2


4/22


Tutorial 4

Week 9-14 Feb 2004

1. Image transformations, square partitions: Write (in pseudocode) aparallel program for the following question and analyze its efficiency.

1. Write a parallel program to perform image transformations (shifting,

scaling, rotation) based on static task assignment and square par-titions. (For example, for an 50 60 image one may divide the rownumber by 2 and the column number by 3 to get 6 square parts,each of size 25 20.)

2. Analyze your code in terms of communication, computation, overallparallel execution time, speedup, and efficiency.

2. Implementing image transformations: Implement the following image

transformations: shifting, scaling, rotation (slide 3.22 or textbook) andrun the program on Tembusu cluster.

1. Start using a simple graphical interface: an image is just a, say, 5060matrix filled in with digits; then you may simply display the image byprinting on a terminal window.

2. Adapt the above program to handle real images, e.g., in PPM format.(An example of PPM file, including a few explanations, may be foundat cs3211 course web page - Tutorial table, Misc column.)

3. Mandelbrot, static task assignment: Write an MPI program for Man-delbrot computation using a static task assignment (that is, simply dividethe image into fixed areas). Run it on Tembusu cluster.

4. Mandelbrot, dynamic task assignment: Repeat the above question,but using a dynamic task assignment (slides 3.33-34).

1


5/22

5. Monte-Carlo method: Write an MPI program to compute /4 usingMonte-Carlo methods. (Run it on Tembusu cluster.)

1. Use a sequential parallel random number generator and both methodsdescribed in the the class (slide 3.34-35): (1) score how many random

points within a 2 2 square lie within a circle of unit radius and (2)compute the corresponding integral 10

1 x2dx.

2. Repeat the above question using a parallel random number generator(write your own implementation of such a parallel random numbergenerator using the method described in the class - slides 3.36-37).

2


6/22


Tutorial 5

Week 16-21 Feb 2004

Regular questions

1. Analysis of divide-and-conquer method:

Analyze the divide-and-conquer method of assigning one processor to eachnode in a tree for adding numbers (see textbook, sec.4.1.2) in terms ofcommunication, computation, overall parallel execution time, speedup, andefficiency.

2. Holes:

Suppose you own a hole punch capable of putting a hole in an arbitrarilythick stack of paper. If you insert the paper into the hole punch andactivate it, you will get a piece of paper with one hole in it. If you fold thepaper in half before inserting it into the hole punch, you will have a pieceof paper with two holes in it. If you can only use the hole punch once, howmany times must you fold a piece of paper in order to put n holes in it?Prove that your answer is correct and optimal.

3. Smallest value with an arbitrary number of processes:

Develop a divide-and-conquer algorithm that finds the smallest value ina set ofn values inO(log n) steps using n

2 processors. What is the time

complexity if there are fewer than n

2 processors?

4. Two variants of summation:

Write a parallel program to compute the summation ofnintegers in eachof the following ways and assess their performance. Assume that n is apower of 2.

(a) Partition the n integers into n2

pairs. Use n2

processes to add togethereach pair of integers resulting in n

2 integers. Repeat the method on

the n

2 integers to obtain n

4 integers and continue until the final resultis obtained. (Binary tree algorithm.)

1


7/22

(b) Divide the nintegers into nlog n

groups of log nnumbers each. Use nlog n

processes each adding the numbers in one group sequentially. Thenadd the n

log nresults using method (a).

5. Integration:

Write a static assignment parallel program to compute using the formula 10

1 x2dx=

4

using each of the following ways:

1. rectangular decomposition 1 (slide 4.22)

2. rectangular decomposition 2 (slide 4.23)

3. trapezoidal decomposition (slide 4.24)

Analyze each method in terms of speed and accuracy.

Additional, research-like question

Convex hull problem:

Given a set of n points in a plane, develop an algorithm and a parallel

program to find the points that are on the perimeter of the smallest convexregion containing all the points. (See textbook, Ex.4-22.)

2


8/22


Tutorial 6

Week 23-28 Feb 2004

1. Analyze insertion sort:

1. Compare the sequential and parallel, pipeline-like versions of inser-tion sort (textbook 5.3.2, slides 5.24-28) in terms of speedup and time

complexity.2. Modify the method to work for a sequence ofn numbers using ppro-

cesses (for arbitrary n, p), then repeat the above question for this newversion.

2. Pipeline programs for ordinary calculations:

1. Develop a pipeline method to compute sin() using the formula

sin() = 3

3!+5

5!

7

7!+ 9

9! . . .

for a series of inputs 0, 1, 2, . . .. Repeat for cos() and tan().

2. Write a parallel program using pipelining (pseudocode!) to computethe polynomial

f(x) =a0x0 +a1x

1 +. . .+an1xn1

whereas, x andnare inputs.

3. Radix sort:

Radix sort is similar to the bucket sort described in lecture 4, but specif-ically uses the bits of the numbers to identify the bucket into which eachnumber is placed. First the most significant bit is used to place the num-bers into two buckets, say B0 or B1. Then the next most significant bit isused to place the numbers from B0 into two new buckets, say B00 or B01;similarly with B1. Repeat till the least significant bit is reached.

Reformulate the method to become a pipeline solution, write a program(pseudocode!), and analyze its time complexity.

1


9/22

4. Outer product of two vectors:

The outer product of two vectors A = (a0, . . . , an1) andB = (b0, . . . , bn1)

is an n n matrix C, where C=

a0b0 . . . a0bn1...

... ...

an1b0 . . . an1bn1

.

Develop a pipeline implementation for the outer product of two vectorsand analyze it.

5. Pipeline, sieve of Eratosthenes:

Consider the following methods for implementing the sieve of Eratosthenes:

1. By a pipeline approach (textbook 5.3.3; slides 5.29-33)

2. By dividing the range of the numbers into m regions and assigning

one region to each process to strike out multiples of prime numbers;use a master process to broadcast each already found prime numberto processes.

Write parallel programs (pseudocode!) for each method and estimate theirtime complexity.

2


10/22


11/22

5. Second-largest key:

Given a list ofnkeys a[0], . . . , a[n 1], design a parallel algorithm to findthe second-largest key in the list. [Note: Keys do not necessarily havedistinct values.]

Additional, programming question

Game of Life:

Write an MPI parallel program to simulate the Game of Life as describedin textbook, 6.3.3 and experiment with different initial populations. Tryto implement both strip and square partitions and compare their perfor-mances on Tembusu cluster.

2


12/22


Tutorial 8

Week 8-13 Mar 2004

1. Implementation of load-balancing line structure technique:

Implement load-balancing line structure technique (textbook, 7.2.3; Slides7.17-19) and use it in one of your parallel programs.

2. Parallel Moore, centralized work pool approach:

Implement parallel Moores single-source shortest path algorithm using thecentralized work pool approach (textbook, p217; Slide 7.39). (Hint: Usea vertexQueue to store the tasks and a requestQueue the store the still-unsolved task requests. Then, the master process will exit the main loopwhen vertexQueue is empty and requestQueue is full.)

3. Parallel Moore, load-balancing line structure:

Implement Moores algorithms using load-balancing line structure tech-nique.

4. Parallel Moore, decentralized work pool approach:

The decentralized work pool approach described in textbook, Section 7.4for searching a graph is inefficient in that processes are only active aftertheir vertex is placed on the queue. Develop a more efficient work poolapproach that keeps processes more active.

5. Parallel Dijkstra vs. parallel Moore:

Write (pseudocode) a load-balancing parallel version of Dijkstras algo-rithm for searching a graph. Compare its performance and the performanceof a corresponding load-balancing parallel version of Moores algorithm.

1


13/22


Tutorial 9

Week 15-20 Mar 2004

1: Analyze the code (using Bernsteins conditions)

forall (i = 2; i < 6; i++){

x = i - 2*i + i*i;

a[i] = a[x];

}

and determine whether any instance of the body can be executed simulta-neously.

2: For the following code

int a[100], b[100];

forall (i = 2; i < 6; i++){

forall (j = 1; j < 8; j++){

a[i] = b[2*j] + a[i+j];

b[j] = b[i+2*j];

}

}

find the instances of the body that can be executed simultaneously and

provide a schedule that minimize parallel execution time.3: List all possible outputs when the following code is executed

j = 0 ;

k = 0 ;

forall (i = 1; i


14/22

assuming that each assignment statement is atomic.

4: The following C-like parallel program is supposed to transpose a matrix:

forall (i = 0; i < n; i++)

forall (j = 0; j < n; j++)a[i][j] = a[j][i]

Explain why the code will not work and correct it.

5: Determine and explain how the following code for a barrier work (basedupon the two-phase barrier given in textbook Section 6.1.3)

void barrier()

{

lock(arrival);

count++;

if (count < n) unlock(arrival) else unlock(departure);

lock(departure);

count--;

if (count > 0) unlock(departure) else unlock(arrival);

return;

}

Why is it necessary to use two lock variables, arrival and departure?

2


15/22


Tutorial 10

Week 22-27 Mar 2004

Regular questions

1: Modify the rank sort code given in Sec.9.1.3

for (i = 0; i < n; i++) { /* for each number */

x = 0 ;

for (j = 0; j < n; j++) /* count number of nos less tan it */

if (a[i] > a{j]) x++;

b[x] = a[i]; /* copy number into correct place */

}

to cope with duplicates in the sequence of numbers (i.e., for it to sort innondecreasing order).

2: The following is an attempt to code the odd-even transposition sort ofSec.9.2.2. as a SPMD program:

Process P_ievenprocess = (i % 2== 0);

evenphase = 1;

for (step = 0; step < n; step++, evenphase = !evenphase){

if ((evenphase && evenprocess) || (!evenphase) && !(evenprocess)){

send(&a, P_{i+1});

recv(&x, P_{i+1});

if (x < a) a = x; /* keep smaller number */} else {

send(&a, P_{i-1});

recv(&x, P_{i-1}); /* keep larger number */

i f ( x > a ) a = x ;

}

}

Determine whether the code is correct and, if not, correct it.

3: Implement (in pseudo-code) shear-sort (Sec.9.2.3). Explain why log n+ 1phases are to be used.

1


16/22

4: Draw the exchange of numbers for the Quick-sort on a Hypercube (Sec.9.2.6)using the algorithm based on Grey code ordering (Fig.9.21). Illustrate theprocedure on a particular set of numbers.

5: Draw the compare-and-exchange circuit configurations for the odd-even

merge-sort algorithm described in Sec.9.2.7 to sort 16 numbers. Sort asequence of numbers by hand using the odd-even merge-sort algorithm.

More questions - [no automatic allocation; send request email tokzhu]

6: Repeat the above problem 5 for bitonic merge-sort (Sec.9.2.8).

7: Analyze the systolic array for matrix multiplication as described in Sec.10.2.4,

deriving equations for the computation and for the communication.8: Develop a parallel program for convolution: Given x1, . . . , xN+n1 and

w1, . . . , wn compute yi =n

j=1 xij+nwj, for i = 1, . . . , N . (See the text-book for more detailed explanation.)

9: Develop a linear pipeline solution of the Gauss-Seidel method described inSec.10.5.1 and write a pseudo-code parallel program to implement it.

10: Derive the system efficiency when implementing Gaussian elimination with

the strip partition and the cyclic partition, as described in sec.10.3.2.

2


17/22

1

CS-3211; Tutorial 1

1. Suppose a galaxy has 1011 stars. Estimate the time it wouldtake to perform 100 iterations of the basic N-body algorithmusing O(N2) computations and a computer that is capable of500 MFlops.

Solution: Each iteration takes 1011 1011 = 1022 steps.100 iterations takes 1024 steps. The computer handle 50010

6

= 5 108

operation Flop per second. Hence, the com-putations takes 1024/(5 108) = 2 1015 seconds, whichgives 63,419,500 years.Notice: As it was pointed out at one tutorial, this is correctprovided we suppose that each step takes 1 Flop. Otherwisethe time is even larger.

2. Find the diameter of: (a) a torus; (b) a tree network; (c) ank-dimensional mesh.

Solution: (a) For am ntorus (mlines andn columns),this is

d= n/2 + m/2The reason is that we may go in both directions on a line(respectively, column), so the shorter distance between twonodes in the same line is at mostn/2. Similarly for thecolumns. To have an example, for a 710 mash two pointswhich realize this diameter are (1,1) and (4,6).(b) In a (complete, balanced, binary) tree network, thelongest (minimal) path is, for instance, between the left-most and the right-most leaves. If the tree has k levels,this is 2(k 1).


18/22

2

We have to express this in terms of number of the networks

nodes. If the tree has k levels, than the number of verticesis 1 + 2 + 22 +. . .+ 2k1 = 2k 1. If there are n nodes inthe tree, this gives n = 2k 1, hence k = log2(n+ 1). Toconclude, the diameter is

d= 2(log2(n+ 1) 1)Notice: If the branching degreeris not 2, but still constant,a similar result is obtained, but the logarithm is in base r.

If the tree is not balanced or the branching degree maybe different for different nodes, then the analysis is morecomplicate and less precise results are obtained.

(c) We suppose that the mash is an hypercube, hence ithas the same length in all directions. In a k-dimensionalmesh, the grater (minimal) distance is between the corners(0, 0, . . . , 0) and (1, 1, . . . , 1). A path between them haveto parse all k directions, along each directions having the

length kn 1. Hence the result is

d= k( kn 1)

3. Look at the minimal distance deadlock-free algorithm forhypercube networks described in the textbook, page 15. Applyit for: (a) a five-dimensional hypercube network from node 7 tonode 22; (b) repeat for an 88 mesh, using its perfect embeddingin a hypercube network.

Solution: (a) The binary representation of 7 is 00111 andof 22 is 10110. The algorithm requires: (i) to compute dis-

junctive or which is 10001 and (ii) to parse the hypercubealong the directions having 1 in the result, in our case,


19/22

3

directions 1 and 5 (left-to-right). The obtained length 2

routing is: 7 = 00111 23 = 10111 22 = 10110.(b) In the mesh, the parsing algorithm is to go, say, firsthorizontally and then vertically from one node to the other.If the mesh is embedded in a hypercube, then this routing isdifferent from the hypercube routing (generally is longer),as the mesh has forgotten many of the hypercube links.

4. Determine how the largest complete binary tree can be em-

bedded into a hypercube. What is the dilation of the mapping?

Solution: We may recursively define a perfect embed-ding as follows. If we know how to embed a k level treein a r-dimensional hypercube, then we take the r + 2-dimensional hypercube and map: the root of the tree in(0, 0, . . . , 0), the left subtree in the (sub) r-dimensional hy-percube (1, 0, , . . . , ) and the left subtree in the (sub) r-dimensional hypercube (0, 1, , . . . , ). This is a perfect em-bedding (one connection in the tree network is realize byone connection in the hypercube), but only a very smallnumber of nodes of the hypercube are used.

If we relax the condition to have a perfect embedding, some-times it is possible to get irregular embedding with lessnodes in the hypercube. E.g., a 3-level tree with the nodesrepresented by ( ), (0), (1), (00), (01), (10), (11) my be em-

bedded in a 3-dimensional cube by the mapping:( ) (0, 0, 0), (0) (1, 0, 0), (1) (0, 0, 1),(00) (1, 1, 0), (01) (1, 0, 1),(10) (0, 1, 0), (11) (1, 1, 1) having dilation 2.


20/22

4

5. Which is the average distance between two nodes in: (a) a

mesh network; (b) a hypercube?

Solution: 5(a) Take an arbitrary position (i, j) of anmnmesh. The distances to the cells from i-th line are

S= (j1)+(j2) . . .+2+1+0+1+2+. . .+(nj1)+(nj)= (j1)j2 +

(nj)(nj+1)2 .

For a line which departs from i-th line by k lines we haveto addkn(an extrak appear for each cell), hence the totalsum of the distances from (i, j) cell to the other ones is

Si,j= [S+(i1)n]+[S+(i2)n]+. . .+[S+2n]+[S+n]+[S]+[S+n] + [S+ 2n] + . . .+ [S+ (m i 1)n] + [S+ (m i)n]=mS+n (i1)i

2 +n (mi)(mi+1)

2

=m( (j1)j2

+ (nj)(nj+1)2

) +n((i1)i2

+ (mi)(mi+1)2

)

Here we may apply two different, but equivalent, methods:Method 2: Find the total length of all paths and divide tothe number of paths. (This is a simple general method.)

Method 1: Find the average of the length from one cell tothe other ones, then make the average of these results overthe cells. (The number of paths to be counted for eachvertex is the same, so a simple, non weighted, average isenough.)

By the first method, we get the total sum of the lengths tobe

St=

i,jSi,j


21/22

5

=m[mj(

(j1)j2

+(nj)(nj+1)2

)]+n[n

i((i1)i

2 +(mi)(mi+1)

2 )]

=m2 12 [2j(j 1)j] +n2 12 [2

i(i 1)i]

=m2[jj

2 jj] +n2[i i2 i i]=m2[n(n+1)(2n+1)6 n(n+1)2 ] +n2[m(m+1)(2m+1)6 m(m+1)2 ]=m2 (n1)n(n+1)3 +n

2 (m1)m(m+1)3

= mn3 (mn 1)(m+n)The number of paths is N = C2mn =

(mn)(mn1)2 , but each

path is counted two times in the above sum (once for each

head), hence the average isA= St2N= m+n3Nice formula... Maybe there is a different, simpler proof...

5(b): In a hypercube all vertexes are equivalent, so it willbe enough to count the average path length for one vertexonly.

If we start with vertex 00 . . . 0 (k times, for a k dimen-sional hypercube), then the distance to an arbitrary vertexis given by the number of 1s in its representation. The totalsum of length to the other vertexes is then

S= 1 C1k+ 2 C2k+. . .+k CkkThis sum may be computed taking the derivative of the

well-known identity(1 +x)k = 1 +C1kx

1 +C2kx2 +. . .+Ckkx

k

The derivative is

k(1 +x)k1 = 0 + 1C1kx0 + 2C2kx

1 +. . .+kCkkxk1


22/22

6

Our sum actually is the right-hand-side of the above iden-

tity whenx= 1, hence

S=k2k1

The number of vertexes (different from 00 . . . 0) is 2k 1,hence the average path length (for 00 . . . 0 and also for thewhole hypercube) is

A= S2k1 = k2 1

2k1

For largek a good approximation of this is k2 .

5(c): Try to find the average distance between two vertices of atree network.

++probleme tot

Documents