parallel and distributed computation (lucidi di l. pagli)

1Cuba

PARALLEL AND DISTRIBUTEDCOMPUTATION

(Lucidi di L. Pagli)• MANY INTERCONNECTED PROCESSORS WORKING CONCURRENTLY

INTERCONNECTION

NETWORK

P2

P3

P1

P4 P5

Pn. . . .

• CONNECTION MACHINE (THINKING COMP. & C.) 64.000 Pocessors

• INTERNET Connects all the computers of the world

2Cuba

THREE TYPES OF MULTIPROCESSING FRAMEWORKS, CLOSELY RELATED

• CONCURRENTCONCURRENT• PARALLELPARALLEL

• PRAMPRAM• Bounded-degree network and VLSIBounded-degree network and VLSI

•DISTRIBUTEDDISTRIBUTED

MULTIPROCESSING ACTVITIES TAKE PLACE IN A SINGLE MACHINE (POSSIBLY USING

SEVERAL PROCESSORS), SHARING MEMORY AND TASKS.

TECHNICAL ASPECTSTECHNICAL ASPECTS

•PARALLEL COMPUTERS (USUALLY) WORK IN TIGHT SYNCRONY, SHARE MEMORY TO A LARGE EXTENT AND HAVE A VERY FAST AND RELIABLE COMMUNICATION MECHANISM BETWEEN THEM.

• DISTRIBUTED COMPUTERSDISTRIBUTED COMPUTERS ARE MORE INDEPENDENT, COMMUNICATION IS LESS FREQUENT AND LESS SYNCRONOUS, AND THE COOPERATION IS LIMITED.

PURPOSESPURPOSES

• PARALLEL COMPUTERS COOPERATE TO SOLVE MORE EFFICIENTLY (POSSIBLY) DIFFICULT PROBLEMS

• DISTRIBUTED COMPUTERSDISTRIBUTED COMPUTERS HAVE INDIVIDUAL GOALS AND PRIVATE ACTIVITIES. SOMETIME COMMUNICATIONS WITH OTHER ONES ARE NEEDED. (E. G. DISTRIBUTED DATA BASE OPERATIONS).

PARALLEL COMPUTERS: COOPERATION IN A POSITIVE SENSE

DISTRIBUTED COMPUTERS: COOPERATION IN A DISTRIBUTED COMPUTERS: COOPERATION IN A NEGATIVENEGATIVE SENSE, ONLY SENSE, ONLY WHEN IT IS NECESSARYWHEN IT IS NECESSARY

3Cuba

FOR PARALLEL SYSTEMS

WE ARE INTERESTED TO SOLVE ANY PROBLEM IN PARALLEL

FOR DISTRIBUTED SYSTEMS

WE ARE INTERESTED TO SOLVE IN PARALLEL PARTICULAR PROBLEMS ONLY, TYPICAL EXAMPLES ARE:

•COMMUNICATION SERVICES ROUTING BROADCASTING

•MAINTENANCE OF CONTROL STUCTURE *SPANNING TREE CONSTRUCTION TOPOLOGY UPDATE *LEADER ELECTION

•RESOURCE CONTROL ACTIVITIES LOAD BALANCING MANAGING GLOBAL DIRECTORIES * MUTUAL EXCLUSION

4Cuba

PARALLEL ALGORITHMS

• WHICH MODEL OF COMPUTATION IS THE BETTER TO USE?

• HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM?

• HOW TO CONSTRUCT EFFICIENT ALGORITHMS?

MANY CONCEPTS OF THE COMPLEXITY THEORY MUST BE REVISITED

• IS THE PARALLELISM A SOLUTION FOR HARD PROBLEMS?

• ARE THERE PROBLEMS NOT ADMITTING AN EFFICIENT PARALLEL SOLUTION, THAT IS INHERENTLY SEQUENTIAL PROBLEMS?

UUNDECIDABLE PROBLEMS REMAIN UNDECIDABLE!NDECIDABLE PROBLEMS REMAIN UNDECIDABLE!

5Cuba

PRAM MODELPRAM MODEL

• Joseph Jajà An introduction to Parallel Algorithms Addison-Wesley Pub. Comp. 1992

•Karp R.M., Ramachandra V.A survey of parallel algorithm for shared-memory machines J. Van LeuwenEd. Handbook of Theoretical Comp. Science

• Jan ParberryParallel Complexity Theory Research Notes in Theoretical Computer Science.John Wiley&Son 1987

TO FOCUS ON ALGORITHMIC ISSUES INDEPENDENTLY OF PHYSICAL LOCATIONS

6Cuba

.

.

.

PP11

PP22

PPnn

.

.

?

123

m

Common Memory

PPii

P

i

PRAM n RAM processors numbered from 1 to n andconnected to a common memory of m cells

ASSUMPTION: at each time unit each Pi can read a memory cell, make an internal computation and write another memory cell.

CONSEQUENCE: any pair of processor Pi Pj can communicate in constant time!constant time!

Pi writes the message in cell x at time tPi reads the message in cell x at time t+1

7Cuba

ASSUMPTIONS

• Shared-memory: The array A is stored in the global memory and can be accessed by any processor.

•Synchronous mode of operation: In each unit of time, each processor is allowed to execute an istruction or to stay idle.

There are several variations regarding the handling of simultaneous access handling of simultaneous access to the same memory location.

EREW-PRAM (exclusive read exclusive write) CREW-PRAM (concurrent read exclusive write)CRCW-PRAM (concurrent read CONCURRENT write) and a policy to resolve concurrent writes Common, Priority, Arbitrary

The three models do not differ substantiallysubstantially in their computationa power!

If each processor can execute its own local program we have a

MIMD (multiple instruction multiple data) model otherwiseSIMD (single instruction multiple data) model

8Cuba

Dal Bertossi Cap. 27

• Sommatoria n log n

• Sommatoria n

[R. Grossi]

9Cuba

Important parameters of the efficiency of a parallel algorithm

Tp(n) (or Tp) parallel time

p(n) (or p) number of processors

LOWER BOUND of the parallel computation

Let A a problem and Ts be the complexity of the optimal sequential(or the best known) algorithm for A, we have:

Tp >= Ts / p

Tp p cost of the parallel algorithm

The parallel algorithm can be converted into a sequential algorithm thatruns in O(Cn ) time: the single processor simulates the p processors in psteps for each of the Tp parallel step.

If the parallel time would be less than Ts / p, we could derive a sequentialalgorithm better than the optimal one!!

Cn =

10Cuba

Parallel algorithm

time 1 2 3

processor P1 op1 op2 op3processor P2 op4 op5

can be simulated by a single processor in a number of steps (time) Š 6

Tp=3 C=6p=2

Sequential algorithm

time 1 2 3 4 5

op1 op4 op2 op5 op3 Ts=5

Tp >= Ts/p

Ts/Tp speed up of the parallel algorithm

11Cuba

MAXIMUM on the PRAM

Input: an array A of n=2k elements in the shared memory of a PRAM with n/2 processors

Output: the maximum element stored in location S.

Algorithm MAXbegin for all k where 1 <= k <= log n do in parallel

if i <= n/2k do in parallel A[i] := max {A[2i], A[2i-1]} MAX := A[1]end

A(2) A(3) A(4) A(5) A(6) A(7) A(8)

A(2) A(3) A(6) A(7)

A(3) A(7)

A(3)

P1 P2 P3 P4

A(1)

P2P1

P1

S

P1

12Cuba

From the previous lower bound and sequential computation

C = Tpn

From algorithm MAX

C = Tpn = O(nlog n)

Better algorithm:

• divide the n elements in k =n/log n subsets of log n elements each

...................P1 P2 P3 Pk

not optimal

• each processor computes the maximum mi of its subsets with the sequential algorithm in time O(log n)

•algorithm MAX is executed among the local maxima, time O(log (n/log n)) = O(logn - loglog n)= O(logn)

Overal time: Tp = O(log n) and p= n/ log n

optimalC = Tpn = O(n)

m1 m2 m3 mk

13Cuba

PERFORMANCE OF PARALLEL ALGORITHM

Four ways of measuring the performance of parallel algorithm:

1. P(n) processors and Tp(n) time.

2. C(n) = P(n)Tp(n) cost and Tp(n) time.

The number of processors depends on the size n of the problems.

The second relation can be generalized to any number p<P(n) processors

each of the Tp parallel step can be simulated by the p processors in O(P(n)/p) substeps; this simulation takes a total of O( Tp(n)P(n)/p) time.

3. O( Tp(n)P(n)/p) time for any number p<P(n) processors

If the number of processors p is larger than P(n), we can clearly achieve the runnng time Tp(n) by using P(n) processors only. Relation 3 can be further generalized.

4. O(C(n)/p + Tp(n)) time for any number p processors

In conclusion,in the design of a PRAM alg., we can assume as many processor we need and use the proper relation to analyze it.

14Cuba

1. P(n)= n/2 processors and Tp(n) =O(log n) time.

2. C(n) = P(n)Tp(n) = O(n log n) cost and Tp(n)= O(log n) time

Assume p= log n processors

PERFORMANCE OF ALGORITHM MAX

3. O(Tp(n)P(n)/p) = O (logn n/logn) = O(n) time

Therefore

4. O(logn n/p + logn) time. If p<=n, O(log n) time, otherwise O(logn n/p ) time.

Work W(n) of a parallel algorithm: total number of operations used.

Work of alg. MAX: W(n) = SUMj=1, logn(n/2j) + 1 = O(n)

W(n) < C(n)

W(n) measures the total number of operations and has nothing to do withthe number of processors available, while C(n) measures the cost of the alg.relative to the number p of processors available.

15Cuba

Work-time presentation of a parallel algorithm

any number of parallel operations at each time unit is allowed

BRENT PRINCIPLE :given a parallel algorithm that runs in time T(n) and requires W(n) work, we can adapt this algorithm to run on a p-processors PRAM in time

The Brent Principle assumes that the scheduling of the operations to the processors is always a trivial task. This is not always true. It’s easy if we use C(n) in place of W(n)

Tp(n) < |W(n)/p| + T(n)

Let Wi(n) be the number of operation of time unit i, 1<= i <= T(n). Simulateeach set of Wi(n) operations in |Wi(n)/p| parallel steps of the p processors,

for each 1<= i <= T(n). The p-processors PRAM algorithm takes <= SUMi |Wi(n)/p| <= SUMi (|Wi(n)/p| +1)

< SUMi |Wi(n)/p| + T(n).

16Cuba

t

algorithm A1

W(n) =36

T1 = 6

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

29

30

36

6 7 3 8 5 7Wi

1 7 10 13 14 17 28 30 36

A1 can be simulated by A2 with 3 processors in time T2(n) <= 36/3 + 6 =18

4

2 8 11 15 18 21 29 31 5

20 23 25 33

342624

9 12 16 19 22 32 6 35273

t1 t2 t3 t4 t5 t6

T2 = 14

17Cuba

Dal Bertossi cap. 27

Tecniche di base:

• Somme prefisse

• Ordinamento non ottimo

• List ranking con pointer jumping

• Ciclo euleriano

[R. Grossi]

18Cuba

PARALLEL DIVIDE AND CONQUER

• Partion the input in several subsets of almost equal size

• Solve recursively the subproblem defined by each subset

• Combine the solutions of the subproblems into a solution to the overall problem

CONVEX HULL

p q

UPPER HULL

LOWER HULLv1

v2

v3 v4

v5

v7 v6

•Sort the x-coordinates in increasing order

•Compute the UPPER HULL

sequential algorithmsequential algorithm O(n logn)O(n logn)

Tp =O(log n)

19Cuba

w. l. o. g., let x (v1) < x (v2) < . . .< x (vn) n = 2k

• Divide the points in two subsets S1 (v1, v2 , . . .,vn/2) S2 (vn/2+1, . . .,vn)

suppose the UPPER HULL of S1 and S2 is already computed

S1 S2

ab

Compute ab : upper common tangent q1

q2

q3

q4

q5

q’1

q’4

q’3q’2

• The UPPER HULL of S is formed by q1 , . . . , qi = a, q’j = b, . . . ,q’s

Algorithm UPPER HULL (Sketch)

1. if n<=4 use brute force method to determine UH(S)

2. Let S1 (v1, v2 , . . .,vn/2), S2 (vn/2+1, . . .,vn) recursively compute UH(S1) and UH(S2) in parallel (Tp(n/2) time and 2W(n /2) operations)

3. Find the Upper Common Tangent between UH(S1) and UH(S2) and deduce UH(S) O(log n sequential time) O(n ) operations Tp(n) = Tp(n/2) +O(log n)= O(log2 n) W(n) = O(nlogn)

20Cuba

Intractable problems remain intractable in parallel

For an intractable problem (NP-hard) the only known solution require exponential time:

Ts = abn p = nc (polynomial in the size of the input)

From the lower bound:

TP >= abn/ n c > a(b/2)n for large value of n

still exponential

We consider only the class PP, and in particular the class NCNC P. P.

NC NC is the class of all (decision) problems that can be solved in solved inis the class of all (decision) problems that can be solved in solved inpolylog parallel time (i. e. polylog parallel time (i. e. Tp is of order O(log is of order O(logkkn)), with a polynomial numbern)), with a polynomial numberof processors.of processors.

NC contains problems that can be NC contains problems that can be efficiently solved in parallelefficiently solved in parallel

21Cuba

PARALLEL SEQUENTIAL

Efficient Algorithm Class NCNC Efficient Algorithm Class PP

NC NC == P P?

P P == NP NP?

•There are problems belonging to P P for which NO EFFICIENT PARALLEL NO EFFICIENT PARALLEL algorithm is known.

•There is no proof that such an algorithm not exists

P-complete Problems NP-complete Problems P-complete Problems NP-complete Problems

Monotone Circuit Value Satisfiability

MCVP2

P1

P3

Ph

SAT

NP1

NP2

NP3

NPK

... .

Goldshlager Th. (1984) Cook’s Th. (1969)

PP NPNP

22Cuba

MONOTONE CIRCUIT VALUE PROBLEM (MCVP)

abcdefg

Determine the value of the single output of a Boolean Circuit consisting of two-valued AND and OR gates and a set of inputs and their complements,

z

z =(((a AND b) OR c) AND d) AND ((e AND f) OR g)

DEPTH FIRST SEARCH

f g

a b

c de

1

22

1

1

221

3

3

a 1b 2c 5d 3e 4f 6g 7

dfs numbers

arcs numbered according to the order of appereance on the adjacency list

23Cuba

MAX FLOW

s t

2

1

3

5

2

1

21

2

3

1

4

1

1

s t

2

1

3

1 11

2

3

4

1

1

3

0

0

A directed graph N (network) with two distinguished vertices: source s and sink t;each arc is labelled with its capacity (positive integer).A flow f is a function, such that 1. 0 <= f(e) <= c(e), for all arcs e (capacity constraint) 2. the sum of the flow of all incoming arcs to any node (!= s,t), is equal to sum of the flow on all outgoing arcs. (conservation constraint)

The value of the FLOW is given by the sum of the flow of the outgoing arcs of s (= to the sum of the flow of all incoming arcs to t). Find the maximum possible value of the flow.

f = 6

Sequential Algorithm O(n3) No efficient parallel solution is knownNo efficient parallel solution is known

24Cuba

Decisional Parallel Problems

Reducibility Notion: Let A1 and A2 be decisional problems. A1 is NC-reducible to A2 ifthere exists an NC-algorithm that transforms an arbitrary input u1 of A1 into an input u2of A2, such that A1 answer yes for u1 if and only if A2 answer yes for u2. A2 is at least as difficult as A1

A problem A is P-Complete if every problem in the class P is NC-reducible to A

If A is P-complete If A is NP-completeANC iff P=NC AP iff P=NP

The hope of finding an efficient parallel algorithm is very low

To show that a problem A is P-Complete

- A P - MCVP is NC-reducible to A

MCVPinput: Acyclic network of gates AND, OR (two-valued input) and an assignementof constant values 0,1 at each input lineoutput: compute the value of the single output value

25Cuba

Sketch of the GOLDSHLAGER’S theorem

An arbitrary problem A P can be formulated as an MCVP problem.

• MCVP P because z can be computed in O(n) sequential time.

• if A P is accepted by a deterministic TM in time T(n), polynomial for any input n.

output

1 n

0

input

Q = {q1, . . . , qs} set of States = { a1, . . . , am} tape’s alphabet

d : Q x Q x x { L, R} transition function The corresponding boolean circuit is defined by the following boolean functions:

1. H (i,t) = 1 if the head is on cell i a time t. 0 <= T <= T(n), 1<= i <= T(n).

2. C(i, j, t) = 1 if the cell i contains the symbol aj at time t. 0 <= T<= T(n), 1<= i <=T(n), 1<= j <= m.3. S (k,t) =1 if the state of the TM is qk at time t. 1<=k<= s, 0<= T<= T(n).

Each step of the Turing machine can be described by one level of the circuit computingH (i, t), C( i, j, t) and S(k, t ).

26Cuba

EX:

TM- 1q2R

0q3R 1q2L

0q3L 1q3R Q = {q1, q2, q3} S = {0,1}

1

t = 0

H (i, 0)

1

1 , i = 1

0 , 2 < i < nC (i, j, 0)

1 , i = 1, j =2

0 , i = 1, j =2

q1

q2

q3

S (k, 0) =

1, k= 1

t > 0

0, k°1

H (i, t) = ( H (i-1, t-1) AND “right shift”) OR ( H(i+1, t-1) AND “left shift”)

“left shift” = ((S (2, t-1) AND C (i+1, 2, t-1)) OR (S(3, t-1) AND C (i+1, 1, t-1))

0 1

i-1 i i+1

analogously compute C (i, j, t) and S (k, t). The circuit value is given by C(1, * , T (n))and can be computed in O (log n) time with a quadratic number of processors.

27Cuba

THE PRAM IS A THEORETICAL (UNFEASIBLE) MODEL

• The interconnection network between processors and memory would require a very large amount of area .

• The message-routing on the interconnection network would require time proportional to network size (i. e. the assumption of a constant access time to the memory is not realistic).

WHY THE PRAM IS A REFERENCE MODEL?

• Algorithm’s designers can forget the communication problems and focus their attention on the parallel computation only.

•There exist algorithms simulating any PRAM algorithm on bounded degree networks.

E. G. A PRAM algorithm requiring time T(n), can be simulated in a mesh of tree in time T(n)log2n/loglogn, that is each step can be simulated with a slow-down

of log2n/loglogn.

• Instead of design ad hoc algorithms for bounded degree networks, design more general algorithms for the PRAM model and simulate them on a feasible network.

28Cuba

• For the PRAM model there exists a well developed body of techniques and methods to handle different classes of computational problems.

• The discussion on parallel model of computation is still HOTHOT

The actual trend:

COARSE-GRAINED MODELSCOARSE-GRAINED MODELS (BSP, LOGP) (BSP, LOGP)

• The degree of parallelism allowed is independent from the number The degree of parallelism allowed is independent from the number of processors.of processors.

• The computation is divided in supersteps, each one includes

• local computation• communication phase• syncronization phase

the study is still at the beginning!the study is still at the beginning!

parallel and distributed computation (lucidi di l. pagli)

Documents