parallel and distributed computation (lucidi di l. pagli)
DESCRIPTION
PARALLEL AND DISTRIBUTED COMPUTATION (Lucidi di L. Pagli). MANY INTERCONNECTED PROCESSORS WORKING CONCURRENTLY. P4. P5. P3. INTERCONNECTION. NETWORK. P2. Pn. P1. CONNECTION MACHINE (THINKING COMP. & C.) 64.000 Pocessors. - PowerPoint PPT PresentationTRANSCRIPT
1Cuba
PARALLEL AND DISTRIBUTEDCOMPUTATION
(Lucidi di L. Pagli)• MANY INTERCONNECTED PROCESSORS WORKING CONCURRENTLY
INTERCONNECTION
NETWORK
P2
P3
P1
P4 P5
Pn. . . .
• CONNECTION MACHINE (THINKING COMP. & C.) 64.000 Pocessors
• INTERNET Connects all the computers of the world
2Cuba
THREE TYPES OF MULTIPROCESSING FRAMEWORKS, CLOSELY RELATED
• CONCURRENTCONCURRENT• PARALLELPARALLEL
• PRAMPRAM• Bounded-degree network and VLSIBounded-degree network and VLSI
•DISTRIBUTEDDISTRIBUTED
MULTIPROCESSING ACTVITIES TAKE PLACE IN A SINGLE MACHINE (POSSIBLY USING
SEVERAL PROCESSORS), SHARING MEMORY AND TASKS.
TECHNICAL ASPECTSTECHNICAL ASPECTS
•PARALLEL COMPUTERS (USUALLY) WORK IN TIGHT SYNCRONY, SHARE MEMORY TO A LARGE EXTENT AND HAVE A VERY FAST AND RELIABLE COMMUNICATION MECHANISM BETWEEN THEM.
• DISTRIBUTED COMPUTERSDISTRIBUTED COMPUTERS ARE MORE INDEPENDENT, COMMUNICATION IS LESS FREQUENT AND LESS SYNCRONOUS, AND THE COOPERATION IS LIMITED.
PURPOSESPURPOSES
• PARALLEL COMPUTERS COOPERATE TO SOLVE MORE EFFICIENTLY (POSSIBLY) DIFFICULT PROBLEMS
• DISTRIBUTED COMPUTERSDISTRIBUTED COMPUTERS HAVE INDIVIDUAL GOALS AND PRIVATE ACTIVITIES. SOMETIME COMMUNICATIONS WITH OTHER ONES ARE NEEDED. (E. G. DISTRIBUTED DATA BASE OPERATIONS).
PARALLEL COMPUTERS: COOPERATION IN A POSITIVE SENSE
DISTRIBUTED COMPUTERS: COOPERATION IN A DISTRIBUTED COMPUTERS: COOPERATION IN A NEGATIVENEGATIVE SENSE, ONLY SENSE, ONLY WHEN IT IS NECESSARYWHEN IT IS NECESSARY
3Cuba
FOR PARALLEL SYSTEMS
WE ARE INTERESTED TO SOLVE ANY PROBLEM IN PARALLEL
FOR DISTRIBUTED SYSTEMS
WE ARE INTERESTED TO SOLVE IN PARALLEL PARTICULAR PROBLEMS ONLY, TYPICAL EXAMPLES ARE:
•COMMUNICATION SERVICES ROUTING BROADCASTING
•MAINTENANCE OF CONTROL STUCTURE *SPANNING TREE CONSTRUCTION TOPOLOGY UPDATE *LEADER ELECTION
•RESOURCE CONTROL ACTIVITIES LOAD BALANCING MANAGING GLOBAL DIRECTORIES * MUTUAL EXCLUSION
4Cuba
PARALLEL ALGORITHMS
• WHICH MODEL OF COMPUTATION IS THE BETTER TO USE?
• HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM?
• HOW TO CONSTRUCT EFFICIENT ALGORITHMS?
MANY CONCEPTS OF THE COMPLEXITY THEORY MUST BE REVISITED
• IS THE PARALLELISM A SOLUTION FOR HARD PROBLEMS?
• ARE THERE PROBLEMS NOT ADMITTING AN EFFICIENT PARALLEL SOLUTION, THAT IS INHERENTLY SEQUENTIAL PROBLEMS?
UUNDECIDABLE PROBLEMS REMAIN UNDECIDABLE!NDECIDABLE PROBLEMS REMAIN UNDECIDABLE!
5Cuba
PRAM MODELPRAM MODEL
• Joseph Jajà An introduction to Parallel Algorithms Addison-Wesley Pub. Comp. 1992
•Karp R.M., Ramachandra V.A survey of parallel algorithm for shared-memory machines J. Van LeuwenEd. Handbook of Theoretical Comp. Science
• Jan ParberryParallel Complexity Theory Research Notes in Theoretical Computer Science.John Wiley&Son 1987
TO FOCUS ON ALGORITHMIC ISSUES INDEPENDENTLY OF PHYSICAL LOCATIONS
6Cuba
.
.
.
PP11
PP22
PPnn
.
.
?
123
m
Common Memory
PPii
P
i
PRAM n RAM processors numbered from 1 to n andconnected to a common memory of m cells
ASSUMPTION: at each time unit each Pi can read a memory cell, make an internal computation and write another memory cell.
CONSEQUENCE: any pair of processor Pi Pj can communicate in constant time!constant time!
Pi writes the message in cell x at time tPi reads the message in cell x at time t+1
7Cuba
ASSUMPTIONS
• Shared-memory: The array A is stored in the global memory and can be accessed by any processor.
•Synchronous mode of operation: In each unit of time, each processor is allowed to execute an istruction or to stay idle.
There are several variations regarding the handling of simultaneous access handling of simultaneous access to the same memory location.
EREW-PRAM (exclusive read exclusive write) CREW-PRAM (concurrent read exclusive write)CRCW-PRAM (concurrent read CONCURRENT write) and a policy to resolve concurrent writes Common, Priority, Arbitrary
The three models do not differ substantiallysubstantially in their computationa power!
If each processor can execute its own local program we have a
MIMD (multiple instruction multiple data) model otherwiseSIMD (single instruction multiple data) model
8Cuba
Dal Bertossi Cap. 27
• Sommatoria n log n
• Sommatoria n
[R. Grossi]
9Cuba
Important parameters of the efficiency of a parallel algorithm
Tp(n) (or Tp) parallel time
p(n) (or p) number of processors
LOWER BOUND of the parallel computation
Let A a problem and Ts be the complexity of the optimal sequential(or the best known) algorithm for A, we have:
Tp >= Ts / p
Tp p cost of the parallel algorithm
The parallel algorithm can be converted into a sequential algorithm thatruns in O(Cn ) time: the single processor simulates the p processors in psteps for each of the Tp parallel step.
If the parallel time would be less than Ts / p, we could derive a sequentialalgorithm better than the optimal one!!
Cn =
10Cuba
Parallel algorithm
time 1 2 3
processor P1 op1 op2 op3processor P2 op4 op5
can be simulated by a single processor in a number of steps (time) Š 6
Tp=3 C=6p=2
Sequential algorithm
time 1 2 3 4 5
op1 op4 op2 op5 op3 Ts=5
Tp >= Ts/p
Ts/Tp speed up of the parallel algorithm
11Cuba
MAXIMUM on the PRAM
Input: an array A of n=2k elements in the shared memory of a PRAM with n/2 processors
Output: the maximum element stored in location S.
Algorithm MAXbegin for all k where 1 <= k <= log n do in parallel
if i <= n/2k do in parallel A[i] := max {A[2i], A[2i-1]} MAX := A[1]end
A(2) A(3) A(4) A(5) A(6) A(7) A(8)
A(2) A(3) A(6) A(7)
A(3) A(7)
A(3)
P1 P2 P3 P4
A(1)
P2P1
P1
S
P1
12Cuba
From the previous lower bound and sequential computation
C = Tpn
From algorithm MAX
C = Tpn = O(nlog n)
Better algorithm:
• divide the n elements in k =n/log n subsets of log n elements each
...................P1 P2 P3 Pk
not optimal
• each processor computes the maximum mi of its subsets with the sequential algorithm in time O(log n)
•algorithm MAX is executed among the local maxima, time O(log (n/log n)) = O(logn - loglog n)= O(logn)
Overal time: Tp = O(log n) and p= n/ log n
optimalC = Tpn = O(n)
m1 m2 m3 mk
13Cuba
PERFORMANCE OF PARALLEL ALGORITHM
Four ways of measuring the performance of parallel algorithm:
1. P(n) processors and Tp(n) time.
2. C(n) = P(n)Tp(n) cost and Tp(n) time.
The number of processors depends on the size n of the problems.
The second relation can be generalized to any number p<P(n) processors
each of the Tp parallel step can be simulated by the p processors in O(P(n)/p) substeps; this simulation takes a total of O( Tp(n)P(n)/p) time.
3. O( Tp(n)P(n)/p) time for any number p<P(n) processors
If the number of processors p is larger than P(n), we can clearly achieve the runnng time Tp(n) by using P(n) processors only. Relation 3 can be further generalized.
4. O(C(n)/p + Tp(n)) time for any number p processors
In conclusion,in the design of a PRAM alg., we can assume as many processor we need and use the proper relation to analyze it.
14Cuba
1. P(n)= n/2 processors and Tp(n) =O(log n) time.
2. C(n) = P(n)Tp(n) = O(n log n) cost and Tp(n)= O(log n) time
Assume p= log n processors
PERFORMANCE OF ALGORITHM MAX
3. O(Tp(n)P(n)/p) = O (logn n/logn) = O(n) time
Therefore
4. O(logn n/p + logn) time. If p<=n, O(log n) time, otherwise O(logn n/p ) time.
Work W(n) of a parallel algorithm: total number of operations used.
Work of alg. MAX: W(n) = SUMj=1, logn(n/2j) + 1 = O(n)
W(n) < C(n)
W(n) measures the total number of operations and has nothing to do withthe number of processors available, while C(n) measures the cost of the alg.relative to the number p of processors available.
15Cuba
Work-time presentation of a parallel algorithm
any number of parallel operations at each time unit is allowed
BRENT PRINCIPLE :given a parallel algorithm that runs in time T(n) and requires W(n) work, we can adapt this algorithm to run on a p-processors PRAM in time
The Brent Principle assumes that the scheduling of the operations to the processors is always a trivial task. This is not always true. It’s easy if we use C(n) in place of W(n)
Tp(n) < |W(n)/p| + T(n)
Let Wi(n) be the number of operation of time unit i, 1<= i <= T(n). Simulateeach set of Wi(n) operations in |Wi(n)/p| parallel steps of the p processors,
for each 1<= i <= T(n). The p-processors PRAM algorithm takes <= SUMi |Wi(n)/p| <= SUMi (|Wi(n)/p| +1)
< SUMi |Wi(n)/p| + T(n).
16Cuba
t
algorithm A1
W(n) =36
T1 = 6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
29
30
36
6 7 3 8 5 7Wi
1 7 10 13 14 17 28 30 36
A1 can be simulated by A2 with 3 processors in time T2(n) <= 36/3 + 6 =18
4
2 8 11 15 18 21 29 31 5
20 23 25 33
342624
9 12 16 19 22 32 6 35273
t1 t2 t3 t4 t5 t6
T2 = 14
17Cuba
Dal Bertossi cap. 27
Tecniche di base:
• Somme prefisse
• Ordinamento non ottimo
• List ranking con pointer jumping
• Ciclo euleriano
[R. Grossi]
18Cuba
PARALLEL DIVIDE AND CONQUER
• Partion the input in several subsets of almost equal size
• Solve recursively the subproblem defined by each subset
• Combine the solutions of the subproblems into a solution to the overall problem
CONVEX HULL
p q
UPPER HULL
LOWER HULLv1
v2
v3 v4
v5
v7 v6
•Sort the x-coordinates in increasing order
•Compute the UPPER HULL
sequential algorithmsequential algorithm O(n logn)O(n logn)
Tp =O(log n)
19Cuba
w. l. o. g., let x (v1) < x (v2) < . . .< x (vn) n = 2k
• Divide the points in two subsets S1 (v1, v2 , . . .,vn/2) S2 (vn/2+1, . . .,vn)
suppose the UPPER HULL of S1 and S2 is already computed
S1 S2
ab
Compute ab : upper common tangent q1
q2
q3
q4
q5
q’1
q’4
q’3q’2
• The UPPER HULL of S is formed by q1 , . . . , qi = a, q’j = b, . . . ,q’s
Algorithm UPPER HULL (Sketch)
1. if n<=4 use brute force method to determine UH(S)
2. Let S1 (v1, v2 , . . .,vn/2), S2 (vn/2+1, . . .,vn) recursively compute UH(S1) and UH(S2) in parallel (Tp(n/2) time and 2W(n /2) operations)
3. Find the Upper Common Tangent between UH(S1) and UH(S2) and deduce UH(S) O(log n sequential time) O(n ) operations Tp(n) = Tp(n/2) +O(log n)= O(log2 n) W(n) = O(nlogn)
20Cuba
Intractable problems remain intractable in parallel
For an intractable problem (NP-hard) the only known solution require exponential time:
Ts = abn p = nc (polynomial in the size of the input)
From the lower bound:
TP >= abn/ n c > a(b/2)n for large value of n
still exponential
We consider only the class PP, and in particular the class NCNC P. P.
NC NC is the class of all (decision) problems that can be solved in solved inis the class of all (decision) problems that can be solved in solved inpolylog parallel time (i. e. polylog parallel time (i. e. Tp is of order O(log is of order O(logkkn)), with a polynomial numbern)), with a polynomial numberof processors.of processors.
NC contains problems that can be NC contains problems that can be efficiently solved in parallelefficiently solved in parallel
21Cuba
PARALLEL SEQUENTIAL
Efficient Algorithm Class NCNC Efficient Algorithm Class PP
NC NC == P P?
P P == NP NP?
•There are problems belonging to P P for which NO EFFICIENT PARALLEL NO EFFICIENT PARALLEL algorithm is known.
•There is no proof that such an algorithm not exists
P-complete Problems NP-complete Problems P-complete Problems NP-complete Problems
Monotone Circuit Value Satisfiability
MCVP2
P1
P3
Ph
SAT
NP1
NP2
NP3
NPK
... .
Goldshlager Th. (1984) Cook’s Th. (1969)
PP NPNP
22Cuba
MONOTONE CIRCUIT VALUE PROBLEM (MCVP)
abcdefg
Determine the value of the single output of a Boolean Circuit consisting of two-valued AND and OR gates and a set of inputs and their complements,
z
z =(((a AND b) OR c) AND d) AND ((e AND f) OR g)
DEPTH FIRST SEARCH
f g
a b
c de
1
22
1
1
221
3
3
a 1b 2c 5d 3e 4f 6g 7
dfs numbers
arcs numbered according to the order of appereance on the adjacency list
23Cuba
MAX FLOW
s t
2
1
3
5
2
1
21
2
3
1
4
1
1
s t
2
1
3
1 11
2
3
4
1
1
3
0
0
A directed graph N (network) with two distinguished vertices: source s and sink t;each arc is labelled with its capacity (positive integer).A flow f is a function, such that 1. 0 <= f(e) <= c(e), for all arcs e (capacity constraint) 2. the sum of the flow of all incoming arcs to any node (!= s,t), is equal to sum of the flow on all outgoing arcs. (conservation constraint)
The value of the FLOW is given by the sum of the flow of the outgoing arcs of s (= to the sum of the flow of all incoming arcs to t). Find the maximum possible value of the flow.
f = 6
Sequential Algorithm O(n3) No efficient parallel solution is knownNo efficient parallel solution is known
24Cuba
Decisional Parallel Problems
Reducibility Notion: Let A1 and A2 be decisional problems. A1 is NC-reducible to A2 ifthere exists an NC-algorithm that transforms an arbitrary input u1 of A1 into an input u2of A2, such that A1 answer yes for u1 if and only if A2 answer yes for u2. A2 is at least as difficult as A1
A problem A is P-Complete if every problem in the class P is NC-reducible to A
If A is P-complete If A is NP-completeANC iff P=NC AP iff P=NP
The hope of finding an efficient parallel algorithm is very low
To show that a problem A is P-Complete
- A P - MCVP is NC-reducible to A
MCVPinput: Acyclic network of gates AND, OR (two-valued input) and an assignementof constant values 0,1 at each input lineoutput: compute the value of the single output value
25Cuba
Sketch of the GOLDSHLAGER’S theorem
An arbitrary problem A P can be formulated as an MCVP problem.
• MCVP P because z can be computed in O(n) sequential time.
• if A P is accepted by a deterministic TM in time T(n), polynomial for any input n.
output
1 n
0
input
Q = {q1, . . . , qs} set of States = { a1, . . . , am} tape’s alphabet
d : Q x Q x x { L, R} transition function The corresponding boolean circuit is defined by the following boolean functions:
1. H (i,t) = 1 if the head is on cell i a time t. 0 <= T <= T(n), 1<= i <= T(n).
2. C(i, j, t) = 1 if the cell i contains the symbol aj at time t. 0 <= T<= T(n), 1<= i <=T(n), 1<= j <= m.3. S (k,t) =1 if the state of the TM is qk at time t. 1<=k<= s, 0<= T<= T(n).
Each step of the Turing machine can be described by one level of the circuit computingH (i, t), C( i, j, t) and S(k, t ).
26Cuba
EX:
TM- 1q2R
0q3R 1q2L
0q3L 1q3R Q = {q1, q2, q3} S = {0,1}
1
t = 0
H (i, 0)
1
1 , i = 1
0 , 2 < i < nC (i, j, 0)
1 , i = 1, j =2
0 , i = 1, j =2
q1
q2
q3
S (k, 0) =
1, k= 1
t > 0
0, k°1
H (i, t) = ( H (i-1, t-1) AND “right shift”) OR ( H(i+1, t-1) AND “left shift”)
“left shift” = ((S (2, t-1) AND C (i+1, 2, t-1)) OR (S(3, t-1) AND C (i+1, 1, t-1))
0 1
i-1 i i+1
analogously compute C (i, j, t) and S (k, t). The circuit value is given by C(1, * , T (n))and can be computed in O (log n) time with a quadratic number of processors.
27Cuba
THE PRAM IS A THEORETICAL (UNFEASIBLE) MODEL
• The interconnection network between processors and memory would require a very large amount of area .
• The message-routing on the interconnection network would require time proportional to network size (i. e. the assumption of a constant access time to the memory is not realistic).
WHY THE PRAM IS A REFERENCE MODEL?
• Algorithm’s designers can forget the communication problems and focus their attention on the parallel computation only.
•There exist algorithms simulating any PRAM algorithm on bounded degree networks.
E. G. A PRAM algorithm requiring time T(n), can be simulated in a mesh of tree in time T(n)log2n/loglogn, that is each step can be simulated with a slow-down
of log2n/loglogn.
• Instead of design ad hoc algorithms for bounded degree networks, design more general algorithms for the PRAM model and simulate them on a feasible network.
28Cuba
• For the PRAM model there exists a well developed body of techniques and methods to handle different classes of computational problems.
• The discussion on parallel model of computation is still HOTHOT
The actual trend:
COARSE-GRAINED MODELSCOARSE-GRAINED MODELS (BSP, LOGP) (BSP, LOGP)
• The degree of parallelism allowed is independent from the number The degree of parallelism allowed is independent from the number of processors.of processors.
• The computation is divided in supersteps, each one includes
• local computation• communication phase• syncronization phase
the study is still at the beginning!the study is still at the beginning!