notes.specworld.in€¦ · web viewunit 2. time and global states. introduction . we need to...

Unit 2Time and Global States

2.1. INTRODUCTION

We need to measure time accurately to know the time an event occurred at a computer. To do this we need to synchronize its clock with an authoritative external clock. Algorithms for clock synchronization useful for concurrency control based on timestamp ordering, authenticity of requests e.g. in Kerberos

There is no global clock in a distributed system. Synchronizing data in a distributed system is an enormous challenge in and of itself. In single CPU systems, critical regions, mutual exclusion, and other synchronization problems are solved using methods such as semaphores. These methods will not work in distributed systems because they implicitly rely on the existence of shared memory.

Examples:

Two processes interacting using a semaphore must both be able to access the semaphore. In a centralized system, the semaphore is stored in the kernel and accessed by the processes using system calls.

If two events occur in a distributed system, it is difficult to determine which event occurred first.

Communication between processes in a distributed system can have unpredictable delays, processes can fail, messages may be lost synchronization in distributed systems is harder than in centralized systems because the need for distributed algorithms. The following are the properties of distributed algorithms:

The relevant information is scattered among multiple machines.

Processes make decisions based only on locally available information.

A single point of failure in the system should be avoided.

No common clock or other precise global time source exists

2.2. CLOCKS EVENTS AND PROCESS STATES

Each computer in a DS has its own internal clock. They areused by local processes to obtain the value of the current time. The processes on different computers can timestamp their events. But clocks on different computers may give different times. computer clocks drift from perfect time and their drift rates differ from one another.

clock drift rate: the relative amount that a computer clock differs from a perfect clock

Even if clocks on all computers in a DS are set to the same time, their clocks will eventually vary quite significantly unless corrections are applied

A distributed system is defined as a collection P of N processes pi, i = 1,2,… N. Each process pi

has a state si consisting of its variables which it transforms as it executes. Processes

4

communicate only by messages via a network. Various actions of processes:Send, Receive, operation which change pi ‘s state

Event: the occurrence of a single action that a process carries out as it executes e.g. Send, Receive, change state

Events at a single process pi, can be placed in a total ordering denoted by the relation i

between the events. i.e. e i e’ if and only if e occurs before e’ at pi

A history of process pi: is a series of events ordered by i

history(pi)= h =<ei0, ei

1 , ei2, …>

Clocks

Physical clocks in computers are realized as crystal oscillation counters at the hardware level. Clock skew(offset) is common problem with clocks.

To timestamp events,the computers use the physical clock. At real time, t, the OS reads the time on the computer’s hardware clock Hi(t). It calculates the time on its software clock

Ci(t)= αHi(t) + β

it can be used to timestamp events at pi

Computer clocks are not generally in perfect agreement

Skew: the difference between the times on two clocks (at any instant) Computer clocks are subject to clock drift (they count time at different rates)

Clock drift rate: the difference per unit of time from some ideal reference clock.

Coordinated Universal Time (UTC)International Atomic Time is based on very accurate physical clocks (drift rate 10-13). UTC is an international standard for time keeping. It is based on atomic time, but occasionally adjusted to astronomical time. It is broadcast from radio stations on land and satellite (e.g. GPS). Computers with receivers can synchronize their clocks with these timing signals. Signals from land-based stations are accurate to about 0.1- 10 millisecond, from GPS are accurate to about 1 microsecond

2.3. SYNCHRONIZING PHYSICAL CLOCKS

There are two ways to synchronise a clock:

External synchronization

• This method synchronize the process’s clock with an authoritative external reference clock S(t) by limiting skew to a delay bound D > 0 - |S(t) - Ci(t) | < D for all t.

• For example, synchronization with a UTC (Coordinated Universal Time)source.

Internal synchronization

• Synchronize the local clocks within a distributed system to disagree by not more than a delay bound D > 0, without necessarily achieving external synchronization - |Ci(t) - Cj(t)| < D for all i, j, t n

• For a system with external synchronization bound of D, the internal synchronization is bounded by 2D.

The correctness of a clock

A hardware clock is correct if drift rate falls within a bound ρ > 0, then for any t and t’ with t’ > t the following error bound in measuring t and t’ holds:

(1-ρ)(t’-t) ≤ H(t’) - H(t) ≤ (1+ ρ)(t’-t)

At this condition, no jumps in hardware clocks allowed

The most frequently used conditions are:

• Monotonically increasing

• Drift rate bounded between synchronization points

• Clock may jump ahead at synchronization points

Working of Computer timer:

To implement a clock in a computer a counter register and a holding register are used.

The counter is decremented by a quartz crystals oscillator.

When it reaches zero, an interrupted is generated and the counter is reloaded from the holding register.

Clock skew problem

To avoid the clock skew problem, two types of clocks are used:

Logical clocks : to provide consistent event ordering

Physical clocks : clocks whose values must not deviate from the real time by more than a certain amount.

Software based solutions for synchronising clocks

The following techniques are used to synchronize clocks:

• time stamps of real-time clocks

• message passing

• round-trip time (local measurement)

Based on the above mentioned techniques, the following algorithms provides clock synchronization:

• Cristian’s algorithm

• Berkeley algorithm

• Network time protocol (Internet)

2.3.1 Cristian’s algorithm

Cristian suggested the use of a time server, connected to a device that receives signals from a source of UTC, to synchronize computers externally. Round trip times between

processes are often reasonably short in practice, yet theoretically unbounded. The practical estimation is possible if round-trip times are sufficiently short in comparison to required accuracy.

Principle:A time server S receives signals from a UTC source

– Process p requests time in mr and receives t in mt from S– p sets its clock to t + Tround/2– Tround is the round trip time recorded by p – min is an estimated minimum round trip– Accuracy ± (Tround /2 - min) :

because the earliest time S puts t in message mt is min after p sent mr. the latest time was min before mt arrived at p the time by S’s clock when mt arrives is in the range [t+min, t + Tround

- min]

Fig 2.1 Cristian’s algorithm

Problems in this system:

Timer must never run backward.

Variable delays in message passing / delivery occurs.

2.3.2 Berkeley algorithm

Berkeley algorithm was developed to solve the problems of Cristian’s algorithm. This algorithm does not need external synchronization.

Master slave approach is used here. The master polls the slaves periodically about their clock readings.

Estimate of local clock times is calculated using round trip. The average values are obtained from a group of processes.

This method cancels out individual clock’s tendencies to run fast and tells slave processes by which amount of time to adjust local clockIn case of master failure, master election algorithm is used.

2.3.3 Network Time Protocol (NTP)

The Network Time Protocol defines architecture for a time service and a protocol to distribute time information over the Internet.

Features of NTP:

To provide a service enabling clients across the Internet to be synchronized accurately to UTC.

To provide a reliable service that can survive lengthy losses of connectivity.

To enable clients to resynchronize sufficiently frequently to offset the rates of drift found in most computers.

To provide protection against interference with the time service, whether malicious or accidental.

Fig 2.2: NTP strata

The NTP service is provided b y a network of servers located across the Internet.

Primary servers are connected directly to a time source such as a radio clock receiving UTC.

Secondary servers are synchronized, with primary servers.

The servers are connected in a logical hierarchy called a synchronization subnet.

Arrows denote synchronization control, numbers denote strata. The levels are called strata.

Working:

NTP follows a layered client-server architecture, based on UDP message passing.

Synchronization is done at clients with higher strata number and is less accurate due to increased latency to strata 1 time server.

If a strata 1 server fails, it may become a strata 2 server that is being synchronized though another strata 1 server.

Modes of NTP:

NTP works in the following modes:

Multicast:

One computer periodically multicasts time info to all other computers on network.

These adjust clock assuming a very small transmission delay.

Only suitable for high speed LANs; yields low but usually acceptable synchronization.

Procedure-call:

This is similar to Christian’s protocol .Here the server accepts requests from clients.

This is applicable where higher accuracy is needed, or where multicast is not supported by the network‟s hardware and software

Symmetric:

This is used where high accuracy is needed.

Working of Procedure call and symmetric modes:

All messages carry timing history information.

The history includes the local timestamps of send and receive of the previous NTP message and the local timestamp of send of this message

For each pair i of messages (m, m‟) exchanged between two servers the following values are being computed

- offsetoi : estimate for the actual offset between two clocks

-delay di : true total transmission time for the pair of messages.

Fig 2.3: Message exchange between NTP

Delay and offset:

Let o be the true offset of B‟s clock relative to A‟s clock, and let t and t’ the true transmission times of m and m‟ (Ti , Ti-1 ... are not true time)

The delay Ti-2 = Ti-3 + t + o…....................(1)

Ti = Ti-1 + t’ – o................................(2)

which leads to di = t + t’ = Ti-2 - Ti-3 + Ti - Ti-1 (clock errors zeroed out à (almost) true d)

Offset oi = ½ (Ti-2 – Ti-3 + Ti-1 – Ti ) (only an estimate)

Implementing NTP

Statistical algorithms based on 8 most recent pairs are used in NTP to determine quality of estimates.

The value of oi that corresponds to the minimum di is chosen as an estimate for o .

Time server communicates with multiple peers, eliminates peers with unreliable data, favors peers with higher strata number (e.g., for primary synchronization partner selection).

NTP phase lock loop model: modify local clock in accordance with observed drift rate.

The experiments achieve synchronization accuracies of 10 msecs over Internet, and 1 msec on LAN using NTP

2.4 LOGICAL TIME AND LOGICAL CLOCKS

Synchronization between two processes without any form of interaction is not needed.

But when processes interact, they must be event ordered. This is done by logical clocks.

Consider a distributed system with n processes, p1, p2, …pn.

Each process pi executes on a separate processor without any memory sharing.

Each pi has a state si. The process execution is a sequence of events.

As a result of the events, changes occur in the local state of the process.

They either send or receive messages.

2.4.1. Lamport Ordering of Events

The partial ordering obtained by generalizing the relationship between two process is called as happened-before relation or causal ordering or potential causal ordering. This term was coined by Lamport. Happens-before defines a partial order of events in a distributed system. Some events can‟t be placed in the order.

We say e →i e’ if e happens before e’ at process i.

e → e’ is defined using the following rules:

HB1: Local ordering: e → e’ if e →ie’ for any process i

HB2: Messages: send(m) → receive(m) for any message m

HB3: Transitivity: e → e” if e → e’ and e’ → e”

2.4.2 Logical Clocks

A Lamport logical clock is a monotonically increasing software counter, whose value need bear no particular relationship to any physical clock.

Lamport’s Algorithm

Assume that each process i keeps a local clock, Li. There are three rules:

1. At process i, increment Li before each event.

2. To send a message m at process i, apply rule 1 and then include the current local time in the message: i.e., send(m,Li).

3. To receive a message (m,t) at process j, set L j = max(Lj,t) and then apply rule 1 before time-stamping the receive event.

• Lamport clock L orders events consistent with logical happens before

ordering. If e → e’, then L(e) < L(e’)

• But the converse is not true.

L(e) < L(e’) does not imply e → e’

• Similar rules for concurrency are:

L(e) = L(e’) implies e║e’ (for distinct e,e’)

e║e’ does not imply L(e) = L(e’)

This implies that Lamport clocks arbitrarily order some concurrent events.

Fig 2.4: Lamport’s timestamps for the three process

The global time L(e) of an event e is just its local time. For an event e at process i, L(e) = Li(e).

Totally ordered logical clocks

Many systems require a total-ordering of events, not a partial-ordering.

Use Lamport‟s algorithm, but break ties using the process ID.

• L(e) = M * Li(e) + i

Where M = maximum number of processes

Vector clocks

The main aim of vector clocks is to order in such a way to match causality. The expression, V(e) < V(e‟) if and only if e → e‟, holds for vector clocks, where V(e) is the vector clock for event e. For implementing vector clock, label each event by vector V(e) [c1, c2 …, cn]. ciis the events in process i that causally precede e.

• Each processor keeps a vector of values, instead of a single value.

• VCi is the clock at process i; it has a component for each process in the system.

VCi[i] corresponds to Pi’s local “time”.

VCi[j] represents Pi’s knowledge of the “time” at Pj(the # of events that Pi knows have occurred at Pj

• Each processor knows its own time exactly, and updates the values of other processors‟ clocks based on timestamps received in messages.

Fig 2.5: Vector Timestamps

Vector timestamps have the disadvantage, compared with Lamport timestamps, of taking up an amount of storage and message payload that is proportional to N, the number of processes.

2.5 GLOBAL STATES

This section checks whether a property is true in a distributed systems. This is checked for the following problems:

• distributed garbage collection

• deadlock detection

• termination detection

• Debugging

Distributed Garbage Collection

The global state of a distributed system consists of the local state of each process, together with the messages that are currently in transit, that is, that have been sent but not delivered.

An object is considered to be garbage if there are no longer any references to it anywhere in the distributed system.

Fig 2.6: Distributed Garbage Collection

To check that an object is garbage, verify that there are no references to it anywhere.

The process p1 has two objects that both have references :

• One has a reference within p1 itself

• p2 has a reference to the other.Process p2 has one garbage object, with no references to it anywhere in the system.

It also has an object for which neither p1 nor p2 has a reference, but there is a reference to it in a message that is in transit between the processes.

Distributed deadlock detection:

A distributed deadlock occurs when each of a collection of processes waits for another process to send it a message, and where there is a cycle in the graph of this „waits-for‟ relationship.In the following figure, processes p1 and p2 are each waiting for a message from the other, so this system will never make progress.

Fig 2.7: Distributed deadlock detection

Termination detection

Detecting termination is a problem is to test whether each process has halted.Find whether a process is either active or passive.

A passive process is not engaged in any activity of its own but is prepared to respond with a value requested by the other.

The phenomena of termination and deadlock are similar:

A deadlock may affect only a subset of the processes in a system, whereas all processes must have terminated.

The process passivity is not the same as waiting in a deadlock cycle: a deadlocked process is attempting to perform a further action, for which another process waits; a passive process is not engaged in any activity.

Fig 2.8: Termination of process

Consider an application with process pi and a variable x, where i=1, 2, …, N. As the program executes the variables may change the value. The value must be within the range. The relationships between variables must be calculated only for the variables which executes at same time.

Global States and Consistency cuts (Distributed Snapshots)

Distributed Snapshot represents a state in which the distributed system might have been in. A snapshot of the system is a single configuration of the system.

2 2 1 1 2

A distributed snapshot should reflect a consistent state. A global state is consistent if it could have been observed by an external observer. For a successful Global State, all states must be consistent

If we have recorded that a process P has received a message from a process Q, then we should have also recorded that process Q had actually send that message.

Otherwise, a snapshot will contain the recording of messages that have been received but never sent.

The reverse condition (Q has sent a message that P has not received) is allowed.

The notion of a global state can be graphically represented by what is called a cut. A cut represents the last event that has been recorded for each process.

The history of each process if given by:history(p ) h e0 , e1, e2 ,. .

i i i i i

Each event either is an internal action of the process. We denote by s ikt he state of

process pi immediately before the kth event occurs. The state siin the global state S correspondin g to the cut C is that of pi immediately after the last event processed by pi in the cut – eici. The set of events eici is called the frontier of the cut.

Fig 2.9: Types of cuts

In the above figure, the <e11, e 0>and <e 2, e 3> are the frontiers.The <e 1, e 0> is

inconsistent because at p2 it includes the receipt of the message m1 , but at p1 it does not include the sending of that message. The frontier <e2

2, e13> is consistent since it includes

both the sending and the receipt of message m1 and the sending but not the receipt of message m2.

A consistent global state is one that corresponds to a consistent cut. We may characterize the e execution of a distributed s yste m as a series of transitions between global states of the system:

S0 S1 S2 …

A run is a total ordering of all the events in a global history that is consistent with each local history‟s ordering. A linearization or consistent run is an ordering of the events in a global history that is consistent with this happened-before relation o on H. A linearization is also a run.

Global state predicates, stability, safety and liveness

Detecting a condition such as deadlock or termination amounts to evaluating a global state predicate.

A global state predicate is a function that maps from the set of global states of processes in the system

One of the useful characteristics of the predicates associated with the state of an object being garbage, of the system being deadlocked or the system being terminated is that they are all stable: once the system enters a state in which the predicate is True, it remains True in all future states reachable from that state.

By contrast, when we monitor or debug an application we are often interested in non-stable predicates, such as that in our example of variables whose difference is supposed to be bounded.

Snapshot algorithm of Chandy and Lamport

Assumptions of the algorithm

The algorithm to determine global states records process states locally and the states are collected by a designated server.

No failures in channels and processes – exactly once delivery.

Unidirectional channel and FIFO ordered message delivery.

There is always a path between two processes.

Global snapshot initiation at any process at any time.

No process activity that halt during snapshot.

Fig 2.10: Process and organization in distributed systems

Algorithm:

If a process Q receives the marker requesting a snapshot for the first time, it considers the process that sent the marker as its predecessor.

When Q completes its part of the snapshot, it sends its predecessor a DONE message.

By recursion, when the initiator of the distributed snapshot has received a DONE message from all its successors, it knows that the snapshot has been completely taken.

2.6 DISTRIBUTED DEBUGGING:

We now examine the problem of recording a system’s global state so that we may make useful statements about whether a transitory state – as opposed to a stable state – occurred in an actual execution.The observed processes pi ={ 1,2,.. N} send their initial state to the monitor initially, and thereafter from time to time, in state messages. The monitor records the state messages from each process pi in a separate queue Qi, for each i= 1,2.. N In order that the monitor can distinguish consistent global states from inconsistent global states, the observed processes enclose their vector clock values with their state messages. Each queue Qi is kept in sending order, which can immediately be established by examining the ith component of the vector timestamps. Of course, the monitor may deduce nothing about the ordering of states sent by different processes from their arrival order, because of variable message latencies. It must instead examine the vector timestamps of the state messages.Let S=(s1, s2 sN ) be a global state drawn from the state messages that the monitor has received. Let V(si) be the vector timestamp of the state si received from pi . Then it can be shown that S is a consistent global state if and only if:

V(si)[i] ≥ V(sj)[i], for i, j = 1,2...N (Condition CGS)This says that the number of pi’s events known at pj when it sent sj is no more than the number of events that had occurred at pi when it sent si. In other words, if one process’s state depends upon another (according to happened-before ordering), then the global state also encompasses the state upon which it depends.

Possibly ΦThere is a consistent global state S through which a linearization of H passes such that Φ(S)

is True

Definitely Φ: for all linearizations L of H, there is a consistent global state S through which L passes such that Φ(S) is True

Evaluating possibly Φ

To evaluate possibly Φ, the monitor must traverse the lattice of reachable states, starting from the initial state.The monitor may discover the set of consistent states in level L 1+ reachable from a given consistent state in level L by the following method. Let Let S=(s1, s2 sN ) be a consistent state. Then a consistent state in the next level reachable from S is of the form S’ which differs from S only by containing the next state (after a single event) of some process p i . The monitor can find all such states by traversing the queues of state messages Qi (i=1,2…n.The state S’ is reachable from S if and only if:

V(sj)[j] ≥ V(si)[j], for j 1,2… N, j≠i

Evaluating definitely ΦTo evaluate definitely Φ, the monitor again traverses the lattice of reachable states a level at a time, starting from the initial state. It maintains the set States, which contains those states at the current level that may be reached on a linearization from the initial state by traversing only states for which Φ evaluates to False. As long as such a linearization exists, we may not assert definitely Φ the execution could have taken this linearization, and Φ would be False at every stage along it. If we reach a level for which no such linearization exists, we may conclude definitely Φ.

Evaluating possibly Φ and definitely Φ in synchronous systemsThe algorithms we have given so far work in an asynchronous system: we have made no timing assumptions. But the price paid for this is that the monitor may examine a consistent global state S for which any two local states si and sj occurred an arbitrarily long time apart in the actual execution of the system.In a synchronous system, suppose that the processes keep their physical clocks internally synchronized within a known bound, and that the observed processes provide physical timestamps as well as vector timestamps in their state messages. Then the monitor need consider only those consistent global states whose local states could possibly have existed simultaneously, given the approximate synchronization of the clocks. With good enough clock synchronization, these will number many less than all globally consistent states.We now give an algorithm to exploit synchronized clocks in this way. We assume that each observed process pi=(1,2…N). and the monitor, which we shall call p0, keep a physical clock Ci=(1,2…N) These are synchronized to within a known bound D >0 ; that is, at the same real time:

|Ci(t)> Cj(t)|<D, i, j=(1,2…N).The observed processes send both their vector time and physical time with their state messages to the monitor. The monitor now applies a condition that not only tests for consistency of a global state S but also tests whether each pair of states could have happened at the same real time, given the physical clock values. In other words V(si)[i] ≥ V(sj)[i], si and sj could have occurred at the same real time. For the second clause, note that pi is in the state si from the time it first notifies the monitor, C i( si), to some later local time Li(si) when the next state transition occurs at pi. For si and sj to have obtained at the same real time we thus have, allowing for the bound on clock synchronization:Ci( si)-D ≤ Cj( sj) ≤ Li(si) +DThe monitor must calculate a value for Li(si) which is measured against pi’s clock. If the monitor has received a state message for pi ’s next state si’ then Li(si) is Ci( si’). Otherwise, the monitor estimates Li(si) as C0- max+D , C0 is the monitor’s current local clock value and max is the maximum transmission time for a state message.

2.7 COORDINATION AND AGREEMENT

Introduction

When more that one process run in an execution environment, they need to coordinate their actions. To solve problems in coordination avoid fixed master-salve relationship to avoid single points of failure for fixed master. Distributed mutual exclusion for resource sharing can also be used.

When a collection of process share resources, mutual exclusion is needed to prevent interference and ensure consistency. In this case, there is no need for shared variables or facilities are provided by single local kernel to solve it. This solution is based on message passing. The main issue to be addressed while implementing the above method is the failure.

Failure Assumptions and failure Detectors

The message passing paradigm is based on reliable communication channels.

But there can be process failures (i.e.) the whole process crashes.

The failure can be detected when the object/code in a process that detects failures of other processes.

There are two types of failure detectors: unreliable failure detector and reliable failure detector.

The following are features of the unreliable failure detectors:

• unsuspected or suspected (i.e.) there can be no evidence of failure

• each process sends ``alive'' message to everyone else

• not receiving ``alive'' message after timeout

• This is present in most practical systems

The following are features of the reliable failure detectors:

• unsuspected or failure

• They are present in synchronous system

2.8 DISTRIBUTED MUTUAL EXCLUSION

The mutual exclusion makes sure that concurrent process access shared resources or data in a serialized way. If a process, say Pi , is executing in its critical section, then no other processes can be executing in their critical sections.

Distributed mutual exclusion provide critical region in a distributed environment.

2.8.1 Algorithms for Distributed Mutual Exclusion

Consider there are N processes and the processes do not fail. The basic assumption is the message delivery system is reliable. So the methods for the critical region are:

enter() : Enter the critical section block if necessary

resourceAccesses():Access the shared resources

exit(): Leaves the critical section. Now other processes can enter critical section.

The following are the requirements for Mutual Exclusion (ME):

[ME1] safety: only one process at a time

[ME2] liveness: eventually enter or exit

[ME3] happened-before ordering: ordering of enter() is the same as HB ordering

The second requirement implies freedom from both deadlock and starvation.Starvation involves fairness condition.

Performance Evaluation:

The following are the criteria for performance measures:Bandwidth consumption, which is proportional to the number of messages sent in

each entry and exit operations.

The client delay incurred by a process at each entry and exit operation.

Throughput of the system: Rate at which the collection of processes as a whole can

access the critical section.

Central Sever Algorithm

This employs the simplest way to grant permission to enter the critical section by using a server.

A process sends a request message to server and awaits a reply from it.

If a reply constitutes a token signifying the permission to enter the critical section.

If no other process has the token at the time of the request, then the server replied immediately with the token.

If token is currently held by another process, then the server does not reply but queues the request.

Client on exiting the critical section, a message is sent to server, giving it back the token.

Fig 2.11: Central Server Algorithm

The central server algorithm fulfils ME1 and ME2 but not ME3 (i.e.) safety and liveness is ensured but ordering is not satisfied. Also the performance of the algorithm is measured as follows:Bandwidth: This is measured by entering and exiting messages. Entering takes

two messages ( request followed by a grant) which are delayed by the round- trip

time. Exiting takes one release message, and does not delay the exiting process.

Throughput is measured by synchronization delay, round-trip of a release

message and grant message.

Ring Based Algorithm

This provides a simplest way to arrange mutual exclusion between N processes

without requiring an additional process is arrange them in a logical ring.

Each process pi has a communication channel to the next process in the ring as

follows: p(i+1)/mod N.

The unique token is in the form of a message passed from process to process in a

single direction clockwise.

If a process does not require to enter the CS when it receives the token, then it

immediately forwards the token to its neighbor.

A process requires the token waits until it receives it, but retains it.

To exit the critical section, the process sends the token on to its neighbor.

Fig 2.12: Ring based algorithm

This algorithm satisfies ME1 and ME2 but not ME (i.e.) safety and liveness are satisfied but not ordering. The performance measures include:

Bandwidth: continuously consumes the bandwidth except when a process is inside the CS. Exit only requires one message.

Delay: experienced by process is zero message(just received token) to N messages(just pass the token).

Throughput: synchronization delay between one exit and next entry is anywhere from 1(next one) to N (self) message transmission.

2.8.2 Multicast Synchronisation

This exploits mutual exclusion between N peer processes based upon multicast.

Processes that require entry to a critical section multicast a request message, and can enter it only when all the other processes have replied to this message.

The condition under which a process replies to a request are designed to ensure ME1 ME2 and ME3 are met.

Each process pi keeps a Lamport clock. Message requesting entry are of the form<T, pi>.

Each process records its state of either RELEASE, WANTED or HELD in a variable state.

If a process requests entry and all other processes is RELEASED, then all processes reply immediately.

If some process is in state HELD, then that process will not reply until it is finished.

If some process is in state WANTED and has a smaller timestamp than the incoming request, it will queue the request until it is finished.

If two or more processes request entry at the same time, then whichever bears the lowest timestamp will be the first to collect N-1 replies.

Fig 2.13: Multicast Synchronisation

In the above figure, P1 and P2 request CS concurrently.The timestamp of P1 is 41 and for P2 is 34.

When P3 receives their requests, it replies immediately.

When P2 receives P1‟s request, it finds its own request has the lower timestamp, and so does not reply, holding P1 request in queue.

However, P1 will reply. P2 will enter CS. After P2 finishes, P2 reply P1 and P1 will enter CS.

Granting entry takes 2(N-1) messages, N-1 to multicast request and N-1 replies.


Bandwidth consumption is high.

Client delay is again 1 round trip time

Synchronization delay is one message transmission time.

Maekawa’s Voting Algorithm

In this algorithm, it is not necessary for all of its peers to grant access. Only need to obtain permission to enter from subsets of their peers, as long as the subsets used by any two processes overlap.

Think of processes as voting for one another to enter the CS. A candidate process must collect sufficient votes to enter.Processes in the intersection of two sets of voters ensure the safety property ME1 by casting their votes for only one candidate.

A voting set Vi associated with each process pi.

There is at least one common member of any two voting sets, the size of all voting set are the same size to be fair.

The optimal solution to minimizes K is K~sqrt(N) and M=K.

• The algorithm is summarized as

follows: Vi {P1, P2, … PN}

Such that for all i, j = 1, 2, … N

Pi Vi

Vi Vj

|Vi | = K

Each process is contained in M of the voting set Vi

Maekawa’s Voting Algorithm

On initialization

state := RELEASED; voted := FALSE;

For pito enter the critical section state := WANTED;

Multicast request to all processes in Vi; Wait until (number of replies received = K); state :=

HELD;

On receiptof a request frompi at pj

if (state = HELD orvoted = TRUE) then

queue request from pi without replying;

else

send reply to pi; voted := TRUE;

end if

For pi to exit the critical section

state := RELEASED;

Multicast release to all processes in Vi;

On receiptof a release frompi at pj

if (queue of requests is non-empty) then

remove head of queue – from pk, say; send reply to pk;

voted := TRUE;

else

voted := FALSE; end if

The ME1 is met.

If two processes can enter CS at the same time, the processes in the intersection of two voting sets would have to vote for both.

The algorithm will only allow a process to make at most one vote between successive receipts of a release message.This algorithm is deadlock prone.

If three processes concurrently request entry to the CS, then it is possible for p1 to reply to itself and hold off p2; for p2 rely to itself and hold off p3; for p3 to reply to itself and hold off p1.

Each process has received one out of two replies, and none can proceed.

If process queues outstanding request in happen-before order, ME3 can be satisfied and will be deadlock free.


Bandwidth utilization is 2sqrt(N) messages per entry to CS and sqrt(N) per exit.

Client delay is the same as Ricart and Agrawala‟salgorithm, one round-trip time.

Synchronization delay is one round-trip time which is worse than R&A

Fault Tolerance

The reactions of the algorithms when messages are lost or when a process crashes is fault tolerance.

None of the algorithm that we have described would tolerate the loss of messages if the channels were unreliable.

The ring-based algorithm cannot tolerate any single process crash failure.

Maekawa‟salgorithm can tolerate some process crash failures: if a crashed process is not in a voting set that is required.

The central server algorithm can tolerate the crash failure of a client process that neither holds nor has requested the token.

2.9 ELECTIONS

An algorithm for choosing a unique process to play a particular role is called an election algorithm. Many algorithms used in distributed systems require a coordinator In general, all processes in the distributed system are equally suitable for the role. Election algorithms are designed to choose a coordinator.

Election Algorithms Any process can serve as coordinator Any process can “call an election” (initiate the algorithm to choose a new coordinator Elections may be needed when the system is initialized, or if the coordinator crashes or retires.

Assumptions Every process/site has a unique ID; e.g. – the network address – a process number Every process in the system should know the values in the set of ID numbers, although not which processors are up or down. The process with the highest ID number will be the new coordinator.

RequirementsWhen the election algorithm terminates a single process has been selected and every process knows its identity. Every process pi has a variable ei to hold the coordinator’s process number. – ∀i, ei = undefined or ei = P, where P is the non-crashed process with highest id – All processes (that have not crashed) eventually set ei = P.

E1(safety): a participant pi has electedi = or electedi =P, P is a non crashed process at the end of the run with largest identifier. where

E2(liveness): All processes Pi participate in election process and eventually set electedi! or crash.

Ring based Election Algorithm

All the processes arranged in a logical ring.

Each process has a communication channel to the next process.

All messages are sent clockwise around the ring.

Assume that no failures occur, and system is asynchronous.

The ultimate goal is to elect a single process coordinator which has the largest identifier.

Fig 2.14: Election process using Ring based election

algorithm Steps in election process:

1. Initially, every process is marked as non-participant. Any process can begin an election.

2. The starting processes marks itself as participant and place its identifier in a message to its neighbour.

3. A process receives a message and compares it with its own. If the arrived identifier is larger, it passes on the message.

4. If arrived identifier is smaller and receiver is not a participant, substitute its own identifier in the message and forward if. It does not forward the message if it is already a participant.

5. On forwarding of any case, the process marks itself as a participant.

6. If the received identifier is that of the receiver itself, then this process‟ s identifier must be the greatest, and it becomes the coordinator.

7. The coordinator marks itself as non-participant set elected_i and sends an elected message to its neighbour enclosing its ID.

8. When a process receives elected message, marks itself as a non-participant, sets its variable elected_i and forwards the message.

The election was started by process 17. The highest process identifier encountered so far is 24.

Requirements:

E1 is met. All identifiers are compared, since a process must receive its own ID back before sending an elected message.

E2 is also met due to the guaranteed traversals of the ring.

Tolerates no failure makes ring algorithm of limited practical use.

Performance Evaluation

If only a single process starts an election, the worst-performance case is then the anti-clockwise neighbour has the highest identifier. A total of N-1 messages is used to reach this neighbour. Then further N messages are required to announce its election. The elected message is sent N times. Making 3N-1 messages in all.

Turnaround time is also 3N-1 sequential message transmission time.

Bully Algorithm

This algorithm allows process to crash during an election, although it assumes the message delivery between processes is reliable.Assume that the system is synchronous to use timeouts to detect a process failure and each process knows which processes have higher identifiers and that it can communicate with all such processes.In this algorithm, there are three types of messages:

o Election message: This is sent to announce an election message. A process begins an election when it notices, through timeouts, that the coordinator has failed. T=2Ttrans+Tprocess From the time of sending

o Answer message: This is sent in response to an election message.

o Coordinator message: This is sent to announce the identity of the elected process.

Fig 2.15: Stages in Bully Algorithm

Election process:

The process begins a election by sending an election message to these processes that have a higher ID and awaits an answer in response.

If none arrives within time T, the process considers itself the coordinator and sends coordinator message to all processes with lower identifiers.

Otherwise, it waits a further time T‟ for coordinator message to arrive. If none, begins another election.

If a process receives a coordinator message, it sets its variable elected_i to be the coordinator ID.If a process receives an election message, it send back an answer message and begins another election unless it has begun one already.

Requirements:

• E1 may be broken if timeout is not accurate or replacement.

• Suppose P3 crashes and replaced by another process. P2 set P3 as coordinator and P1 set P2 as coordinator.

• E2 is clearly met by the assumption of reliable transmission.

Performance Evaluation

Best case the process with the second highest ID notices the coordinator’s failure. Then it can immediately elect itself and send N-2 coordinator messages.

The bully algorithm requires O(N^2) messages in the worst case - that is, when the process with the least ID first detects the coordinator‟s failure. For then N-1 processes altogether begin election, each sending messages to processes with higher ID.

2.10 MULTICAST COMMUNICATION

The system under consideration contains a collection of processes, which can communicate reliably over one-to-one channels. As before, processes may fail only by crashing. The processes are members of groups, which are the destinations of messages sent with the multicast operation. It is generally useful to allow processes to be members of several groups simultaneously. The operation multicast(g, m) sends the message m to all members of the group g of processes. Correspondingly, there is an operation deliver(m) that delivers a message sent by multicast to the calling process.

Every message m carries the unique identifier of the process sender(m) that sent it, and the unique destination group identifier group(m).

Reliable multicast

We define a reliable multicast with corresponding operations R-multicast and R-deliver.

A reliable multicast is one that satisfies the following properties:

Integrity: A correct process p delivers a message m at most once. Furthermore, p € group (m) and m was supplied to a multicast operation by sender(m).

Validity: If a correct process multicasts message m, then it will eventually deliver m.

Agreement: If a correct process delivers message m, then all other correct processes in group(m) will eventually deliver m.

Reliable multicast algorithm

Types of message ordering

Three types of message ordering

a. FIFO (First-in, first-out) ordering: if a correct process delivers a message before another, every correct process will deliver the first message before the other

b. Casual ordering: any correct process that delivers the second message will deliver the previous message first

c. Total ordering: if a correct process delivers a message before another, any other correct process that delivers the second message will deliver the first message first

Implementing FIFO ordering

FIFO-ordered multicast (with operations FO-multicast and FO-deliver) is achieved with sequence numbers, much as we would achieve it for one-to-one communication. We shall consider only non-overlapping groups.

Sgp is a count of how many messages p has sent to g and, for each q, R g

q is the sequence number of the latest message p has delivered from process q that was sent to group g.

For p to FO-multicast a message to group g, it piggybacks the value Sgp onto the

message, B-multicasts the message to g and then increments Sgp by 1. Upon

receipt of a message from q bearing the sequence number S, p checks whether S = Rg

q +1 . If so, this message is the next one expected from the sender q and p FO-delivers it, setting Rg

q =S. If S > Rgq +1 it places the message in the hold-back

queue until the intervening messages have been delivered.

Since all messages from a given sender are delivered in the same sequence, and since a message’s delivery is delayed until its sequence number has been reached, the condition for FIFO ordering is clearly satisfied

Implementing total ordering

The basic approach to implementing total ordering is to assign totally ordered identifiers to multicast messages so that each process makes the same ordering decision based upon these identifiers.

The delivery algorithm is very similar to the one we described for FIFO ordering; the difference is that processes keep group-specific sequence numbers rather than process-specific sequence numbers. We only consider how to totally order messages sent to non-overlapping groups. We call the multicast operations TO-multicast and TO-deliver.

A process wishing to TO-multicast a message m to group g attaches a unique identifier id(m) to it. The messages for g are sent to the sequencer for g, sequencer(g), as well as to the members of g. (The sequencer may be chosen to be a member of g.) The process sequencer(g) maintains a group-specific sequence number sg which it uses to assign increasing and consecutive sequence numbers to the messages that it B-delivers. It announces the sequence numbers by B-multicasting order messages to g.

The obvious problem with a sequencer-based scheme is that the sequencer may become a bottleneck The ISIS algorithm for total orderingA process B-multicasts its message to the members of the group. The group may be open or closed. The receiving processes propose sequence numbers for messages as they arrive and return these to the sender, which uses them to generate agreed sequence numbers. Each process q in group g keeps Ag

q , the largest agreed sequence number it has observed so far for group g, and Pg

q , its own largest proposed sequence number. The algorithm for process p to multicast a message m to group g is as follows:

1. p B-multicasts <m, i> to g, where i is a unique identifier for m.

2. Each process q replies to the sender p with a proposal for the message’s agreed

sequence number of Pgq=max(Ag

q , Pgq )+1

3. p collects all the proposed sequence numbers and selects the largest one, a, as the next agreed sequence number. It then B-multicasts <i, a> to g.

4. Each process q in g sets Agq =max(Ag

q, a) and attaches a to the message (which is identified by i).

Hold-back queue

ordered with the message with the smallest sequence number at the front of the queue

when the agreed number is added to a message, the queue is re-ordered

when the message at the front has an agreed id, it is transferred to the delivery queue . Even if agreed, those not at the front of the queue are not transferred

every process agrees on the same order and delivers messages in that order, therefore we have total ordering

Implementing causal ordering The causally ordered multicast operations are CO-multicast and CO-deliver. The

algorithm takes account of the happened-before relationship only as it is established by multicast messages. Each process updates its vector timestamp before delivering a message to maintain the count of precedent messages To CO-multicast a message to group g, the process adds 1 to its entry in the timestamp and B-multicasts the message along with its timestamp to g. When a process B-delivers m, it places it in a hold-back queue until messages earlier in the causal ordering have been deliveredTo establish this pi waits until a) earlier messages from same sender have been delivered b) any messages that the sender had delivered when it sent the multicast message have been delivered , then it CO-delivers the message A process can immediately CO-deliver to updates its timestamp itself its own messages

2.11 CONSENSUS AND RELATED PROBLEMS

Problems of agreement

For processes to agree on a value (consensus) after one or more of the processes has proposed what that value should

The problems of Byzantine generals and interactive consistencyare collectively referred as problems of agreement

Definition of the consensus problem

To reach consensus, every process pi begins in the undecided state and proposes a single value vi , drawn from a set D (i= 1,2… N ). The processes communicate with one another, exchanging values. Each process then sets the value of a decision variable, di. In doing so it enters the decided state, in which it may no longer change di (i= 1,2… N ).

The requirements of a consensus algorithm are that the following conditions should hold for every execution of it:

Termination: Eventually each correct process sets its decision variable.

Agreement: The decision value of all correct processes is the same: if pi and pj are correct and have entered the decided state, di= dj (i, j=1,2…N)

Integrity: If the correct processes all proposed the same value, then any correct process in the decided state has chosen that value.

The Byzantine generals problem

If processes can fail in arbitrary (Byzantine) ways, then faulty processes can in principle communicate random values to the others. This may seem unlikely in practice but, someone could deliberately make a process send different values to different peers in an attempt to thwart the others, which are trying to reach consensus.

The Byzantine generals problem differs from consensus in that a distinguished process supplies a value that the others are to agree upon, instead of each of them proposing a value. The requirements are:


Agreement: The decision value of all correct processes is the same: if pi and pj are correct and have entered the decided state, then di= dj (i, j=1,2…N)

Integrity: If the commander is correct, then all correct processes decide on the value that the commander proposed.

Interactive consistency

The interactive consistency problem is another variant of consensus, in which every process proposes a single value. The goal of the algorithm is for the correct processes to agree on a vector of values, one for each process. We call this the ‘decision vector’.

The requirements for interactive consistency are:


Agreement: The decision vector of all correct processes is the same.

Integrity: If pi is correct, then all correct processes decide on vi as the ith component of their vector

Relating consensustoother problems

Consensus (C), Byzantine Generals (BG), and Interactive Consensus (IC) are all problems concerned with making decisions in the context of arbitrary or crash failures •

We can sometimes generate solutions for one problem in terms of another.

Ci (v1, v2,… vN) returns the decision value of pi in a run of the solution to the consensus problem, where v1, v2 are the values that the processes proposed.

BGi jv returns the decision value of p i in a run of the solution to the Byzantine generals problem, where pj , the commander, proposes the value v.

ICi( v1 v2 vN )[ j] returns the jth value in the decision vector of pi in a run of the solution to the interactive consistency problem, where v1 v2 … vN are the values that the processes proposed.

It is possible to construct solutions out of the solutions to other problems. We give three examples:

IC from BG: We construct a solution to IC from BG by running BG N times, once with each process pi acting as the commander.

C from IC: For the case where a majority of processes are correct, we construct a solution to C from IC by running IC to produce a vector of values at each process, then applying an appropriate function on the vector’s values to derive a single value.

BG from C: We construct a solution to BG from C as follows

The commander pj sends its proposed value v to itself and each of the remaining processes.

All processes run C with the values v1 v2 vN that they receive (pj may be faulty).

They derive BG=C

It must be checked that the termination, agreement and integrity conditions are preserved in each case.

Consensus in a synchronous system:

The algorithm to solve consensus in a synchronous system is based on a modified form of the integrity requirement. The algorithm uses only a basic multicast protocol. It assumes that up to f of the N processes exhibit crash failures.

The variable Valuesir holds the set of proposed values known to process p i at the

beginning of round r.

Each process multicasts the set of values that it has not sent in previous rounds.

It then takes delivery of similar multicast messages from other processes and records any new values.

The duration of a round is limited by setting a timeout based on the maximum time for a correct process to multicast a message.

After f +1 rounds, each process chooses the minimum value it has received as its decision value

Limits for solutions to Byzantine Generals

Some cases of the Byzantine Generals problems have no solutions . Lamport et al found that if there are only 3 processes, there is no solution. Pease et al found that if the total number of processes is less than three times the number of failures plus one, there is no solution

Thus there is a solution with 4 processes and 1 failure, if there are two rounds. In the first, the commander sends the values while in the second, each lieutenant sends the values it received

Asynchronous Systems

All solutions to consistency and Byzantine generals problems are limited to synchronous systems.

Fischer et al found that there are no solutions in an asynchronous system with even one failure

This impossibility is circumvented by masking faults or using failure detection. There is also a partial solution, assuming an adversary process, based on introducing random values in the process to prevent an effective thwarting strategy. This does not always reach consensus

notes.specworld.in€¦ · web viewunit 2. time and global states. introduction . we need to...

Documents