classical distributed algorithms with dds
DESCRIPTION
The OMG DDS standard has been witnessing a very strong adoption as the distribution middleware of choice for a large class of mission and business critical systems, such as Air Traffic Control, Automated Trading, SCADA, Smart Energy, etc. The main reason for choosing DDS lies in its efficiency, scalability, high-availability and configurability -- through the 20+ QoS policy. Yet, all of these nice properties come at the cost of a relaxed consistency model no strong guarantees over global invariants. As a result, many architects have to devise, by themselves – assuming the DDS primitives as a foundation – the correct algorithms for classical problems such as fault-detection, leader election, consensus, distributed mutual exclusion, atomic multicast, distributed queues, etc. In this presentation we will explore DDS-based distributed algorithms for many classical, yet fundamental, problems in distributed systems. For simplicity, we'll start with algorithms that ignore the presence of failures. Then we will (1) demonstrate how these algorithms can be extended to deal with failures, and (2) introduce Paxos as one of the fundamental algorithm for consensus and atomic broadcast. Finally, we'll show how these classical algorithms can be used to implement useful extensions of the DDS semantics, such as multi-writer / multi-reader distributed queues.TRANSCRIPT
Ope
nSpl
ice
DD
S
Angelo CORSARO, Ph.D.Chief Technology Officer OMG DDS Sig Co-Chair
Classical Distributed Algorithms with DDS[Developing Higher Level Abstractions on DDS]
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Context☐ The Data Distribution Service (DDS) provides a very useful foundation
for building highly dynamic, reconfigurable, dependable and high performance systems
☐ However, in building distributed systems with DDS one is often faced with two kind of problems:☐ How can distributed coordination problems be solved with DDS?
e.g. distributed mutual exclusion, consensus, etc☐ How can higher order primitives and abstractions be supported over DDS?
e.g. fault-tolerant distributed queues, total-order multicast, etc.
☐ In this presentation we will look at how DDS can be used to implement some of the classical Distributed Algorithm that solve these problems
Ope
nSpl
ice
DD
S
DDS Abstractions and Properties
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Data Distribution Service
☐ Topics: data distribution subject’s
☐ DataWriters: data producers
☐ DataReaders: data consumers
DDS provides a Topic-Based Publish/Subscribe abstraction based on:
DDS Global Data Space
...
TopicA
TopicBTopicC
TopicD
Data Writer
Data Writer
Data Writer
Data Writer
Data Reader
Data Reader
Data Reader
Data Reader
For Real-Time Systems
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Data Distribution Service
☐ DataWriters and DataReaders are automatically and dynamically matched by the DDS Dynamic Discovery
☐ A rich set of QoS allows to control existential, temporal, and spatial properties of data
DDS Global Data Space
...
TopicA
TopicBTopicC
TopicD
Data Writer
Data Writer
Data Writer
Data Writer
Data Reader
Data Reader
Data Reader
Data Reader
For Real-Time Systems
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
DDS Topics☐ A Topic defines a class of streams
☐ A Topic has associated a unique name, a user defined extensible type and a set of QoS policies
☐ QoS Policies capture the Topic non-functional invariants
☐ Topics can be discovered or locally defined
DURABILITY,DEADLINE,PRIORITY,
…
“Circle”, “Square”, “Triangle”, ...
TopicTypeName
QoS
ShapeType
struct ShapeType { @Key string color; long x; long y; long shapesize;};
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Topic Instances☐ Each unique key value
identifies a unique stream of data
☐ DDS not only demultiplexes “streams” but provides also lifecycle information
☐ A DDS DataWriter can write multiple instances
Topic
InstancesInstances
color =”Green”
color =”red”
color = “Blue”
struct ShapeType { @Key string color; long x; long y; long shapesize;};
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Anatomy of a DDS ApplicationDomain (e.g. Domain 123)
Domain Participant
Topic
Publisher
DataWrter
Subscriber
DataReader
Partition (e.g. “Telemetry”, “Shapes”, )
Topic Instances/Samples
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Channel Properties
☐ We can think of a DataWriter and its matching DataReaders as connected by a logical typed communication channel
☐ The properties of this channel are controlled by means of QoS Policies
☐ At the two extreme this logical communication channel can be:☐ Best-Effort/Reliable Last n-values Channel☐ Best-Effort/Reliable FIFO Channel
DR
DR
DR
TopicDW
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Last n-values Channel☐ The last n-values channel is useful when
modeling distributed state
☐ When n=1 then the last value channel provides a way of modeling an eventually consistent distributed state
☐ This abstraction is very useful if what matters is the current value of a given topic instance
☐ The Qos Policies that give a Last n-value Channel are:☐ RELIABILITY = BEST_EFFORT | RELIABLE☐ HISTORY = KEEP_LAST(n)☐ DURABILITY = TRANSIENT | PERSISTENT [in most cases]
DR
DR
DR
TopicDW
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
FIFO Channel☐ The FIFO Channel is useful when we care about
every single sample that was produced for a given topic -- as opposed to the “last value”
☐ This abstraction is very useful when writing distributing algorithm over DDS
☐ Depending on Qos Policies, DDS provides: ☐ Best-Effort/Reliable FIFO Channel☐ FT-Reliable FIFO Channel (using an OpenSplice-
specific extension)
☐ The Qos Policies that give a FIFO Channel are:☐ RELIABILITY = BEST_EFFORT | RELIABLE☐ HISTORY = KEEP_ALL
DR
DR
DR
TopicDW
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Membership☐ We can think of a DDS Topic as defining a
group
☐ The members of this group are matching DataReaders and DataWriters
☐ DDS’ dynamic discovery manages this group membership, however it provides a low level interface to group management and eventual consistency of views
☐ In addition, the group view provided by DDS makes available matched readers on the writer-side and matched-writers on the reader-side
☐ This is not sufficient for certain distributed algorithms.
DR
DR
DR
TopicDW
DataWriter Group View
DW
DW DRTopic
DW
DataReader Group View
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Fault-Detection
☐ DDS provides built-in mechanism for detection of DataWriter faults through the LivelinessChangedStatus
☐ A writer is considered as having lost its liveliness if it has failed to assert it within its lease period
DW
DW DRTopic
DW
DataReader Group View
Ope
nSpl
ice
DD
S
System Model
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
System Model
☐ Partially Synchronous☐ After a Global Stabilization Time (GST) communication latencies are
bounded, yet the bound is unknown
☐ Non-Byzantine Fail/Recovery☐ Process can fail and restart but don’t perform malicious actions
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Programming Environment
☐ The algorithms that will be showed next are implemented on OpenSplice using the Escalier Scala API
☐ All algorithms are available as part of the Open Source project dada
¥Fastest growing JVM Language¥Open Source¥www.scala-lang.org
¥ #1 OMG DDS Implementation¥ Open Source¥ www.opensplice.org
OpenSplice | DDS¥Scala API for OpenSplice DDS¥Open Source¥github.com/kydos/escalier
Escalier
¥ DDS-based Advanced Distributed Algorithms Toolkit
¥Open Source¥github.com/kydos/dada
Ope
nSpl
ice
DD
S
Higher Level Abstractions
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Group Management☐ A Group Management
abstraction should provide the ability to join/leave a group, provide the current view and detect failures of group members
☐ Ideally group management should also provide the ability to elect leaders
☐ A Group Member should represent a process
abstract class Group { // Join/Leave API def join(mid: Int) def leave(mid: Int)
// Group View API def size: Int def view: List[Int] def waitForViewSize(n: Int) def waitForViewSize(n: Int, timeout: Int)
// Leader Election API def leader: Option[Int] def proposeLeader(mid: Int, lid: Int)
// Reactions handling Group Events val reactions: Reactions}
case class MemberJoin(val mid: Int)case class MemberLeave(val mid: Int)case class MemberFailure(mid:Int)case class EpochChange(epoch: Long)case class NewLeader(mid: Option[Int])
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Topic Types☐ To implement the Group abstraction with support for Leader
Election it is sufficient to rely on the following topic types:
enum TMemberStatus { JOINED, LEFT, FAILED, SUSPECTED};
struct TMemberInfo { long mid; // member-id TMemberStatus status;};#pragma keylist TMemberInfo mid
struct TEventualLeaderVote { long long epoch; long mid; long lid; // voted leader-id};#pragma keylist TEventualLeaderVote mid
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
TopicsGroup Management☐ The TMemberInfo topic is used to advertise presence and manage the
members state transitions
Leader Election☐ The TEventualLeaderVote topic is used to cast votes for leader election
This leads us to:☐ Topic(name = MemberInfo, type = TMemberInfo,
QoS = {Reliability.Reliable, History.KeepLast(1), Durability.TransientLocal})☐ Topic(name = EventualLeaderVote, type = TEventualLeaderVote,
QoS = {Reliability.Reliable, History.KeepLast(1), Durability.TransientLocal}
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Observation
☐ Notice that we are using two Last-Value Channels for implementing both the (eventual) group management and the (eventual) leader election
☐ This makes it possible to:☐ Let DDS provide our latest known state automatically thanks to the
TransientLocal Durability☐ No need for periodically asserting our liveliness as DDS will do that our
DataWriter
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Leader ElectionM1
M2
M0
crashjoin
join
join
epoch = 0 epoch = 1 epoch = 2 epoch = 3
Leader: None => M1 Leader: None => M1 Leader: None => M0 Leader: None => M0
☐ At the beginning of each epoch the leader is None☐ Each new epoch a leader election algorithm is run
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Distinguishing Groups
☐ To isolate the traffic generated by different groups, we use the group-id gid to name the partition in which all the group related traffic will take place
“1”“2”
“3” DDS Domain
Partition associated to the group with gid=2
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Example
☐ Events provide notification of group membership changes
☐ These events are handled by registering partial functions with the Group reactions
object GroupMember { def main(args: Array[String]) { if (args.length < 2) { println("USAGE: GroupMember <gid> <mid>") sys.exit(1) } val gid = args(0).toInt val mid = args(1).toInt
val group = Group(gid)
group.join(mid)
val printGroupView = () => { print("Group["+ gid +"] = { ") group.view foreach(m => print(m + " ")) println("}")}
group.reactions += { case MemberFailure(mid) => { println("Member "+ mid + " Failed.") printGroupView() } case MemberJoin(mid) => { println("Member "+ mid + " Joined") printGroupView() } case MemberLeave(mid) => { println("Member "+ mid +" Left") printGroupView() } } }}
[1/2]
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Example☐ An eventual leader election algorithm
can be implemented by simply casting a vote each time there is an group epoch change
☐ A Group Epoch change takes place each time there is a change on the group view
☐ The leader is eventually elected only if a majority of the process currently on the view agree
☐ Otherwise the group leader is set to “None”
[1/2]
object EventualLeaderElection { def main(args: Array[String]) { if (args.length < 2) { println("USAGE: GroupMember <gid> <mid>") sys.exit(1) } val gid = args(0).toInt val mid = args(1).toInt
val group = Group(gid)
group.join(mid)
group.reactions += { case EpochChange(e) => { val lid = group.view.min group.proposeLeader(mid, lid) } case NewLeader(l) =>
println(">> NewLeader = "+ l) } }}
Ope
nSpl
ice
DD
S
Distributed Mutex
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Lamport’s Distributed Mutex☐ A relatively simple Distributed Mutex Algorithm was proposed by Leslie
Lamport as an example application of Lamport’s Logical Clocks
☐ The basic protocol (with Agrawala optimization) works as follows (sketched):☐ When a process needs to enter a critical section sends a MUTEX request by
tagging it with its current logical clock☐ The process obtains the Mutex only when he has received ACKs from all the
other process in the group☐ When process receives a Mutex requests he sends an ACK only if he has not an
outstanding Mutex request timestamped with a smaller logical clock
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Mutex Abstraction☐ A base class defines the
Mutex Protocol
☐ The Mutex companion uses dependency injection to decide which concrete mutex implementation to use
abstract class Mutex { def acquire()
def release()
}
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Foundation Abstractions
☐ The mutual exclusion algorithm requires essentially:☐ FIFO communication channels between group members☐ Logical Clocks☐ MutexRequest and MutexAck Messages
These needs, have now to be translated in terms of topic types, topics, readers/writers and QoS Settings
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Topic Types☐ For implementing the Mutual Exclusion Algorithm it is sufficient to
define the following topic types:
struct TLogicalClock { long ts; long mid;};#pragma keylist LogicalClock mid
struct TAck { long amid; // acknowledged member-id LogicalClock ts;};#pragma keylist TAck ts.mid
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
TopicsWe need essentially two topics:☐ One topic for representing the Mutex Requests, and☐ Another topic for representing Acks
This leads us to:☐ Topic(name = MutexRequest, type = TLogicalClock,
QoS = {Reliability.Reliable, History.KeepAll})☐ Topic(name = MutexAck, type = TAck,
QoS = {Reliability.Reliable, History.KeepAll})
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Show me the Code!
☐ All the algorithms presented were implemented using DDS and Scala
☐ Specifically we’ve used the OpenSplice Escalier language mapping for Scala
☐ The resulting library has been baptized “dada” (DDS Advanced Distributed Algorithms) and is available under LGPL-v3
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
LCMutex☐ The LCMutex is one of the possible Mutex protocol, implementing
the Agrawala variation of the classical Lamport’s Algorithm
class LCMutex(val mid: Int, val gid: Int, val n: Int)(implicit val logger: Logger) extends Mutex {
private var group = Group(gid) private var ts = LogicalClock(0, mid) private var receivedAcks = new AtomicLong(0)
private var pendingRequests = new SynchronizedPriorityQueue[LogicalClock]() private var myRequest = LogicalClock.Infinite
private val reqDW = DataWriter[TLogicalClock](LCMutex.groupPublisher(gid), LCMutex.mutexRequestTopic, LCMutex.dwQos)
private val reqDR = DataReader[TLogicalClock](LCMutex.groupSubscriber(gid), LCMutex.mutexRequestTopic, LCMutex.drQos)
private val ackDW = DataWriter[TAck](LCMutex.groupPublisher(gid), LCMutex.mutexAckTopic, LCMutex.dwQos)
private val ackDR = DataReader[TAck](LCMutex.groupSubscriber(gid), LCMutex.mutexAckTopic, LCMutex.drQos)
private val ackSemaphore = new Semaphore(0)
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
LCMutex.acquire
def acquire() { ts = ts.inc() myRequest = ts reqDW ! myRequest ackSemaphore.acquire() }
Notice that as the LCMutex is single-threaded we can’t issue concurrent acquire.
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
LCMutex.release
Notice that as the LCMutex is single-threaded we can’t issue a new request before we release.
def release() { myRequest = LogicalClock.Infinite (pendingRequests dequeueAll) foreach { req => ts = ts inc() ackDW ! new TAck(req.id, ts) } }
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
LCMutex.onACKackDR.reactions += { case DataAvailable(dr) => { // Count only the ACK for us val acks = ((ackDR take) filter (_.amid == mid)) val k = acks.length
if (k > 0) { // Set the local clock to the max (tsi, tsj) + 1 synchronized { val maxTs = math.max(ts.ts, (acks map (_.ts.ts)).max) + 1 ts = LogicalClock(maxTs, ts.id) } val ra = receivedAcks.addAndGet(k) val groupSize = group.size // If received sufficient many ACKs we can enter our Mutex! if (ra == groupSize - 1) { receivedAcks.set(0) ackSemaphore.release() } } } }
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
LCMutex.onReqreqDR.reactions += { case DataAvailable(dr) => { val requests = (reqDR take) filterNot (_.mid == mid)
if (requests.isEmpty == false ) { synchronized { val maxTs = math.max((requests map (_.ts)).max, ts.ts) + 1 ts = LogicalClock(maxTs, ts.id) } requests foreach (r => { if (r < myRequest) { ts = ts inc() val ack = new TAck(r.mid, ts) ackDW ! ack None } else { (pendingRequests find (_ == r)).getOrElse({ pendingRequests.enqueue(r) r}) } }) } } }
Ope
nSpl
ice
DD
S
Distributed Queue
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Distributed Queue Abstraction☐ A distributed queue is conceptually provide with the ability of
enqueueing and dequeueing elements
☐ Depending on the invariants that are guaranteed the distributed queue implementation can be more or less efficient
☐ In what follows we’ll focus on a relaxed form of distributed queue, called Eventual Queue, which while providing a relaxed yet very useful semantics is amenable to high performance implementations
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Eventual Queue Specification☐ Invariants
☐ All enqueued elements will be eventually dequeued☐ Each element is dequeued once☐ If the queue is empty a dequeue returns nothing☐ If the queue is non-empty a dequeue might return something☐ Elements might be dequeued in a different order than they are enqueued
DR
DR
DR
DW
DW
DW
DRDistributed Eventual Queue
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Eventual Queue Specification☐ Invariants
☐ All enqueued elements will be eventually dequeued☐ Each element is dequeued once☐ If the queue is empty a dequeue returns nothing☐ If the queue is non-empty a dequeue might return something☐ Elements might be dequeued in a different order than they are enqueued
DR
DR
DR
DW
DW
DW
DRDistributed Eventual Queue
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Eventual Queue Specification☐ Invariants
☐ All enqueued elements will be eventually dequeued☐ Each element is dequeued once☐ If the queue is empty a dequeue returns nothing☐ If the queue is non-empty a dequeue might return something☐ Elements might be dequeued in a different order than they are enqueued
DR
DR
DR
DW
DW
DW
DRDistributed Eventual Queue
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Eventual Queue Specification☐ Invariants
☐ All enqueued elements will be eventually dequeued☐ Each element is dequeued once☐ If the queue is empty a dequeue returns nothing☐ If the queue is non-empty a dequeue might return something☐ Elements might be dequeued in a different order than they are enqueued
DR
DR
DR
DW
DW
DW
DRDistributed Eventual Queue
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Eventual Queue Specification☐ Invariants
☐ All enqueued elements will be eventually dequeued☐ Each element is dequeued once☐ If the queue is empty a dequeue returns nothing☐ If the queue is non-empty a dequeue might return something☐ Elements might be dequeued in a different order than they are enqueued
DR
DR
DR
DW
DW
DW
Distributed Eventual QueueDR
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Eventual Queue Abstraction
☐ A Queue can be seen as the composition of two simpler data structure, a Dequeue and an Enqueue
☐ The Enqueue simply allows to add elements
☐ The Enqueue simply allows to get elements
trait Enqueue[T] { def enqueue(t: T)}
trait Dequeue[T] { def dequeue(): Option[T] def sdequeue(): Option[T] def length: Int def isEmpty: Boolean = length == 0}
trait Queue[T] extends Enqueue[T] with Dequeue[T]
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Eventual Queue on DDS☐ One approach to implement the eventual queue on DDS is to
keep a local queue on each of the consumer and to run a coordination algorithm to enforce the Eventual Queue Invariants
☐ The advantage of this approach is that the latency of the dequeue is minimized and the throughput of enqueues is maximized (we’ll see this latter is really a property of the eventual queue)
☐ The disadvantage, for some use cases, is that the consumer need to store the whole queue locally thus, this solution is applicable for either symmetric environments running on LANs
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Eventual Queue Invariants & DDS
☐ All enqueued elements will be eventually dequeued☐ Each element is dequeued once☐ If the queue is empty a dequeue returns nothing☐ If the queue is non-empty a dequeue might return something
☐ These invariants require that we implement a distributed protocol for ensuring that values are eventual picked up and picked up only once!
☐ Elements might be dequeued in a different order than they are enqueued
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Eventual Queue Invariants & DDS☐ All enqueued elements will be eventually dequeued☐ If the queue is empty a dequeue returns nothing☐ If the queue is non-empty a dequeue might return something
☐ Elements might be dequeued in a different order than they are enqueued☐ This essentially means that we can have different local order for the queue
elements on each consumer. Which in turns means that we can distribute enqueued elements by simple DDS writes!
☐ The implication of this is that the enqueue operation is going to be as efficient as a DDS write
☐ Finally, to ensure eventual consistency in presence of writer faults we’ll take advantage of OpenSplice FT-Reliability!
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Dequeue Protocol: General Idea☐ A possible Dequeue protocol can be derived by the Lamport/Agrawala
Distributed Mutual Exclusion Algorithm
☐ The general idea is similar as we want to order dequeues as opposed to access to some critical section, however there are some important details to be sorted out to ensure that we really maintain the eventual queue invariants
☐ Key Issues to be dealt☐ DDS provides eventual consistency thus we might have wildly different local view of the
content of the queue (not just its order but the actual elements)☐ Once a process has gained the right to dequeue it has to be sure that it can pick an
element that nobody else has picked just before. Then he has to ensure that before he allows anybody else to pick a value his choice has to be popped by all other local queues
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Topic Types☐ To implement the Eventual Queue
over DDS we use three different Topic Types
☐ The TQueueCommand represents all the commands used by the protocol (more later on this)
☐ TQueueElement represents a writer time-stamped queue element
struct TLogicalClock { long long ts; long mid;};
enum TCommandKind { DEQUEUE, ACK, POP};
struct TQueueCommand { TCommandKind kind; long mid; TLogicalClock ts;};#pragma keylist TQueueCommand
typedef sequence<octet> TData;struct TQueueElement { TLogicalClock ts; TData data;};#pragma keylist TQueueElement
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Topics
To implement the Eventual Queue we need only two topics:☐ One topic for representing the queue elements☐ Another topic for representing all the protocol messages. Notice
that the choice of using a single topic for all the protocol messages was carefully made to be able to ensure FIFO ordering between protocol messages
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Topics
This leads us to:
☐ Topic(name = QueueElement, type = TQueueElement, QoS = {Reliability.Reliable, History.KeepAll})
☐ Topic(name = QueueCommand, type = TQueueCommand, QoS = {Reliability.Reliable, History.KeepAll})
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Dequeue Protocol: A Sample Rundeq():a
a, ts b, ts’
app 1 (1,1)
req {(1,2)}
deq():b ack {(2,2)}
(1,1) (1,2)
pop{ts, (3,1)}
req {(1,1)}
1 1 2
1 1 2 3
3
ack {(4,1)}
4
pop{ts, (5,2)}
app 2
b, ts’ a, ts
(1,2) (1,1) (1,2)
b, ts’
b, ts’
(1,2) (1,2)
’
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Example: Producerobject MessageProducer { def main(args: Array[String]) { if (args.length < 4) { println("USAGE:\n\t MessageProducer <mid> <gid> <n> <samples>") sys.exit(1) } val mid = args(0).toInt val gid = args(1).toInt val n = args(2).toInt val samples = args(3).toInt val group = Group(gid) group.reactions += { case MemberJoin(mid) => println("Joined M["+ mid +"]") } group.join(mid) group.waitForViewSize(n)
val queue = Enqueue[String]("CounterQueue", mid, gid)
for (i <- 1 to samples) { val msg = "MSG["+ mid +", "+ i +"]" println(msg) queue.enqueue(msg) // Pace the write so that you can see what's going on Thread.sleep(300) } }}
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Example: Consumerobject MessageConsumer { def main(args: Array[String]) { if (args.length < 4) { println("USAGE:\n\t MessageProducer <mid> <gid> <readers-num> <n>") sys.exit(1) } val mid = args(0).toInt val gid = args(1).toInt val rn = args(2).toInt val n = args(3).toInt
val group = Group(gid) group.reactions += { case MemberJoin(mid) => println("Joined M["+ mid +"]") } group.join(mid) group.waitForViewSize(n)
val queue = Queue[String]("CounterQueue", mid, gid, rn)
val baseSleep = 1000 while (true) { queue.sdequeue() match { case Some(s) => println(Console.MAGENTA_B + s + Console.RESET) case _ => println(Console.MAGENTA_B + "None" + Console.RESET) } val sleepTime = baseSleep + (math.random * baseSleep).toInt Thread.sleep(sleepTime) } }}
Ope
nSpl
ice
DD
S
Dealing with Faults
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Fault-Detectors
☐ The algorithms presented so far can be easily extended to deal with failures by taking advantage of group abstraction presented earlier
☐ The main issue to carefully consider is that if a timing assumption is violated thus leading to falsely suspecting the crash of a process safety of some of those algorithms might be violated!
Ope
nSpl
ice
DD
S
Paxos
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Paxos in Brief☐ Paxos is a protocol for state-machine replication proposed by Leslie
Lamport in his “The Part-time Parliament”
☐ The Paxos protocol works in under asynchrony -- to be precise, it is safe under asynchrony and has progress under partial synchrony (both are not possible under asynchrony due to FLP) -- and admits a crash/recovery failure mode
☐ Paxos requires some form of stable storage
☐ The theoretical specification of the protocol is very simple and elegant
☐ The practical implementations of the protocol have to fill in many hairy details...
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Paxos in Brief☐ The Paxos protocol considers three different kinds of agents (the
same process can play multiple roles):☐ Proposers☐ Acceptors☐ Learners
☐ To make progress the protocol requires that a proposer acts as the leader in issuing proposals to acceptors on behalf of clients
☐ The protocol is safe even if there are multiple leaders, in that case progress might be scarified ☐ This implies that Paxos can use an eventual leader election algorithm to decide
the distinguished proposer
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Paxos Synod Protocol
[Pseudocode from “Ring Paxos: A High-Throughput Atomic Broadcast Protocol, DSN 2010”. Notice that the pseudo code is not correct as it suffers from progress in several cases, however it illustrates the key idea of the Paxos Synod protocol]
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Paxos in Action
C1
C2
Cn
P1
P2
Pk
A2
Am
A1
L2
Lh
L1
[Leader]
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Paxos in Action -- Phase 1A
C1
C2
Cn
P1
P2
Pk
[Leader]
A2
Am
A1
L2
Lh
L1
phase1A(c-rnd)
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Paxos in Action -- Phase 1B
C1
C2
Cn
P1
P2
Pk
[Leader]
A2
Am
A1
L2
Lh
L1
phase1B(rnd, v-rnd, v-val)
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Paxos in Action -- Phase 2A
C1
C2
Cn
P1
P2
Pk
[Leader]
A2
Am
A1
L2
Lh
L1
phase2A(c-rnd, c-val)
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Paxos in Action -- Phase 2B
C1
C2
Cn
P1
P2
Pk
[Leader]
A2
Am
A1
L2
Lh
L1
phase2B(v-rnd, v-val)
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Paxos in Action -- Phase 2B
C1
C2
Cn
P1
P2
Pk
[Leader]
A2
Am
A1
L2
Lh
L1
Decision(v-val)
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Eventual Queue with Paxos☐ The Eventual queue we specified on the previous section can be
implemented using an adaptation of the Paxos protocol
☐ In this case, consumers don’t cache locally the queue but leverage a mid-tier running the Paxos protocol to serve dequeues
C1
C2
Cn
P1
P2
Pm[Learners]
Pi
Ai
[Proposers]
[Acceptors]
[Eventual Queue]
L1 [Learners]
Ope
nSpl
ice
DD
S
Summing Up
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
Concluding Remarks
☐ OpenSplice DDS provides a good foundation to effectively and efficiently express some of the most important distributed algorithms☐ e.g. DataWriter fault-detection and OpenSplice FT-Reliable Multicast
☐ dada provides access to reference implementations of many of the most important distributed algorithms☐ It is implemented in Scala, but that means you can also use these libraries
from Java too!
Copyrig
ht 2011, PrismTech – A
ll Rights Reserved.
Ope
nSpl
ice
DD
S
References
¥Fastest growing JVM Language¥Open Source¥www.scala-lang.org
¥ #1 OMG DDS Implementation¥ Open Source¥ www.opensplice.org
OpenSplice | DDS¥Scala API for OpenSplice DDS¥Open Source¥github.com/kydos/escalier
Escalier
¥Simple C++ API for DDS¥Open Source¥github.com/kydos/simd-cxx
¥DDS-PSM-Java for OpenSplice DDS¥Open Source¥github.com/kydos/simd-java
¥ DDS-based Advanced Distributed Algorithms Toolkit
¥Open Source¥github.com/kydos/dada
Ope
nSpl
ice
DD
S ¥@prismtech
¥@acorsaro
¥youtube.com/opensplicetube ¥slideshare.net/angelo.corsaro
¥opensplice.com ¥forums.opensplice.org
¥opensplice.org ¥[email protected]
:: Connect with Us ::
Ope
nSpl
ice
DD
S