distributed systems 2006 virtual synchrony ii* *with material adapted from ken birman, ben wang,...

Distributed Systems 2006

Virtual Synchrony II*

*With material adapted from Ken Birman, Ben Wang, Bill Burke, Bela Ban

Distributed Systems 2006 2

Plan

We skip Section 18.2

Tracking group membership: We’ll base it on 2PC and 3PC

Fault-tolerant multicast: We’ll use membership

Ordered multicast: We’ll base it on fault-tolerant multicast

Tools for solving practical replication and availability problems: we’ll base them on ordered multicast

Robust Web Services: We’ll build them with these tools

2PC and 3PC: Our first “tools” (lowest layer)


Recap: Elements of Virtual Synchrony

Support for process groups– Processes may join and leave dynamically – Excluded if (thought to) fail

Various reliable, ordered multicast protocols– Strive to replace synchronous, totally-ordered, dynamically uniform

protocols View-synchronous delivery

– Two processes that are member of the same view receive same set of multicasts during that view

Identical process groups views and rankings– Identical sequences of group membership lists

Gap-freedom guarantees– If, after a failure, m1 has been delivered to a destination, then any

message m0 that should be delivered prior to m1 is also delivered State transfer for joining processes

– A new process may obtain group state from existing member– (Will develop this shortly)


Distributed Algorithms

Election Consensus Consistent snapshot Replicated data and synchronization State transfer Load-balancing Fault tolerance

– Primary-backup– Coordinator-cohort


State transfer

We need to transfer current state of a process group to joining members?

”State” is application-dependent– Creating state should be done by

application itself– E.g., org.jgroups.MessageListener

• byte[] getState()• void setState(byte[] state)

Simple approach– Just make all members transfer

state to joining process

Reasonable approach– Joining process pulls state from

one existing member– Take another if first one fails


Load-Balancing

Coordinate members of a group to share workload in order to obtain a speed-ud for parallelism?

Use groups communication to implement load-balancing of request...

Styles of algorithms– Group decides who should handle request– Client decides who should handle request


Group Decides

Multicast request from client to full membership– May require expensive transfer to all

Need deterministic rule for deciding who should handle request– E.g., with abcast, requests may be numbered in a

total order– A process may handle the ith request if the process

rank is i mod n

With cbcast?


Client affinity schemes

Group members provide clients with information used to select appropriate serve to send request to

Best choice dependent on data size, fault-tolerance needed, queries/updates, ...

E.g.,– Static assigment of client to specific server

• + caching, - for very active clients

– Pick a random server– Base choice on (approximate) load information

• Could also be used with previous approach


Using approximate load information

Assume that processes send out load reports using abcast– E.g., 0 for no load, 1 for currently handling 1 request etc.

Represent load on group of n servers as a vector: [l0, ..., ln-1]– lmax = max(l0, ..., ln-1) + 1– [l0’, ..., ln-1’] = [lmax - l0, ..., lmax - ln-1]– L’ = l0 ’ + ... + ln-1 ’

Map incoming requests to process– Given a request, choose a random number, r, between 0 and L’

• By applying pseudo-random generator, same seed at all processes

– Now choose process i if l0’ + ... + li’ <= r < l0’ + ... + li+1’ (i < n-1)

(Think of l0’, ..., li’ as points on a line with length L’, then the algorithm selects the segment that r is within)


Fault tolerance

We want to offer clients “fault-tolerant request execution”?

We can replace a traditional service with a group of members– Each request is assigned to a

primary (ideally, spread the work around) and a backup

• Primary sends a “cc” of the response to the request to the backup

– Backup keeps a copy of the request and steps in only if the primary crashes before replying

Sometimes called “coordinator/cohort” just to distinguish from “primary/backup”


Trade-offs


Trade-offs

Membership– Static

• Fixed membership, changing connectivity and availability

– Dynamic• Changing membership, fixed

connectivity and availability

Consistency– Internal

• Defined with respect to members of group observing messages

– External• Defined with respect to

external observer (e.g., a database)


Toolkits

Isis Horus Ensemble JGroups Spread


Features of major virtual synchrony platforms

Isis: first and no longer widely used– But was perhaps the most successful; has major roles

in NYSE, Swiss Exchange, French Air Traffic Control system (two major subsystems of it), US AEGIS Naval warship

– Pioneered use of cbcast– Also was first to offer a publish-subscribe interface

that mapped topics to groups


Features of major virtual synchrony platforms

Horus, JGroups and Ensemble– Successors to Isis– These focus on flexible protocol stack linked directly into

application address space• A stack is a pile of micro-protocols• Can assemble an optimized solution fitted to specific needs of the

application by plugging together “properties this application requires”, lego-style

• The system is optimized to reduce overheads of this compositional style of protocol stack

Use– JGroups is very popular

• Used in, e.g., JBoss, OpenSymphony OSCache, Jetty, Tomcat– Ensemble is somewhat popular and supported by a user

community– Horus works well but is not widely used.


Horus/JGroups/Ensemble protocol stacks

Application belongs to process group

comm

nak

frag

mbrshp fc

comm

comm

nak frag

comm

nak

frag mbrshp

parcld

comm

nak

frag

mbrshp merge

total total


Spread Toolkit

Focused on a sort of “RISC” approach– Very simple architecture and system– Fairly fast, easy to use, rather popular

Supports one large group within which user sees many small “lightweight” subgroups that seem to be free-standing

Protocols implemented by Spread “agents” that relay messages to apps


Case: J2EE/JBoss

Java 2 Enterprise Edition (J2EE)– Multi-tiered, distributed application model / reference

architecture• Tiered = physically layered architecture

– Technologies to support this reference architecture

Enterprise Java Beans (EJB)– Server-side component model– Component: “… a unit of composition with contractually specified

interfaces and explicit context dependencies only. A software component can be deployed independently and is subject to composition by third parties”

JBoss– Open source J2EE Application Server– Arguably the most popular Java application server


J2EE Business Context

Powerful workstations (and servers) makes distributed computing viable– Programming abstractions for distributed computing needed

The Web…– Internet-enabled business systems / enterprise information systems– Specific requirements for web applications

• Scalability– support variations in load– e.g., Amazon before Christmas

• Availability– Very small downtime periods– e.g., eBay (400 million transactions/day)

• Security– Authenticate and authorize users

• Usability– Different users should have different contents in different forms

• Performance– Reasonable response times needed– Requests often arrive in bursts

– Also, e.g., time-to-market...


Architectural Solution


Tiers

Client tier– User interfaces

• Internet browser• Standalone Java clients• (COM applications)

Middle tier– Web tier

• Web server for handling requests from browsers– Gets request from client tier– Forwards to business component tier– Renders result

– Business component tier• Core “business logic”

– E.g., customers, accounts, …, relationships and rules among these in Web shop• Realized by EJBs running in an EJB container• “Application server” = middle-tier component server compatible with J2EE

Enterprise information systems tier– Databases– Backend systems


Example: EHR in Ribe Amt

DB2:Database

RMI-IIOP

<<SessionBean>>:LoginSessionBean

Trifork: Application Server

<<SessionBean>>:CommandSessionBean

<<SessionBean>>:PrintHandler

PC:Client

PC:Server

MVS:Backend

DB2:Database

XML

PAS:Fødesystem

<<SessionBean>>:PersistencyDelegate

:Security

Medicin:UI

Lab:UI

Notat:UI

:Validering

MQ Series

MQ Series


Specific J2EE-Related Technologies

JavaServer Pages (JSP)– Creating dynamic web-based content

Java Servlets– Extending functionality of web servers

Java Messaging Service (JMS)– Asynchronous point-to-point and many-to-many messaging

Java Naming and Directory Interface (JNDI)– Directory-based retrieval of user-defined objects

J2EE Connector Architecture– Standard architecture for integrating external systems

RMI over IIOP– RMI using OMG’s Internet Inter-Orb Protocol

Java DataBase Connectivity (JDBC)– Uniform interface to relational databases

Enterprise JavaBeans (EJB)


EJB Deployment View

EJB container– Manage execution of components– Expose platform services


EJB Code View

Need to enable remote clients to access bean– Remote interface– ~ Proxy object

Manage lifecycle of bean– Home interface– Possibly functionality to

locate specific instances– ~ Factory object

Implement functionality– Bean class

Clients use generated stubs


Detailed Example


EJB Types

Entity beans– Representing business data objects– Data members map to data items stored in associated data base– Container-managed persistence

• Container loads and stores data• No application code required

– Bean-managed persistence• Bean code responsible itself for persistence

– “handcrafted” JDBC

Session beans– Business logic and services– “Stateful“ (SFSBs)

• Can keep state on behalf of client• Successive calls go to same component• Container handles life-cycle

– Passivation– Activation

• E.g., CommandBean– “Stateless” (SLSBs)

• Does not keep any state on behalf of client• Each successive call delegated to stateless

session bean as needed• Easy scalability and load balancing• E.g., PrintHandlerBean

Message-driven beans– Stateless– Asynchronous listener style of invocation


”Clustering” and J2EE

Scalability– I want to handle x times the number of concurrent access than what I

have now High availability

– Services are accessible with reasonable (and predictable) response times at any time

– E.g., 99.999 (5 Nines in Telco) Load balancing

– A way to obtain high availability and better performance by dispatching incoming requests to different servers

– Session affinity (or stickiness)– Checking heart beat

Failover– Process can continue when it is re-directed to a “backup” node because

the original one fails– What is the policy? Round-robin?

Fault tolerance– A service that guarantees strictly correct behavior despite system failure


JGroups in JBoss?

Serverless JMS Clustering

– Replication of entity beans, SLSBs and SFSBs– HA-JNDI– Session replication (integrated Tomcat, Jetty)

Cache– Replicated transactional clustered cache


The real world is complex...


Default JBoss JGroup Configuration

<UDP mcast_addr="${jboss.partition.udpGroup:228.1.2.3}" mcast_port="45566"ip_ttl="8" ip_mcast="true" mcast_send_buf_size="800000” mcast_recv_buf_size="150000"

ucast_send_buf_size="800000" ucast_recv_buf_size="150000" loopback="false"/>

<PING timeout="2000" num_initial_members="3" up_thread="true" down_thread="true"/><MERGE2 min_interval="10000" max_interval="20000"/><FD shun="true" up_thread="true" down_thread="true" timeout="2500" max_tries="5"/><VERIFY_SUSPECT timeout="3000" num_msgs="3" up_thread="true" down_thread="true"/><pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800" max_xmit_size="8192" up_thread="true" down_thread="true"/><UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10" down_thread="true"/><pbcast.STABLE desired_avg_gossip="20000" up_thread="true" down_thread="true"/><FRAG frag_size="8192” down_thread="true" up_thread="true"/><pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true"/><pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>


Serverless JMS

Java Messaging Service based on JGroups– Peer-to-peer architecture rather than Client/Server

Client publishing to a topic– Other clients subscribe to topics– The ”publish/subscribe” paradigm

Instead of sending message to server, and server distributes to multiple clients: publisher multicasts message– JMS Server just another member

• Handles persistent messages (DB)


Serverless JMS

Publisher

JMS Server

Subscriber

Subscriber

Subscriber

Client/Server Model

Publisher

JMS Server

Subscriber

Subscriber

Subscriber

Serverless Model

Multicast

(discard) (accept)

(accept)

(accept)

(accept)

(discard)

Cost: 4 unicasts Cost: 1 multicast


Serverless JMS

Clients are still able to publish even when server is down

Caveat– works in scenario where client and server are in same

multicast-reachable network


Summary

We gave further examples of what can be built on top of Virtual Synchrony

Brief examples of toolkits and uses– Isis, Ensemble Horus, Spread, JGroups

J2EE introduction as bonus

distributed systems 2006 virtual synchrony ii* *with material adapted from ken birman, ben wang,...

Documents

distributed systems

group state

request slide

multicast request

current state

request client

process groups processes

new process