zookeeper: wait-free coordination for internet-scale systems€¦ · when a zookeeper server...

ZOOKEEPER: WAIT-FREE COORDINATION FOR

INTERNET-SCALE SYSTEMS

Authors: P. Hunt, M. Konar, F. P. Junqueira, B. ReedPresenter: Lian Mo

WHAT IS ZOOKEEPER

A distributed, open-source coordination service for distributed applicationsIt provides tools for implementing primitives and tasks like locks, electing a master and track live processes

WHAT IS ZOOKEEPER

ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace similar to file system.Applications act as clients. They can connect and invoke operation on ZooKeeper servers through the client API

WHAT IS ZOOKEEPER

Wait-free property---- slow processes cannot slow down fast ones---- no deadlocks---- easier for implementationsFIFO client order & linearizable writes guarantees.

ZOOKEEPER VS CHUBBY

Chubby ZooKeeperChubby is a distributed locksystem only

Lock system may beimplemented, but is not a must

Reads & Writes all go to leader Reads on any serverChubby manages clients caches Clients manage their own caches

Similarity:• Both provide interfaces similar to UNIX like file system• Both provide a mechanism to follow up changes on files

(events & watches)• Both have sessions• Both keep a write-ahead replay log• Both have regular and ephemeral nodesDifferences:

DATA MODEL

ZooKeeper follows a hierarchical namespaceEach node in the namespace is called as a znode.

Znodes are data objects that clients manipulated through the ZooKeeper API

ZNODES

Regular znodes- can have children- created and deleted by clients explicitly

Ephemeral znodes- cannot have child- created by clients explicitly- deleted by clients or removed by system automatically when the session that creates them terminated

ZNODES

Are in-memory data nodesData is read and written in entiretyNot for storing general data, but meta-dataMap to abstractions of the client application, typically corresponding to meta-data used for coordination purpose.Have meta-data of time stamps and version counters.

ZNODE WATCHES

How do clients keep track of znode changes?Could do periodically polling. Inefficient.

Watches!One-time trigger associated with a sessionIndicate a change but do not provide itUnregistered once triggered or the session closesE.g. getChildren(path, watch)

ZOOKEEPER API

Create(path, data, flags)Delete(path, version)Exists(path, watch)getData(path, watch)setData(path, data, version)getChildren(path, watch)Sync(path)

ORDERING GUARANTEES

Linearizable writes: all requests that update the state of ZooKeeper are serializable and respect precedence

FIFO client order: all requests from a given client are executed in the order that they were sent by the client

Notification order: if a client is watching for a change, the client will see the notification event before it sees the new state of the system after the change is made.

WHY IMPORTANT

If we have a system that just elected its new leader, the newleader must change many configuration parameters andnotify the other processes once it finishes.----When the process is making changes, we don’t wantother processes to start using the configuration.----If the new leader dies before the configuration has beenfully updated, we don’t want processes to use that partialconfiguration.Set a READY znode.

delete READY updatesconfig create READY

ZOOKEEPER GUARANTEES

Liveness guarantees: if a majority of ZooKeeper serversare active and communicating the service will be available

Durability guarantees: if the ZooKeeper serviceresponds successfully to a change request, that changepersists across any number of failures as long as a quorum ofservers is eventually able to recover

PRIMITIVES IMPLEMENTED BYZOOKEEPER

Configuration ManagementRendezvousGroup MembershipSimple LocksSimple Locks without Herd EffectRead/Write LocksDouble Barrier

CONFIGURATION

Workers get configuration-getData(“../config/settings”, true)

Adminstrators change the configuration-setData(“../config/settings”, newConf, -1)

Workders notified of change and get the new settings-getData(“…/config/settings”, true)

GROUP MEMBERSHIP

Process member of the group creates ephemeral child under workers znode when starts

-create(“../workers/workerName”, hostInfo, EPHEMERAL)If a process fails or ends, the znode represents it will be automatically removedProcesses can obtain group information by listing the children of workers znode

-listChildren(“../workers”, true)

LOCKS

For simple lock, we could just use an EPHEMERAL node lrepresents the lock. To acquire a lock, a client tries to create l

-Success: holds the lock-Fail: sets a read watch on that node, tries to create again when gets notification.

LOCKS

Line up all the clients requesting the lock and each client obtains the lock in order of request arrival.

1) id = create(“.../locks/x-”, SEQUENCE|EPHEMERAL)2) getChildren(“.../locks”/, false)3) if id is the 1st child, exit4) exists(name of last child before id, true)5) if does not exist, goto 2)6) wait for notification from th watch7) goto 2)

Each znode watches one other. No herd effect!

ZOOKEEPERIMPLEMENTATION

ZOOKEEPER IMPLEMENTATION

For read requests, a server reads the state of local databaseFor write requests, servers forward them to the leader. Thenthey use an agreement protocol and finally servers commitchanges to the ZooKeeper database fully replicated across allservers of the ensemble

Pics from Scott Leberknight’s presentation

REQUEST PROCESSOR

Converts write requests into idempotenttransactions

ATOMIC BROADCAST

The Leader executes the request and broadcaststhe change to the ZooKeeper state through Zab,an atomic broadcast protocolOrder guaranteesWrite-ahead log to keep track of proposals

REPLICATED DATABASE

Each replica has a copy in memory of the ZooKeeper stateWhen a ZooKeeper server recovers from a crash, it needsto recover this internal state.Could redeliver the write-ahead log, but slow!Instead, servers do periodic fuzzy snapshots (When recoveringfrom a crash, only requires redelivery of messages since the start of the snapshot)

CLIENT-SERVER INTERACTIONWhen a server processes a write request, it sends out and clears notifications relative to any watches that corresponds to that updatesServers process writes in order, and do not process other read or write concurrentlyRead request is processed locally

Read may return a stale valueClients call sync followed by a read FIFO ordering guarantee and the global guarantee of sync enables the result of read operation to reflect any changes before sync

CLIENT-SERVER INTERACTION

Timeouts are used to detect session failuresTo make sure the view of the server is at least as recent asthe view of the client, zxid is used for housekeeping

If client has a more recent view, the server does not reestablish the session until the server has caught up.A client can always find another server with a recent view of the system.

Yahoo! leader election, configuration management, sharding, locking, group membership etc.

Apache HbaseThe Hadoop database use ZooKeeper for master election, server lease management, bootstrapping, and coordination between servers.

EclipseEclipse Communication Framework & Gyrex use ZooKeeper as the core cloud component for node membership and management, coordination of jobs executing among workers, a lock service and a simple queue service and a lot more.

APPLICATIONS & ORGANIZATIONSUSING ZOOKEEPER

APPLICATIONS & ORGANIZATIONSUSING ZOOKEEPER

AdroitLogicDeepdyveHelpraceKatta101tecNeo4jRackspaceCXF DOSGiSolrBenipal TechnologiesMakara

EVALUATION

THROUGHPUT

THROUGHPUT UPON FAILURES

LATENCY OF REQUEST

CONCLUSION

Simple interface and powerful abstractions

Use fast reads with watches to achieve high throughput for read-dominant workloads

Wait-free property is essential for high performance

REFERENCE

Hunt, Patrick, et al. "ZooKeeper: Wait-free Coordination for Internet-scale Systems." USENIX annual technical conference. Vol. 8. 2010.Junqueira, Flavio, and Benjamin Reed. ZooKeeper: distributed process coordination. " O'Reilly Media, Inc.", 2013.Reed, Benjamin. “ZooKeeper: Wait-free Coordination for Internet-scale System.” USENIX Annual Technical Conference. Boston, MA. 24 June, 2010. https://www.usenix.org/legacy/events/atc10/tech/slides/hunt.pdfThe Apache Software Foundation. (2016, July 20). ZooKeeper Overview. Retrieved February 20, 2017, from Apache.org, https://zookeeper.apache.org/doc/trunk/zookeeperOver.htmlHaloi, Saurav. “Introduction to Apache ZooKeeper.” GNUnify. Pune, IN. Feb 16, 2013. http://www.slideshare.net/sauravhaloi/introduction-to-apache-zookeeper?qid=1634c565-7f28-4f08-8182-2824cdf14d20&v=&b=&from_search=1Leberknight, Scott. “Apache ZooKeeper.” Near Infinity 2012 spring conference. http://www.slideshare.net/scottleber/apache-zookeeper?qid=77f8af68-ec4d-442d-890c-f8ae99b54b24&v=&b=&from_search=3 Shah2239. "Zookeeper: Wait-free Coordination for Internet-scale Systems Part 3." YouTube. YouTube, 06 Dec. 2016. Web. 20 Feb. 2017.Shah2239. "Zookeeper: Wait-free Coordination for Internet-scale Systems Part 1." YouTube. YouTube, 06 Dec. 2016. Web. 20 Feb. 2017.

THANK YOU!

QUESTIONS?

zookeeper: wait-free coordination for internet-scale systems€¦ · when a zookeeper server...

Documents