zookeeper: wait-free coordination for internet-scale systems€¦ · when a zookeeper server...
TRANSCRIPT
ZOOKEEPER: WAIT-FREE COORDINATION FOR
INTERNET-SCALE SYSTEMS
Authors: P. Hunt, M. Konar, F. P. Junqueira, B. ReedPresenter: Lian Mo
WHAT IS ZOOKEEPER
A distributed, open-source coordination service for distributed applicationsIt provides tools for implementing primitives and tasks like locks, electing a master and track live processes
WHAT IS ZOOKEEPER
ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace similar to file system.Applications act as clients. They can connect and invoke operation on ZooKeeper servers through the client API
WHAT IS ZOOKEEPER
Wait-free property---- slow processes cannot slow down fast ones---- no deadlocks---- easier for implementationsFIFO client order & linearizable writes guarantees.
ZOOKEEPER VS CHUBBY
Chubby ZooKeeperChubby is a distributed locksystem only
Lock system may beimplemented, but is not a must
Reads & Writes all go to leader Reads on any serverChubby manages clients caches Clients manage their own caches
Similarity:• Both provide interfaces similar to UNIX like file system• Both provide a mechanism to follow up changes on files
(events & watches)• Both have sessions• Both keep a write-ahead replay log• Both have regular and ephemeral nodesDifferences:
DATA MODEL
ZooKeeper follows a hierarchical namespaceEach node in the namespace is called as a znode.
Znodes are data objects that clients manipulated through the ZooKeeper API
ZNODES
Regular znodes- can have children- created and deleted by clients explicitly
Ephemeral znodes- cannot have child- created by clients explicitly- deleted by clients or removed by system automatically when the session that creates them terminated
ZNODES
Are in-memory data nodesData is read and written in entiretyNot for storing general data, but meta-dataMap to abstractions of the client application, typically corresponding to meta-data used for coordination purpose.Have meta-data of time stamps and version counters.
ZNODE WATCHES
How do clients keep track of znode changes?Could do periodically polling. Inefficient.
Watches!One-time trigger associated with a sessionIndicate a change but do not provide itUnregistered once triggered or the session closesE.g. getChildren(path, watch)
ZOOKEEPER API
Create(path, data, flags)Delete(path, version)Exists(path, watch)getData(path, watch)setData(path, data, version)getChildren(path, watch)Sync(path)
ORDERING GUARANTEES
Linearizable writes: all requests that update the state of ZooKeeper are serializable and respect precedence
FIFO client order: all requests from a given client are executed in the order that they were sent by the client
Notification order: if a client is watching for a change, the client will see the notification event before it sees the new state of the system after the change is made.
WHY IMPORTANT
If we have a system that just elected its new leader, the newleader must change many configuration parameters andnotify the other processes once it finishes.----When the process is making changes, we don’t wantother processes to start using the configuration.----If the new leader dies before the configuration has beenfully updated, we don’t want processes to use that partialconfiguration.Set a READY znode.
delete READY updatesconfig create READY
ZOOKEEPER GUARANTEES
Liveness guarantees: if a majority of ZooKeeper serversare active and communicating the service will be available
Durability guarantees: if the ZooKeeper serviceresponds successfully to a change request, that changepersists across any number of failures as long as a quorum ofservers is eventually able to recover
PRIMITIVES IMPLEMENTED BYZOOKEEPER
Configuration ManagementRendezvousGroup MembershipSimple LocksSimple Locks without Herd EffectRead/Write LocksDouble Barrier
CONFIGURATION
Workers get configuration-getData(“../config/settings”, true)
Adminstrators change the configuration-setData(“../config/settings”, newConf, -1)
Workders notified of change and get the new settings-getData(“…/config/settings”, true)
GROUP MEMBERSHIP
Process member of the group creates ephemeral child under workers znode when starts
-create(“../workers/workerName”, hostInfo, EPHEMERAL)If a process fails or ends, the znode represents it will be automatically removedProcesses can obtain group information by listing the children of workers znode
-listChildren(“../workers”, true)
LOCKS
For simple lock, we could just use an EPHEMERAL node lrepresents the lock. To acquire a lock, a client tries to create l
-Success: holds the lock-Fail: sets a read watch on that node, tries to create again when gets notification.
LOCKS
Line up all the clients requesting the lock and each client obtains the lock in order of request arrival.
1) id = create(“.../locks/x-”, SEQUENCE|EPHEMERAL)2) getChildren(“.../locks”/, false)3) if id is the 1st child, exit4) exists(name of last child before id, true)5) if does not exist, goto 2)6) wait for notification from th watch7) goto 2)
Each znode watches one other. No herd effect!
ZOOKEEPERIMPLEMENTATION
ZOOKEEPER IMPLEMENTATION
For read requests, a server reads the state of local databaseFor write requests, servers forward them to the leader. Thenthey use an agreement protocol and finally servers commitchanges to the ZooKeeper database fully replicated across allservers of the ensemble
Pics from Scott Leberknight’s presentation
REQUEST PROCESSOR
Converts write requests into idempotenttransactions
ATOMIC BROADCAST
The Leader executes the request and broadcaststhe change to the ZooKeeper state through Zab,an atomic broadcast protocolOrder guaranteesWrite-ahead log to keep track of proposals
REPLICATED DATABASE
Each replica has a copy in memory of the ZooKeeper stateWhen a ZooKeeper server recovers from a crash, it needsto recover this internal state.Could redeliver the write-ahead log, but slow!Instead, servers do periodic fuzzy snapshots (When recoveringfrom a crash, only requires redelivery of messages since the start of the snapshot)
CLIENT-SERVER INTERACTIONWhen a server processes a write request, it sends out and clears notifications relative to any watches that corresponds to that updatesServers process writes in order, and do not process other read or write concurrentlyRead request is processed locally
Read may return a stale valueClients call sync followed by a read FIFO ordering guarantee and the global guarantee of sync enables the result of read operation to reflect any changes before sync
CLIENT-SERVER INTERACTION
Timeouts are used to detect session failuresTo make sure the view of the server is at least as recent asthe view of the client, zxid is used for housekeeping
If client has a more recent view, the server does not reestablish the session until the server has caught up.A client can always find another server with a recent view of the system.
Yahoo! leader election, configuration management, sharding, locking, group membership etc.
Apache HbaseThe Hadoop database use ZooKeeper for master election, server lease management, bootstrapping, and coordination between servers.
EclipseEclipse Communication Framework & Gyrex use ZooKeeper as the core cloud component for node membership and management, coordination of jobs executing among workers, a lock service and a simple queue service and a lot more.
APPLICATIONS & ORGANIZATIONSUSING ZOOKEEPER
APPLICATIONS & ORGANIZATIONSUSING ZOOKEEPER
AdroitLogicDeepdyveHelpraceKatta101tecNeo4jRackspaceCXF DOSGiSolrBenipal TechnologiesMakara
EVALUATION
THROUGHPUT
THROUGHPUT
THROUGHPUT UPON FAILURES
LATENCY OF REQUEST
CONCLUSION
Simple interface and powerful abstractions
Use fast reads with watches to achieve high throughput for read-dominant workloads
Wait-free property is essential for high performance
REFERENCE
Hunt, Patrick, et al. "ZooKeeper: Wait-free Coordination for Internet-scale Systems." USENIX annual technical conference. Vol. 8. 2010.Junqueira, Flavio, and Benjamin Reed. ZooKeeper: distributed process coordination. " O'Reilly Media, Inc.", 2013.Reed, Benjamin. “ZooKeeper: Wait-free Coordination for Internet-scale System.” USENIX Annual Technical Conference. Boston, MA. 24 June, 2010. https://www.usenix.org/legacy/events/atc10/tech/slides/hunt.pdfThe Apache Software Foundation. (2016, July 20). ZooKeeper Overview. Retrieved February 20, 2017, from Apache.org, https://zookeeper.apache.org/doc/trunk/zookeeperOver.htmlHaloi, Saurav. “Introduction to Apache ZooKeeper.” GNUnify. Pune, IN. Feb 16, 2013. http://www.slideshare.net/sauravhaloi/introduction-to-apache-zookeeper?qid=1634c565-7f28-4f08-8182-2824cdf14d20&v=&b=&from_search=1Leberknight, Scott. “Apache ZooKeeper.” Near Infinity 2012 spring conference. http://www.slideshare.net/scottleber/apache-zookeeper?qid=77f8af68-ec4d-442d-890c-f8ae99b54b24&v=&b=&from_search=3 Shah2239. "Zookeeper: Wait-free Coordination for Internet-scale Systems Part 3." YouTube. YouTube, 06 Dec. 2016. Web. 20 Feb. 2017.Shah2239. "Zookeeper: Wait-free Coordination for Internet-scale Systems Part 1." YouTube. YouTube, 06 Dec. 2016. Web. 20 Feb. 2017.
THANK YOU!
QUESTIONS?