zookeeper: wait-free coordination for internet-scale systems€¦ · when a zookeeper server...

35
ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS Authors: P. Hunt, M. Konar, F. P. Junqueira, B. Reed Presenter: Lian Mo

Upload: others

Post on 12-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

ZOOKEEPER: WAIT-FREE COORDINATION FOR

INTERNET-SCALE SYSTEMS

Authors: P. Hunt, M. Konar, F. P. Junqueira, B. ReedPresenter: Lian Mo

Page 2: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

WHAT IS ZOOKEEPER

A distributed, open-source coordination service for distributed applicationsIt provides tools for implementing primitives and tasks like locks, electing a master and track live processes

Page 3: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

WHAT IS ZOOKEEPER

ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace similar to file system.Applications act as clients. They can connect and invoke operation on ZooKeeper servers through the client API

Page 4: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

WHAT IS ZOOKEEPER

Wait-free property---- slow processes cannot slow down fast ones---- no deadlocks---- easier for implementationsFIFO client order & linearizable writes guarantees.

Page 5: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

ZOOKEEPER VS CHUBBY

Chubby ZooKeeperChubby is a distributed locksystem only

Lock system may beimplemented, but is not a must

Reads & Writes all go to leader Reads on any serverChubby manages clients caches Clients manage their own caches

Similarity:• Both provide interfaces similar to UNIX like file system• Both provide a mechanism to follow up changes on files

(events & watches)• Both have sessions• Both keep a write-ahead replay log• Both have regular and ephemeral nodesDifferences:

Page 6: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

DATA MODEL

ZooKeeper follows a hierarchical namespaceEach node in the namespace is called as a znode.

Znodes are data objects that clients manipulated through the ZooKeeper API

Page 7: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

ZNODES

Regular znodes- can have children- created and deleted by clients explicitly

Ephemeral znodes- cannot have child- created by clients explicitly- deleted by clients or removed by system automatically when the session that creates them terminated

Page 8: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

ZNODES

Are in-memory data nodesData is read and written in entiretyNot for storing general data, but meta-dataMap to abstractions of the client application, typically corresponding to meta-data used for coordination purpose.Have meta-data of time stamps and version counters.

Page 9: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

ZNODE WATCHES

How do clients keep track of znode changes?Could do periodically polling. Inefficient.

Watches!One-time trigger associated with a sessionIndicate a change but do not provide itUnregistered once triggered or the session closesE.g. getChildren(path, watch)

Page 10: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

ZOOKEEPER API

Create(path, data, flags)Delete(path, version)Exists(path, watch)getData(path, watch)setData(path, data, version)getChildren(path, watch)Sync(path)

Page 11: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

ORDERING GUARANTEES

Linearizable writes: all requests that update the state of ZooKeeper are serializable and respect precedence

FIFO client order: all requests from a given client are executed in the order that they were sent by the client

Notification order: if a client is watching for a change, the client will see the notification event before it sees the new state of the system after the change is made.

Page 12: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

WHY IMPORTANT

If we have a system that just elected its new leader, the newleader must change many configuration parameters andnotify the other processes once it finishes.----When the process is making changes, we don’t wantother processes to start using the configuration.----If the new leader dies before the configuration has beenfully updated, we don’t want processes to use that partialconfiguration.Set a READY znode.

delete READY updatesconfig create READY

Page 13: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

ZOOKEEPER GUARANTEES

Liveness guarantees: if a majority of ZooKeeper serversare active and communicating the service will be available

Durability guarantees: if the ZooKeeper serviceresponds successfully to a change request, that changepersists across any number of failures as long as a quorum ofservers is eventually able to recover

Page 14: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

PRIMITIVES IMPLEMENTED BYZOOKEEPER

Configuration ManagementRendezvousGroup MembershipSimple LocksSimple Locks without Herd EffectRead/Write LocksDouble Barrier

Page 15: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

CONFIGURATION

Workers get configuration-getData(“../config/settings”, true)

Adminstrators change the configuration-setData(“../config/settings”, newConf, -1)

Workders notified of change and get the new settings-getData(“…/config/settings”, true)

Page 16: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

GROUP MEMBERSHIP

Process member of the group creates ephemeral child under workers znode when starts

-create(“../workers/workerName”, hostInfo, EPHEMERAL)If a process fails or ends, the znode represents it will be automatically removedProcesses can obtain group information by listing the children of workers znode

-listChildren(“../workers”, true)

Page 17: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

LOCKS

For simple lock, we could just use an EPHEMERAL node lrepresents the lock. To acquire a lock, a client tries to create l

-Success: holds the lock-Fail: sets a read watch on that node, tries to create again when gets notification.

Page 18: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

LOCKS

Line up all the clients requesting the lock and each client obtains the lock in order of request arrival.

1) id = create(“.../locks/x-”, SEQUENCE|EPHEMERAL)2) getChildren(“.../locks”/, false)3) if id is the 1st child, exit4) exists(name of last child before id, true)5) if does not exist, goto 2)6) wait for notification from th watch7) goto 2)

Each znode watches one other. No herd effect!

Page 19: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

ZOOKEEPERIMPLEMENTATION

Page 20: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

ZOOKEEPER IMPLEMENTATION

For read requests, a server reads the state of local databaseFor write requests, servers forward them to the leader. Thenthey use an agreement protocol and finally servers commitchanges to the ZooKeeper database fully replicated across allservers of the ensemble

Pics from Scott Leberknight’s presentation

Page 21: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

REQUEST PROCESSOR

Converts write requests into idempotenttransactions

Page 22: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

ATOMIC BROADCAST

The Leader executes the request and broadcaststhe change to the ZooKeeper state through Zab,an atomic broadcast protocolOrder guaranteesWrite-ahead log to keep track of proposals

Page 23: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

REPLICATED DATABASE

Each replica has a copy in memory of the ZooKeeper stateWhen a ZooKeeper server recovers from a crash, it needsto recover this internal state.Could redeliver the write-ahead log, but slow!Instead, servers do periodic fuzzy snapshots (When recoveringfrom a crash, only requires redelivery of messages since the start of the snapshot)

Page 24: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

CLIENT-SERVER INTERACTIONWhen a server processes a write request, it sends out and clears notifications relative to any watches that corresponds to that updatesServers process writes in order, and do not process other read or write concurrentlyRead request is processed locally

Read may return a stale valueClients call sync followed by a read FIFO ordering guarantee and the global guarantee of sync enables the result of read operation to reflect any changes before sync

Page 25: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

CLIENT-SERVER INTERACTION

Timeouts are used to detect session failuresTo make sure the view of the server is at least as recent asthe view of the client, zxid is used for housekeeping

If client has a more recent view, the server does not reestablish the session until the server has caught up.A client can always find another server with a recent view of the system.

Page 26: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

Yahoo! leader election, configuration management, sharding, locking, group membership etc.

Apache HbaseThe Hadoop database use ZooKeeper for master election, server lease management, bootstrapping, and coordination between servers.

EclipseEclipse Communication Framework & Gyrex use ZooKeeper as the core cloud component for node membership and management, coordination of jobs executing among workers, a lock service and a simple queue service and a lot more.

APPLICATIONS & ORGANIZATIONSUSING ZOOKEEPER

Page 27: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

APPLICATIONS & ORGANIZATIONSUSING ZOOKEEPER

AdroitLogicDeepdyveHelpraceKatta101tecNeo4jRackspaceCXF DOSGiSolrBenipal TechnologiesMakara

Page 28: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

EVALUATION

Page 29: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

THROUGHPUT

Page 30: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

THROUGHPUT

Page 31: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

THROUGHPUT UPON FAILURES

Page 32: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

LATENCY OF REQUEST

Page 33: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

CONCLUSION

Simple interface and powerful abstractions

Use fast reads with watches to achieve high throughput for read-dominant workloads

Wait-free property is essential for high performance

Page 34: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

REFERENCE

Hunt, Patrick, et al. "ZooKeeper: Wait-free Coordination for Internet-scale Systems." USENIX annual technical conference. Vol. 8. 2010.Junqueira, Flavio, and Benjamin Reed. ZooKeeper: distributed process coordination. " O'Reilly Media, Inc.", 2013.Reed, Benjamin. “ZooKeeper: Wait-free Coordination for Internet-scale System.” USENIX Annual Technical Conference. Boston, MA. 24 June, 2010. https://www.usenix.org/legacy/events/atc10/tech/slides/hunt.pdfThe Apache Software Foundation. (2016, July 20). ZooKeeper Overview. Retrieved February 20, 2017, from Apache.org, https://zookeeper.apache.org/doc/trunk/zookeeperOver.htmlHaloi, Saurav. “Introduction to Apache ZooKeeper.” GNUnify. Pune, IN. Feb 16, 2013. http://www.slideshare.net/sauravhaloi/introduction-to-apache-zookeeper?qid=1634c565-7f28-4f08-8182-2824cdf14d20&v=&b=&from_search=1Leberknight, Scott. “Apache ZooKeeper.” Near Infinity 2012 spring conference. http://www.slideshare.net/scottleber/apache-zookeeper?qid=77f8af68-ec4d-442d-890c-f8ae99b54b24&v=&b=&from_search=3 Shah2239. "Zookeeper: Wait-free Coordination for Internet-scale Systems Part 3." YouTube. YouTube, 06 Dec. 2016. Web. 20 Feb. 2017.Shah2239. "Zookeeper: Wait-free Coordination for Internet-scale Systems Part 1." YouTube. YouTube, 06 Dec. 2016. Web. 20 Feb. 2017.

Page 35: ZOOKEEPER: WAIT-FREE COORDINATION FOR INTERNET-SCALE SYSTEMS€¦ · When a ZooKeeper server recovers from a crash, it needs to recover this internal state. Could redeliver the write-ahead

THANK YOU!

QUESTIONS?