zookeeper futures

Thursday, 5 November 2009

ZooKeeper FuturesExpanding the Menagerie

Henry RobinsonSoftware engineer @ ClouderaHadoop meetup - 11/5/2009


Upcoming features for ZooKeeper

▪Observers

▪Dynamic ensembles

▪ZooKeeper in Cloudera’s Distribution for Hadoop


Observers▪ZOOKEEPER-368

▪ Problem:

▪ Every node in a ZooKeeper cluster has to vote▪ So increasing the cluster size increases the cost of write operations

▪ But increasing the cluster size is the only way currently to get client scalability

▪ False tension between number of clients and performance▪ Should only increase size of voting cluster to improve reliability


Observers ▪ It’s worse than that

▪ Since clients are given a list of servers in the ensemble to connect to, the cluster is not isolated from swamping due to the number of clients

▪ That is, if a swarm of clients connect to one server and kill it, they’ll move on to another and do the same.

▪ Now we are sharing the same number of clients amongst fewer servers!

▪ So if these were enough clients originally to down a server, the prognosis is not good for those remaining

▪ Only n/2 servers have to die before the cluster is no longer live


Cascading Failures


Observers▪ Simple way to attack this problem: non-voting cluster members

▪Act as a fan-in point for client connections by proxying requests to the inner voting ensemble

▪Doesn’t matter if they die (in the sense that liveness is preserved) - cluster is still available for writes

▪Write throughput stays roughly constant as number of Observers increases

▪ So we can freely scale the number of Observers to meet the requirements of the number of clients


Observers: More benefits▪Voting ensemble members must meet strict latency contracts in order to not be considered ‘failed’

▪Therefore distributing ZooKeeper across many racks, or even datacenters, is problematic.

▪No such requirements made of Observers

▪ So deploy the voting ensemble for reliability and low latency communicaton, and everywhere you need a client, add an Observer

▪Reads get served locally, so wide distribution isn’t too painful for some workloads

▪ Likelihood of partition increases relative to distribution of ensemble, so availability is increased in some cases

▪Good integration point for publish-subscribe, and for specific optimisations


Observers: Current state▪This patch required a lot of structural work

▪Hoping to get in to 3.3

▪One major refactor patch committed

▪Core patch up on ZOOKEEPER-368

▪ Check it out and add comments!

▪ Fully functional - you can apply the patch, update your configuration and start using Observers today

▪Benchmarks show expected (and pleasing!) performance improvements

▪To come in future JIRAs - performance tweaking (batching)


Dynamic Ensembles▪ZOOKEEPER-107

▪ Problem:

▪ What if you really do want to change the membership of your cluster?

▪ Downtime is problematic for a ‘highly-available’ service▪ But failures occur and machines get repurposed or upgraded


Dynamic Ensembles▪We would like to be able to add or remove machines from the cluster without stopping the world

▪Conceptually, this is reasonably easy - we have a mechanism for updating information on every server synchronously, and in order

▪ (it’s called ZooKeeper)

▪ In practice, this is rather involved:

▪ When is a new cluster ‘live’?▪ Who votes on the cluster membership change?▪ How do we deal with slow members?▪ What happens when the leader changes?▪ How do we find the cluster when it’s completely different?


Dynamic Ensembles▪Getting all this right is hard

▪ (good!)

▪A fundamental change in how ZooKeeper is designed - much of the code is predicated on a static view of the cluster membership

▪ Ideally, we want to prove that the resulting protocol is correct

▪The key observation is that membership changes must be voted upon by both the old and the new configuration

▪ So this is no magic bullet if the cluster is down

▪Need to keep track of old configurations so that each vote can be tallied with the right quorum


Dynamic Ensembles▪ Lots of discussion on the JIRA

▪ although no public activity for a couple of months

▪ I have code that pretty much works

▪But waiting until Observers gets committed before I move focus completely to this

▪Current situation not *too* bad; there are upgrade workarounds that are a bit scary theoretically but in practice work ok.


ZooKeeper Packages in CDH▪We maintain Cloudera’s Distribution for Hadoop

▪ Packages for Mapred, HDFS, HBase, Pig and Hive

▪We see ZooKeeper as increasingly important to that stack, as well as having a wide variety of other applications

▪Therefore, we’ve packaged ZooKeeper 3.2.1 and are making it a first class part of CDH

▪We’ll track the Apache releases, and also backport important patches

▪Wrapped up in the service framework:

▪ /sbin/service zookeeper start

▪RPMs and tarballs are done, DEBs to follow imminently

▪Download RPMs at http://archive.cloudera.com/redhat/cdh/unstable/Thursday, 5 November 2009

http://archive.cloudera.com/redhat/cdh/unstable/

http://archive.cloudera.com/redhat/cdh/unstable/

Thanks! [email protected]


mailto:[email protected]

mailto:[email protected]

zookeeper futures

Technology