messaging architecture @fb (fifth elephant conference)

Click here to load reader

Download Messaging architecture @FB (Fifth Elephant Conference)

Post on 25-Dec-2014




1 download

Embed Size (px)


These are based on my participation in FB Messaging design in 2010. Software evolves fast - this may not reflect the state of the world today.


  • 1. Messaging Architecture @FB Joydeep Sen Sarma
  • 2. Background WAS: key participant design team (10/2009-04/2010) vocal proponent of final consensus architecture WAS NOT: Part of actual implementation Involved in FB chat backend IS NOT: Facebook Employee since 08/2011
  • 3. Problem 1Billion Users Volume: 25 messages/day * 4KB (exclude attachments) 10 TB per day Indexes/Summaries: Keyword/Threadid/Label Index Label/Thread message counts
  • 4. Problem Cont. Must have: Cross continent copy Ideally concurrent updates across regions At least disaster recoverable No Single Failure Points for Entire Service FB has downtime on a few MySql databases/day No one cares Cannot Cannot Cannot lose Data
  • 5. Solved Problem Attachment Store HayStack Stores FB Photos Optimized for Immutable Data Hiring best programmers available Choose best design, not implementation But get things done Fast
  • 6. Write Throughput Disk: Need Log Structured container Can store small messages inline Can store keyword index as well What about read performance? Flash/Memory Expensive Only metadata
  • 7. LSM Trees High write throughput Recent Data Clustered Nice! Fits a mailbox access pattern Inherently Snapshotted Backups/DR should be easy
  • 8. Reads? Write-Optimized => Read-Penalty Cache working set in App Server At-Most one App Server per User. All mailbox updates via Application Server Serve directly from cache Cold-Start LSM tree clustering should make retrieving recent messages/threads fast.
  • 9. SPOF? Single Hbase/HDFS cluster? NO! Lots of 100 node clusters HDFS Namenode HA
  • 10. Cassandra vs. HBase (abridged) Tested it out (c. 2010) HBase held up, (FB Internal) Cassandra didnt Tried to understand internals HBase held up, Cassandra didnt Really Really trusted HDFS Stored PB of data for years with no loss Missing features in Hbase/HDFS can be added
  • 11. Disaster Recovery (HBase)1. Ship HLog to Remote Data Center real-time2. Every-day update Remote Snapshot3. Reset remote HLog No need to synchronize #2 and #3 perfectly HLog replay is idempotent
  • 12. Test! Try to avoid writing a cache in Java
  • 13. What about Flash? In HBase: Store recent LSM tree segments in Flash Store HBase block cache Inefficient in Cassandra! (3x LSM trees/cache) In App Server Page /in out User cache from Flash
  • 14. Lingering Doubts Small Components vs. Big Systems Small Components are better Is HDFS too big? Separate DataNode, BlockManager, NameNode HBase doesnt need NameNode Gave up on Cross-DC concurrency Partition Users if required Global user->DC registry needs to deal with partitions and conflict resolution TBD
  • 15. Cassandra vs. HBase
  • 16. Cassandra: Flat Earth The world is hierarchical PCI Bus, Rack, Data Center, Region, Continent .. Odds of Partitioning differvs. Symmetric hash ring spanning continents Odds of partitioning considered constant
  • 17. Cassandra No Centralization The world has central (but HA) tiers: DNS servers, Core-Switches, Memcache-Tier, Cassandra: all servers independent No authoritative commit log or snapshot Do Repeat Your Reads (DRYR) paradigm
  • 18. Philosophies have Consequences Consistent Reads are expensive N=3, R=2, W=2 Ugh: why are reads expensive in write optimized system? Is Consistency foolproof ? Edge cases with failed writes Internet still debating If Science has Bugs then imagine Code!
  • 19. Distributed Storage vs. Database How to recover failed block or disk? Distributed Storage (HDFS): Simple - Find other replicas for that block. Distributed Database (Cassandra): A ton of my databases lived on that drive Hard: Lets merge all the affected databases
  • 20. Eventual Consistency Read-Modify-Write pattern problematic 1. Read value 2. Apply Business Logic 3. Write value Stale Read leads to Junk What about atomic increments?
  • 21. Conflict Resolution Easy to resolve conflicts in Increments Imagine multi-row transactions Pointless resolving conflicts at row levelSolve conflicts at highest possible layer Transaction Monitor
  • 22. How did it work out? Ton of missing Hbase/HDFS features added Bloom Filters, Namenode HA Remote Hlog shipping Modified Block Placement Policy Sticky Regions Improved Block Cache User -> AppServer via Zookeeper App Server worked out

View more