Messaging Architecture @FB
Joydeep Sen Sarma
Background
• WAS:– key participant design team (10/2009-04/2010)– vocal proponent of final consensus architecture
• WAS NOT:
– Part of actual implementation– Involved in FB chat backend
• IS NOT: Facebook Employee since 08/2011
Problem
• 1Billion Users
• Volume:– 25 messages/day * 4KB (exclude attachments)– 10 TB per day
• Indexes/Summaries:– Keyword/Threadid/Label Index– Label/Thread message counts
Problem – Cont.
• Must have: Cross continent copy– Ideally concurrent updates across regions– At least disaster recoverable
• No Single Failure Points for Entire Service– FB has downtime on a few MySql databases/day– No one cares
• Cannot Cannot Cannot lose Data
Solved Problem
• Attachment Store– HayStack– Stores FB Photos– Optimized for Immutable Data
• Hiring best programmers available – Choose best design, not implementation– But get things done Fast
Write Throughput
• Disk:– Need Log Structured container– Can store small messages inline– Can store keyword index as well– What about read performance?
• Flash/Memory– Expensive– Only metadata
LSM Trees
• High write throughput
• Recent Data Clustered– Nice! Fits a mailbox access pattern
• Inherently Snapshotted – Backups/DR should be easy
Reads?
• Write-Optimized => Read-Penalty
• Cache working set in App Server– At-Most one App Server per User.– All mailbox updates via Application Server– Serve directly from cache
• Cold-Start– LSM tree clustering should make retrieving recent
messages/threads fast.
SPOF?
• Single Hbase/HDFS cluster?
• NO!– Lots of 100 node clusters– HDFS Namenode HA
Cassandra vs. HBase (abridged)
• Tested it out (c. 2010)– HBase held up, (FB Internal) Cassandra didn’t
• Tried to understand internals– HBase held up, Cassandra didn’t
• Really Really trusted HDFS– Stored PB of data for years with no loss
• Missing features in Hbase/HDFS can be added
Disaster Recovery (HBase)
1. Ship HLog to Remote Data Center real-time2. Every-day update Remote Snapshot3. Reset remote HLog
• No need to synchronize #2 and #3 perfectly– HLog replay is idempotent
Test!
Try to avoid writing a cache in Java
What about Flash?
• In HBase:– Store recent LSM tree segments in Flash– Store HBase block cache– Inefficient in Cassandra! (3x LSM trees/cache)
• In App Server– Page /in out User cache from Flash
Lingering Doubts
• Small Components vs. Big Systems– Small Components are better– Is HDFS too big?
• Separate DataNode, BlockManager, NameNode• HBase doesn’t need NameNode
• Gave up on Cross-DC concurrency– Partition Users if required– Global user->DC registry needs to deal with
partitions and conflict resolution– TBD
Cassandra vs. HBase
Cassandra: Flat Earth
• The world is hierarchical– PCI Bus, Rack, Data Center, Region, Continent ..– Odds of Partitioning differ
vs.
• Symmetric hash ring spanning continents– Odds of partitioning considered constant
Cassandra – No Centralization
• The world has central (but HA) tiers:– DNS servers, Core-Switches, Memcache-Tier, …
• Cassandra: all servers independent– No authoritative commit log or snapshot– Do Repeat Your Reads (DRYR) paradigm
Philosophies have Consequences
• Consistent Reads are expensive– N=3, R=2, W=2– Ugh: why are reads expensive in write optimized
system?
• Is Consistency foolproof ?– Edge cases with failed writes– Internet still debating– If Science has Bugs – then imagine Code!
Distributed Storage vs. Database
• How to recover failed block or disk?
• Distributed Storage (HDFS):– Simple - Find other replicas for that block.
• Distributed Database (Cassandra):– A ton of my databases lived on that drive– Hard: Let’s merge all the affected databases
Eventual Consistency
• Read-Modify-Write pattern problematic1. Read value2. Apply Business Logic3. Write valueStale Read leads to Junk
• What about atomic increments?
Conflict Resolution
• Easy to resolve conflicts in Increments
• Imagine multi-row transactions– Pointless resolving conflicts at row level
Solve conflicts at highest possible layer– Transaction Monitor
How did it work out?
• Ton of missing Hbase/HDFS features added– Bloom Filters, Namenode HA– Remote Hlog shipping– Modified Block Placement Policy– Sticky Regions– Improved Block Cache – …
• User -> AppServer via Zookeeper• App Server worked out