messaging architecture @fb (fifth elephant conference)

Messaging Architecture @FB

Joydeep Sen Sarma

Background

• WAS:– key participant design team (10/2009-04/2010)– vocal proponent of final consensus architecture

• WAS NOT:

– Part of actual implementation– Involved in FB chat backend

• IS NOT: Facebook Employee since 08/2011

Problem

• 1Billion Users

• Volume:– 25 messages/day * 4KB (exclude attachments)– 10 TB per day

• Indexes/Summaries:– Keyword/Threadid/Label Index– Label/Thread message counts

Problem – Cont.

• Must have: Cross continent copy– Ideally concurrent updates across regions– At least disaster recoverable

• No Single Failure Points for Entire Service– FB has downtime on a few MySql databases/day– No one cares

• Cannot Cannot Cannot lose Data

Solved Problem

• Attachment Store– HayStack– Stores FB Photos– Optimized for Immutable Data

• Hiring best programmers available – Choose best design, not implementation– But get things done Fast

Write Throughput

• Disk:– Need Log Structured container– Can store small messages inline– Can store keyword index as well– What about read performance?

• Flash/Memory– Expensive– Only metadata

LSM Trees

• High write throughput

• Recent Data Clustered– Nice! Fits a mailbox access pattern

• Inherently Snapshotted – Backups/DR should be easy

Reads?

• Write-Optimized => Read-Penalty

• Cache working set in App Server– At-Most one App Server per User.– All mailbox updates via Application Server– Serve directly from cache

• Cold-Start– LSM tree clustering should make retrieving recent

messages/threads fast.

SPOF?

• Single Hbase/HDFS cluster?

• NO!– Lots of 100 node clusters– HDFS Namenode HA

Cassandra vs. HBase (abridged)

• Tested it out (c. 2010)– HBase held up, (FB Internal) Cassandra didn’t

• Tried to understand internals– HBase held up, Cassandra didn’t

• Really Really trusted HDFS– Stored PB of data for years with no loss

• Missing features in Hbase/HDFS can be added

Disaster Recovery (HBase)

1. Ship HLog to Remote Data Center real-time2. Every-day update Remote Snapshot3. Reset remote HLog

• No need to synchronize #2 and #3 perfectly– HLog replay is idempotent

Test!

Try to avoid writing a cache in Java

What about Flash?

• In HBase:– Store recent LSM tree segments in Flash– Store HBase block cache– Inefficient in Cassandra! (3x LSM trees/cache)

• In App Server– Page /in out User cache from Flash

Lingering Doubts

• Small Components vs. Big Systems– Small Components are better– Is HDFS too big?

• Separate DataNode, BlockManager, NameNode• HBase doesn’t need NameNode

• Gave up on Cross-DC concurrency– Partition Users if required– Global user->DC registry needs to deal with

partitions and conflict resolution– TBD

Cassandra vs. HBase

Cassandra: Flat Earth

• The world is hierarchical– PCI Bus, Rack, Data Center, Region, Continent ..– Odds of Partitioning differ

vs.

• Symmetric hash ring spanning continents– Odds of partitioning considered constant

Cassandra – No Centralization

• The world has central (but HA) tiers:– DNS servers, Core-Switches, Memcache-Tier, …

• Cassandra: all servers independent– No authoritative commit log or snapshot– Do Repeat Your Reads (DRYR) paradigm

Philosophies have Consequences

• Consistent Reads are expensive– N=3, R=2, W=2– Ugh: why are reads expensive in write optimized

system?

• Is Consistency foolproof ?– Edge cases with failed writes– Internet still debating– If Science has Bugs – then imagine Code!

Distributed Storage vs. Database

• How to recover failed block or disk?

• Distributed Storage (HDFS):– Simple - Find other replicas for that block.

• Distributed Database (Cassandra):– A ton of my databases lived on that drive– Hard: Let’s merge all the affected databases

Eventual Consistency

• Read-Modify-Write pattern problematic1. Read value2. Apply Business Logic3. Write valueStale Read leads to Junk

• What about atomic increments?

Conflict Resolution

• Easy to resolve conflicts in Increments

• Imagine multi-row transactions– Pointless resolving conflicts at row level

Solve conflicts at highest possible layer– Transaction Monitor

How did it work out?

• Ton of missing Hbase/HDFS features added– Bloom Filters, Namenode HA– Remote Hlog shipping– Modified Block Placement Policy– Sticky Regions– Improved Block Cache – …

• User -> AppServer via Zookeeper• App Server worked out

messaging architecture @fb (fifth elephant conference)

Technology