pagerduty: span the wan? yes you can!
TRANSCRIPT
2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC
Span the WAN. Why?
2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC
2015-10-01SPAN THE WAN? YES YOU CAN!
2015-10-01
PagerDuty: some history
•Monolithic Ruby on Rails + MySQL •Hosted in AWS us-east-1 •AWS outages in 2010 and 2011 •…including correlated multi-AZ failures •PagerDuty was heavily impacted •Needed resiliency to this failure mode
SPAN THE WAN? YES YOU CAN!
2015-10-01
Design goals
•Continuity during a DC drop (AZ or Region) •No operator intervention •Can’t lose data •Can’t delay data (shelf life) •Timely notifications - always
•Measured in 10’s of seconds
SPAN THE WAN? YES YOU CAN!
2015-10-01
Design decisions
•Masterless: peer-based & clustered •Can’t tolerate staleness: synchronous WAN replication •Manage state: consistent reads •Opted to use Cassandra •…despite many of Cassandra’s features not being relevant
SPAN THE WAN? YES YOU CAN!
2015-10-01
How Cassandra is often used
SPAN THE WAN? YES YOU CAN!
•Massive throughput •Lots of data •Horizontally scalable •Eventually consistent •High write:read ratio •High performance individual operations
2015-10-01
Essential Cassandra features for PagerDuty
•Quorum operations •Tuneable consistency •Synchronous WAN replication
SPAN THE WAN? YES YOU CAN!
2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC
WAN-spanning system design
2015-10-01
System architecture
SPAN THE WAN? YES YOU CAN!
Shared cross-DC datastore
(Cassandra)
Distributed Coordination (ZooKeeper)
Clustered Application
2015-10-01
Quorum consistency systems
•Each item replicated N times •Writes: require W of N replicas •Reads: require R of N replicas •W + R <= N: read can miss a write •W + R > N: read can’t miss a write
SPAN THE WAN? YES YOU CAN!
WRITE READ
2015-10-01
•Replication factor: N=5 •Three DCs •DC-aware placement strategy •W=3: all writes hit multiple DCs •R=3: all reads hit multiple DCs •3 + 3 > 5: consistent reads
Cassandra setup
SPAN THE WAN? YES YOU CAN!
Cass 5
Cass 1
Cass 2 Cass 4
Cass 3
DC-A
DC-C
DC-B
2015-10-01
Data layer summary
•Data safe against DC failure •Consistent reads (of acknowledged writes) •Expensive multi-DC writes & reads •Managing state: No ACID transactions! •Enforce “transactions” in the application layer
SPAN THE WAN? YES YOU CAN!
2015-10-01
Application layer: “transactions”
•Sequence of logic and Cassandra operations •Implement sequence as idempotent •Failure is not an option •Enforce transaction ordering •Expect (some) (transient) inconsistencies
SPAN THE WAN? YES YOU CAN!
2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC
Tales from production
2015-10-01
What about the network?
SPAN THE WAN? YES YOU CAN!
Cass 5
Cass 1
Cass 2 Cass 4
Cass 3
DC-A
DC-C
DC-B
24 ms
24 ms 3
ms
•Network diversity limits DC choices •Result? Uneven network latencies
2015-10-01
…and how you should think of the network
SPAN THE WAN? YES YOU CAN!
Cass 5
Cass 1
Cass 2
Cass 4
Cass 3DC-A
DC-C
DC-B
24 ms
24 ms
3 m
s
2015-10-01
Reads and writes
SPAN THE WAN? YES YOU CAN!
DC-A
DC-B
DC-C
Client
R1
R2
R3
R4
R5
2015-10-01
Read and write performance
SPAN THE WAN? YES YOU CAN!
Cass 5
Cass 1
Cass 2
Cass 4
Cass 3DC-A
DC-C
DC-B
24 ms
24 ms
3 m
s
•R and W =3 means always hitting replicas in two DCs (by design) •Reads coordinated from DC-B or DC-C nodes will take >3ms •Reads coordinated from DC-A nodes will take >24ms
2015-10-01
Another latency effect? Per-node read volume
SPAN THE WAN? YES YOU CAN!
2015-10-01
Per-node read volume: why so skewed?
SPAN THE WAN? YES YOU CAN!
2015-10-01
Writes: Which replicas are involved? All 5
SPAN THE WAN? YES YOU CAN!
DC-A
DC-B
DC-C
Client
R1
R2
R3
R4
R5
2015-10-01
Writes: per-node volume
SPAN THE WAN? YES YOU CAN!
Cass 5
Cass 1
Cass 2
Cass 4
Cass 3DC-A
DC-C
DC-B
24 ms
24 ms
3 m
s
•N=5, so there is a write op on each replica •All replicas experience the same per-node write load
2015-10-01
Reads: Which replicas are involved? Only 3!
SPAN THE WAN? YES YOU CAN!
DC-A
DC-B
DC-C
Client
R1
R2
R3
R4
R5
2015-10-01
Reads: per-node volume
SPAN THE WAN? YES YOU CAN!
Cass 5
Cass 1
Cass 2
Cass 4
Cass 3DC-A
DC-C
DC-B
24 ms
24 ms
3 m
s
•Coordinator chooses R fastest replicas (R=3) •Network latency steers to the nearest replicas
2015-10-01
Reads: per-node volume (Cass 3 as coord)
SPAN THE WAN? YES YOU CAN!
Cass 5
Cass 1
Cass 2
Cass 4
Cass 3DC-A
DC-C
DC-B
24 ms
24 ms
3 m
s
•Chooses 3, 4, and 5 •Same when Cass 4 or Cass 5 coordinates
2015-10-01
Reads: per-node volume (Cass 1 as coord)
SPAN THE WAN? YES YOU CAN!
Cass 5
Cass 1
Cass 2
Cass 4
Cass 3DC-A
DC-C
DC-B
24 ms
24 ms
3 m
s
•Hits 1, 2 and (randomly) one of 3, 4, 5 •Same when Cass 2 coordinates
2015-10-01
Reads: per-node volume, uniform coord usage
SPAN THE WAN? YES YOU CAN!
Coordinator Node Cass 1 Cass 2 Cass 3 Cass 4 Cass 5
Cass 1 1 1 0.33 0.33 0.33
Cass 2 1 1 0.33 0.33 0.33
Cass 3 0 0 1 1 1
Cass 4 0 0 1 1 1
Cass 5 0 0 1 1 1
Total requests 2 2 3.66 3.66 3.66
2015-10-01
Per-node read volume: reality vs. theory
SPAN THE WAN? YES YOU CAN!
2015-10-01
What about scaling out?
• Asymmetrical per-node read volumes • So each DC has different CPU and disk IO needs • Different node size? • Different per-DC node count? • What about DC degradation or loss? • End up with same-sized nodes
SPAN THE WAN? YES YOU CAN!
2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC
When a data center vanishes…
2015-10-01
Major outage: DC-C (May, 2015)
• All hosts unreachable for ~5 hours
SPAN THE WAN? YES YOU CAN!
2015-10-01
Seamless data center migration (August 2015)
• Moved DC-C fleet from one provider to another • Remove old node; add new node • No application-level migration needed • Zero customer impact
SPAN THE WAN? YES YOU CAN!
2015-10-01
DC-A to DC-B fiber cut (September, 2015)
• DC-A to DC-B network latency 24ms -> 200ms, lasted 48 hours • All Cass ops now take 24ms
SPAN THE WAN? YES YOU CAN!
FIBER CUT EAST-1
2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC
And back to where we started
2015-10-01
What have we learned?
• WAN-spanning synchronous replication is a thing • Data layer consistent reads are practical • Application layer consequences for managing state • Network topology affects:
• Request performance • Per-node load
• Trade off latency for reliability
SPAN THE WAN? YES YOU CAN!
2015-10-01
Span the WAN?
Yes you can!
SPAN THE WAN? YES YOU CAN!