pagerduty: span the wan? yes you can!

40
2015-10-01 Span the WAN? Yes you can! [email protected] #CassandraSummit

Upload: datastax-academy

Post on 15-Apr-2017

637 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Span the WAN? Yes you [email protected]

#CassandraSummit

Page 2: PagerDuty: Span the WAN? Yes you can!

2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC

Span the WAN. Why?

Page 3: PagerDuty: Span the WAN? Yes you can!

2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC

Page 4: PagerDuty: Span the WAN? Yes you can!

2015-10-01SPAN THE WAN? YES YOU CAN!

Page 5: PagerDuty: Span the WAN? Yes you can!

2015-10-01

PagerDuty: some history

•Monolithic Ruby on Rails + MySQL •Hosted in AWS us-east-1 •AWS outages in 2010 and 2011 •…including correlated multi-AZ failures •PagerDuty was heavily impacted •Needed resiliency to this failure mode

SPAN THE WAN? YES YOU CAN!

Page 6: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Design goals

•Continuity during a DC drop (AZ or Region) •No operator intervention •Can’t lose data •Can’t delay data (shelf life) •Timely notifications - always

•Measured in 10’s of seconds

SPAN THE WAN? YES YOU CAN!

Page 7: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Design decisions

•Masterless: peer-based & clustered •Can’t tolerate staleness: synchronous WAN replication •Manage state: consistent reads •Opted to use Cassandra •…despite many of Cassandra’s features not being relevant

SPAN THE WAN? YES YOU CAN!

Page 8: PagerDuty: Span the WAN? Yes you can!

2015-10-01

How Cassandra is often used

SPAN THE WAN? YES YOU CAN!

•Massive throughput •Lots of data •Horizontally scalable •Eventually consistent •High write:read ratio •High performance individual operations

Page 9: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Essential Cassandra features for PagerDuty

•Quorum operations •Tuneable consistency •Synchronous WAN replication

SPAN THE WAN? YES YOU CAN!

Page 10: PagerDuty: Span the WAN? Yes you can!

2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC

WAN-spanning system design

Page 11: PagerDuty: Span the WAN? Yes you can!

2015-10-01

System architecture

SPAN THE WAN? YES YOU CAN!

Shared cross-DC datastore

(Cassandra)

Distributed Coordination (ZooKeeper)

Clustered Application

Page 12: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Quorum consistency systems

•Each item replicated N times •Writes: require W of N replicas •Reads: require R of N replicas •W + R <= N: read can miss a write •W + R > N: read can’t miss a write

SPAN THE WAN? YES YOU CAN!

WRITE READ

Page 13: PagerDuty: Span the WAN? Yes you can!

2015-10-01

•Replication factor: N=5 •Three DCs •DC-aware placement strategy •W=3: all writes hit multiple DCs •R=3: all reads hit multiple DCs •3 + 3 > 5: consistent reads

Cassandra setup

SPAN THE WAN? YES YOU CAN!

Cass 5

Cass 1

Cass 2 Cass 4

Cass 3

DC-A

DC-C

DC-B

Page 14: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Data layer summary

•Data safe against DC failure •Consistent reads (of acknowledged writes) •Expensive multi-DC writes & reads •Managing state: No ACID transactions! •Enforce “transactions” in the application layer

SPAN THE WAN? YES YOU CAN!

Page 15: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Application layer: “transactions”

•Sequence of logic and Cassandra operations •Implement sequence as idempotent •Failure is not an option •Enforce transaction ordering •Expect (some) (transient) inconsistencies

SPAN THE WAN? YES YOU CAN!

Page 16: PagerDuty: Span the WAN? Yes you can!

2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC

Tales from production

Page 17: PagerDuty: Span the WAN? Yes you can!

2015-10-01

What about the network?

SPAN THE WAN? YES YOU CAN!

Cass 5

Cass 1

Cass 2 Cass 4

Cass 3

DC-A

DC-C

DC-B

24 ms

24 ms 3

ms

•Network diversity limits DC choices •Result? Uneven network latencies

Page 18: PagerDuty: Span the WAN? Yes you can!

2015-10-01

…and how you should think of the network

SPAN THE WAN? YES YOU CAN!

Cass 5

Cass 1

Cass 2

Cass 4

Cass 3DC-A

DC-C

DC-B

24 ms

24 ms

3 m

s

Page 19: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Reads and writes

SPAN THE WAN? YES YOU CAN!

DC-A

DC-B

DC-C

Client

R1

R2

R3

R4

R5

Page 20: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Read and write performance

SPAN THE WAN? YES YOU CAN!

Cass 5

Cass 1

Cass 2

Cass 4

Cass 3DC-A

DC-C

DC-B

24 ms

24 ms

3 m

s

•R and W =3 means always hitting replicas in two DCs (by design) •Reads coordinated from DC-B or DC-C nodes will take >3ms •Reads coordinated from DC-A nodes will take >24ms

Page 21: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Another latency effect? Per-node read volume

SPAN THE WAN? YES YOU CAN!

Page 22: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Per-node read volume: why so skewed?

SPAN THE WAN? YES YOU CAN!

Page 23: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Writes: Which replicas are involved? All 5

SPAN THE WAN? YES YOU CAN!

DC-A

DC-B

DC-C

Client

R1

R2

R3

R4

R5

Page 24: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Writes: per-node volume

SPAN THE WAN? YES YOU CAN!

Cass 5

Cass 1

Cass 2

Cass 4

Cass 3DC-A

DC-C

DC-B

24 ms

24 ms

3 m

s

•N=5, so there is a write op on each replica •All replicas experience the same per-node write load

Page 25: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Reads: Which replicas are involved? Only 3!

SPAN THE WAN? YES YOU CAN!

DC-A

DC-B

DC-C

Client

R1

R2

R3

R4

R5

Page 26: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Reads: per-node volume

SPAN THE WAN? YES YOU CAN!

Cass 5

Cass 1

Cass 2

Cass 4

Cass 3DC-A

DC-C

DC-B

24 ms

24 ms

3 m

s

•Coordinator chooses R fastest replicas (R=3) •Network latency steers to the nearest replicas

Page 27: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Reads: per-node volume (Cass 3 as coord)

SPAN THE WAN? YES YOU CAN!

Cass 5

Cass 1

Cass 2

Cass 4

Cass 3DC-A

DC-C

DC-B

24 ms

24 ms

3 m

s

•Chooses 3, 4, and 5 •Same when Cass 4 or Cass 5 coordinates

Page 28: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Reads: per-node volume (Cass 1 as coord)

SPAN THE WAN? YES YOU CAN!

Cass 5

Cass 1

Cass 2

Cass 4

Cass 3DC-A

DC-C

DC-B

24 ms

24 ms

3 m

s

•Hits 1, 2 and (randomly) one of 3, 4, 5 •Same when Cass 2 coordinates

Page 29: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Reads: per-node volume, uniform coord usage

SPAN THE WAN? YES YOU CAN!

Coordinator Node Cass 1 Cass 2 Cass 3 Cass 4 Cass 5

Cass 1 1 1 0.33 0.33 0.33

Cass 2 1 1 0.33 0.33 0.33

Cass 3 0 0 1 1 1

Cass 4 0 0 1 1 1

Cass 5 0 0 1 1 1

Total requests 2 2 3.66 3.66 3.66

Page 30: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Per-node read volume: reality vs. theory

SPAN THE WAN? YES YOU CAN!

Page 31: PagerDuty: Span the WAN? Yes you can!

2015-10-01

What about scaling out?

• Asymmetrical per-node read volumes • So each DC has different CPU and disk IO needs • Different node size? • Different per-DC node count? • What about DC degradation or loss? • End up with same-sized nodes

SPAN THE WAN? YES YOU CAN!

Page 32: PagerDuty: Span the WAN? Yes you can!

2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC

When a data center vanishes…

Page 33: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Major outage: DC-C (May, 2015)

• All hosts unreachable for ~5 hours

SPAN THE WAN? YES YOU CAN!

Page 34: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Seamless data center migration (August 2015)

• Moved DC-C fleet from one provider to another • Remove old node; add new node • No application-level migration needed • Zero customer impact

SPAN THE WAN? YES YOU CAN!

Page 35: PagerDuty: Span the WAN? Yes you can!

2015-10-01

DC-A to DC-B fiber cut (September, 2015)

• DC-A to DC-B network latency 24ms -> 200ms, lasted 48 hours • All Cass ops now take 24ms

SPAN THE WAN? YES YOU CAN!

FIBER CUT EAST-1

Page 36: PagerDuty: Span the WAN? Yes you can!

2015-10-01MAKING PAGERDUTY MORE RELIABLE USING PXC

And back to where we started

Page 37: PagerDuty: Span the WAN? Yes you can!

2015-10-01

What have we learned?

• WAN-spanning synchronous replication is a thing • Data layer consistent reads are practical • Application layer consequences for managing state • Network topology affects:

• Request performance • Per-node load

• Trade off latency for reliability

SPAN THE WAN? YES YOU CAN!

Page 38: PagerDuty: Span the WAN? Yes you can!

2015-10-01

Span the WAN?

Yes you can!

SPAN THE WAN? YES YOU CAN!

Page 39: PagerDuty: Span the WAN? Yes you can!

2015-10-01

[email protected] PAGERDUTY.COM/JOBS

SPAN THE WAN? YES YOU CAN!