high-availability infrastructure in the cloud - evan cooke - web 2.0 expo nyc 2011

twilioCLOUD COMMUNICATIONS

SCALING HIGH-AVAILABILITYINFRASTRUCTURE

IN THE CLOUD

OCT 11, 2011, WEB 2.0EVAN COOKECO-FOUNDER & CTO

High-AvailabilitySounds good, we need that!

Yummmm Technical Meat!


Availability = Uptime

Uptime + Downtime


Availability % Downtime/yr Downtime/mo

99.9% ("three nines") 8.76 hours 43.2 minutes

99.99% ("four nines") 52.56 minutes 4.32 minutes

99.999% ("five nines") 5.26 minutes 25.9 seconds

99.9999% ("six nines") 31.5 seconds 2.59 seconds


Availability % Downtime/yr Downtime/mo

99.9% ("three nines") 8.76 hours 43.2 minutes

99.99% ("four nines") 52.56 minutes 4.32 minutes

99.999% ("five nines") 5.26 minutes 25.9 seconds

99.9999% ("six nines") 31.5 seconds 2.59 seconds

Can’t rely on human to respond in a 5 min window! Must use automation.

2.5 Hours Down

“...we had to stop all traffic to this database cluster, which meant turning off the site. Once t h e d a t a b a s e s h a d recovered and the root cause had been fixed, we slowly al lowed more people back onto the site.”

September 23, 2010

11 Hours DownOctober 4, 2010

“. . .At 6:30pm EST, we d e t e r m i n e d t h e m o s t effective course of action w a s t o r e - i n d e x t h e [database] shard, which would address the memory fragmentation and usage issues. The whole process, including extensive testing against data loss and data corruption, took about five hours.”

“...Before every run of our test suite we destroy t h e n r e - c r e a t e t h e database... Due to the c o n fi g u r a t i o n e r r o r GitHub's production d a t a b a s e w a s dest royed t hen re -created. Not good.”

HoursNovember 14, 2010

Happens to the best

https://github.com/blog/744-today-s-outage#comments


Causes of DowntimeLack of best practice change controlLack of best practice monitoring of the relevant componentsLack of best practice requirements and procurementLack of best practice operationsLack of best practice avoidance of network failuresLack of best practice avoidance of internal application failuresLack of best practice avoidance of external services that failLack of best practice physical environmentLack of best practice network redundancyLack of best practice technical solution of backupLack of best practice process solution of backupLack of best practice physical locationLack of best practice infrastructure redundancyLack of best practice storage architecture redundancy

E. Marcus and H. Stern, Blueprints for high availability, second edition. Indianapolis, IN, USA: John Wiley & Sons, Inc., 2003.

DataPersistence

ChangeControl

change control monitoring of the relevant components

requirements procurement

operations

Operations

avoidance of internal app

failuresavoidance of

external services that fail

storage architecture redundancy

technical solution of

backup

process solution of

backup

Datacenter

avoidance of network failures

physical environment

network redundancy

physical location

infrastructure redundancy

Cloud Non-Cloud

2.5 Hours Down

“...we had to stop all traffic to this database cluster, which meant turning off the site. Once t h e d a t a b a s e s h a d recovered and the root cause had been fixed, we slowly al lowed more people back onto the site.”

September 23, 2010

11 Hours DownOctober 4, 2010

“. . .At 6:30pm EST, we d e t e r m i n e d t h e m o s t effective course of action w a s t o r e - i n d e x t h e [database] shard, which would address the memory fragmentation and usage issues. The whole process, including extensive testing against data loss and data corruption, took about five hours.”

“...Before every run of our test suite we destroy t h e n r e - c r e a t e t h e database... Due to the c o n fi g u r a t i o n e r r o r GitHub's production d a t a b a s e w a s dest royed t hen re -created. Not good.”

HoursNovember 14, 2010

Happens to the best

Database Database Database

ChangeControl



DataPersistence

ChangeControl



operations

Operations






backup

process solution of

backup

Datacenter



network redundancy

physical location


TodayData PersistenceChange Control

lessons learned@twilio

Developer

End User

Carriers Inbound CallsOutbound Calls

Mobile/Browser VoIPVoice

SMS

PhoneNumbers

Send To/From Phone Numbers

Short Codes

Dynamically Buy Phone Numbers

Twilio provides web service APIs to automate Voice and SMS communications

6

2009

20

2010

3

70+

2011

X

1 Year

100X

100x Growth in Tx/Day over 1 Year

10X

10Servers

2009100’s ofServers

2011

10’s ofServers

2010

• 100’s of prod hosts in continuous operation

• 80+ service types running in prod

• 50+ prod database servers

• Prod deployments several times/day across 7 engineering teams

2011

• Frameworks

- PHP for frontend components

- Python Twisted & gevent for async network services

- Java for backend services

• Storage technology

- MySQL for core DB services

- Redis for queuing and messaging

2011

Data persistence is hard(especially in the cloud)

Data persistence is hardData persistence is the hardest

technical problem most scalable SaaS businesses face

What is data persistence?

Stuff that looks like this

What is data persistence?

DatabasesQueues

Files

Files K/VC C D D

Tier 3

SQL

Tier 2 B B B B

A

Q

A

QTier 1

LB

Incoming Requests

DataPersistence!

• Difficult to change structure

- Huge inertia e.g., large schema migrations

• Painful to recover from disk/node failures

- “just boot a new node” doesn’t work

• Woeful performance/scalability

- I/O is huge bottleneck in modern servers (e.g. EC2)

• Freak’in complex!!!

- Atomic transactions/rollback, ACID, blah blah blah

Why is persistence so hard?

Difficult to Change Structure

Id Name Value

1 Bob 12

2 Jane 78

3 Steve 56

Id Name

1 Bob

2 Jane

3 Steve

...500 million rows

ALTER TABLE names DROP COLUMN Value

HOURS later...

‣ You live with data decisions for a long time

Painful to Recover from Failures

Primary

W R R

DB DB

Secondary

Data on secondary?How much data?R/W consistency?

‣ Because of complexity, failover is human process

Woeful Performance/Scalability

‣ Poor I/O on cloud today, 100x slower than real HW

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %utilsda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00sdb 169.31 111.88 57.43 469.31 0.90 2.25 12.24 2.29 4.36 1.12 59.01sdc 178.22 110.89 59.41 396.04 0.93 1.98 13.08 1.58 3.50 1.18 53.56sdd 145.54 102.97 50.50 384.16 0.78 1.90 12.63 1.00 2.34 1.03 44.85sde 166.34 95.05 54.46 337.62 0.85 1.69 13.27 1.12 2.84 1.22 47.92md0 0.00 0.00 880.20 2007.92 3.44 7.82 7.99 0.00 0.00 0.00 0.00

~10 MB/s write

m1.xlargeraid0 4x ephemeral

ec2


DB DB DB DB DB DB

‣ Difficult to horizontally scale in the cloud

BUFFER POOL AND MEMORY----------------------Total memory allocated 11655168000; in additional pool allocated 0Internal hash tables (constant factor + variable factor) Adaptive hash index 223758224 (179959576 + 43798648) Page hash 11248264 Dictionary cache 45048690 (44991344 + 57346) File system 84400 (82672 + 1728) Lock system 28180376 (28119464 + 60912) Recovery system 0 (0 + 0) Threads 428608 (406936 + 21672)Dictionary memory allocated 57346Buffer pool size 693759Buffer pool size, bytes 11366547456Free buffers 1Database pages 691085Old database pages 255087Modified db pages 326490Pending reads 0Pending writes: LRU 0, flush list 0, single page 0Pages made young 497782847, not young 024.78 youngs/s, 0.00 non-youngs/sPages read 447257683, created 16982810, written 40515343324.82 reads/s, 1.14 creates/s, 33.36 writes/sBuffer pool hit rate 993 / 1000, young-making rate 7 / 1000 not 0 / 1000Pages read ahead 0.00/s, evicted without access 0.39/sLRU len: 691085, unzip_LRU len: 0I/O sum[2753]:cur[2], unzip sum[0]:cur[0]

• Incredibly complex configuration

- Billion knobs and buttons

- Whole companies exist just to tune DB’s

• Lots of consistency/transactional models

• Multi-region data is unsolved - Facebook and Google struggle

@!#$%^&* Complex

Deep breath, step backThink about each problem(use @twilio examples)

•Software that runs in the cloud•Open source

• Don’t have structure

- key/value databases (SimpleDB, Cassandra)

- document-orient databases (CouchDB, MongoDB)

• Don’t store a lot of data...

Difficult to Change Structure1

• Outsource data as much as possible

• But NOT to your customers

Don’t Store Stuff1

• Aggressively archive and move data offline

Don’t Store Stuff

~500MRows

S3/SimpleDB

(keep indices in memory)

Build UX that supports longer/restricted access times to older data

1

• Avoid stateful systems/architectures where possible

Don’t Store Stuff

Browser

Web

Web

Web

SessionDB

Cookie:SessionID

1

• Avoid stateful systems/architectures where possible

Don’t Store Stuff

Browser

Web

Web

Web

SessionDB

Cookie:enc($session)

Store state in client browser

1


• Avoid single points of failure

- E.g., master-master (active/active)

- Complex to set up, complex failure modes

- Sometimes it’s the only solution

- Lots of great docs on web

• Minimize number of stateful node, separate stateful & stateless components...

2

Separate Stateful and Stateless Components

App AReq App B App C

App BOn failure, even if we boot replacement, we lose data

2



Que

ue

Que

ue

Que

ue

On failure, even if we boot replacement, we lose data

App B

Que

ue

2



Keep connection open for whole app path!(hint: use evented framework)

App BApp AApp A App B App CApp C

On failure, we don’t lose a single request

Twilio’s SMS stackuses this approach

2


• Avoid single points of failure

- E.g., master-master (active/active)

- Complex to set up, complex failure modes

- Sometimes it’s the only solution

- Lots of great blog docs on web

• Minimize number of stateful nodes, separate stateful & stateless components

• Build a data change control process to avoid mistakes and errors...

2

• 100’s of prod hosts in continuous operation

• 80+ service types running in prod

• 50+ prod database servers

• Prod deployments several times/day across 7 engineering teams

Components deployed at different frequencies: Partially Continuous Deployment

1000x

WebsiteContent

CMS

100x

WebsiteCode

PHP/Rubyetc.

10x

RESTAPI

Python/Javaetc.

1x

Big DBSchema

SQL

Log

Scal

e

DeploymentFrequency(Risk)

4 buckets

WebsiteContent

One Click

WebsiteCode

One ClickCI Tests

RESTAPI

One Click

CI TestsHuman Sign-off

Big DBSchema

Human Assisted Click

CI TestsHuman Sign-off

DeploymentProcesses


• If disk I/O is poor, avoid disk

- Tune tune tune. Keep your indices in memory

- Use an in-memory datastore e.g., Redis and configure replication such that if you have a master failure, you can always promote a slave

• When disk I/O saturates, shard

- LOTs of sharding info on web

- Method of last resort, single point of failure becomes multiple single points of failure

3

@#$%^&* Complex

• Bring the simplest tool to the job

- Use a strictly consistent store only if you need it

- If you don’t need HA, don’t add the complexity

• There is no magic database. Decompose requirements, mix-and-match datastores as needed...

4

Magic Database does it all. Consistency, Availability, Partition-tolerance, it's got allthree.

Magic Database

Twilio Data Lifecycle

CREATE

name:foo

status:

ret:INIT

0

UPDATE

name:foo

status:

ret:QUEUED

0

UPDATE

name:foo

status:

ret:GOING

0

name:foo

status:

ret:DONE

42

Twilio Examples: Call, SMS, Conference Other Examples: Order, Workflow, $

4

CREATE

name:foo

status:

ret:INIT

0

UPDATE

name:foo

status:

ret:QUEUED

0

UPDATE

name:foo

status:

ret:GOING

0

name:foo

status:

ret:DONE

42

In-Flight Post-Flight

Twilio Data Lifecycle4


• Atomically update part of a workflow

• Billing

• Log Access

• Analytics

• Reporting

ApplicationsTwilio Data Lifecycle

4


• Strict Consistency

• Key/Value

• ~20ms

• Eventual Consistency

• Range Queries w/ Filters

• ~200ms

High-AvailabilityProperties


Data Store BData Store A


Systems with very different access semantics


Strict Consistency

Key/Value

~20ms

10k-1M


Logs(REST API)

Eventual consistencyRange queriesFiltered queries~200msBillions

Q

ReportingEventual consistencyArbitrary queriesHigh LatencyBillions

Q

BillingIdempotentAggregationKey/ValueBillions

Q

4

MySQL

PostgreSQL

Redis

NDB

Billing


Logs(REST API)

Reporting

Q

Q

Q

SQL ShardedCassandra/AcunuMongoDbRiakCouchDb

SQL ShardedRedis

SQL ShardedRedis

Hadoop

4

• Difficult to change structure

- Huge inertia e.g., large schema migrations

• Painful to recover from disk/node failures

- “just boot a new node” doesn’t work

• Woeful performance/scalability

- I/O is huge bottleneck in modern servers (e.g. EC2)

• Freak’in complex!!!

- Atomic transactions/rollback, ACID, blah blah blah

Why is persistence so hard?

Don’t store stuff!Go schema-less

Separate stateful/statelessChange control processes

Memory FTWShard

Decompose data lifecycleMinimize complexity

Files K/VC C D D

Tier 3

SQL

Tier 2 B B B B

A

Q

A

QTier 1

LB

Incoming Requests

A

LB

Incoming Requests

ATier 1

B B B BTier 2

C C D D

Tier 3

SQLSQLQ

SimpleDBS3

Q

Aggregate into HA queuesMaster-MasterMySQL

Move file store to S3

Move K/V toSimpleDB w/local cache

Idempotentrequest path

DataPersistence

ChangeControl



operations

Operations






backup

process solution of

backup

Datacenter



network redundancy

physical location


HAis

Hard

SCALING HIGH-AVAILABILITYINFRASTRUCTURE IN THE CLOUD

Focus on dataHow you store it

When you can delete itControl changes to it

Where you store it

Billing


Logs(REST API)

Reporting

Q

Q

Q

Hadoop

HAqueue

Simplemulti-AZ

multi-regionconsistent

K/V

Open Problems...Massively scalable

range queriesfilterable~200ms

Simple HA

Hadoop

Massively scalable

aggregator

twiliohttp://www.twilio.com

@emcooke

high-availability infrastructure in the cloud - evan cooke - web 2.0 expo nyc 2011

Technology