high-availability infrastructure in the cloud - evan cooke - web 2.0 expo nyc 2011
DESCRIPTION
Designing a massively scalable highly available persistence layer has been one of the great challenges we’ve faced building out Twilio’s cloud communications infrastructure. Robust Voice and SMS APIs have strict consistency, latency, and availability requirements that cannot be solved using traditional sharding or scaling approaches. In this talk we first look to understand the challenges of running high-availability services in the cloud and then describe how we’ve architected “in-flight” and “post-flight” data into separate datastores that can be implemented using a range of technologies.TRANSCRIPT
twilioCLOUD COMMUNICATIONS
SCALING HIGH-AVAILABILITYINFRASTRUCTURE
IN THE CLOUD
OCT 11, 2011, WEB 2.0EVAN COOKECO-FOUNDER & CTO
High-AvailabilitySounds good, we need that!
Yummmm Technical Meat!
High-AvailabilitySounds good, we need that!
Availability = Uptime
Uptime + Downtime
High-AvailabilitySounds good, we need that!
Availability % Downtime/yr Downtime/mo
99.9% ("three nines") 8.76 hours 43.2 minutes
99.99% ("four nines") 52.56 minutes 4.32 minutes
99.999% ("five nines") 5.26 minutes 25.9 seconds
99.9999% ("six nines") 31.5 seconds 2.59 seconds
High-AvailabilitySounds good, we need that!
Availability % Downtime/yr Downtime/mo
99.9% ("three nines") 8.76 hours 43.2 minutes
99.99% ("four nines") 52.56 minutes 4.32 minutes
99.999% ("five nines") 5.26 minutes 25.9 seconds
99.9999% ("six nines") 31.5 seconds 2.59 seconds
Can’t rely on human to respond in a 5 min window! Must use automation.
2.5 Hours Down
“...we had to stop all traffic to this database cluster, which meant turning off the site. Once t h e d a t a b a s e s h a d recovered and the root cause had been fixed, we slowly al lowed more people back onto the site.”
September 23, 2010
11 Hours DownOctober 4, 2010
“. . .At 6:30pm EST, we d e t e r m i n e d t h e m o s t effective course of action w a s t o r e - i n d e x t h e [database] shard, which would address the memory fragmentation and usage issues. The whole process, including extensive testing against data loss and data corruption, took about five hours.”
“...Before every run of our test suite we destroy t h e n r e - c r e a t e t h e database... Due to the c o n fi g u r a t i o n e r r o r GitHub's production d a t a b a s e w a s dest royed t hen re -created. Not good.”
HoursNovember 14, 2010
Happens to the best
Causes of DowntimeLack of best practice change controlLack of best practice monitoring of the relevant componentsLack of best practice requirements and procurementLack of best practice operationsLack of best practice avoidance of network failuresLack of best practice avoidance of internal application failuresLack of best practice avoidance of external services that failLack of best practice physical environmentLack of best practice network redundancyLack of best practice technical solution of backupLack of best practice process solution of backupLack of best practice physical locationLack of best practice infrastructure redundancyLack of best practice storage architecture redundancy
E. Marcus and H. Stern, Blueprints for high availability, second edition. Indianapolis, IN, USA: John Wiley & Sons, Inc., 2003.
DataPersistence
ChangeControl
change control monitoring of the relevant components
requirements procurement
operations
Operations
avoidance of internal app
failuresavoidance of
external services that fail
storage architecture redundancy
technical solution of
backup
process solution of
backup
Datacenter
avoidance of network failures
physical environment
network redundancy
physical location
infrastructure redundancy
Cloud Non-Cloud
2.5 Hours Down
“...we had to stop all traffic to this database cluster, which meant turning off the site. Once t h e d a t a b a s e s h a d recovered and the root cause had been fixed, we slowly al lowed more people back onto the site.”
September 23, 2010
11 Hours DownOctober 4, 2010
“. . .At 6:30pm EST, we d e t e r m i n e d t h e m o s t effective course of action w a s t o r e - i n d e x t h e [database] shard, which would address the memory fragmentation and usage issues. The whole process, including extensive testing against data loss and data corruption, took about five hours.”
“...Before every run of our test suite we destroy t h e n r e - c r e a t e t h e database... Due to the c o n fi g u r a t i o n e r r o r GitHub's production d a t a b a s e w a s dest royed t hen re -created. Not good.”
HoursNovember 14, 2010
Happens to the best
Database Database Database
ChangeControl
DataPersistence
ChangeControl
change control monitoring of the relevant components
requirements procurement
operations
Operations
avoidance of internal app
failuresavoidance of
external services that fail
storage architecture redundancy
technical solution of
backup
process solution of
backup
Datacenter
avoidance of network failures
physical environment
network redundancy
physical location
infrastructure redundancy
TodayData PersistenceChange Control
lessons learned@twilio
Developer
End User
Carriers Inbound CallsOutbound Calls
Mobile/Browser VoIPVoice
SMS
PhoneNumbers
Send To/From Phone Numbers
Short Codes
Dynamically Buy Phone Numbers
Twilio provides web service APIs to automate Voice and SMS communications
6
2009
20
2010
3
70+
2011
X
1 Year
100X
100x Growth in Tx/Day over 1 Year
10X
10Servers
2009100’s ofServers
2011
10’s ofServers
2010
• 100’s of prod hosts in continuous operation
• 80+ service types running in prod
• 50+ prod database servers
• Prod deployments several times/day across 7 engineering teams
2011
• Frameworks
- PHP for frontend components
- Python Twisted & gevent for async network services
- Java for backend services
• Storage technology
- MySQL for core DB services
- Redis for queuing and messaging
2011
Data persistence is hard(especially in the cloud)
Data persistence is hardData persistence is the hardest
technical problem most scalable SaaS businesses face
What is data persistence?
Stuff that looks like this
What is data persistence?
DatabasesQueues
Files
Files K/VC C D D
Tier 3
SQL
Tier 2 B B B B
A
Q
A
QTier 1
LB
Incoming Requests
DataPersistence!
• Difficult to change structure
- Huge inertia e.g., large schema migrations
• Painful to recover from disk/node failures
- “just boot a new node” doesn’t work
• Woeful performance/scalability
- I/O is huge bottleneck in modern servers (e.g. EC2)
• Freak’in complex!!!
- Atomic transactions/rollback, ACID, blah blah blah
Why is persistence so hard?
Difficult to Change Structure
Id Name Value
1 Bob 12
2 Jane 78
3 Steve 56
Id Name
1 Bob
2 Jane
3 Steve
...500 million rows
ALTER TABLE names DROP COLUMN Value
HOURS later...
‣ You live with data decisions for a long time
Painful to Recover from Failures
Primary
W R R
DB DB
Secondary
Data on secondary?How much data?R/W consistency?
‣ Because of complexity, failover is human process
Woeful Performance/Scalability
‣ Poor I/O on cloud today, 100x slower than real HW
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %utilsda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00sdb 169.31 111.88 57.43 469.31 0.90 2.25 12.24 2.29 4.36 1.12 59.01sdc 178.22 110.89 59.41 396.04 0.93 1.98 13.08 1.58 3.50 1.18 53.56sdd 145.54 102.97 50.50 384.16 0.78 1.90 12.63 1.00 2.34 1.03 44.85sde 166.34 95.05 54.46 337.62 0.85 1.69 13.27 1.12 2.84 1.22 47.92md0 0.00 0.00 880.20 2007.92 3.44 7.82 7.99 0.00 0.00 0.00 0.00
~10 MB/s write
m1.xlargeraid0 4x ephemeral
ec2
Woeful Performance/Scalability
DB DB DB DB DB DB
‣ Difficult to horizontally scale in the cloud
BUFFER POOL AND MEMORY----------------------Total memory allocated 11655168000; in additional pool allocated 0Internal hash tables (constant factor + variable factor) Adaptive hash index 223758224 (179959576 + 43798648) Page hash 11248264 Dictionary cache 45048690 (44991344 + 57346) File system 84400 (82672 + 1728) Lock system 28180376 (28119464 + 60912) Recovery system 0 (0 + 0) Threads 428608 (406936 + 21672)Dictionary memory allocated 57346Buffer pool size 693759Buffer pool size, bytes 11366547456Free buffers 1Database pages 691085Old database pages 255087Modified db pages 326490Pending reads 0Pending writes: LRU 0, flush list 0, single page 0Pages made young 497782847, not young 024.78 youngs/s, 0.00 non-youngs/sPages read 447257683, created 16982810, written 40515343324.82 reads/s, 1.14 creates/s, 33.36 writes/sBuffer pool hit rate 993 / 1000, young-making rate 7 / 1000 not 0 / 1000Pages read ahead 0.00/s, evicted without access 0.39/sLRU len: 691085, unzip_LRU len: 0I/O sum[2753]:cur[2], unzip sum[0]:cur[0]
• Incredibly complex configuration
- Billion knobs and buttons
- Whole companies exist just to tune DB’s
• Lots of consistency/transactional models
• Multi-region data is unsolved - Facebook and Google struggle
@!#$%^&* Complex
Deep breath, step backThink about each problem(use @twilio examples)
•Software that runs in the cloud•Open source
• Don’t have structure
- key/value databases (SimpleDB, Cassandra)
- document-orient databases (CouchDB, MongoDB)
• Don’t store a lot of data...
Difficult to Change Structure1
• Outsource data as much as possible
• But NOT to your customers
Don’t Store Stuff1
• Aggressively archive and move data offline
Don’t Store Stuff
~500MRows
S3/SimpleDB
(keep indices in memory)
Build UX that supports longer/restricted access times to older data
1
• Avoid stateful systems/architectures where possible
Don’t Store Stuff
Browser
Web
Web
Web
SessionDB
Cookie:SessionID
1
• Avoid stateful systems/architectures where possible
Don’t Store Stuff
Browser
Web
Web
Web
SessionDB
Cookie:enc($session)
Store state in client browser
1
Painful to Recover from Failures
• Avoid single points of failure
- E.g., master-master (active/active)
- Complex to set up, complex failure modes
- Sometimes it’s the only solution
- Lots of great docs on web
• Minimize number of stateful node, separate stateful & stateless components...
2
Separate Stateful and Stateless Components
App AReq App B App C
App BOn failure, even if we boot replacement, we lose data
2
Separate Stateful and Stateless Components
App AReq App B App C
Que
ue
Que
ue
Que
ue
On failure, even if we boot replacement, we lose data
App B
Que
ue
2
Separate Stateful and Stateless Components
App AReq App B App C
Keep connection open for whole app path!(hint: use evented framework)
App BApp AApp A App B App CApp C
On failure, we don’t lose a single request
Twilio’s SMS stackuses this approach
2
Painful to Recover from Failures
• Avoid single points of failure
- E.g., master-master (active/active)
- Complex to set up, complex failure modes
- Sometimes it’s the only solution
- Lots of great blog docs on web
• Minimize number of stateful nodes, separate stateful & stateless components
• Build a data change control process to avoid mistakes and errors...
2
• 100’s of prod hosts in continuous operation
• 80+ service types running in prod
• 50+ prod database servers
• Prod deployments several times/day across 7 engineering teams
Components deployed at different frequencies: Partially Continuous Deployment
1000x
WebsiteContent
CMS
100x
WebsiteCode
PHP/Rubyetc.
10x
RESTAPI
Python/Javaetc.
1x
Big DBSchema
SQL
Log
Scal
e
DeploymentFrequency(Risk)
4 buckets
WebsiteContent
One Click
WebsiteCode
One ClickCI Tests
RESTAPI
One Click
CI TestsHuman Sign-off
Big DBSchema
Human Assisted Click
CI TestsHuman Sign-off
DeploymentProcesses
Woeful Performance/Scalability
• If disk I/O is poor, avoid disk
- Tune tune tune. Keep your indices in memory
- Use an in-memory datastore e.g., Redis and configure replication such that if you have a master failure, you can always promote a slave
• When disk I/O saturates, shard
- LOTs of sharding info on web
- Method of last resort, single point of failure becomes multiple single points of failure
3
@#$%^&* Complex
• Bring the simplest tool to the job
- Use a strictly consistent store only if you need it
- If you don’t need HA, don’t add the complexity
• There is no magic database. Decompose requirements, mix-and-match datastores as needed...
4
Magic Database does it all. Consistency, Availability, Partition-tolerance, it's got allthree.
Magic Database
Twilio Data Lifecycle
CREATE
name:foo
status:
ret:INIT
0
UPDATE
name:foo
status:
ret:QUEUED
0
UPDATE
name:foo
status:
ret:GOING
0
name:foo
status:
ret:DONE
42
Twilio Examples: Call, SMS, Conference Other Examples: Order, Workflow, $
4
CREATE
name:foo
status:
ret:INIT
0
UPDATE
name:foo
status:
ret:QUEUED
0
UPDATE
name:foo
status:
ret:GOING
0
name:foo
status:
ret:DONE
42
In-Flight Post-Flight
Twilio Data Lifecycle4
In-Flight Post-Flight
• Atomically update part of a workflow
• Billing
• Log Access
• Analytics
• Reporting
ApplicationsTwilio Data Lifecycle
4
In-Flight Post-Flight
• Strict Consistency
• Key/Value
• ~20ms
• Eventual Consistency
• Range Queries w/ Filters
• ~200ms
High-AvailabilityProperties
Twilio Data Lifecycle4
Data Store BData Store A
In-Flight Post-Flight
Systems with very different access semantics
Twilio Data Lifecycle4
Strict Consistency
Key/Value
~20ms
10k-1M
In-Flight Post-Flight
Logs(REST API)
Eventual consistencyRange queriesFiltered queries~200msBillions
Q
ReportingEventual consistencyArbitrary queriesHigh LatencyBillions
Q
BillingIdempotentAggregationKey/ValueBillions
Q
4
MySQL
PostgreSQL
Redis
NDB
Billing
In-Flight Post-Flight
Logs(REST API)
Reporting
Q
Q
Q
SQL ShardedCassandra/AcunuMongoDbRiakCouchDb
SQL ShardedRedis
SQL ShardedRedis
Hadoop
4
Data
• Difficult to change structure
- Huge inertia e.g., large schema migrations
• Painful to recover from disk/node failures
- “just boot a new node” doesn’t work
• Woeful performance/scalability
- I/O is huge bottleneck in modern servers (e.g. EC2)
• Freak’in complex!!!
- Atomic transactions/rollback, ACID, blah blah blah
Why is persistence so hard?
Don’t store stuff!Go schema-less
Separate stateful/statelessChange control processes
Memory FTWShard
Decompose data lifecycleMinimize complexity
Files K/VC C D D
Tier 3
SQL
Tier 2 B B B B
A
Q
A
QTier 1
LB
Incoming Requests
A
LB
Incoming Requests
ATier 1
B B B BTier 2
C C D D
Tier 3
SQLSQLQ
SimpleDBS3
Q
Aggregate into HA queuesMaster-MasterMySQL
Move file store to S3
Move K/V toSimpleDB w/local cache
Idempotentrequest path
DataPersistence
ChangeControl
change control monitoring of the relevant components
requirements procurement
operations
Operations
avoidance of internal app
failuresavoidance of
external services that fail
storage architecture redundancy
technical solution of
backup
process solution of
backup
Datacenter
avoidance of network failures
physical environment
network redundancy
physical location
infrastructure redundancy
HAis
Hard
SCALING HIGH-AVAILABILITYINFRASTRUCTURE IN THE CLOUD
Focus on dataHow you store it
When you can delete itControl changes to it
Where you store it
Billing
In-Flight Post-Flight
Logs(REST API)
Reporting
Q
Q
Q
Hadoop
HAqueue
Simplemulti-AZ
multi-regionconsistent
K/V
Open Problems...Massively scalable
range queriesfilterable~200ms
Simple HA
Hadoop
Massively scalable
aggregator
twiliohttp://www.twilio.com
@emcooke