high availability by design
TRANSCRIPT
High Availability by DesignDavid Prinzing – October 14, 2015
Web-scale Computing (that can run on your laptop!)
● Continuous Availability→ no planned outages; no foolish outages
● Linear Scalability→ add another unit of compute, get another unit of capacity
● Amazing Performance→ e.g. transaction duration reduced, 2 min → 5 sec
● Information Security→ e.g. certified PCI Level 1 Service Provider
Project ExperienceMyTownPerks● Application: real-time loyalty calculations from payment card transactions● Availability: when are people not shopping? No planned outages● Scalability: serving the Fortune 6 million (local merchants)● Performance: compute between card swipe and receipt print● Security: certified PCI Level 1 Service Provider
Clear Capital● Application: real-time real estate appraisal analysis, validation● Availability: when are appraisers not working? No planned outages● Scalability: geospatial search on a table with > 1 billion rows● Performance: transaction duration reduced from ~2 min to ~5 sec● Security: the customers are banks!
Algorithmic Ads● Application: algorithmically-generated display ads from web page(s)
Architectural OverviewWeb
Applications
Database Cluster
QueueWorkerCluster
Mobile Applications
Machine Learning Cluster
External Integration
Applications
Core API Cluster
Business Logic, Data Access
ApacheCassandra
Continuous Availability Requires:● Distributed Infrastructure● Distributed Database● Stateless Services (API)● Intelligent Client Applications
Production VPC 10.1.0.0/16
Function: DMZ (ELB, NAT, Ops)Subnet: 10.1.3.0/24Routing: Internet GatewayZone: us-west-2cELB: www, api.example.com
Function: Application ClusterSubnet: 10.1.13.0/24Routing: NAT1cZone: us-west-2c
Function: Database ClusterSubnet: 10.1.23.0/24Routing: NAT1cZone: us-west-2c
nat1cIP: 10.1.3.1050.112.129.94SG: NAT-Prod
ops1cIP: 10.1.3.1150.112.128.59SG: Ops-Prod
app1cIP: 10.1.13.10SG: WS-Prod
app2cIP: 10.1.13.11SG: WS-Prod
db1c (seed)IP: 10.1.23.10SG: DB-Prod
Function: DMZ (ELB, NAT, Ops)Subnet: 10.1.1.0/24Routing: Internet GatewayZone: us-west-2aELB: www, api.example.com
Function: Application ClusterSubnet: 10.1.11.0/24Routing: NAT1aZone: us-west-2a
Function: Database ClusterSubnet: 10.1.21.0/24Routing: NAT1aZone: us-west-2a
Function: DMZ (ELB, NAT, Ops)Subnet: 10.1.2.0/24Routing: Internet GatewayZone: us-west-2bELB: www, api.example.com
Function: Application ClusterSubnet: 10.1.12.0/24Routing: NAT1bZone: us-west-2b
Function: Database ClusterSubnet: 10.1.22.0/24Routing: NAT1bZone: us-west-2b
Internet Gateway
ops1aIP: 10.1.1.1150.112.130.110SG: Ops-Prod
nat1bIP: 10.1.2.1050.112.130.114SG: NAT-Prod
app1aIP: 10.1.11.10SG: WS-Prod
app2aIP: 10.1.11.11SG: WS-Prod
app1bIP: 10.1.12.10SG: WS-Prod
app2bIP: 10.1.12.11SG: WS-Prod
db1a (seed)IP: 10.1.21.10SG: DB-Prod
db2aIP: 10.1.21.11SG: DB-Prod
db1b (seed)IP: 10.1.22.10SG: DB-Prod
nat1aIP: 10.1.1.1050.112.130.119SG: NAT-Prod
db2bIP: 10.1.22.11SG: DB-Prod
ops1bIP: 10.1.2.1150.112.130.110SG: Ops-Prod
db2cIP: 10.1.23.11SG: DB-Prod
Availability Zone A Availability Zone B Availability Zone C
Dat
abas
e Ti
erA
pplic
atio
n Ti
erD
MZ
Tier
DataStax EnterpriseOne Integrated Database (no ETL)● Real-time data (with CQL)● Analytics (parallel computation)● Search (faceted, geo-spatial)
Advantages● Cost-effective, open source● Massive linear scalability● High performance/speed● Continuous availability● Simple admin, operations● Elastic, incremental expansion● Cloud-compatible; scale out● Data compression
Disadvantages● Eventual (tunable) consistency● Think differently...
Availability: Distributed Database
High availability with distributed data:3 real-time replicas in each of 2 data centers (6 total)2 replicas needed to function (local quorum: 2 of 3)
RESTful Web Services APIHTTP Methodand URI
POSTcreate
GETread
PUTupdate
DELETEdelete
/plural_noun Create/return a new noun.201 Created or202 AcceptedLocation header
Get list of nouns.200 OK404 Not Found
405 Method Not Allowed Delete all nouns.204 No Content404 Not Found405 Method Not Allowed
/plural_noun/id 405 Method Not Allowed Get indicated noun.200 OK404 Not Found
Update indicated noun.204 No Content404 Not Found
Delete indicated noun.204 No Content404 Not Found
Representational State Transfer (REST) using HTTP verbs and addressable resources/nouns
Basic URI Structure:https://api.example.com/api-version/plural_noun/id
API Version (a simple ordinal number):/v1/organizations/2TP68NHBVYNE
Nested path parameters (use sparingly):/v1/organizations/2TP68NHBVYNE/users/2TP7RB0LSLD6
Filtering query parameters:GET /v1/events?log_level=ERROR
See: Apigee’s Web API Design Guidelines
Action verbs (use quite sparingly):GET /v1/files/2V34H956A4S4/download
Content-Type and Accept request headers (json, xml, html)Content-Type: application/json
Location response headers to identify newly created resources:Location: https://api.example.com/v1/organizations/2TP68NHBVYNE
Basic or Token Authentication over SSL/TLS (request header):Authorization: Basic dXNlcm5hbWU6cGFzc3dvcmQ=Authorization: Bearer baf05b90-bfa9-11e4-a7eb-77baa98f4f40
Continuous Availability
● Customer Demonstration (Test Environment)○ Complete loss of one of three availability zones○ Sending transactions through:
before… while going down… while down… while coming back up… and when restored.
○ 100% successful!● No Planned Outages● 100% Availability… for how long?
Summary
C* C* C*
InternetGateway
app1a
nat1a nat1b nat1c
app2aapp3a
app1bapp2b
app3b
app1capp2c
app3c
db1adb2adb3a…
db1bdb2bdb3b…
db1cdb2cdb3c…
ops1a ops1b
ELB ELB ELB
ops1c
Continuous Availability:● Distributed Infrastructure
○ Multiple AZs, maybe Regions○ Automate; config as code
● Distributed Database○ Apache Cassandra rocks!○ Eventual consistency, replicas
● Stateless Services (API)○ Integrate on APIs, not DB○ API first, before client apps○ Adopt consistent patterns○ Documentation (Swagger)○ Ops-friendly package
● Intelligent Client Apps○ Session state on client○ Consume RESTful APIs
David [email protected]