high availability by design

14
High Availability by Design David Prinzing – October 14, 2015 Web-scale Computing (that can run on your laptop!) Continuous Availability → no planned outages; no foolish outages Linear Scalability → add another unit of compute, get another unit of capacity Amazing Performance → e.g. transaction duration reduced, 2 min → 5 sec Information Security → e.g. certified PCI Level 1 Service Provider

Upload: david-prinzing

Post on 10-Feb-2017

183 views

Category:

Technology


0 download

TRANSCRIPT

High Availability by DesignDavid Prinzing – October 14, 2015

Web-scale Computing (that can run on your laptop!)

● Continuous Availability→ no planned outages; no foolish outages

● Linear Scalability→ add another unit of compute, get another unit of capacity

● Amazing Performance→ e.g. transaction duration reduced, 2 min → 5 sec

● Information Security→ e.g. certified PCI Level 1 Service Provider

Project ExperienceMyTownPerks● Application: real-time loyalty calculations from payment card transactions● Availability: when are people not shopping? No planned outages● Scalability: serving the Fortune 6 million (local merchants)● Performance: compute between card swipe and receipt print● Security: certified PCI Level 1 Service Provider

Clear Capital● Application: real-time real estate appraisal analysis, validation● Availability: when are appraisers not working? No planned outages● Scalability: geospatial search on a table with > 1 billion rows● Performance: transaction duration reduced from ~2 min to ~5 sec● Security: the customers are banks!

Algorithmic Ads● Application: algorithmically-generated display ads from web page(s)

Architectural OverviewWeb

Applications

Database Cluster

QueueWorkerCluster

Mobile Applications

Machine Learning Cluster

External Integration

Applications

Core API Cluster

Business Logic, Data Access

ApacheCassandra

Continuous Availability Requires:● Distributed Infrastructure● Distributed Database● Stateless Services (API)● Intelligent Client Applications

Technology Selections

Dropwizard

AWS Infrastructure Design

C* C* C* C* C* C*

Production VPC 10.1.0.0/16

Function: DMZ (ELB, NAT, Ops)Subnet: 10.1.3.0/24Routing: Internet GatewayZone: us-west-2cELB: www, api.example.com

Function: Application ClusterSubnet: 10.1.13.0/24Routing: NAT1cZone: us-west-2c

Function: Database ClusterSubnet: 10.1.23.0/24Routing: NAT1cZone: us-west-2c

nat1cIP: 10.1.3.1050.112.129.94SG: NAT-Prod

ops1cIP: 10.1.3.1150.112.128.59SG: Ops-Prod

app1cIP: 10.1.13.10SG: WS-Prod

app2cIP: 10.1.13.11SG: WS-Prod

db1c (seed)IP: 10.1.23.10SG: DB-Prod

Function: DMZ (ELB, NAT, Ops)Subnet: 10.1.1.0/24Routing: Internet GatewayZone: us-west-2aELB: www, api.example.com

Function: Application ClusterSubnet: 10.1.11.0/24Routing: NAT1aZone: us-west-2a

Function: Database ClusterSubnet: 10.1.21.0/24Routing: NAT1aZone: us-west-2a

Function: DMZ (ELB, NAT, Ops)Subnet: 10.1.2.0/24Routing: Internet GatewayZone: us-west-2bELB: www, api.example.com

Function: Application ClusterSubnet: 10.1.12.0/24Routing: NAT1bZone: us-west-2b

Function: Database ClusterSubnet: 10.1.22.0/24Routing: NAT1bZone: us-west-2b

Internet Gateway

ops1aIP: 10.1.1.1150.112.130.110SG: Ops-Prod

nat1bIP: 10.1.2.1050.112.130.114SG: NAT-Prod

app1aIP: 10.1.11.10SG: WS-Prod

app2aIP: 10.1.11.11SG: WS-Prod

app1bIP: 10.1.12.10SG: WS-Prod

app2bIP: 10.1.12.11SG: WS-Prod

db1a (seed)IP: 10.1.21.10SG: DB-Prod

db2aIP: 10.1.21.11SG: DB-Prod

db1b (seed)IP: 10.1.22.10SG: DB-Prod

nat1aIP: 10.1.1.1050.112.130.119SG: NAT-Prod

db2bIP: 10.1.22.11SG: DB-Prod

ops1bIP: 10.1.2.1150.112.130.110SG: Ops-Prod

db2cIP: 10.1.23.11SG: DB-Prod

Availability Zone A Availability Zone B Availability Zone C

Dat

abas

e Ti

erA

pplic

atio

n Ti

erD

MZ

Tier

DataStax EnterpriseOne Integrated Database (no ETL)● Real-time data (with CQL)● Analytics (parallel computation)● Search (faceted, geo-spatial)

Advantages● Cost-effective, open source● Massive linear scalability● High performance/speed● Continuous availability● Simple admin, operations● Elastic, incremental expansion● Cloud-compatible; scale out● Data compression

Disadvantages● Eventual (tunable) consistency● Think differently...

Availability: Distributed Database

High availability with distributed data:3 real-time replicas in each of 2 data centers (6 total)2 replicas needed to function (local quorum: 2 of 3)

Scalability: Linear, Demonstrated

Dropwizard Framework

RESTful Web Services APIHTTP Methodand URI

POSTcreate

GETread

PUTupdate

DELETEdelete

/plural_noun Create/return a new noun.201 Created or202 AcceptedLocation header

Get list of nouns.200 OK404 Not Found

405 Method Not Allowed Delete all nouns.204 No Content404 Not Found405 Method Not Allowed

/plural_noun/id 405 Method Not Allowed Get indicated noun.200 OK404 Not Found

Update indicated noun.204 No Content404 Not Found

Delete indicated noun.204 No Content404 Not Found

Representational State Transfer (REST) using HTTP verbs and addressable resources/nouns

Basic URI Structure:https://api.example.com/api-version/plural_noun/id

API Version (a simple ordinal number):/v1/organizations/2TP68NHBVYNE

Nested path parameters (use sparingly):/v1/organizations/2TP68NHBVYNE/users/2TP7RB0LSLD6

Filtering query parameters:GET /v1/events?log_level=ERROR

See: Apigee’s Web API Design Guidelines

Action verbs (use quite sparingly):GET /v1/files/2V34H956A4S4/download

Content-Type and Accept request headers (json, xml, html)Content-Type: application/json

Location response headers to identify newly created resources:Location: https://api.example.com/v1/organizations/2TP68NHBVYNE

Basic or Token Authentication over SSL/TLS (request header):Authorization: Basic dXNlcm5hbWU6cGFzc3dvcmQ=Authorization: Bearer baf05b90-bfa9-11e4-a7eb-77baa98f4f40

Continuous Availability

● Customer Demonstration (Test Environment)○ Complete loss of one of three availability zones○ Sending transactions through:

before… while going down… while down… while coming back up… and when restored.

○ 100% successful!● No Planned Outages● 100% Availability… for how long?

Customer Availability TestingHealthy Ring ⅓ Nodes Down

Result: 100% successful transactions!

Summary

C* C* C*

InternetGateway

app1a

nat1a nat1b nat1c

app2aapp3a

app1bapp2b

app3b

app1capp2c

app3c

db1adb2adb3a…

db1bdb2bdb3b…

db1cdb2cdb3c…

ops1a ops1b

ELB ELB ELB

ops1c

Continuous Availability:● Distributed Infrastructure

○ Multiple AZs, maybe Regions○ Automate; config as code

● Distributed Database○ Apache Cassandra rocks!○ Eventual consistency, replicas

● Stateless Services (API)○ Integrate on APIs, not DB○ API first, before client apps○ Adopt consistent patterns○ Documentation (Swagger)○ Ops-friendly package

● Intelligent Client Apps○ Session state on client○ Consume RESTful APIs

David [email protected]