operating a highly available cloud service - …files.meetup.com/1460349/operating a highly...

38
Operating a Highly Available Cloud Service Depankar Neogi Chief Architect QuickBase, Intuit Inc. November 14, 2013 http://www.meetup.com/Boston-cloud-services/events/141118632/ Presented at Boston Cloud Services Meetup

Upload: dodang

Post on 13-Apr-2018

235 views

Category:

Documents


1 download

TRANSCRIPT

Page 2: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

Agenda

•Intuit and QuickBase

•Building and Running Highly Available Cloud Services

–People & Process

–Technology

2

The single most important thing to keep in mind when designing for High Availability is to anticipate failure.

Page 3: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

20% of GDP & Pay 1 in 12

Improving

Lives 60M

Apps for >50% of Fortune 500

Facilitate $40B Tax Refunds

#1 Financial Management Software

#1 for Innovation

in Computer Software Industry 3

Page 4: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

4

What is QuickBase?

One platform solves jobs across the enterprise. Project Management, IT helpdesk, CRM, Field service, Human resources, etc.

An Enterprise platform to

empower your team to build applications

Easily customized to meet unique business needs

Requirements, processes and teams evolving constantly

Excel to QuickBase

in less than 5 minutes

500,000+

current users Brand NEW modern UI

enables Ease of Use

More than

4,500 companies use QuickBase

Page 5: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

QuickBase – Customized applications matching your unique requirements

Open extensible API’s Common Infrastructure Services

Roles Based UI Dashboards & Reports

Business logic & workflow

Secure Access Control

Relational Data Tables

Data Storage & Backup

5

Page 6: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

Modern, Easy, Productive, Dynamic, Fast

30 million requests per day

80 K unique visitors per day

100,000 active apps at any time

25 milliseconds median processing time

Supports Dynamic DML, DDL, CRUD

Cloud based Database with a beautiful UX

6

Page 7: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

New QuickBase DIY Data Access

8

Data Mapping WSQL Transforms

Virtual tables Cache

Warehouse Scheduler Repository

Liberator Library

Liberators

2. New Data Sharing Service

1. QuickBase UI Extended with new DIY data sharing

A N Y

A P I

3. Connections to Popular Industry Data

Intuit-class infrastructure (security, billing, HADR, hosting)

Page 8: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

AVAILABILITY

9

Page 9: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

PSTN Systems Availability SLA

10

99.9999 % “six nines” 31.5 secs/yr, 2.59 secs/month, 0.605 secs/week

99.999 % “five nines” 5.26 mins/yr, 25.9 secs/month, 6.05 secs/week

Downtime

Page 10: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

Web Services Availability SLA

11

99.95 % 4.38 hrs/yr, 21.56 mins/month, 5.04 mins/week

99.9 % 8.76 hrs/yr, 43.8 mins/month, 10.1 mins/week

Downtime

Page 11: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

12 http://www.google.com/apps/intl/en/terms/sla.html

Page 12: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

PEOPLE & PROCESSES Operating High Availability Service

13

Page 13: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

People & Process: Monitoring Business Metrics

• It’s critical to detect a problem before your customers have to tell you or you have to ask them.

• By monitoring real time business metrics and comparing the actual data to a historical curve you can more quickly detect if there is a problem and avoid sifting through alerting and monitoring white noise that your systems will inevitability produce.

• Five evolutionary questions that monitoring should answer: 1. Is there a problem?

2. Where is the problem?

3. What is the problem?

4. Why is there a problem?

5. Will there be a problem?

• External versus Internal Monitoring

14

http://akfpartners.com/techblog/2009/06/15/monitoring-strategies/

Page 14: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

People & Process: Invest in Good Tools

15

95 K Requests in 12 hour window

Peak Request: 4.3 req/sec (1286 request/5 min window)

Processing Time: 61 millisecond per request

A good tool will help you find the needle in a haystack - fast

Page 15: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

People & Process: Incident Management Process

• Incident Management Team (IMT)

• Incident Management Response Plan

• Activating the IMT, notifications

• Having the right break-out rooms

• Classification of the incident

• Communication of the incident

• Time keeper

• Management versus Technical Process

• Tracking:

– SLA

– RPO (recovery point objective)

– RTO (recovery time objective)

• Incident closure, recovery

• Evaluation process

16

Page 16: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

People & Process: Runbook and messaging

• Runbook

– Detail process for managing the incident

– Contact Information

– Managing data center cutover, recovery steps, testing, managing replication

• Messaging book

– Who is responsible for communication

– Who creates and approves the message

– How you communicate

– At what cadence

– What you tell your customers

• Social Media Strategy

– If you are not transparent, your customers will let you know

– Social Media coordinator – own the channels

17

Page 17: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

People & Process: Service Page

18

Provide Customers ability to find out the health of the system and be notified of any service related issues

Page 18: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

People & Process: Service Page

19

Transparency is Key. If you let the customers know what you know, they will respect you and may remain loyal to your business.

Page 19: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

People & Process: Business Fault Isolation

• What if your data center went down

• And the production server is down because the data center is down

• And your email server was in the same data center

• And your marketing server was in the same data center

• And your service page was on a server in the same date center

• How do you communicate with all your customers?

20

Business Fault Isolation prevents your business from a SPOF (single point of failure).

Page 20: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

People & Process: Review Process

• SaaS or Operations Review Process should have a fixed cadence and be led by a company leader

• Review Team should include leaders from:

– Finance

– Compliance & Risk

– CTO

– Operations

– Product

• Dashboard with KPI

• Review Fire drills

• Change Control Process

– Preferably change one thing at a time

21

Page 21: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

TECHNOLOGIES Operating High Availability Service

22

Page 22: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

The Three Pillars of High Availability

The goal of High Availability and Disaster Recovery (HA/DR) is to provide Business Continuance through:

HA/DR directly enhances a customer’s experience through greater offering availability

Lack of Service Outage = Happy Customers = Greater Business Value

Page 23: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

High Availability Architecture Principles

•Design for Failure

–Avoid Single Points of Failure

–Graceful Degradation and Soft Dependencies

–Asynchronous Design

–Keep State Confined to Where it is Needed

•Design for Operability

–Design to be Monitored

–Design for Hot Deployment and Rollback

–Automate Where Possible

•Keep Everything “In Production”

•Scale Out (Not Up)

•Keep it Fresh…and Mature

Page 24: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

Architecture Patterns for High Availability

Swimlanes

1) Active/Passive

2) Active/Active 3) Single Write Master 4) Store and Forward

25

Page 25: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

Active / Passive

Active Data

Primary Data Center Secondary Data Center

Near Real-time Replication

Passive Back Up

26

Page 26: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

Swimlane Principle

A “Swimlane” is:

A set of predefined systems and software infrastructure tuned to support a predefined workload

•Only a portion of an offering’s total users are hosted on any given swimlane

Within a Swimlane:

–Each Swimlane is independent and self-sufficient and shares no compute/storage resources with other swimlanes

–Offering transactions occur within a Swimlane

–Only access to Shared Services go outside the Swimlane

–Standard Fault Detection and Fault Recovery methods are used

27

Page 27: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

Intuit Proprietary & Confidential

High Availability with Swimlanes

WS

AS

Storage

Sw

imla

ne 4

WS

AS

Storage

Sw

imla

ne 2

WS

AS

Storage

Sw

imla

ne 3

WS

AS

Storage

Sw

imla

ne 1

WS

AS

Storage

Sw

imla

ne 2

WS

AS

Storage

Sw

imla

ne 3

WS

AS

Storage

Sw

imla

ne 4

WS

AS

Storage

Sw

imla

ne 1

F5 GTM DNS F5 LTM

DC 1

DC 2

F5 GTM F5 LTM

Internet

GTM

WS: web server; AS: app server

Fault Domain 1 Fault Domain 2

Application Partitioning

via Swimlanes

28

Page 28: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

Swimlanes Support Application Needs

• Scalability • Replicated swimlanes add capacity with linear scalability

• Fault Isolation • Complete failure only impacts a subset of users due to application partitioning and data sharding

• High Availability • Individual tiers can be made highly available through intra-VM application recovery, intra-swimlane application failover or intra-swimlane VM restart

• Disaster Recovery • Disaster recovery is achieved through swimlane failover, either in the same or a remote data center

• Automation • The identical nature of a swimlane allows for a high degree of operational automation

29

Page 29: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

Active / Active – Swim Lanes

DB1 active

-----------------

DB3 passive

Data Center 1 Data Center 2

25% customers

25% customers

25% customers

25% customers

Replication

Replication

Global Load

Balancer

DB2 active

-----------------

DB4 passive

DB3 active

-----------------

DB1 passive

DB4 active

-----------------

DB2 passive

30

Page 30: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

Active / Active – Single Write Master

Read Cache

DC1 DC2 DC3 DC4

Read Cache

Read Cache

Read Cache

Updates

Writes

Cache Updates

31

Page 31: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

Design for Failure: Resiliency Patterns

Throttling versus Circuit Breaker

32

Page 32: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

Circuit Breaker Pattern

http://techblog.netflix.com/2012_02_01_archive.html

Closed

On call/ pass through

Call succeeds / reset count

Call fail/count failure

Threshold reached/trip breaker

Open

On Call / Fail

On timeout / attempt reset

Half Open

On call / pass through

On succeed/reset

On fail /trip breaker

Trip breaker

Trip breaker

Attempt

Reset

Attempt

Reset

C D

Caller Dependency

Circuit Breaker State Diagram

33

Page 33: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

34

Cir

cu

it B

reaker P

att

ern

:

Exam

ple

htt

p:/

/techblo

g.n

etf

lix.c

om

/2012_02_01_arc

hiv

e.h

tml

Page 34: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

35

Cir

cu

it B

reaker P

att

ern

: Exam

ple

Example of how threads, network timeouts and retries combine

htt

p:/

/techblo

g.n

etf

lix.c

om

/2012_02_01_arc

hiv

e.h

tml

Page 35: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

Examples of Tools for Building HA Systems

• Highly Available DNS– Akamai, Dyn, AWS Route53

• Load Balancing – F5 LTM, F5 GTM, AWS ELB

• Data Replication – Golden Gate

• Monitoring – eHealth, Spectrum, Wily, Splunk, Cacti

• Application Performance – DynaTrace, NewRelic

• Deployment – Perforce, Maven, Nexus, Hudson, Puppet

• Distributed Databases – NuoDB, VoltDB, several NoSQL types

• Distributed Storage – GlusterFS, Atmos, OpenStack

• HA Devices – Veritas Cluster Server

• OS Virtualization – AWS, Mware, Xen, Parallels

• Network Virtualization – AWS, Mware NSX, PLUMgrid

• Caching– Memcached, Akamai, CloudFront

• Caching– Netflix Chaos Monkey

• DDos Protection– Arbor, Riverbed

36

Page 36: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

Trust Not the Execution Environment

“Everything Fails, All the Time.” – Werner Vogels, CTO of Amazon.com

37

Page 37: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

Summary: Operating HA Service

Monitoring Business Metrics

Incident Management Process

Runbooks

Social Media & Messaging

Service Page

Business Fault Isolation

SLA, RPO, RTO

Failover Drills

Review Process

Change one thing at a time

Principles:

– Design for Failure

– Design for Operability

– Keep Everything “In Production”

– Scale Out (stateless)

– Keep it Fresh

Patterns:

– Active/Active

– Swimlanes

– Active/Passive

– Store-Forward

Design:

– Throttling

– Circuit Breaker

– Caching

– Rollback

– Healthchecks

Tools

38

Page 38: Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly Available Cloud Service... · Operating a Highly Available Cloud Service Depankar Neogi

39

Thank You!