failover and global server load balancing for better network availability

Failover and Global Server Load Balancing for Be4er Network Availability

Jeremy Hitchcock CEO

Dynamic Network Services

Overview

• Problem space: Keeping services up

• About Failover and GSLB

• Case Study: Roll your own CDN in...quick • Case Study: Speed and Stability

• Case Study: DR You can Sleep On • General lessons for network availability

You are probably…

•  SoJware service provider •  Completely online

•  UpLme and revenue directly related

•  Audience is internaLonal (non-‐geographical)

So is everyone (lot more of us)!

Mean Time Between Failures (MTBF) (Local)

Fiber Cuts (Network/global)

Failures Are a Way of Life

• Affects bo4om line

• Gets people paged

• Brands loose value

A Be4er Way?

•  Current tools: in-‐house scripts, appliances, CDN networks

•  Either high opex or capex •  New opLons in infrastructure •  Example: – 5-‐10 person [boot-‐strapped] companies rolling self-‐healing, auto-‐provisioning networks

OpLmizing The Wrong Part

• Hardware redundancy is expensive • Single point of failures are bad

• Infrastructure is not a core funcLon • Things break, everything auto

• Easier (cheaper) than you think

RealizaLons

•  Things break, route around outages •  Infrastructure providers a plenty today •  Users more sensiLve to outages

•  Internet users are around the world – Speed of light is sLll c – RTT of 100m with 50 objects adds up

Traffic management is criBcal

Different Architectures, Different Results

Old New

Use hardware redundancy, local Use soJware redundancy

Super-‐site build out Regionalize, all over-‐provisioned

Page on failure, fix based on page Email report in morning

Planned deployments AutomaLc load handling

Single master datacenter Many POPs, all closer to users

DR is a passive, manual failover DR and failover blended together

New Tools (new to some)

•  AutomaLc failover •  Global server load balancing •  CDN balancing/managing

•  Opex relaLve to actual usage •  Avoid capex step funcLons

•  Two acLve components,

traffic switch

•  Implies external monitoring

•  Hide outages

Failover

Standard operaLon

On Failover

Failover Use Cases

•  Two servers for www.domain.com – On failure, redirect from one to the other

– Works via DNS – Redirect to a staLc page

•  Requirements – External monitoring point

– External DNS – Low DNS caching TTL values

•  More than two acLve

components

•  Traffic management

–  TargeLng (geo, network) –  WeighLng (percent)

•  Failover plus opLmize RTT

•  Hostname to A record mapping

Global Server Load Balancing (GSLB)

Global Server Load Balancing Use Cases

•  Regionalize eyeballs/end-‐users •  Internet outages/subpar speeds avoided •  Weight based on load, percentages

•  Requirements: – Same as failover – Bit of math/algorithms to balance traffic – Many to many mappings

•  Two complete systems

•  Balance between CDNs

–  Bandwidth commits

–  Regional advantages

•  Works on CNAMEs

CDN Management

CDN Manager

•  Try out a mix of networks – CDNs, infrastructure providers

•  Be4er manage traffic – Cost/performance reasons

•  Requirements – Same as GSLB but with DNS alias CNAMEs

•  Internet doesn't care about domain.com

•  twi4er.com 128.121.146.228

•  Lot of tricks you can do here

Traffic Cop: DNS

Lenses and OpLons

•  EvaluaLon Criteria – SoJ/hard costs, capital/operaLng costs

•  Outcome based – Determine your metrics, test those

•  PotenLal Outcomes – Roll it in house – CDN Network – Hardware appliances – SaaS-‐based

Which one is be4er?

•  Roll it in house – Mid-‐high capex, higher than you think opex –  Lots of soJ-‐costs, applicaLon specific though

•  CDN Network –  Li4le capex, high opex –  Some have more knobs than others

•  Hardware appliances –  High capex, low opex –  Need to make full investment into architecture

•  SaaS-‐based –  Li4le capex, low-‐mid opex –  Let others worry about this for you

Case Study 1 Roll your own CDN in...quick

Wikia and regionalizing CDNs for be4er delivery

CDN Choice and Transparency

•  Lots of CDNs – Two great public ones – 30 (more?) private providers – Telco/ISP opLons

•  Currently give customer hostname –  (customer.cdn.com)

•  Only test with live traffic

CDN Manager: Enabling TesLng

•  Segment traffic and test •  Try 2 or 10 CDNs •  Low risk method to collect data

•  Data collecLon has to be from end points – Your office computer is not the Internet

•  Can be4er rate cost/performance

CDN Manager: Wikia

•  Wikia runs several niche wikis (audience) •  OpLmize traffic delivery for those niches

•  Wanted to determine the best CDN based on actual data

CDN Manager: Wikia

•  In America, use CDN •  In Europe, use their own •  Why? Who knows, but it’s the best for their traffic

Discussion

•  Not all CDNs are the same •  MulLple relaLonships to manage

•  Cost control/performance of CDNs

•  Audience and economies drive decisions

Case Study 2 Speed and Stability

Twi4er and keeping up

Speed and Stability

•  All Internet sites have DNS – Range from good, bad, ugly

•  Online services must be fast and accurate – Latency and upLme are what ma4ers

•  Things fail all the Lme, sends users to what works

Speed and Stability: Twi4er

•  Spiky and growing traffic (like a lot) •  Things change too fast to keep up •  Load balance a lot •  Easier to scale core competencies

•  One less thing to worry about

Speed and Stability: Twi4er

•  DNS part of system to make site work •  Desire not to be an expert in it •  Huge, wide spread audience •  Online-‐only service

Discussion

•  When infrastructure changes rapidly, external monitoring good

•  Failover message is be4er than Lmeouts

•  Keep traffic regionalize through targeLng

•  Outsource non-‐core competencies

•  Latency affects page views or ad revenue

Case Study 3: Disaster Recovery You Can Sleep With

37 Signals and doing what needs to get done

Disaster Recovery ImplementaLon

Requirements – One good facility (A) – One backup facility (B) – Ability to recognize facility A is out – Ability to direct traffic from A to B

Authorize.net Interlude

•  DR implementaLon Lmeline –  Late-‐July: move to new DR facility and plan –  July 2: fire at Fisher Plaza (unplanned) –  July 3: …

•  Only missing a traffic engineering switch •  TTLs (DNS record caching) a big difference –  SLll a problem today –  secure.authorize.net. 86400 IN A 64.94.118.32

•  Fully discussion: h4p://bit.ly/23mayf

DR: 37 Signals

•  Cloud based SaaS tools, have to be up •  External DNS important for controlling traffic

•  What if facility A is down and DNS is only at A?

•  External DNS means failover/DR possible

Discussion

•  Ensuring full replicaLon is usually easy •  Traffic management, is usually the problem

•  Confuse cold assets/warm spare/hot acLve

•  People wait unLl they have an outage to implement DR

Overall Notes

•  Networked services need to be rock solid •  Failover, GSLB, and CDNM are within reach

•  Wikia, Twi4er, and 37 Signals using external traffic management for their applicaLon

•  Audience ma4ers, so does tesLng and benchmarking

•  DynTini

twi4er.com/dynLni

Copy of presentaLon?

Leave a business card in back (or talk to me aJerwards) and I’ll send it to you

Dynamic Network Services, Inc. 1230 Elm St. FiJh Floor Manchester, NH 03101

+1 888.840.3258 [email protected] dyn.com

Join us for drinks: dynLni.com Follow us on Twi4er: @DynInc

Contact Us

Uptime Is the

Bottom Line.

failover and global server load balancing for better network availability

Technology