failover and global server load balancing for better network availability
DESCRIPTION
Speaker Jeremy Hitchcock of Dynamic Network Services presents how to obtain better uptime and availability through network techniques like failover, global server load balancing, and CDN balancing. Presented at Interop NYC 09.TRANSCRIPT
Failover and Global Server Load Balancing for Be4er Network Availability
Jeremy Hitchcock CEO
Dynamic Network Services
Overview
• Problem space: Keeping services up
• About Failover and GSLB
• Case Study: Roll your own CDN in...quick • Case Study: Speed and Stability
• Case Study: DR You can Sleep On • General lessons for network availability
You are probably…
• SoJware service provider • Completely online
• UpLme and revenue directly related
• Audience is internaLonal (non-‐geographical)
So is everyone (lot more of us)!
Mean Time Between Failures (MTBF) (Local)
Fiber Cuts (Network/global)
Failures Are a Way of Life
• Affects bo4om line
• Gets people paged
• Brands loose value
A Be4er Way?
• Current tools: in-‐house scripts, appliances, CDN networks
• Either high opex or capex • New opLons in infrastructure • Example: – 5-‐10 person [boot-‐strapped] companies rolling self-‐healing, auto-‐provisioning networks
OpLmizing The Wrong Part
• Hardware redundancy is expensive • Single point of failures are bad
• Infrastructure is not a core funcLon • Things break, everything auto
• Easier (cheaper) than you think
RealizaLons
• Things break, route around outages • Infrastructure providers a plenty today • Users more sensiLve to outages
• Internet users are around the world – Speed of light is sLll c – RTT of 100m with 50 objects adds up
Traffic management is criBcal
Different Architectures, Different Results
Old New
Use hardware redundancy, local Use soJware redundancy
Super-‐site build out Regionalize, all over-‐provisioned
Page on failure, fix based on page Email report in morning
Planned deployments AutomaLc load handling
Single master datacenter Many POPs, all closer to users
DR is a passive, manual failover DR and failover blended together
New Tools (new to some)
• AutomaLc failover • Global server load balancing • CDN balancing/managing
• Opex relaLve to actual usage • Avoid capex step funcLons
• Two acLve components,
traffic switch
• Implies external monitoring
• Hide outages
Failover
Standard operaLon
On Failover
Failover Use Cases
• Two servers for www.domain.com – On failure, redirect from one to the other
– Works via DNS – Redirect to a staLc page
• Requirements – External monitoring point
– External DNS – Low DNS caching TTL values
• More than two acLve
components
• Traffic management
– TargeLng (geo, network) – WeighLng (percent)
• Failover plus opLmize RTT
• Hostname to A record mapping
Global Server Load Balancing (GSLB)
Global Server Load Balancing Use Cases
• Regionalize eyeballs/end-‐users • Internet outages/subpar speeds avoided • Weight based on load, percentages
• Requirements: – Same as failover – Bit of math/algorithms to balance traffic – Many to many mappings
• Two complete systems
• Balance between CDNs
– Bandwidth commits
– Regional advantages
• Works on CNAMEs
CDN Management
CDN Manager
• Try out a mix of networks – CDNs, infrastructure providers
• Be4er manage traffic – Cost/performance reasons
• Requirements – Same as GSLB but with DNS alias CNAMEs
• Internet doesn't care about domain.com
• twi4er.com 128.121.146.228
• Lot of tricks you can do here
Traffic Cop: DNS
Lenses and OpLons
• EvaluaLon Criteria – SoJ/hard costs, capital/operaLng costs
• Outcome based – Determine your metrics, test those
• PotenLal Outcomes – Roll it in house – CDN Network – Hardware appliances – SaaS-‐based
Which one is be4er?
• Roll it in house – Mid-‐high capex, higher than you think opex – Lots of soJ-‐costs, applicaLon specific though
• CDN Network – Li4le capex, high opex – Some have more knobs than others
• Hardware appliances – High capex, low opex – Need to make full investment into architecture
• SaaS-‐based – Li4le capex, low-‐mid opex – Let others worry about this for you
Case Study 1 Roll your own CDN in...quick
Wikia and regionalizing CDNs for be4er delivery
CDN Choice and Transparency
• Lots of CDNs – Two great public ones – 30 (more?) private providers – Telco/ISP opLons
• Currently give customer hostname – (customer.cdn.com)
• Only test with live traffic
CDN Manager: Enabling TesLng
• Segment traffic and test • Try 2 or 10 CDNs • Low risk method to collect data
• Data collecLon has to be from end points – Your office computer is not the Internet
• Can be4er rate cost/performance
CDN Manager: Wikia
• Wikia runs several niche wikis (audience) • OpLmize traffic delivery for those niches
• Wanted to determine the best CDN based on actual data
CDN Manager: Wikia
• In America, use CDN • In Europe, use their own • Why? Who knows, but it’s the best for their traffic
Discussion
• Not all CDNs are the same • MulLple relaLonships to manage
• Cost control/performance of CDNs
• Audience and economies drive decisions
Case Study 2 Speed and Stability
Twi4er and keeping up
Speed and Stability
• All Internet sites have DNS – Range from good, bad, ugly
• Online services must be fast and accurate – Latency and upLme are what ma4ers
• Things fail all the Lme, sends users to what works
Speed and Stability: Twi4er
• Spiky and growing traffic (like a lot) • Things change too fast to keep up • Load balance a lot • Easier to scale core competencies
• One less thing to worry about
Speed and Stability: Twi4er
• DNS part of system to make site work • Desire not to be an expert in it • Huge, wide spread audience • Online-‐only service
Discussion
• When infrastructure changes rapidly, external monitoring good
• Failover message is be4er than Lmeouts
• Keep traffic regionalize through targeLng
• Outsource non-‐core competencies
• Latency affects page views or ad revenue
Case Study 3: Disaster Recovery You Can Sleep With
37 Signals and doing what needs to get done
Disaster Recovery ImplementaLon
Requirements – One good facility (A) – One backup facility (B) – Ability to recognize facility A is out – Ability to direct traffic from A to B
Authorize.net Interlude
• DR implementaLon Lmeline – Late-‐July: move to new DR facility and plan – July 2: fire at Fisher Plaza (unplanned) – July 3: …
• Only missing a traffic engineering switch • TTLs (DNS record caching) a big difference – SLll a problem today – secure.authorize.net. 86400 IN A 64.94.118.32
• Fully discussion: h4p://bit.ly/23mayf
DR: 37 Signals
• Cloud based SaaS tools, have to be up • External DNS important for controlling traffic
• What if facility A is down and DNS is only at A?
• External DNS means failover/DR possible
Discussion
• Ensuring full replicaLon is usually easy • Traffic management, is usually the problem
• Confuse cold assets/warm spare/hot acLve
• People wait unLl they have an outage to implement DR
Overall Notes
• Networked services need to be rock solid • Failover, GSLB, and CDNM are within reach
• Wikia, Twi4er, and 37 Signals using external traffic management for their applicaLon
• Audience ma4ers, so does tesLng and benchmarking
• DynTini
twi4er.com/dynLni
Copy of presentaLon?
Leave a business card in back (or talk to me aJerwards) and I’ll send it to you
Dynamic Network Services, Inc. 1230 Elm St. FiJh Floor Manchester, NH 03101
+1 888.840.3258 [email protected] dyn.com
Join us for drinks: dynLni.com Follow us on Twi4er: @DynInc
Contact Us
Uptime Is the
Bottom Line.