preparing for cdn failure: why and how
TRANSCRIPT
Preparing for CDN Failure:
Why and How
Aaron Peters TurboBytes
@aaronpeters & @ksgyoung
Kyle Young Mobify
Multi-continent outage for 17 minutes
@aaronpeters & @ksgyoung
100% broken in North-America and Europe.
#1 eyeballs network in Germany
Single country, single ASN perf degradations are not uncommon in CDN land …
@aaronpeters & @ksgyoung
Synthetic monitoring
@aaronpeters & @ksgyoung
The ProsVery Specific
Known Intervals
Controlled Locations
Lots of Information
Synthetic monitoring
@aaronpeters & @ksgyoung
The ConsLimited Locations
Datacenters
Hyper-idealized Client Model
RUM for Page Load Time
@aaronpeters & @ksgyoung
CDN must serve HTML + resources
Third party content adds noise
Can’t capture CDN failing
Not very useful
RUM with Test Object
@aaronpeters & @ksgyoung
Fetch small object from CDN(s) after onload
Beacon timings, or a Fail
Use Nav Timing for best insight
Challenge
get your JS on other sites too, to capture CDN failing
RUM for Page Resources
@aaronpeters & @ksgyoung
Resource Timing API
Send assets with TAO header
Onload => beacon timings
Easy, right? Not so fast …
Starting Points
@aaronpeters & @ksgyoung
Did the asset come from CDN?
Did it load fast enough?
Was it a good response (200/304)?
Fetched from network?if (total time > 20 ms) { // from network }
DOES NOT WORK !
RT API has many quirks
@aaronpeters & @ksgyoung
DNS time and Connect time always zero in IE
No data for 4xx/5xx responses, except in IE
Nothing in RT API until asset fully loaded, except in IE
FF doesn’t tell the truth
How measure CDN perf with RT API
@aaronpeters & @ksgyoung
main.css
inline JS, high in HEAD, exec only if window.chrome
setTimeOut(checkRTAPI, 5000);
at onload or when timer ended:
if ( main.css in RT API && connectTime > 3 ) { loaded fine from network } else { meh }
Too slow: Fail
transferSize attribute, FTW!
@aaronpeters & @ksgyoung
transferSize = byte size that came over the wire
if ( transferSize != 0 ) { // from network }
Status: no browser is implementing this yet
Background/discussion: https://github.com/w3c/navigation-timing/issues/3
Future
Take-Aways
@aaronpeters & @ksgyoung
Measure Fail Ratio too, not just Speed Use RUM for real-world performance insightAnd Synthetic monitoring for deep visibilityBeware of the many bugs in Res Timing API
Doing Multi-CDN
@aaronpeters & @ksgyoung
Perf dataDecision making
TargetingTime-to-Switch
High volume, high qualityMake good sense of the data, quicklyVery granular, very accurateAsap!
Where switch CDNs?
@aaronpeters & @ksgyoung
cs109.wac.edgecastcdn.net.
cds.z4b9c4e6.hwcdn.net
//cdn1.mydomain.com/main.css
//cdn2.mydomain.com/main.css
in DNS in HTML
OR
Traffic management in DNS
@aaronpeters & @ksgyoung
resolver authoritativeclient
I see the request comes from NL, based on resolver IP address …
… so I’ll handout the CNAME to the CDN configured for NL
CDN BCDN A
Static Geo
@aaronpeters & @ksgyoung
Always route to that CDN in that geoEasy: no need to monitor perfBut what if CDN has boo boo?
Dynamic Geo
@aaronpeters & @ksgyoung
Always route to best CDN per geoNeeds solid perf data (RUM !)Geo targeting accuracy important
Dynamic Geo + ASN
@aaronpeters & @ksgyoung
Holy grail: gives best results
Really needs RUM data, and lots of it
Targeting accuracy even more important
Geo targeting gone wrong
@aaronpeters & @ksgyoung
8.8.8.8 authoritativeclient
I see the request comes from MY, based on resolver IP address …
… so I’ll handout the CNAME to the CDN configured for MY
CDN B - best in India !CDN A
EDNS0 to the rescue !
@aaronpeters & @ksgyoung
8.8.8.8 authoritativeclient
I see the request comes from IN, based on client IP address /24 …
… so I’ll handout the CNAME to the CDN configured for IN
CDN B - best in India !CDN A
Decision Making
@aaronpeters & @ksgyoung
Look at everything, not just ‘Response Time’
Use multiple statistics, not just median
Make your ‘decider’ sensitive to Fail Ratio !
Tuning your logic takes time
Coping with low volume data
@aaronpeters & @ksgyoung
Don’t make changes
Make decisions with lower confidence
Have a dynamic targeting granularity
Experiment: do a Pat Meenan !
@aaronpeters & @ksgyoung
http://www.slideshare.net/patrickmeenan/service-workers-for-performance
Hi, I’m Pat
Example CDN Perf Program
Limitations
Not Practical for Monitoring
Humans are Required
Misses Important Factors (EG SSL)
Hard to Commit to Bandwidth
@aaronpeters & @ksgyoung
Support Lines - A Tale of Lowered Expectations
IT Crowd - Fremantle MediaBatman - 20th Century Fox
@aaronpeters & @ksgyoung
Best EffortSwitch over to an alternate CDN for the entire service, across the globe, as per the “Holy S#!t Handbook”.
@aaronpeters & @ksgyoung
This Isn’t PrettyCold Cache
Backend Thrashing
3 to 4 hours of Intermittent Failure
@aaronpeters & @ksgyoung
Lessons LearnedHot Standby
Geographic DNS Control, and Optimization
More Monitoring
@aaronpeters & @ksgyoung
Thank you.
@aaronpeters - [email protected]@ksgyoung - [email protected]
Questions?