integrating multiple cdn providers at etsy - velocity europe (london) 2013

Post on 08-Sep-2014

13 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Relying on a single content delivery network for your site can impose a number of flexibility limitations. By diversifying your CDN providers you can put the power back in your hands, allowing you to get the best of both worlds in terms of performance, reliability and cost. In this talk Marcus and Laurie will present Etsy’s recent work integrating multiple CDN providers to their site delivery infrastructure. This presentation was delivered at Velocity Europe, November 2013

TRANSCRIPT

Integrating Multiple CDN ProvidersOur experiences at Etsy

@lozzd • @ickymettle

Marcus Barczak Laurie Denness

Staff Operations Engineers

@lozzd • @ickymettle

@lozzd • @ickymettle

@lozzd • @ickymettle

Beginning of 2010 Today

@lozzd • @ickymettle

Background▪ First started using a single CDN in 2008

▪ Exponential Growth

▪ Start of 2012 began investigation into running multiple CDNs

@lozzd • @ickymettle

Why use a CDN?▪ Goal: Consistently fast user experience globally

▪ Improve last mile performance by caching content close to the user

▪ Offload content delivery from origin infrastructure to the CDN provider

@lozzd • @ickymettle

Why use more than one CDN?▪ Resilience

- Eliminate single point of failure

▪ Flexibility- Balance traffic based on business requirements

▪ Cost- Manage provider costs

The Plan

http://www.flickr.com/photos/malloy/195204215

@lozzd • @ickymettle

The Plan1. Establish evaluation criteria

2. Initial configuration and testing

3. Test with production traffic

4. Operationalising

@lozzd • @ickymettle

Evaluation Criteria

http://www.flickr.com/photos/49212595@N00/5646403386

@lozzd • @ickymettle

Evaluation Criteria▪ Performance

▪ Configuration

▪ Reporting, Metrics and Logging

▪ Culture

@lozzd • @ickymettle

Performance▪ Baseline Response Times

- Should be within ±5% of our existing CDN provider’s response times

▪ Hit Ratios and Origin Offload - Provider should achieve equivalent or better origin offload

performance and hit ratios

@lozzd • @ickymettle

Configuration▪ Complexity

- how complex is the providers configuration system

▪ Self service- can you make changes directly or do they require

professional services or other intervention

▪ Latency for changes- how quickly do changes take to propagate

@lozzd • @ickymettle

Reporting, Metrics and Logging▪ Resolution

▪ Latency

▪ Delivery

▪ Customisation

@lozzd • @ickymettle

Culture▪ Understand our culture

▪ Postmortems

▪ Access to technical staff

▪ Shared success

Initial Configuration

and Testing

http://www.flickr.com/photos/7269902@N07/4592239326

Clean the househttp://www.flickr.com/photos/mastergeorge/8562623590

@lozzd • @ickymettle

Clean the house▪ Managing caching TTLs from origin

- CDNs honour the origin cache-control headers!

<LocationMatch "\.(gif|jpg|jpeg|png|css|js)$"> Header set Cache-Control "max-age=94670800"</LocationMatch>

@lozzd • @ickymettle

Clean the house▪ Manage gzip compression from origin

- Honoured by CDNs

- Compression from origin to CDN

## mod_deflate compression - see OPS-1537 ##AddOutputFilterByType DEFLATE text/html text/plain text/css application/x-javascript [..]

@lozzd • @ickymettle

Clean the house

If you can do it at origin,do it at origin

Mean Time To Curlhttp://www.flickr.com/photos/wwarby/3297205226

HTTP/1.1 200 OKServer: ApacheLast-Modified: Sat, 09 Nov 2013 23:43:38 GMTCache-Control: max-age=94670800[...]X-Served-By: cache-lo82-LHRX-Cache: MISSX-Cache-Hits: 0

curl -i -H 'Host: img0.etsystatic.com' \ global-ssl.fastly.net/someimage.jpg

HTTP/1.1 200 OKServer: ApacheLast-Modified: Sat, 09 Nov 2013 23:43:38 GMTCache-Control: max-age=94670800[...]X-Served-By: cache-lo82-LHRX-Cache: HITX-Cache-Hits: 1

curl -i -H 'Host: img0.etsystatic.com' \ global-ssl.fastly.net/someimage.jpg

https://www.etsy.com/listing/99871278

Mean Time To Curl = Done

@lozzd • @ickymettle

Mean Time To Curl▪ No need to touch existing infrastructure

▪ Smoke test of functionality

▪ 10 minutes from configuration to curl

▪ New providers should be plug and play

Testing In Productionhttp://www.flickr.com/photos/solarnu/10646426865

@lozzd • @ickymettle

Testing with Production Traffic▪ Images only at first

▪ Good test of caching performance

▪ Easy to test by swapping hostnames

▪ Made even easier with our A/B testing framework

@lozzd • @ickymettle

A/B Test Framework▪ Fine grained control

▪ Enable test for specific users or groups

▪ Percentage of users

▪ All controlled via configuration in code

▪ Rapid and complete rollback

@lozzd • @ickymettle

Configure Mappings to CDNs$server_config["image"] = array( 'akamai' => array( 'img0-ak.etsystatic.com', 'img1-ak.etsystatic.com', ), 'edgecast' => array( 'img0-ec.etsystatic.com', 'img1-ec.etsystatic.com', ), 'fastly' => array( 'img0-f.etsystatic.com', 'img1-f.etsystatic.com', ),);

@lozzd • @ickymettle

Test Controls

$server_config['ab']['cdn'] = array( 'enabled' => 'on', 'weights' => array( 'akamai' => 0.0, 'edgecast' => 0.0, 'fastly' => 0.0, 'origin' => 100.0, ), 'override' => 'cdn_diversity',);

@lozzd • @ickymettle

Metrics and Monitoring

http://www.flickr.com/photos/nicolasfleury/6073151084

@lozzd • @ickymettle

Metrics and Monitoring

Even if it doesn’t move, graph it anyway

@lozzd • @ickymettle

Simplest approach: Provider’s dashboards

Metrics and Monitoring

@lozzd • @ickymettle

▪ Get more detail by pulling metrics in house

▪ Write script to pull data from API

▪ Create dashboards with data

Metrics and Monitoring

@lozzd • @ickymettle

▪ Get more detail by pulling metrics in house

▪ Write script to pull data from API

▪ Create dashboards with data

Metrics and Monitoring

@lozzd • @ickymettle

Metrics and Monitoring

@lozzd • @ickymettle

Metrics and Monitoring

@lozzd • @ickymettle

Testing Plan1. for c in $cdns; do rampup $c; done;

2. Deliberately slow and steady

3. Watch traffic increase

4. Watch origin offload increase

5. Watch performance

@lozzd • @ickymettle

Downsides of this approach▪ AB testing can’t be used for main site

▪ Exposing your test CNAMEs

▪ Especially if hotlinking is a concern

@lozzd • @ickymettle

Downsides of this approach▪ Exposing your test CNAMEs

▪ Especially if hotlinking is a concern

@lozzd • @ickymettle

How do you know it’s broke? ▪ Check the graphs!

▪ Check with your community

▪ Keep support in the loop

Operationalising

http://www.flickr.com/photos/98047351@N05/9706165200

@lozzd • @ickymettle

Content Partitioning

@lozzd • @ickymettle

Etsy’s site partitioning

Dynamic HTML Contentwww.etsy.com

@lozzd • @ickymettle

Etsy’s site partitioning

Static Assets (js, css, fonts)site.etsystatic.com

@lozzd • @ickymettle

Etsy’s site partitioning

Listing Images, AvatarsimgX.etsystatic.com

@lozzd • @ickymettle

Etsy’s site partitioning

Listing Images, AvatarsimgX.etsystatic.com

Static Assets (js, css, fonts)site.etsystatic.com

Dynamic HTML Contentwww.etsy.com

Balancing Traffic in Production

http://www.flickr.com/photos/wok_design/2499217405

@lozzd • @ickymettle

Balancing Traffic Using DNS▪ Traffic Manager

▪ Extends DNS to dynamically return records based on rules

▪ Weighted round robin

@lozzd • @ickymettle

Balancing Traffic Using DNS

[2589:~] $ dig +short www.etsy.comwww.etsy.com.edgekey.net.e2463.b.akamaiedge.net.23.74.122.37

[2589:~] $ dig +short www.etsy.comcs34.adn.edgecastcdn.net.93.184.219.54[2589:~] $ dig +short www.etsy.comglobal-ssl.fastly.net.185.31.19.184

[2589:~] $ dig +short www.etsy.cometsy.com.38.123.123.123

@lozzd • @ickymettle

Balancing Traffic Using DNS

[2589:~] $ dig +short www.etsy.comwww.etsy.com.edgekey.net.e2463.b.akamaiedge.net.23.74.122.37

[2589:~] $ dig +short www.etsy.comcs34.adn.edgecastcdn.net.93.184.219.54

[2589:~] $ dig +short www.etsy.comglobal-ssl.fastly.net.185.31.19.184

[2589:~] $ dig +short www.etsy.cometsy.com.38.123.123.123

@lozzd • @ickymettle

Balancing Traffic Using DNS▪ Rule updates typically made via web UI

▪ Can be slow and error prone

▪ Changes need to be applied to all three domains

▪ API available to make changes programmatically

@lozzd • @ickymettle

cdncontrol

http://www.flickr.com/photos/foshydog/4441105829

@lozzd • @ickymettle

cdncontrol

@lozzd • @ickymettle

cdncontrol

@lozzd • @ickymettle

cdncontrol

@lozzd • @ickymettle

cdncontrol

@lozzd • @ickymettle

cdncontrol

@lozzd • @ickymettle

cdncontrol

@lozzd • @ickymettle

cdncontrol

@lozzd • @ickymettle

cdncontrol

@lozzd • @ickymettle

cdncontrol

@lozzd • @ickymettle

cdncontrol

@lozzd • @ickymettle

DNS balancing downsides▪ Low TTLs for fast convergence

▪ More DNS lookups for users

▪ Not 100% instant or deterministic

▪ Mo QPS == Mo Money

@lozzd • @ickymettle

50% within 1 minute Long Tail is Loooong

@lozzd • @ickymettle

Monitoring in Production

http://www.flickr.com/photos/9229426@N05/5160787240

@lozzd • @ickymettle

Whoopsie Page▪ Static HTML delivered for 5xx errors

- Branding

- Translated error messages

- Links to status page

@lozzd • @ickymettle

Whoopsie Page▪ Static HTML delivered for 5xx errors

- Branding

- Translated error messages

- Links to status page

@lozzd • @ickymettle

Failure Beacons1. 1x1 tracking pixel embedded in page

[...]<img src="//failure.etsy.com/status/images/beacon.gif?beacon_source=fastly_origin_failure-etsy.com"></body></html>

@lozzd • @ickymettle

Failure Beacons1. 1x1 tracking pixel embedded in page

2. Request creates an access log line

@lozzd • @ickymettle

Failure Beacons1. 1x1 tracking pixel embedded in page

2. Request creates an access log line

3. Scrape them out minutely using logster

self.reg = re.compile('^\S+(\s:)? (?P<remote_addr>[0-9\.]+),? [0-9\.,\- ]+ \[[^\]]+\] \"GET /status/images/beacon\.gif\?(beacon_)?source=(?P<source>\S+) HTTP/1\.\d\" \d+ [\d\-]+ \"(?P<referrer>[^\"]+)\" \"(?P<user_agent>[^\"]+)\" .*$')

@lozzd • @ickymettle

1. 1x1 tracking pixel embedded in page

2. Request creates an access log line

3. Scrape them out minutely using logster

4. Logster posts event counts to Graphite

Failure Beacons

@lozzd • @ickymettle

1. 1x1 tracking pixel embedded in page

2. Request creates an access log line

3. Scrape them out minutely using logster

4. Logster posts event counts to Graphite

Failure Beacons

@lozzd • @ickymettle

Failure Beacons1. 1x1 tracking pixel embedded in page

2. Request creates an access log line

3. Scrape them out minutely using logster

4. Logster posts event counts to Graphite

5. Alert on Graphite graph in Nagios

@lozzd • @ickymettle

Failure Beacons1. 1x1 tracking pixel embedded in page

2. Request creates an access log line

3. Scrape them out minutely using logster

4. Logster posts event counts to Graphite

5. Alert on Graphite graph in Nagios

@lozzd • @ickymettle

Failure Beacons▪ Client IP address can be geolocated

@lozzd • @ickymettle

Failure Beacons▪ Optional extra debugging information

[31/Oct/2013:07:06:42 +0000] "GET /status/images/beacon.gif?beacon_source=fastly_origin_failure-etsy.com&provider_error=Connection%20timed%20out&server_identity=cache-ny57-NYC HTTP/1.1"

@lozzd • @ickymettle

Failure Beacons▪ Optional extra debugging information

@lozzd • @ickymettle

Tracking Requests to Origin

GET / HTTP/1.1User-Agent: curl/7.24.0Accept: */*X-Forwarded-Host: www.etsy.com[...]X-CDN-Provider: edgecast[...]Host: www.etsy.com

@lozzd • @ickymettle

GET / HTTP/1.1User-Agent: curl/7.24.0Accept: */*X-Forwarded-Host: www.etsy.com[...]X-CDN-Provider: edgecast[...]Host: www.etsy.com

Tracking Requests to Origin

@lozzd • @ickymettle

Backend Monitoring▪ Vendor APIs to bring data in house

@lozzd • @ickymettle

Backend Monitoring▪ Logster on CDN provider header

▪ Vendor APIs to bring data in house

@lozzd • @ickymettle

Backend Monitoring▪ Vendor APIs to bring data in house

▪ Data in-house benefits include- Integration with our anomaly detection systems

- Consistent and unified view of all CDN metrics

- We control data retention period

@lozzd • @ickymettle

Awareness▪ Over 100 engineers

▪ Deploying 60 times a day

▪ Correlating external and internal services

@lozzd • @ickymettle

Awareness

@lozzd • @ickymettle

Awareness

Deploy lines

@lozzd • @ickymettle

Frontend Monitoring▪ Performance is important to us

▪ Monitoring overall site performance

▪ Monitoring performance by CDN provider

▪ Real User Monitoring on key pages to track page performance

@lozzd • @ickymettle

Frontend Monitoring▪ Performance is important to us

▪ Monitoring overall site performance

▪ Monitoring performance by CDN provider

▪ SOASTA mPulse on key pages to track real user page performance

Downsides http://www.flickr.com/photos/39272170@N00/3841286802

@lozzd • @ickymettle

Debugging: What broke?▪ MTTD/MTTR can be extremely low with this

system

▪ But not always

@lozzd • @ickymettle

Debugging: What broke?▪ MTTD/MTTR can be extremely low with this

system

▪ But not always

@lozzd • @ickymettle

Debugging: What broke?▪ Non technical member base

▪ Confusing and time consuming

▪ Amazing support team

▪ Log as much information as possible

Conclusions/Takeaways

http://www.flickr.com/photos/sk8geek/4649776194

@lozzd • @ickymettle

Great success▪ 12 months in the benefits have far outweighed the

few downsides

▪ We’re continuing to evolve the system

▪ We’ll be sure to share our experience with the community along the way

@lozzd • @ickymettle

Links/Open Source▪ cdncontrol

http://github.com/etsy/cdncontrolhttp://github.com/etsy/cdncontrol_ui

▪ logsterhttp://github.com/etsy/logster

▪ CDN API to Graphite scripts http://github.com/lozzd/cdn_scripts

Thanks!Questions?

@lozzd • @ickymettle

top related