southbay sre meetup jan 2016

17
Michael Kehoe Senior Site Reliability Engineer LinkedIn SouthBay SRE Meetup LinkedIn Traffic Shifting

Upload: michael-kehoe

Post on 10-Feb-2017

240 views

Category:

Engineering


2 download

TRANSCRIPT

Michael Kehoe Senior Site Reliability Engineer

LinkedIn

SouthBay SRE MeetupLinkedIn Traffic Shifting

2

$ whoami Michael Kehoe

• Sr Site Reliability Engineer (SRE)• Member of PROD-SRE• https://www.linkedin.com/in/michaelkkehoe

3

LinkedIn Multicolo History

4

What is a Traffic Shift?

• Edge(PoP)shift• DatacenterLoadshift• SingleMasterFailovers

5

Why do we do traffic shifts

• Tomitigateuserimpactfromproblemswitha3rdpartyproviderorLinkedIn’sinfrastructure/services

• TovalidateDisasterRecovery(DR)incaseofanydatacenterfailure

• Tovalidateandtestcapacityheadroomacrossourdatacenters

• Toexposebugsandsuboptimalconfigurationsbyloadtestingoneormoredatacenters

• Toperformplannedmaintenance• Tovalidateandexercisethetrafficshiftautomation

6

Traffic shifting How do we do it?

7

Edge Traffic shifts How does it work

• WeuseIPVStoloadbalanceatouredges• Wecanwithdrawanycastroutestoremovetrafficfrom

thatPoP• HealthchecksonouredgeproxyaretestedbyDNS

providerstoverifywhetherthatPoPisinrotation• Wecanfailthosehealthcheckstoremoveunicast

trafficfromthatPoP

8

Edge Traffic shifts

9

Datacenter Traffic shifts How does it work?

• Differenttraffictypesarepartitionedandcontrolledseparately• Logged-invsLogged-out• CDN• Monitoring• Microsites

• Logged-inusersareplacedinto‘buckets’andhaveprimary/secondarydatacenterassignments

• Bucketsaremarkedonline/offlinetomovesitetraffic

10

Mitigating Impact What a traffic shift looks like

11

Load testing How do we do it?

12

Load testing How do we do it?

13

Single Master Failover How does it work?

• Onlyusedinextremecases• LeveragedistributedlockinginApacheZookeeper• Singlemasterserviceshaveaspringcomponentthatchecks

themastershipoftheserviceinaparticulardatacenter

14

Single Master Failover How does it work?

15

Conclusion

• Thebestwaytoprepareforadisasteristopracticeoneregularly!

• Toolingandautomationisyourbestfriendduringanoutage• Capacityplanning/managementisextremelyimportant

16

Questions?Thank You

©2014 LinkedIn Corporation. All Rights Reserved.