southbay sre meetup jan 2016
TRANSCRIPT
Michael Kehoe Senior Site Reliability Engineer
SouthBay SRE MeetupLinkedIn Traffic Shifting
2
$ whoami Michael Kehoe
• Sr Site Reliability Engineer (SRE)• Member of PROD-SRE• https://www.linkedin.com/in/michaelkkehoe
5
Why do we do traffic shifts
• Tomitigateuserimpactfromproblemswitha3rdpartyproviderorLinkedIn’sinfrastructure/services
• TovalidateDisasterRecovery(DR)incaseofanydatacenterfailure
• Tovalidateandtestcapacityheadroomacrossourdatacenters
• Toexposebugsandsuboptimalconfigurationsbyloadtestingoneormoredatacenters
• Toperformplannedmaintenance• Tovalidateandexercisethetrafficshiftautomation
7
Edge Traffic shifts How does it work
• WeuseIPVStoloadbalanceatouredges• Wecanwithdrawanycastroutestoremovetrafficfrom
thatPoP• HealthchecksonouredgeproxyaretestedbyDNS
providerstoverifywhetherthatPoPisinrotation• Wecanfailthosehealthcheckstoremoveunicast
trafficfromthatPoP
9
Datacenter Traffic shifts How does it work?
• Differenttraffictypesarepartitionedandcontrolledseparately• Logged-invsLogged-out• CDN• Monitoring• Microsites
• Logged-inusersareplacedinto‘buckets’andhaveprimary/secondarydatacenterassignments
• Bucketsaremarkedonline/offlinetomovesitetraffic
13
Single Master Failover How does it work?
• Onlyusedinextremecases• LeveragedistributedlockinginApacheZookeeper• Singlemasterserviceshaveaspringcomponentthatchecks
themastershipoftheserviceinaparticulardatacenter
15
Conclusion
• Thebestwaytoprepareforadisasteristopracticeoneregularly!
• Toolingandautomationisyourbestfriendduringanoutage• Capacityplanning/managementisextremelyimportant