kaushik veeraraghavan - usenix · kaushik veeraraghavan, justin meza, david chou, wonho kim, sonia...
TRANSCRIPT
![Page 1: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/1.jpg)
KaushikVeeraraghavan,JustinMeza,DavidChou,Wonho Kim,SoniaMargulis,ScottMichelson,RajeshNishtala,DanielObenshain,Dmitri
Perelman,andYeeJiun SongFacebookInc.
![Page 2: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/2.jpg)
2
Webserver
Eachuserrequesttoucheshundredsofsystems
NewsfeedNewsfeed
DBDB
CacheCache
PYMLPYML
AdsAds
SearchSearch
EverstoreEverstore
ScribePtail
CoefficientCoefficient
LaserLaser
ScribeScribe
![Page 3: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/3.jpg)
Theworkloadisconstantly evolving
3
Manyproducts Growinguserbase
• Facebook:3dailyreleases
• Instagram:cont.release
Rapidsoftwarechange
![Page 4: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/4.jpg)
Goals
• Howmanymachinesdoeseachsoftwaresystemneed?
• Canweservepeakload?
• Areweoperatingefficiently?
4
![Page 5: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/5.jpg)
Commonapproachestocapacitymanagement
• Loadmodeling:simulatehowsystembehavesathighload• Loadtesting:benchmarkusingsyntheticworkloads
5
Webserver
NewsfeedNewsfeed
DBDB
CacheCache
PYMLPYML
AdsAds
SearchSearch
EverstoreEverstore
ScribePtail
CoefficientCoefficient
LaserLaser
ScribeScribe
![Page 6: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/6.jpg)
6
Liveusertrafficisthemostrepresentativeworkload
• Accuratedistributionofreads&writes
• Donotneedacustomtestsetup
![Page 7: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/7.jpg)
7
• Directliveusertrafficattarget
Livetrafficloadtestsmeasurepeakservingcapacity
![Page 8: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/8.jpg)
8
Livetrafficloadtestsmeasurepeakservingcapacitysafely
• Monitorhealthmetrics• Responselatency• Servererror
• Resetloadwhenthresholdsarehit
![Page 9: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/9.jpg)
Roadmap
• Krakenmeasurespeakservingcapacityatallscales• Asinglewebserver• Asinglecluster• Anentiregeographicalregion
• Krakenidentifiesbottleneckslimitingutilization• Loadimbalance• Networksaturation
• ChallengesindeployingKraken
9
![Page 10: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/10.jpg)
FrontendCluster
FrontendCluster
Clusterweight
Region
Region
Region
Region
Edgeweight
Webserverweight
DNS
EdgePOP
EdgePOP
EdgePOP
ServiceCluster
BackendCluster
Search
Newsfeed
Krakenusesweightstorouterequests
![Page 11: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/11.jpg)
Krakenmeasuresawebserver’speakservingcapacity
11
• Peakwebservercapacity:175requestspersecond(RPS)• Productiontarget:90%utilizationi.e.,157RPS
175
![Page 12: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/12.jpg)
Krakenmeasuresacluster’speakservingcapacity
• Maxclustercapacity=(webservercapacity)*(num.webserversincluster)12
90%
![Page 13: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/13.jpg)
Krakenmeasuresaregion’speakservingcapacity
13• Wenowserve20%moreuserswiththesameinfrastructure
74%
90%
2015 2016
![Page 14: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/14.jpg)
Inefficientloadbalancinglimitsutilization
14
![Page 15: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/15.jpg)
Networksaturationlimitsutilization
15
![Page 16: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/16.jpg)
Challenge:non-linearresponsetotrafficshifts
16
Webserver
NewsfeedNewsfeed
DBDB
CacheCache
PYMLPYML
AdsAds
SearchSearch
EverstoreEverstore
ScribePtail
CoefficientCoefficient
LaserLaser
ScribeScribe
?
![Page 17: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/17.jpg)
Challenge:howcanwefosterexperimentation?
• Setconservativethresholds
• Communicatewidelyabouttests
• Encouragecollaboration• Monitoring• Failuremitigationstrategies
17
![Page 18: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/18.jpg)
Conclusion
• We’verun50+regional,1000+clusterlivetrafficloadtestsin3years
• Krakenhashelpedusidentifyhundredsofbottlenecksandverifyfixes
• Wecannowserve20%moreuserswiththesameinfrastructure
18
![Page 19: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/19.jpg)
Kraken:assumptionsandcaveats
• Statelessservers
• Routablerequests
• Loadimpactsdownstreamsystems
19
![Page 20: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/20.jpg)
Kraken:usertrafficmanagement
WebLB
FrontendCluster
PoP
DNS
FrontendCluster
ServiceLB
ServiceCluster
PoP
Region
PoP
Region
Region
Region
PoPs Datacenterregions Datacenter
Edgeweight Clusterweight
Serverweight Serverweight
20
![Page 21: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/21.jpg)
Kraken:livetrafficloadtests
WebLB
FrontendCluster
HealthMonitor
FeedbackControl
TrafficShifter
EdgePOP
DNS
MeasurehealthIncrease/resetloadUpdateweights
Kraken
21
![Page 22: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/22.jpg)
HealthmetricsforsystemsaffectedbywebloadServicetype Metrics
Webservers CPUutilization,latency,errorrate,fractionofoperationalservers
Aggregator–leaf CPUutilization,errorrate,responsequality
Proxygen softwareL7loadbalancer CPUutilization,latency,connections,retransmitrate,Ethernetutilization,memorycapacityutilization
Memcache Latency,objectleasecount
TAO CPUutilization,writesuc- cess rate,readlatency
Batch processor Queuelength,exceptionrate
Logging Errorrate
Search CPUutilization
Servicediscovery CPUutilization
Messagedelivery CPUutilization
22
![Page 23: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/23.jpg)
Continuousrunsmeasuringawebserver’scapacity
23
175
![Page 24: Kaushik Veeraraghavan - USENIX · Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and](https://reader033.vdocument.in/reader033/viewer/2022042621/5f6cc0ac590a343e3e12706c/html5/thumbnails/24.jpg)
SomelessonsfromathousandKrakentests
• SimplicityiskeytoKraken’ssuccess.
• Identifyingtherightperformance,errorrateandlatencymetricstotrackisdifficult.
• Cheapsolutions,likeallocatingcapacityorfixingmisconfiguration,areoftenmoreimpactfulthanprofile-basedtuningorsystemredesign.
24