a datacenter infrastructure perspective...a datacenter infrastructure perspective re-emergence of...
TRANSCRIPT
![Page 1: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/1.jpg)
![Page 2: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/2.jpg)
Kim HazelwoodFacebook AI Infrastructure
Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective
![Page 3: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/3.jpg)
Re-Emergence of Machine Learning
0
500
1000
1500
2000
2500
3000
2001 2003 2005 2007 2009 2011 2013 2015 2017
Gradient-Based Learning Applied to Document Recognition, LeCun et al., 1998
Cit
ati
on
Co
un
t
![Page 4: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/4.jpg)
Why Now?
More Compute
Bigger
(and better) Data
Better Algorithms
![Page 5: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/5.jpg)
Machine Learning Execution Flow
Data Features Training Evaluation Inference
Offline Online
![Page 6: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/6.jpg)
Machine learning Execution Flow
Data Features Training Eval InferenceModel Predictions
![Page 7: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/7.jpg)
Let’s Answer Some Pressing Questions• How does Facebook leverage machine learning?
• Does Facebook design hardware? For machine learning?
• Does Facebook design machine learning platforms and frameworks?
• Are these hardware and software solutions available to the community?
• What assumptions break when scaling to 2B people?
![Page 8: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/8.jpg)
How Does facebook Use Machine Learning?
Face taggingNews Feed
Search Translation Ads
![Page 9: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/9.jpg)
Major Services and Use Cases
News Feed Ads Search
Language Translation
Sigma Facer Lumos
Speech Recognition
Content Understanding
Classification Service
![Page 10: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/10.jpg)
What ML Models Do We Leverage?
GBDTSVM MLP CNN RNN
Support Vector Machines
Gradient-Boosted Decision Trees
Multi-Layer Perceptron
Convolutional Neural Nets
Recurrent Neural Nets
Facer Sigma News Feed
Ads
Search
Sigma
Facer
Lumos
Language Translation
Content Understanding
Speech Rec
![Page 11: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/11.jpg)
How Often Do We Train Models?
minutes hours days months
FacerFeed/Ads
![Page 12: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/12.jpg)
How Long Does Training Take?
seconds minutes hours days
FacerLanguage
Translation
![Page 13: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/13.jpg)
How Much Compute Does Inference Consume?
100X 10x 1x
News Feed Sigma
![Page 14: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/14.jpg)
Let’s Answer Some Pressing Questions• How does Facebook leverage machine learning?
• Does Facebook design hardware? For machine learning?
• Does Facebook design machine learning platforms and frameworks?
• Are these hardware and software solutions available to the community?
• What assumptions break when scaling to 2B people?
![Page 15: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/15.jpg)
Facebook AI Ecosystem
Frameworks: Core ML SoftwareCaffe2 / PyTorch / etc
Platforms: Workflow Management, DeploymentFB Learner
Large-Scale InfrastructureServers, Storage, Network Strategy
![Page 16: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/16.jpg)
The Infrastructure View
DataOffline Training
Online Inference
Storage Challenges!
Network Challenges!
Compute Challenges!
![Page 17: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/17.jpg)
Does Facebook Design Hardware?• Yes! All designs released through the Open Compute Project since 2010
• Facebook Server Design Philosophy
– Identify a small number of major services with unique resource requirements
– Design servers for those major services
![Page 18: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/18.jpg)
Twin LakesFor the web tier and other “stateless services”
Open Compute “Sleds” are 2U x 3 Across in an Open Compute Rack
![Page 19: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/19.jpg)
Tioga PassFor compute or memory-intensive workloads:
![Page 20: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/20.jpg)
Bryce Canyon and LightningFor storage-heavy workloads
![Page 21: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/21.jpg)
Big BasinIn 2017, we transitioned from Big Sur to Big Basin GPU Servers for ML training
Big SurIntegrated Compute8 Nvidia M40 GPUs
Big BasinJBOG Design (CPU headnode)8 Nvidia V100 GPUs per Big Basin
![Page 22: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/22.jpg)
Let’s Answer Some Pressing Questions• How does Facebook leverage machine learning?
• Does Facebook design hardware? For machine learning?
• Does Facebook design machine learning platforms and frameworks?
• Are these hardware and software solutions available to the community?
• What assumptions break when scaling to 2B people?
![Page 23: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/23.jpg)
Facebook AI Frameworks
• Used for Production
• Stability
• Scale & Speed
• Data Integration
• Relatively Fixed
• Used for Research
• Flexible
• Fast Iteration
• Debuggable
• Less Robust
![Page 24: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/24.jpg)
Deep Learning Frameworks
Frameworkbackends
Vendor and numeric libraries
Apple CoreML Nvidia TensorRTIntel/Nervana
ngraphQualcom
SNPE
…
O(n2) pairs
![Page 25: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/25.jpg)
Shared model and operator representation
Open Neural Network Exchange
Frameworkbackends
Vendor and numeric libraries
Apple CoreML Nvidia TensorRTIntel/Nervana
ngraphQualcom
SNPE
…
From O(n2) to O(n) pairs
![Page 26: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/26.jpg)
Facebook AI Ecosystem
Frameworks: Core ML SoftwareCaffe2 / PyTorch / etc
Platforms: Workflow Management, DeploymentFB Learner
Large-Scale InfrastructureServers, Storage, Network Strategy
![Page 27: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/27.jpg)
FB Learner Platform
FB Learner Feature Store
FB Learner Flow
FB Learner Predictor
• AI Workflow • Model Management and Deployment
![Page 28: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/28.jpg)
Tying It All Together
Data Features Training Evaluation Inference
FBLearnerFeature Store
FBLearnerFlow
FBLearnerPredictor
Bryce Canyon Big Basin Tioga Pass Twin Lakes
![Page 29: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/29.jpg)
2 Billion People
What changes when you scale to over
![Page 30: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/30.jpg)
Santa Clara, CaliforniaAshburn, VirginiaPrineville, OregonForest City, North CarolinaLulea, SwedenAltoona, IowaFort Worth, TexasClonee, IrelandLos Lunas, New MexicoOdense, DenmarkNew Albany, OhioPapillion, Nebraska
![Page 31: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/31.jpg)
Scaling Challenges / Opportunities
Lots of Data Lots of Compute
![Page 32: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/32.jpg)
Scaling Challenges / Opportunities: Data
Lots of Data
Data quality (and potentially quantity) correlates well with user experience
Network design mattersGeographic locations matterDatabase configs matter
![Page 33: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/33.jpg)
Scaling Challenges / Opportunities: Compute
Lots of Compute
Can leverage idle resources on nights and weekends for “free”
Must consider geographic resource distribution
![Page 34: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/34.jpg)
Scaling Opportunity: Free Compute!
![Page 35: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/35.jpg)
Scaling Challenges: Disaster Recovery• Seamlessly handle the loss of an entire datacenter
• Geographic compute diversity becomes critical
• Delaying even the offline training portion of machine learning workloads has a measurable impact
![Page 36: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/36.jpg)
Key Takeaways
Facebook AILots of
DataWide variety
of models
Full stack challenges Global scale
![Page 37: A Datacenter Infrastructure Perspective...A Datacenter Infrastructure Perspective Re-Emergence of Machine Learning 0 500 1000 1500 2000 2500 3000 2001 2003 2005 2007 2009 2011 2013](https://reader030.vdocument.in/reader030/viewer/2022040612/5f0461517e708231d40db056/html5/thumbnails/37.jpg)