a low-latency online prediction serving system · 2017-07-10 · presenter: alexey tumanov daniel...
TRANSCRIPT
Presenter: Alexey TumanovDaniel Crankshaw
Corey Zumar, Guilio Zhou, Xin WangMichael Franklin, Joseph Gonzalez, Ion Stoica
BDD/RISE Mini-Retreat 2017May 2, 2017
A Low-Latency Online Prediction Serving System
Clipper
BigData
Complex Model
Training
Learning
BigData
Complex Model
Training
Learning
Learning Produces a Trained Model
“CAT”
Query Decision
Model
BigData
Training
Learning
Application
Decision
Query
?
Serving
Model
BigData
Training
Learning
Application
Decision
Query
Model
Prediction-Serving for interactive applicationsTimescale: ~10s of milliseconds
Serving
Google Translate
Serving
82,000 GPUs running 24/7
[1] https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html
140 billion words a day1Invented New Hardware!Tensor Processing Unit
(TPU)
Prediction-Serving Raises New Challenges
Prediction-Serving Challenges
Support low-latency, high-throughput serving workloads
???Create VWCaffe 9
Large and growing ecosystem of ML models and frameworks
Models getting more complexØ 10s of GFLOPs [1]
Support low-latency, high-throughput serving workloads
Using specialized hardware for predictions
Deployed on critical pathØ Maintain SLOs under heavy load
[1] Deep Residual Learning for Image Recognition. He et al. CVPR 2015.
Prediction-Serving Challenges
Support low-latency, high-throughput serving workloads
???Create VWCaffe 11
Large and growing ecosystem of ML models and frameworks
Large and growing ecosystem of ML models and frameworks
???
ContentRec.
FraudDetection
PersonalAsst.
RoboticControl
MachineTranslation
Create VWCaffe 12
Big Companies Build One-Off Systems
Problems:Ø Expensive to build and maintain
Ø Highly specialized and require ML and systems expertise
Ø Tightly-coupled model and applicationØ Difficult to change or update model
Ø Only supports single ML framework
???
ContentRec.
FraudDetection
PersonalAsst.
RoboticControl
MachineTranslation
Create VWCaffe
Varying physicalresource requirements
Difficult to deploy andbrittle to manage
Large and growing ecosystem of ML models and frameworks
But most companies can’t build new
serving systems…
BigData
Batch Analytics
Model
Training
Use existing systems: Offline Scoring
BigData
Model
Training
Batch Analytics
ScoringX Y
DatastoreUse existing systems: Offline Scoring
Application
Decision
Query
Look up decision in datastore
Low-Latency Serving
X Y
Use existing systems: Offline Scoring
Application
Decision
Query
Look up decision in datastore
Low-Latency Serving
X Y
Problems:Ø Requires full set of queries ahead of time
Ø Small and bounded input domainØ Wasted computation and space
Ø Can render and store unneeded predictionsØ Costly to update
Ø Re-run batch job
Use existing systems: Offline Scoring
Prediction-Serving Challenges
Support low-latency, high-throughput serving workloads
???Create VWCaffe 20
Large and growing ecosystem of ML models and frameworks
How does Clipper address these challenges?
q Simplifies deployment through layered architecture
q Serves many models across ML frameworks concurrently
q Employs caching, batching, scale-out for high-performance serving
Clipper Solutions
Clipper
Predict FeedbackRPC/REST Interface
CaffeMC MC MC
RPC RPC RPC RPC
Clipper Decouples Applications and Models
Applications
Model Container (MC)
Clipper Architecture
Clipper
Caffe
ApplicationsPredict ObserveRPC/REST Interface
MC MC MCRPC RPC RPC RPC
Model Abstraction LayerProvide a common interface to modelswhile bounding latency and maximizing throughput.
Model Selection LayerImprove accuracy through bandit methods and ensembles, online learning, and personalization
Model Container (MC)
Clipper Architecture
Clipper
Caffe
ApplicationsPredict ObserveRPC/REST Interface
MC MC MCRPC RPC RPC RPC
Model Selection LayerSelection Policy
Model Abstraction LayerCaching
Adaptive Batching
Model Container (MC)
ClipperModel Selection LayerSelection Policy
Selection policies supported by ClipperØExploit multiple models to estimate confidenceØ Use multi-armed bandit algorithms to learn
optimal model-selection onlineØ Online personalization across ML frameworks
*See paper for details [NSDI 2017]
A single page load may generatemany queries
Batching to Improve ThroughputØ Optimal batch depends on:
Ø hardware configurationØ model and frameworkØ system load
Ø Why batching helps:
HardwareAcceleration
Helps amortizesystem overhead
A single page load may generatemany queries
Clipper Solution:
Adaptively tradeoff latency and throughput…
Ø Inc. batch size until the latency objective is exceeded (Additive Increase)
Ø If latency exceeds SLO cut batch size by a fraction (Multiplicative Decrease)
Ø Why batching helps:
HardwareAcceleration
Helps amortizesystem overhead
Ø Optimal batch depends on:Ø hardware configurationØ model and frameworkØ system load
Batching to Improve ThroughputAdaptive
ConclusionØ Prediction-serving is an important and challenging area for systems
researchØ Support low-latency, high-throughput serving workloadsØ Serve large and growing ecosystem of ML frameworks
Ø Clipper is a first step towards addressing these challengesØ Simplifies deployment through layered architectureØ Serves many models across ML frameworks concurrentlyØ Employs caching, adaptive batching, container scale-out to meet interactive
serving workload demandsØ Beyond academic prototype to build a real, open-source system
https://github.com/ucbrise/[email protected]