a low-latency online prediction serving system · 2017-07-10 · presenter: alexey tumanov daniel...

29
Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, Guilio Zhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion Stoica BDD/RISE Mini-Retreat 2017 May 2, 2017 A Low-Latency Online Prediction Serving System Clipper

Upload: others

Post on 12-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

Presenter: Alexey TumanovDaniel Crankshaw

Corey Zumar, Guilio Zhou, Xin WangMichael Franklin, Joseph Gonzalez, Ion Stoica

BDD/RISE Mini-Retreat 2017May 2, 2017

A Low-Latency Online Prediction Serving System

Clipper

Page 2: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

BigData

Complex Model

Training

Learning

Page 3: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

BigData

Complex Model

Training

Learning

Page 4: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

Learning Produces a Trained Model

“CAT”

Query Decision

Model

Page 5: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

BigData

Training

Learning

Application

Decision

Query

?

Serving

Model

Page 6: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

BigData

Training

Learning

Application

Decision

Query

Model

Prediction-Serving for interactive applicationsTimescale: ~10s of milliseconds

Serving

Page 7: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

Google Translate

Serving

82,000 GPUs running 24/7

[1] https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html

140 billion words a day1Invented New Hardware!Tensor Processing Unit

(TPU)

Page 8: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

Prediction-Serving Raises New Challenges

Page 9: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

Prediction-Serving Challenges

Support low-latency, high-throughput serving workloads

???Create VWCaffe 9

Large and growing ecosystem of ML models and frameworks

Page 10: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

Models getting more complexØ 10s of GFLOPs [1]

Support low-latency, high-throughput serving workloads

Using specialized hardware for predictions

Deployed on critical pathØ Maintain SLOs under heavy load

[1] Deep Residual Learning for Image Recognition. He et al. CVPR 2015.

Page 11: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

Prediction-Serving Challenges

Support low-latency, high-throughput serving workloads

???Create VWCaffe 11

Large and growing ecosystem of ML models and frameworks

Page 12: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

Large and growing ecosystem of ML models and frameworks

???

ContentRec.

FraudDetection

PersonalAsst.

RoboticControl

MachineTranslation

Create VWCaffe 12

Page 13: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

Big Companies Build One-Off Systems

Problems:Ø Expensive to build and maintain

Ø Highly specialized and require ML and systems expertise

Ø Tightly-coupled model and applicationØ Difficult to change or update model

Ø Only supports single ML framework

Page 14: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

???

ContentRec.

FraudDetection

PersonalAsst.

RoboticControl

MachineTranslation

Create VWCaffe

Varying physicalresource requirements

Difficult to deploy andbrittle to manage

Large and growing ecosystem of ML models and frameworks

Page 15: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

But most companies can’t build new

serving systems…

Page 16: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

BigData

Batch Analytics

Model

Training

Use existing systems: Offline Scoring

Page 17: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

BigData

Model

Training

Batch Analytics

ScoringX Y

DatastoreUse existing systems: Offline Scoring

Page 18: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

Application

Decision

Query

Look up decision in datastore

Low-Latency Serving

X Y

Use existing systems: Offline Scoring

Page 19: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

Application

Decision

Query

Look up decision in datastore

Low-Latency Serving

X Y

Problems:Ø Requires full set of queries ahead of time

Ø Small and bounded input domainØ Wasted computation and space

Ø Can render and store unneeded predictionsØ Costly to update

Ø Re-run batch job

Use existing systems: Offline Scoring

Page 20: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

Prediction-Serving Challenges

Support low-latency, high-throughput serving workloads

???Create VWCaffe 20

Large and growing ecosystem of ML models and frameworks

Page 21: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

How does Clipper address these challenges?

Page 22: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

q Simplifies deployment through layered architecture

q Serves many models across ML frameworks concurrently

q Employs caching, batching, scale-out for high-performance serving

Clipper Solutions

Page 23: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

Clipper

Predict FeedbackRPC/REST Interface

CaffeMC MC MC

RPC RPC RPC RPC

Clipper Decouples Applications and Models

Applications

Model Container (MC)

Page 24: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

Clipper Architecture

Clipper

Caffe

ApplicationsPredict ObserveRPC/REST Interface

MC MC MCRPC RPC RPC RPC

Model Abstraction LayerProvide a common interface to modelswhile bounding latency and maximizing throughput.

Model Selection LayerImprove accuracy through bandit methods and ensembles, online learning, and personalization

Model Container (MC)

Page 25: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

Clipper Architecture

Clipper

Caffe

ApplicationsPredict ObserveRPC/REST Interface

MC MC MCRPC RPC RPC RPC

Model Selection LayerSelection Policy

Model Abstraction LayerCaching

Adaptive Batching

Model Container (MC)

Page 26: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

ClipperModel Selection LayerSelection Policy

Selection policies supported by ClipperØExploit multiple models to estimate confidenceØ Use multi-armed bandit algorithms to learn

optimal model-selection onlineØ Online personalization across ML frameworks

*See paper for details [NSDI 2017]

Page 27: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

A single page load may generatemany queries

Batching to Improve ThroughputØ Optimal batch depends on:

Ø hardware configurationØ model and frameworkØ system load

Ø Why batching helps:

HardwareAcceleration

Helps amortizesystem overhead

Page 28: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

A single page load may generatemany queries

Clipper Solution:

Adaptively tradeoff latency and throughput…

Ø Inc. batch size until the latency objective is exceeded (Additive Increase)

Ø If latency exceeds SLO cut batch size by a fraction (Multiplicative Decrease)

Ø Why batching helps:

HardwareAcceleration

Helps amortizesystem overhead

Ø Optimal batch depends on:Ø hardware configurationØ model and frameworkØ system load

Batching to Improve ThroughputAdaptive

Page 29: A Low-Latency Online Prediction Serving System · 2017-07-10 · Presenter: Alexey Tumanov Daniel Crankshaw Corey Zumar, GuilioZhou, Xin Wang Michael Franklin, Joseph Gonzalez, Ion

ConclusionØ Prediction-serving is an important and challenging area for systems

researchØ Support low-latency, high-throughput serving workloadsØ Serve large and growing ecosystem of ML frameworks

Ø Clipper is a first step towards addressing these challengesØ Simplifies deployment through layered architectureØ Serves many models across ML frameworks concurrentlyØ Employs caching, adaptive batching, container scale-out to meet interactive

serving workload demandsØ Beyond academic prototype to build a real, open-source system

https://github.com/ucbrise/[email protected]