machine learning model life cycle management in production

Post on 01-Jan-2022

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

harish@datatron.comjerry@datatron.com

Machine Learning Model Life Cycle Management

In Production - #aiops #mlops

datatron

Who are we?

�2

Harish DoddiCEO

• Lyft Surge Pricing Model• Twitter’s Distributed Photo

Storage - 12 PB• Architected Snapchat Stories -

scaled from 0 to billions

Jerry XuCTO

• Lyft ETA Machine Learning Model - replaced Google

• Twitter’s Manhattan - replaced Cassandra

• Founding team of Windows Azure

Team: Previously worked at places like Amazon, Microsoft, and AnacondaHeadquarters: 350 Townsend Street, Suite 204, San Francisco, CA

datatron

Today’s Enterprises Data Science Lifecycle

�3

Development

Data Science

Initial AnalysisModel development using multiple frameworksTraining and Experimenting

Production

Engineering DevOps

DeploymentWorkflow processRoll back strategyA/B Testing

Post-Production

DevOps

Model Performance MonitoringInfrastructure MonitoringScalingModel Governance

datatron�4

Production and Post-Production

Deployment, Monitoring & Governance

Data Aggregation

Discovery & Analysis

Training & Experimenting

datatron

Lesson 1

Need for single centralized production teamand platform

�5

datatron�6

Fraud Data Science

MarketingData Science

Credit Risk Data Science

Production infrastructure

Production infrastructure

Production infrastructure

Financial Institutions Production Environment

datatron

Hidden Technical Debt in Machine Learning Systems

�7

Google PaperHidden Technical Debt in Machine Learning Systems

Machine learning offers a fantastically powerful toolkit for building complex systems quickly. … it

is remarkably easy to incur massive ongoing maintenance costs at the system level when

applying machine learning.

Boundary erosionEntanglementHidden feedback loopsUndeclared consumersData dependenciesChanges in the external worldSystem-level anti-patterns

datatron�8

Fraud Data Science

Horizontal DEVOPS AI team

MarketingData Science

Credit Risk Data Science

Production Model Deployment

Centralized Production AI Platform like Datatron

datatron

Lesson 2

Start adopting best practices and standardization early

�9

datatron�10

Old Way

Data Science

End User

Machine Learning

Engineering DevOps

• Teams operate in silos, don’t speak the same language• Errors due to lack of communication• Engineering has to write stand-alone scripts

Production Model

Teams face cross-functional

inefficiencies

datatron

Central devops platform for AI

models streamlines the process and

reduces the inefficiencies

�11

Data Science

End User

platformdatatron

DevOps

Production Model

New Way

datatron

Lesson 3

Data Science Universe shouldn’t be limited

�12

datatron�13

Model Containerization

Data Scientist Universe is un-limited

Team

Upcoming Frameworks

New Way

Team Team

Model

TensorFlowModel

scikit-learnModel

Apache Spark

No Model Containerization

Data Scientist Universe is limited

Old Way

TeamTeam Team

Model

datatron�14

datatron

Framework Agnostic

LibraryAgnostic

LanguageAgnostic

Infrastructure Agnostic

SageMaker

datatron

Lesson 4

Models may go wrong, you need to monitor them continuously

�15

datatron�16

South Park and Alexa

datatron�17

RecommendationModel

Features

Business is continuously losing value

Garbage

Old Way

RecommendationModel

Features

Recommendation

Minimize the continuous loss

Alert!

New Way

Alert!

datatron�18

New Way

# o

f Fra

ud

Tra

nsa

ctio

ns

Jan Feb Mar Apr May Jun

Reality

Model Prediction

Anomalies Detected

Production Model Monitoring

Allows You To Act Preemptively

Old Way

TOOLATE

# o

f Fra

ud

Tra

nsa

ctio

ns

Post-mortem reports

datatron

Model Performance monitoring• Confusion Matrix• Gain and Lift charts• Kolomogorov Smirnov chart• Area Under the ROC curve• Gini Coefficient• Concordant – Discordant ratio• Root Mean Squared Error (RMSE)• etcModel Timeout monitoringInfrastructure monitoringOrganization KPI monitoringAnomoly monitoring

�19

Monitoring for Machine Learning Models

datatron

Lesson 5

You either NEVER deploy a model, or you have to do it over and over again

�20

datatron

ML model is a continuously optimizing process

�21

Concept driftNew concept comes up

Model Building

Model Deployment

Model Monitoring

Model Management

datatron

Connecting Machine Learning to Software world

�22

Before Now

Software deployment once a 1 or 2 years

Software deployment every day

Future

Machine Learning models will deploy

very frequent and fast

Machine Learning models deploy

very slow

BUT

datatron

Lesson 6

Model Governance should be automated as much as possible

�23

datatron�24

Log

Comments

Versions

Model Versioning

New Way

Production Model Governance

Each Model Action Is Logged

Old Way

Comments

Version Log

LogLog

Log

Time for an audit

datatron

Lesson 7

Be prepared, your number of model will grow

�25

datatron�26

Deployment Learning: 1 model vs Multiple models

datatron

Value Proposition: Cost per order of magnitude

�27

Without Model Management

# of models

With Model Management

As the number of models increases, the cost also increases

As the number of models increases, the cost significantly decreases

# of models

Cost per model

Cost per model

datatron

Lesson 8

Senior people are required at later stages

�28

datatron

Software Development vs ML Model Development

�29

EvolutionTestingImplementationDesignRequirements

Monitoring and

Optimization

Deploy to Production

Training / Testing

Data Preparation

Requirements

Senior People

Senior People

datatron

Lesson 9

Your real work starts AFTER you deploy the model to production

�30

datatron

Enterprise AI Life Cycle

�31

Exploration Training Deploy

datatron

Model performance Model latency Infrastructure monitoring

Feature distribution Model result

Model routing Challenger KPI based selection

Model timeout Fall back strategy Alerting

Split traffic Shadowing Multi-Armed Bandit Optimization

Blue-Green deployment Rollback Canary

Enterprise AI Life Cycle After Deployment

�32

Deploy A/B Testing SLA

Model Selection

Anomaly Detection

Monitoring

Replay End-to-end tracing Logging

Troubleshooting

Cluster management integration Security check

IT Env Integration

Versioning History Approval process

Auditing

Tensorflow, sklearn, H2O, SAS, SparkML etc.

ML Frameworks

Support

HARISH@DATATRON.COM

Thank you!

top related