dataengconf sf16 - three lessons learned from building a production machine learning system

30
Three lessons learned from building a production machine learning system Michael Manapat Stripe @mlmanapat

Upload: hakka-labs

Post on 17-Jan-2017

320 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Three lessons learned from building a production machine learning system

Michael Manapat Stripe @mlmanapat

Page 2: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Fraud• Card numbers are stolen by hacking, malware, etc.

• “Dumps” are sold in “carding” forums

• Fraudsters use numbers in dumps to buy goods, which they then resell

• Cardholders dispute transactions

• Merchant ends up bearing cost of fraud

Page 3: DataEngConf SF16 - Three lessons learned from building a production machine learning system

• We train binary classifiers to predict fraud

• We use open source tools

• Scalding/Summingbird for feature generation

• scikit-learn for model training(eventually: github.com/stripe/brushfire)

Page 4: DataEngConf SF16 - Three lessons learned from building a production machine learning system

1

Don’t treat models as black boxes

Page 5: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Early ML at Stripe• Focused on training with more and more data and

adding more and more features

• Didn’t think much about

• ML algorithms (tuning hyperparameters, e.g.)

• The deeper reasons behind any particular set of results

Substantial reduction in fraud rate

Page 6: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Product development

From a product standpoint:

• We were blocking high risk charges and surfacing just the decision

• We wanted to provide Stripe users insight into our actions—reasons for scores

Page 7: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Score reasons

X = 5, Y = 3: score = 0.1

Which feature is “driving” the score more?

X < 10

Y < 5 X < 15

0.1 (20) 0.3 (30) 0.5 (10) 0.9 (40)

True False

Page 8: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Score reasons

X = ?, Y = 3:(20/70) * 0.1 + (10/70) * 0.5 + (40/70) * 0.9 = 0.61

Score Δ = |holdout - original| = |0.61 - 0.1| = 0.51

Now producing richer reasons with multiple predicates

X < ?

Y < 5 X < ?

0.1 (20) 0.3 (30) 0.5 (10) 0.9 (40)

Page 9: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Model introspectionIf a model didn’t look good in validation, it wasn’t clear what to do (besides trying more features/data)

What if we used our “score reasons” to debug model issues?

Page 10: DataEngConf SF16 - Three lessons learned from building a production machine learning system

• Take all false positives (in validation data or in production) and group by generated reason

• Were a substantial fraction of the false positives driven by a few features?

• Did all the comparisons in the explanation predicates make sense? (Were they comparisons a human might make for fraud?)

• Our models were overfit!

Page 11: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Actioning insights• Hyperparameter optimization

• Feature selection

PrecisionRecall

Page 12: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Summary• Don’t treat models as black boxes

• Thinking about the learning process (vs. just features and data) can yield significant payoffs

• Tooling for introspection can accelerate model development/“debugging”

Julia Evans, Alyssa Frazee, Erik Osheim, Sam Ritchie, Jocelyn Ross, Tom Switzer

Page 13: DataEngConf SF16 - Three lessons learned from building a production machine learning system

2

Have a plan for counterfactual evaluation

Page 14: DataEngConf SF16 - Three lessons learned from building a production machine learning system

• December 31st, 2013

• Train a binary classifier for disputes on data from Jan 1st to Sep 30th

• Validate on data from Oct 1st to Oct 31st (need to wait ~60 days for labels)

• Based on validation data, pick a policy for actioning scores: block if score > 50

Page 15: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Questions (1)

• Business complains about high false positive rate: what would happen if we changed the policy to "block if score > 70"?

• What are the production precision and recall of the model?

Page 16: DataEngConf SF16 - Three lessons learned from building a production machine learning system

• December 31st, 2014. We repeat the exercise from a year earlier

• Train a model on data from Jan 1st to Sep 30th

• Validate on data from Oct 1st to Oct 31st (need to wait ~60 days for labels)

• Validation results look ~ok (but not great)

• We put the model into production and the results are terrible

Page 17: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Questions (2)

• Why did the validation results for the new model look so much worse?

• How do we know if the retrained model really is better than the original model?

Page 18: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Counterfactual evaluation

• Our model changes reality (the world is different because of its existence)

• We can answer some questions (around model comparisons) with A/B tests

• For all these questions, we want an approximation of the charge/outcome distribution that would exist if there were no model

Page 19: DataEngConf SF16 - Three lessons learned from building a production machine learning system

One approach• Probabilistically reverse a

small fraction of our block decisions

• The higher the score, the lower probability we let the charge through

• Weight samples by 1 / P(allow)

• Get information on the area we want to improve on

Page 20: DataEngConf SF16 - Three lessons learned from building a production machine learning system

ID Score p(Allow) Original Action

Selected Action Outcome

1 10 1.0 Allow Allow OK

2 45 1.0 Allow Allow Fraud

3 55 0.30 Block Block -

4 65 0.20 Block Allow Fraud

5 100 0.0005 Block Block -

6 60 0.25 Block Allow OK

Page 21: DataEngConf SF16 - Three lessons learned from building a production machine learning system

ID Score P(Allow) Weight Original Action

Selected Action Outcome

1 10 1.0 1 Allow Allow OK

2 45 1.0 1 Allow Allow Fraud

4 65 0.20 5 Block Allow Fraud

6 60 0.25 4 Block Allow OK

Evaluating the "block if score > 50" policy

Precision = 5 / 9 = 0.56Recall = 5 / 6 = 0.83

Page 22: DataEngConf SF16 - Three lessons learned from building a production machine learning system

• The propensity function controls the exploration/exploitation tradeoff

• Precision, recall, etc. are estimators

• Variance of the estimators decreases the more we allow through

• Bootstrap to get error bars (pick rows from the table uniformly at random with replacement)

• Li, Chen, Kleban, Gupta: "Counterfactual Estimation and Optimization of Click Metrics for Search Engines"

Page 23: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Summary• Have a plan for counterfactual evaluation before

you productionize your first model

• You can back yourself into a corner (with no data to retrain on) if you address this later

• You should be monitoring the production performance of your model anyway (cf. next lesson)

Alyssa Frazee, Julia Evans, Roban Kramer, Ryan Wang

Page 24: DataEngConf SF16 - Three lessons learned from building a production machine learning system

3

Invest in production monitoring for your models

Page 25: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Production vs. data stack• Ruby/Mongo vs. Scala/Hadoop/Thrift

• Some issues

• Divergence between production and training definitions

• Upstream changes to library code in production feature generation can change feature definitions

• True vs. “True”

Page 26: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Domain-specific scoring service (business logic)

“Pure” model evaluation service

Aggregation jobs

Aggregation jobs keep track of • Overall action rate and

rate per Stripe user • Score distributions • Feature distributions (%

null, p50/p90 for numerical values, etc.)

Logged scoring requests

Page 27: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Aggregation jobs (get all aggregates per model)

Logged scoring requests

Domain-specific scoring service (business logic)

“Pure” model evaluation service

Page 28: DataEngConf SF16 - Three lessons learned from building a production machine learning system

Summary• Monitor the production inputs to and outputs of

your models

• Have dashboards that can be watched on deploys and alerting for significant anomalies

• Bake the monitoring into generic ML infrastructure (so that each ML application isn’t redoing this)

Steve Mardenfeld, Tom Switzer

Page 29: DataEngConf SF16 - Three lessons learned from building a production machine learning system

• Don’t treat models as black boxes

• Have a plan for counterfactual evaluation before productionizing your first model

• Build production monitoring for action rates, score distributions, and feature distributions (and bake into ML infra)

Page 30: DataEngConf SF16 - Three lessons learned from building a production machine learning system

ThanksStripe is hiring data scientists, engineers, and engineering managers!

[email protected] | @mlmanapat