dataengconf sf16 - three lessons learned from building a production machine learning system

Three lessons learned from building a production machine learning system

Michael Manapat Stripe @mlmanapat

Fraud• Card numbers are stolen by hacking, malware, etc.

• “Dumps” are sold in “carding” forums

• Fraudsters use numbers in dumps to buy goods, which they then resell

• Cardholders dispute transactions

• Merchant ends up bearing cost of fraud

• We train binary classifiers to predict fraud

• We use open source tools

• Scalding/Summingbird for feature generation

• scikit-learn for model training(eventually: github.com/stripe/brushfire)

1

Don’t treat models as black boxes

Early ML at Stripe• Focused on training with more and more data and

adding more and more features

• Didn’t think much about

• ML algorithms (tuning hyperparameters, e.g.)

• The deeper reasons behind any particular set of results

Substantial reduction in fraud rate

Product development

From a product standpoint:

• We were blocking high risk charges and surfacing just the decision

• We wanted to provide Stripe users insight into our actions—reasons for scores

Score reasons

X = 5, Y = 3: score = 0.1

Which feature is “driving” the score more?

X < 10

Y < 5 X < 15

0.1 (20) 0.3 (30) 0.5 (10) 0.9 (40)

True False

Score reasons

X = ?, Y = 3:(20/70) * 0.1 + (10/70) * 0.5 + (40/70) * 0.9 = 0.61

Score Δ = |holdout - original| = |0.61 - 0.1| = 0.51

Now producing richer reasons with multiple predicates

X < ?

Y < 5 X < ?

0.1 (20) 0.3 (30) 0.5 (10) 0.9 (40)

Model introspectionIf a model didn’t look good in validation, it wasn’t clear what to do (besides trying more features/data)

What if we used our “score reasons” to debug model issues?

• Take all false positives (in validation data or in production) and group by generated reason

• Were a substantial fraction of the false positives driven by a few features?

• Did all the comparisons in the explanation predicates make sense? (Were they comparisons a human might make for fraud?)

• Our models were overfit!

Actioning insights• Hyperparameter optimization

• Feature selection

PrecisionRecall

Summary• Don’t treat models as black boxes

• Thinking about the learning process (vs. just features and data) can yield significant payoffs

• Tooling for introspection can accelerate model development/“debugging”

Julia Evans, Alyssa Frazee, Erik Osheim, Sam Ritchie, Jocelyn Ross, Tom Switzer

2

Have a plan for counterfactual evaluation

• December 31st, 2013

• Train a binary classifier for disputes on data from Jan 1st to Sep 30th

• Validate on data from Oct 1st to Oct 31st (need to wait ~60 days for labels)

• Based on validation data, pick a policy for actioning scores: block if score > 50

Questions (1)

• Business complains about high false positive rate: what would happen if we changed the policy to "block if score > 70"?

• What are the production precision and recall of the model?

• December 31st, 2014. We repeat the exercise from a year earlier

• Train a model on data from Jan 1st to Sep 30th

• Validate on data from Oct 1st to Oct 31st (need to wait ~60 days for labels)

• Validation results look ~ok (but not great)

• We put the model into production and the results are terrible

Questions (2)

• Why did the validation results for the new model look so much worse?

• How do we know if the retrained model really is better than the original model?

Counterfactual evaluation

• Our model changes reality (the world is different because of its existence)

• We can answer some questions (around model comparisons) with A/B tests

• For all these questions, we want an approximation of the charge/outcome distribution that would exist if there were no model

One approach• Probabilistically reverse a

small fraction of our block decisions

• The higher the score, the lower probability we let the charge through

• Weight samples by 1 / P(allow)

• Get information on the area we want to improve on

ID Score p(Allow) Original Action

Selected Action Outcome

1 10 1.0 Allow Allow OK

2 45 1.0 Allow Allow Fraud

3 55 0.30 Block Block -

4 65 0.20 Block Allow Fraud

5 100 0.0005 Block Block -

6 60 0.25 Block Allow OK

ID Score P(Allow) Weight Original Action

Selected Action Outcome

1 10 1.0 1 Allow Allow OK

2 45 1.0 1 Allow Allow Fraud

4 65 0.20 5 Block Allow Fraud

6 60 0.25 4 Block Allow OK

Evaluating the "block if score > 50" policy

Precision = 5 / 9 = 0.56Recall = 5 / 6 = 0.83

• The propensity function controls the exploration/exploitation tradeoff

• Precision, recall, etc. are estimators

• Variance of the estimators decreases the more we allow through

• Bootstrap to get error bars (pick rows from the table uniformly at random with replacement)

• Li, Chen, Kleban, Gupta: "Counterfactual Estimation and Optimization of Click Metrics for Search Engines"

Summary• Have a plan for counterfactual evaluation before

you productionize your first model

• You can back yourself into a corner (with no data to retrain on) if you address this later

• You should be monitoring the production performance of your model anyway (cf. next lesson)

Alyssa Frazee, Julia Evans, Roban Kramer, Ryan Wang

3

Invest in production monitoring for your models

Production vs. data stack• Ruby/Mongo vs. Scala/Hadoop/Thrift

• Some issues

• Divergence between production and training definitions

• Upstream changes to library code in production feature generation can change feature definitions

• True vs. “True”

Domain-specific scoring service (business logic)

“Pure” model evaluation service

Aggregation jobs

Aggregation jobs keep track of • Overall action rate and

rate per Stripe user • Score distributions • Feature distributions (%

null, p50/p90 for numerical values, etc.)

Logged scoring requests

Aggregation jobs (get all aggregates per model)

Logged scoring requests

Domain-specific scoring service (business logic)

“Pure” model evaluation service

Summary• Monitor the production inputs to and outputs of

your models

• Have dashboards that can be watched on deploys and alerting for significant anomalies

• Bake the monitoring into generic ML infrastructure (so that each ML application isn’t redoing this)

Steve Mardenfeld, Tom Switzer

• Don’t treat models as black boxes

• Have a plan for counterfactual evaluation before productionizing your first model

• Build production monitoring for action rates, score distributions, and feature distributions (and bake into ML infra)

ThanksStripe is hiring data scientists, engineers, and engineering managers!

[email protected] | @mlmanapat

dataengconf sf16 - three lessons learned from building a production machine learning system

Technology