three lessons learned from building a ... - michael...
TRANSCRIPT
Three lessons learned from building a production machine learning system
Michael Manapat Stripe @mlmanapat
Fraud• Card numbers are stolen by hacking, malware, etc.
• “Dumps” are sold in “carding” forums
• Fraudsters use numbers in dumps to buy goods, which they then resell
• Cardholders dispute transactions
• Merchant ends up bearing cost of fraud
• We train binary classifiers to predict fraud
• We use open source tools
• Scalding/Summingbird for feature generation
• scikit-learn for model training(eventually: github.com/stripe/brushfire)
1
Don’t treat models as black boxes
Early ML at Stripe• Focused on training with more and more data and
adding more and more features
• Didn’t think much about
• ML algorithms (tuning hyperparameters, e.g.)
• The deeper reasons behind any particular set of results
Substantial reduction in fraud rate
Product development
From a product standpoint:
• We were blocking high risk charges and surfacing just the decision
• We wanted to provide Stripe users insight into our actions—reasons for scores
Score reasons
X = 5, Y = 3: score = 0.1
Which feature is “driving” the score more?
X < 10
Y < 5 X < 15
0.1 (20) 0.3 (30) 0.5 (10) 0.9 (40)
True False
Score reasons
X = ?, Y = 3:(20/70) * 0.1 + (10/70) * 0.5 + (40/70) * 0.9 = 0.61
Score Δ = |holdout - original| = |0.61 - 0.1| = 0.51
Now producing richer reasons with multiple predicates
X < ?
Y < 5 X < ?
0.1 (20) 0.3 (30) 0.5 (10) 0.9 (40)
Model introspectionIf a model didn’t look good in validation, it wasn’t clear what to do (besides trying more features/data)
What if we used our “score reasons” to debug model issues?
• Take all false positives (in validation data or in production) and group by generated reason
• Were a substantial fraction of the false positives driven by a few features?
• Did all the comparisons in the explanation predicates make sense? (Were they comparisons a human might make for fraud?)
• Our models were overfit!
Actioning insights• Hyperparameter optimization
• Feature selection
PrecisionRecall
Summary• Don’t treat models as black boxes
• Thinking about the learning process (vs. just features and data) can yield significant payoffs
• Tooling for introspection can accelerate model development/“debugging”
Julia Evans, Alyssa Frazee, Erik Osheim, Sam Ritchie, Jocelyn Ross, Tom Switzer
2
Have a plan for counterfactual evaluation
• December 31st, 2013
• Train a binary classifier for disputes on data from Jan 1st to Sep 30th
• Validate on data from Oct 1st to Oct 31st (need to wait ~60 days for labels)
• Based on validation data, pick a policy for actioning scores: block if score > 50
Questions (1)
• Business complains about high false positive rate: what would happen if we changed the policy to "block if score > 70"?
• What are the production precision and recall of the model?
• December 31st, 2014. We repeat the exercise from a year earlier
• Train a model on data from Jan 1st to Sep 30th
• Validate on data from Oct 1st to Oct 31st (need to wait ~60 days for labels)
• Validation results look ~ok (but not great)
• We put the model into production and the results are terrible
Questions (2)
• Why did the validation results for the new model look so much worse?
• How do we know if the retrained model really is better than the original model?
Counterfactual evaluation
• Our model changes reality (the world is different because of its existence)
• We can answer some questions (around model comparisons) with A/B tests
• For all these questions, we want an approximation of the charge/outcome distribution that would exist if there were no model
One approach• Probabilistically reverse a
small fraction of our block decisions
• The higher the score, the lower probability we let the charge through
• Weight samples by 1 / P(allow)
• Get information on the area we want to improve on
ID Score p(Allow) Original Action
Selected Action Outcome
1 10 1.0 Allow Allow OK
2 45 1.0 Allow Allow Fraud
3 55 0.30 Block Block -
4 65 0.20 Block Allow Fraud
5 100 0.0005 Block Block -
6 60 0.25 Block Allow OK
ID Score P(Allow) Weight Original Action
Selected Action Outcome
1 10 1.0 1 Allow Allow OK
2 45 1.0 1 Allow Allow Fraud
4 65 0.20 5 Block Allow Fraud
6 60 0.25 4 Block Allow OK
Evaluating the "block if score > 50" policy
Precision = 5 / 9 = 0.56Recall = 5 / 6 = 0.83
• The propensity function controls the exploration/exploitation tradeoff
• Precision, recall, etc. are estimators
• Variance of the estimators decreases the more we allow through
• Bootstrap to get error bars (pick rows from the table uniformly at random with replacement)
• Li, Chen, Kleban, Gupta: "Counterfactual Estimation and Optimization of Click Metrics for Search Engines"
Summary• Have a plan for counterfactual evaluation before
you productionize your first model
• You can back yourself into a corner (with no data to retrain on) if you address this later
• You should be monitoring the production performance of your model anyway (cf. next lesson)
Alyssa Frazee, Julia Evans, Roban Kramer, Ryan Wang
3
Invest in production monitoring for your models
Production vs. data stack• Ruby/Mongo vs. Scala/Hadoop/Thrift
• Some issues
• Divergence between production and training definitions
• Upstream changes to library code in production feature generation can change feature definitions
• True vs. “True”
Domain-specific scoring service (business logic)
“Pure” model evaluation service
Aggregation jobs
Aggregation jobs keep track of • Overall action rate and
rate per Stripe user • Score distributions • Feature distributions (%
null, p50/p90 for numerical values, etc.)
Logged scoring requests
Aggregation jobs (get all aggregates per model)
Logged scoring requests
Domain-specific scoring service (business logic)
“Pure” model evaluation service
Summary• Monitor the production inputs to and outputs of
your models
• Have dashboards that can be watched on deploys and alerting for significant anomalies
• Bake the monitoring into generic ML infrastructure (so that each ML application isn’t redoing this)
Steve Mardenfeld, Tom Switzer
• Don’t treat models as black boxes
• Have a plan for counterfactual evaluation before productionizing your first model
• Build production monitoring for action rates, score distributions, and feature distributions (and bake into ML infra)
ThanksStripe is hiring data scientists, engineers, and engineering managers!
[email protected] | @mlmanapat