python as part of a production machine learning stack by michael manapat pydata sv 2014

Post on 27-Jan-2015

123 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Over the course of three years, we've built Stripe from scratch and scaled it to process billions of dollars of transaction volume a year by making it easy and painless for merchants to get set up and start accepting payments. While the vast majority of transactions facilitated by Stripe are honest, we do need to protect our merchants from rogue individuals and groups seeing to "test" or "cash" stolen credit cards. To combat this sort of activity, Stripe uses Python (together with Scala and Ruby) as part of its production machine learning pipeline to detect and block fraud in real time. In this talk, I'll go through the scikit-based modeling process for a sample data set that is derived from production data to illustrate how we train and validate our models. We'll also walk through how we deploy the models and monitor them in our production environment and how Python has allowed us to do this at scale.

TRANSCRIPT

Python  as  part  of  a  produc0on  machine  learning  stack        Michael  Manapat  @mlmanapat  Stripe    

Outline    -­‐Why  we  need  ML  at  Stripe  -­‐Simple  models  with  sklearn  -­‐Pipelines  with  Luigi  -­‐Scoring  as  a  service    

Stripe  is  a  technology  company  focusing  on  making  payments  easy    -­‐Short  applica>on    

Tokeniza0on       Customer  

browser   Stripe  

Stripe.js  

Token  

Merchant  server   Stripe  

API  call  

Result  

API  Call    import stripe stripe.Charge.create( amount=400, currency="usd", card="tok_103xnl2gR5VxTSB” email=customer@example.com" )"

Fraud  /  business  viola0ons    -­‐Terms  of  service  viola>ons  (weapons)  -­‐Merchant  fraud  (card  “cashers”)      -­‐Transac>on  fraud    -­‐No  machine  learning  a  year  ago  

Fraud  /  business  viola0ons    -­‐Terms  of  service  viola>ons    E-­‐cigareMes,  drugs,  weapons,  etc.    How  do  we  find  these  automa>cally?  

Merchant  sign  up  flow          

Applica>on  submission  

Website  scraped  

Text  scored  Applica>on  reviewed  

Merchant  sign  up  flow          

Applica>on  submission  

Website  scraped  

Text  scored  Applica>on  reviewed  

Machine  learning  

pipeline  and  service  

Building  a  classifier:  e-­‐cigareIes    data = pandas.from_pickle(‘ecigs’) data.head() text violator 0 " please verify your age i am 21 years or older ... True 1 coming soon toggle me drag me with your mouse ... False 2 drink moscow mules cart 0 log in or create an ... False 3 vapors electronic cigarette buy now insuper st... True 4 t-shirts shorts hawaii about us silver coll... False [5 rows x 2 columns]  

Features  for  text  classifica0on    cv = CountVectorizer features = cv.fit_transform(data['text'])

Sparse  matrix  of  word  counts  from  input  text  (omiSng  feature  selec>on)  

Features  for  text  classifica0on  X_train, X_test, y_train, y_test = train_test_split( features, data['violator'], test_size=0.2)

-­‐Avoid  leakage  -Other  cross-­‐valida>on  methods  

Training  model = LogisticRegression() model.fit(X_train, y_train)

Serializer  reads  from  model.intercept_ model.coef_

 

Valida0on  probs = model.predict_proba(X_test) fpr, tpr, thresholds = roc_curve(y_test, probs[:, 1]) matplotlib.pyplot(fpr, tpr)  

ROC:  Receiver  opera0ng  characteris0c  

 

Pipeline    -­‐Fetch  website  snapshots  from  S3  -­‐Fetch  classifica>ons  from  SQL/Impala  -­‐Sani>ze  text  (strip  HTML)  -­‐Run  feature  genera>on  and  selec>on  -­‐Train  and  serialize  model  -­‐Export  valida>on  sta>s>cs  

Luigi    class GetSnapshots(luigi.Task): def run(self): " "... class GenFeatures(luigi.Task): def requires(self): return GetSnapshots()"

Luigi  runs  tasks  on  Hadoop  cluster  "

Scoring  as  a  service    " Applica>on  submission  

Website  scraped  

Text  scored  Applica>on  reviewed  

ThriO  RPC  

Scoring  Service  

Scoring  as  a  service    struct ScoringRequest { 1: string text 2: optional string model_name } struct ScoringResponse { 1: double score" " "// Experiments? 2: double request_duration }"

Why  a  service?    -­‐Same  code  base  for  training/scoring    -­‐Reduced  duplica>on/easier  deploys    -­‐Experimenta>on    

-­‐Log  requests    and  responses    (Parquet/Impala)    -­‐Centralized    monitoring    (Graphite)  

Summary    -­‐Simple  models  with  sklearn  -­‐Pipelines  with  Luigi  -­‐Scoring  as  a  service    Thanks!  @mlmanapat    

top related