Transcript
Page 1: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Python  as  part  of  a  produc0on  machine  learning  stack        Michael  Manapat  @mlmanapat  Stripe    

Page 2: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Outline    -­‐Why  we  need  ML  at  Stripe  -­‐Simple  models  with  sklearn  -­‐Pipelines  with  Luigi  -­‐Scoring  as  a  service    

Page 3: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Stripe  is  a  technology  company  focusing  on  making  payments  easy    -­‐Short  applica>on    

Page 4: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Tokeniza0on       Customer  

browser   Stripe  

Stripe.js  

Token  

Merchant  server   Stripe  

API  call  

Result  

Page 5: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

API  Call    import stripe stripe.Charge.create( amount=400, currency="usd", card="tok_103xnl2gR5VxTSB” [email protected]" )"

Page 6: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Fraud  /  business  viola0ons    -­‐Terms  of  service  viola>ons  (weapons)  -­‐Merchant  fraud  (card  “cashers”)      -­‐Transac>on  fraud    -­‐No  machine  learning  a  year  ago  

Page 7: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Fraud  /  business  viola0ons    -­‐Terms  of  service  viola>ons    E-­‐cigareMes,  drugs,  weapons,  etc.    How  do  we  find  these  automa>cally?  

Page 8: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Merchant  sign  up  flow          

Applica>on  submission  

Website  scraped  

Text  scored  Applica>on  reviewed  

Page 9: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Merchant  sign  up  flow          

Applica>on  submission  

Website  scraped  

Text  scored  Applica>on  reviewed  

Machine  learning  

pipeline  and  service  

Page 10: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Building  a  classifier:  e-­‐cigareIes    data = pandas.from_pickle(‘ecigs’) data.head() text violator 0 " please verify your age i am 21 years or older ... True 1 coming soon toggle me drag me with your mouse ... False 2 drink moscow mules cart 0 log in or create an ... False 3 vapors electronic cigarette buy now insuper st... True 4 t-shirts shorts hawaii about us silver coll... False [5 rows x 2 columns]  

Page 11: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Features  for  text  classifica0on    cv = CountVectorizer features = cv.fit_transform(data['text'])

Sparse  matrix  of  word  counts  from  input  text  (omiSng  feature  selec>on)  

Page 12: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Features  for  text  classifica0on  X_train, X_test, y_train, y_test = train_test_split( features, data['violator'], test_size=0.2)

-­‐Avoid  leakage  -Other  cross-­‐valida>on  methods  

Page 13: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Training  model = LogisticRegression() model.fit(X_train, y_train)

Serializer  reads  from  model.intercept_ model.coef_

 

Page 14: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Valida0on  probs = model.predict_proba(X_test) fpr, tpr, thresholds = roc_curve(y_test, probs[:, 1]) matplotlib.pyplot(fpr, tpr)  

Page 15: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

ROC:  Receiver  opera0ng  characteris0c  

 

Page 16: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Pipeline    -­‐Fetch  website  snapshots  from  S3  -­‐Fetch  classifica>ons  from  SQL/Impala  -­‐Sani>ze  text  (strip  HTML)  -­‐Run  feature  genera>on  and  selec>on  -­‐Train  and  serialize  model  -­‐Export  valida>on  sta>s>cs  

Page 17: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Luigi    class GetSnapshots(luigi.Task): def run(self): " "... class GenFeatures(luigi.Task): def requires(self): return GetSnapshots()"

Page 18: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Luigi  runs  tasks  on  Hadoop  cluster  "

Page 19: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Scoring  as  a  service    " Applica>on  submission  

Website  scraped  

Text  scored  Applica>on  reviewed  

ThriO  RPC  

Scoring  Service  

Page 20: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Scoring  as  a  service    struct ScoringRequest { 1: string text 2: optional string model_name } struct ScoringResponse { 1: double score" " "// Experiments? 2: double request_duration }"

Page 21: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Why  a  service?    -­‐Same  code  base  for  training/scoring    -­‐Reduced  duplica>on/easier  deploys    -­‐Experimenta>on    

Page 22: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

-­‐Log  requests    and  responses    (Parquet/Impala)    -­‐Centralized    monitoring    (Graphite)  

Page 23: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Summary    -­‐Simple  models  with  sklearn  -­‐Pipelines  with  Luigi  -­‐Scoring  as  a  service    Thanks!  @mlmanapat    


Top Related