python as part of a production machine learning stack by michael manapat pydata sv 2014

Python as part of a produc0on machine learning stack Michael Manapat @mlmanapat Stripe

Outline -‐Why we need ML at Stripe -‐Simple models with sklearn -‐Pipelines with Luigi -‐Scoring as a service

Stripe is a technology company focusing on making payments easy -‐Short applica>on

Tokeniza0on Customer

browser Stripe

Stripe.js

Merchant server Stripe

API call

Result

API Call import stripe stripe.Charge.create( amount=400, currency="usd", card="tok_103xnl2gR5VxTSB” email=customer@example.com" )"

Fraud / business viola0ons -‐Terms of service viola>ons (weapons) -‐Merchant fraud (card “cashers”) -‐Transac>on fraud -‐No machine learning a year ago

Fraud / business viola0ons -‐Terms of service viola>ons E-‐cigareMes, drugs, weapons, etc. How do we find these automa>cally?

Merchant sign up flow

Applica>on submission

Website scraped

Text scored Applica>on reviewed

Merchant sign up flow

Applica>on submission

Website scraped

Machine learning

pipeline and service

Building a classifier: e-‐cigareIes data = pandas.from_pickle(‘ecigs’) data.head() text violator 0 " please verify your age i am 21 years or older ... True 1 coming soon toggle me drag me with your mouse ... False 2 drink moscow mules cart 0 log in or create an ... False 3 vapors electronic cigarette buy now insuper st... True 4 t-shirts shorts hawaii about us silver coll... False [5 rows x 2 columns]

Features for text classifica0on cv = CountVectorizer features = cv.fit_transform(data['text'])

Sparse matrix of word counts from input text (omiSng feature selec>on)

Features for text classifica0on X_train, X_test, y_train, y_test = train_test_split( features, data['violator'], test_size=0.2)

-‐Avoid leakage -Other cross-‐valida>on methods

Training model = LogisticRegression() model.fit(X_train, y_train)

Serializer reads from model.intercept_ model.coef_

Valida0on probs = model.predict_proba(X_test) fpr, tpr, thresholds = roc_curve(y_test, probs[:, 1]) matplotlib.pyplot(fpr, tpr)

ROC: Receiver opera0ng characteris0c

Pipeline -‐Fetch website snapshots from S3 -‐Fetch classifica>ons from SQL/Impala -‐Sani>ze text (strip HTML) -‐Run feature genera>on and selec>on -‐Train and serialize model -‐Export valida>on sta>s>cs

Luigi class GetSnapshots(luigi.Task): def run(self): " "... class GenFeatures(luigi.Task): def requires(self): return GetSnapshots()"

Luigi runs tasks on Hadoop cluster "

Scoring as a service " Applica>on submission

Website scraped

ThriO RPC

Scoring Service

Scoring as a service struct ScoringRequest { 1: string text 2: optional string model_name } struct ScoringResponse { 1: double score" " "// Experiments? 2: double request_duration }"

Why a service? -‐Same code base for training/scoring -‐Reduced duplica>on/easier deploys -‐Experimenta>on

-‐Log requests and responses (Parquet/Impala) -‐Centralized monitoring (Graphite)

Summary -‐Simple models with sklearn -‐Pipelines with Luigi -‐Scoring as a service Thanks! @mlmanapat

python as part of a production machine learning stack by michael manapat pydata sv 2014

textscored applicaon

test fpr

logisticregression model

training model

mlmanapat stripe

apicall import stripe

optional string model

featuresfortextclassica0on

Technology

pydata-python tools for webscraping

wise.io: a machine-learning platform (pydata sv 2013)

hdf5 is for lovers (pydata sv 2013)

pydata: past, present future (pydata sv 2014 keynote)

measuring the new wikipedia community (pydata sv 2013)

manapat vs ca

introduction to numpy (pydata sv 2013)

bayesian machine learning & python – naïve bayes (pydata...

vaex talk-pydata-paris

validation methods - pydata israel

pydata london january 2017

orange canvas - pydata 2013

thin client data science (pydata sv 2013)

pydata nyc 2014 talk

shogun 2.0 @ pydata nyc 2012

pydata: the next generation

data engineering 101: building your first data product by...

data wrangling kung fu with pandas (pydata sv 2013)

python in an evolving enterprise system (pydata sv 2013)

memex - pydata seattle