spark & machine learning workflows© cloudera, inc. all rights reserved. ‹#› spark &...

‹#›© Cloudera, Inc. All rights reserved.

Spark & Machine Learning WorkflowsJuliet Hougland @j_houg


Spark Execution Model


Modeling Lifecycle

Historic Data Model Training Persisted

Model

Model Scoring

New Data

Model Result


Model Training

MTraining

Data

Test Data

Model Pipeline: Featurization, Model Fitting

Persisted Model Evaluation

Historic Data


Pipelines


Real ExampleChurn Prediction for a Telco


KS, 128, 415, 382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False.

OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False.

NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False.

OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False.

OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, False

The Dataset


Scikit-learn Pipelines

from sklearn.ensemble import GradientBoostingClassifier

X, Y = get_data()gbr = GradientBoostingClassifier()X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2)gbr.fit(X_train, Y_train)Y_predicted =gbr.transform(X_test)


Scikit-learn Pipelinesfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.preprocessing import OneHotEncoder

X, Y = get_data()pipeline = Pipeline([ (‘ohe', OneHotEncoder(categorical_features=[0, 20])), ('gbr', GradientBoostingClassifier()),])

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2)pipeline.fit(X_train, Y_train)Y_predicted = pipeline.transform(X_test)


Apache Spark MLLib Pipelines


MLLib Pipelinesfrom pyspark.ml.feature import StringIndexerfrom pyspark.ml.feature import VectorAssemblerfrom pyspark.ml import Pipelinefrom pyspark.ml.classification import DecisionTreeClassifier

label_indexer = StringIndexer(inputCol = ‘churned', outputCol = 'label')plan_indexer = StringIndexer(inputCol = ‘intl_plan', outputCol = 'intl_plan_indexed')

assembler = VectorAssembler( inputCols = ['intl_plan_indexed'] + reduced_numeric_cols, outputCol = 'features')classifier = DecisionTreeClassifier(labelCol = 'label', featuresCol = 'features')

pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])


Deploy!


Model

Model Scoring

New Data

Model Result


You have a few options: • Pickle • Joblib • PMML • Custom

Well, how did you save your model?


Insecure Not Portable Big Slow

“Pickles are for delis”

http://pyvideo.org/pycon-us-2014/pickles-are-for-delis-not-software.html


Storing Models as PMML

// Export a Spark MLLib model to a local file in PMML format pipeline.toPMML(“/path/to_my_file.xml”)

// Export a scikit-learn model to a file in PMML format from sklearn2pmml import sklearn2pmml

sklearn2pmml(iris_pipeline, “DecisionTreeIris.pmml", with_repr = True)


Spark PMML Export Supported Models


Distributed Model Fitting


Modeling Lifecycle


Model

Model Scoring

New Data

Model Result


Model Training

MTraining

Data

Test Data

Model Pipeline: Featurization, Model Fitting


Historic Data


MLLib Pipelinesfrom pyspark.ml.feature import StringIndexerfrom pyspark.ml.feature import VectorAssemblerfrom pyspark.ml import Pipelinefrom pyspark.ml.classification import DecisionTreeClassifier

label_indexer = StringIndexer(inputCol = ‘churned', outputCol = 'label')plan_indexer = StringIndexer(inputCol = ‘intl_plan', outputCol = 'intl_plan_indexed')

assembler = VectorAssembler( inputCols = ['intl_plan_indexed'] + reduced_numeric_cols, outputCol = 'features')classifier = DecisionTreeClassifier(labelCol = 'label', featuresCol = 'features')

pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])


Distributed Grid Search


Modeling Lifecycle


Model

Model Scoring

New Data

Model Result


Model Training

MTraining Data

Test

Model Pipeline:


MTraining Data

Test

Model Pipeline:


MTraining Data

Test

Model Pipeline:


MTraining Data

Test

Model Pipeline:


MTraining Data

Test

Model Pipeline:


MTraining Data

Test

Model Pipeline:



Fit multiple models… Serially


from sklearn import ensemblefrom sklearn.grid_search import GridSearchCV

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,test_size=0.2)

tuned_parameters = { "n_estimators": [ 300 ], "max_depth" : [ 4 ], "learning_rate": [ 0.01 ], "min_samples_split" : [ 1 ], "loss" : [ 'ls', 'lad' ]}

gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error")preds = clf.fit(X_train, y_train)best = clf.best_estimator_


Fit multiple models… in Parallel


from sklearn import ensemblefrom sklearn.grid_search import GridSearchCV


tuned_parameters = { "n_estimators": [ 300, 400, 200 ], "max_depth" : [ 4, 3 ], "learning_rate": [ 0.01, 0.05, 0.001 ], "min_samples_split" : [ 1, 3 ], "loss" : [ 'ls', 'lad' ]}

gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error", n_jobs=10, pre_dispatch=2)preds = clf.fit(X_train, y_train)best = clf.best_estimator_


Fit multiple models… Distributed

https://bigdatapix.tumblr.com/


from sklearn import ensemblefrom spark_sklearn import GridSearchCV


tuned_parameters = { "n_estimators": [ 300, 400, 200 ], "max_depth" : [ 4, 3 ], "learning_rate": [ 0.01, 0.05, 0.001 ], "min_samples_split" : [ 1, 3 ], "loss" : [ 'ls', 'lad' ]}

gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error")preds = clf.fit(X_train, y_train)best = clf.best_estimator_


Distributed Model Scoring


What do you mean by “Deploy?”


Model

Model Scoring

New Data

Model Result


Scoring with REST Server

Persisted Model

Model Scoring

HTTP Request

HTTP Response


Distributed Batch Model Scoring: With REST server


Distributed Batch Model Scoring


Model

Model Scoring

New Data

Model Result


Distributed Batch Model Scoring: With REST Server


Distributed Batch Model Scoring: With Spark + JPMML

File pmmlFile = ...;

Evaluator evaluator = EvaluatorUtil.createEvaluator(pmmlFile);

TransformerBuilder pmmlTransformerBuilder = new TransformerBuilder(evaluator).withTargetCols().withOutputCols().exploded(false);

Transformer pmmlTransformer = pmmlTransformerBuilder.build();


Modeling Lifecycle


Model

Model Scoring

New Data

Model Result


Juliet Hougland @j_houg

Thank You!

spark & machine learning workflows© cloudera, inc. all rights reserved. ‹#› spark &...

Documents