spark & machine learning workflows© cloudera, inc. all rights reserved. ‹#› spark &...
TRANSCRIPT
‹#›© Cloudera, Inc. All rights reserved.
Spark & Machine Learning WorkflowsJuliet Hougland @j_houg
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
Spark Execution Model
‹#›© Cloudera, Inc. All rights reserved.
Modeling Lifecycle
Historic Data Model Training Persisted
Model
Model Scoring
New Data
Model Result
‹#›© Cloudera, Inc. All rights reserved.
Model Training
MTraining
Data
Test Data
Model Pipeline: Featurization, Model Fitting
Persisted Model Evaluation
Historic Data
‹#›© Cloudera, Inc. All rights reserved.
Pipelines
‹#›© Cloudera, Inc. All rights reserved.
Real ExampleChurn Prediction for a Telco
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
KS, 128, 415, 382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False.
OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False.
NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False.
OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False.
OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, False
The Dataset
‹#›© Cloudera, Inc. All rights reserved.
Scikit-learn Pipelines
from sklearn.ensemble import GradientBoostingClassifier
X, Y = get_data()gbr = GradientBoostingClassifier()X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2)gbr.fit(X_train, Y_train)Y_predicted =gbr.transform(X_test)
‹#›© Cloudera, Inc. All rights reserved.
Scikit-learn Pipelinesfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.preprocessing import OneHotEncoder
X, Y = get_data()pipeline = Pipeline([ (‘ohe', OneHotEncoder(categorical_features=[0, 20])), ('gbr', GradientBoostingClassifier()),])
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2)pipeline.fit(X_train, Y_train)Y_predicted = pipeline.transform(X_test)
‹#›© Cloudera, Inc. All rights reserved.
Apache Spark MLLib Pipelines
‹#›© Cloudera, Inc. All rights reserved.
MLLib Pipelinesfrom pyspark.ml.feature import StringIndexerfrom pyspark.ml.feature import VectorAssemblerfrom pyspark.ml import Pipelinefrom pyspark.ml.classification import DecisionTreeClassifier
label_indexer = StringIndexer(inputCol = ‘churned', outputCol = 'label')plan_indexer = StringIndexer(inputCol = ‘intl_plan', outputCol = 'intl_plan_indexed')
assembler = VectorAssembler( inputCols = ['intl_plan_indexed'] + reduced_numeric_cols, outputCol = 'features')classifier = DecisionTreeClassifier(labelCol = 'label', featuresCol = 'features')
pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])
‹#›© Cloudera, Inc. All rights reserved.
Deploy!
Historic Data Model Training Persisted
Model
Model Scoring
New Data
Model Result
‹#›© Cloudera, Inc. All rights reserved.
You have a few options: • Pickle • Joblib • PMML • Custom
Well, how did you save your model?
‹#›© Cloudera, Inc. All rights reserved.
Insecure Not Portable Big Slow
“Pickles are for delis”
http://pyvideo.org/pycon-us-2014/pickles-are-for-delis-not-software.html
‹#›© Cloudera, Inc. All rights reserved.
Storing Models as PMML
// Export a Spark MLLib model to a local file in PMML format pipeline.toPMML(“/path/to_my_file.xml”)
// Export a scikit-learn model to a file in PMML format from sklearn2pmml import sklearn2pmml
sklearn2pmml(iris_pipeline, “DecisionTreeIris.pmml", with_repr = True)
‹#›© Cloudera, Inc. All rights reserved.
Spark PMML Export Supported Models
‹#›© Cloudera, Inc. All rights reserved.
Distributed Model Fitting
‹#›© Cloudera, Inc. All rights reserved.
Modeling Lifecycle
Historic Data Model Training Persisted
Model
Model Scoring
New Data
Model Result
‹#›© Cloudera, Inc. All rights reserved.
Model Training
MTraining
Data
Test Data
Model Pipeline: Featurization, Model Fitting
Persisted Model Evaluation
Historic Data
‹#›© Cloudera, Inc. All rights reserved.
MLLib Pipelinesfrom pyspark.ml.feature import StringIndexerfrom pyspark.ml.feature import VectorAssemblerfrom pyspark.ml import Pipelinefrom pyspark.ml.classification import DecisionTreeClassifier
label_indexer = StringIndexer(inputCol = ‘churned', outputCol = 'label')plan_indexer = StringIndexer(inputCol = ‘intl_plan', outputCol = 'intl_plan_indexed')
assembler = VectorAssembler( inputCols = ['intl_plan_indexed'] + reduced_numeric_cols, outputCol = 'features')classifier = DecisionTreeClassifier(labelCol = 'label', featuresCol = 'features')
pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])
‹#›© Cloudera, Inc. All rights reserved.
Distributed Grid Search
‹#›© Cloudera, Inc. All rights reserved.
Modeling Lifecycle
Historic Data Model Training Persisted
Model
Model Scoring
New Data
Model Result
‹#›© Cloudera, Inc. All rights reserved.
Model Training
MTraining Data
Test
Model Pipeline:
Persisted Model Evaluation
MTraining Data
Test
Model Pipeline:
Persisted Model Evaluation
MTraining Data
Test
Model Pipeline:
Persisted Model Evaluation
MTraining Data
Test
Model Pipeline:
Persisted Model Evaluation
MTraining Data
Test
Model Pipeline:
Persisted Model Evaluation
MTraining Data
Test
Model Pipeline:
Persisted Model Evaluation
‹#›© Cloudera, Inc. All rights reserved.
Fit multiple models… Serially
‹#›© Cloudera, Inc. All rights reserved.
from sklearn import ensemblefrom sklearn.grid_search import GridSearchCV
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,test_size=0.2)
tuned_parameters = { "n_estimators": [ 300 ], "max_depth" : [ 4 ], "learning_rate": [ 0.01 ], "min_samples_split" : [ 1 ], "loss" : [ 'ls', 'lad' ]}
gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error")preds = clf.fit(X_train, y_train)best = clf.best_estimator_
‹#›© Cloudera, Inc. All rights reserved.
Fit multiple models… in Parallel
‹#›© Cloudera, Inc. All rights reserved.
from sklearn import ensemblefrom sklearn.grid_search import GridSearchCV
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,test_size=0.2)
tuned_parameters = { "n_estimators": [ 300, 400, 200 ], "max_depth" : [ 4, 3 ], "learning_rate": [ 0.01, 0.05, 0.001 ], "min_samples_split" : [ 1, 3 ], "loss" : [ 'ls', 'lad' ]}
gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error", n_jobs=10, pre_dispatch=2)preds = clf.fit(X_train, y_train)best = clf.best_estimator_
‹#›© Cloudera, Inc. All rights reserved.
Fit multiple models… Distributed
https://bigdatapix.tumblr.com/
‹#›© Cloudera, Inc. All rights reserved.
from sklearn import ensemblefrom spark_sklearn import GridSearchCV
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,test_size=0.2)
tuned_parameters = { "n_estimators": [ 300, 400, 200 ], "max_depth" : [ 4, 3 ], "learning_rate": [ 0.01, 0.05, 0.001 ], "min_samples_split" : [ 1, 3 ], "loss" : [ 'ls', 'lad' ]}
gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error")preds = clf.fit(X_train, y_train)best = clf.best_estimator_
‹#›© Cloudera, Inc. All rights reserved.
Distributed Model Scoring
‹#›© Cloudera, Inc. All rights reserved.
What do you mean by “Deploy?”
Historic Data Model Training Persisted
Model
Model Scoring
New Data
Model Result
‹#›© Cloudera, Inc. All rights reserved.
Scoring with REST Server
Persisted Model
Model Scoring
HTTP Request
HTTP Response
‹#›© Cloudera, Inc. All rights reserved.
Distributed Batch Model Scoring: With REST server
‹#›© Cloudera, Inc. All rights reserved.
Distributed Batch Model Scoring
Historic Data Model Training Persisted
Model
Model Scoring
New Data
Model Result
‹#›© Cloudera, Inc. All rights reserved.
Distributed Batch Model Scoring: With REST Server
‹#›© Cloudera, Inc. All rights reserved.
Distributed Batch Model Scoring: With Spark + JPMML
File pmmlFile = ...;
Evaluator evaluator = EvaluatorUtil.createEvaluator(pmmlFile);
TransformerBuilder pmmlTransformerBuilder = new TransformerBuilder(evaluator).withTargetCols().withOutputCols().exploded(false);
Transformer pmmlTransformer = pmmlTransformerBuilder.build();
‹#›© Cloudera, Inc. All rights reserved.
Modeling Lifecycle
Historic Data Model Training Persisted
Model
Model Scoring
New Data
Model Result
‹#›© Cloudera, Inc. All rights reserved.
Juliet Hougland @j_houg
Thank You!