spark & machine learning workflows© cloudera, inc. all rights reserved. ‹#› spark &...

41
‹#› © Cloudera, Inc. All rights reserved. Spark & Machine Learning Workflows Juliet Hougland @j_houg

Upload: others

Post on 05-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Spark & Machine Learning WorkflowsJuliet Hougland @j_houg

Page 2: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.

Page 3: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Spark Execution Model

Page 4: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Modeling Lifecycle

Historic Data Model Training Persisted

Model

Model Scoring

New Data

Model Result

Page 5: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Model Training

MTraining

Data

Test Data

Model Pipeline: Featurization, Model Fitting

Persisted Model Evaluation

Historic Data

Page 6: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Pipelines

Page 7: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Real ExampleChurn Prediction for a Telco

Page 8: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Page 9: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Page 10: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

KS, 128, 415, 382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False.

OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False.

NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False.

OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False.

OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, False

The Dataset

Page 11: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Scikit-learn Pipelines

from sklearn.ensemble import GradientBoostingClassifier

X, Y = get_data()gbr = GradientBoostingClassifier()X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2)gbr.fit(X_train, Y_train)Y_predicted =gbr.transform(X_test)

Page 12: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Scikit-learn Pipelinesfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.preprocessing import OneHotEncoder

X, Y = get_data()pipeline = Pipeline([ (‘ohe', OneHotEncoder(categorical_features=[0, 20])), ('gbr', GradientBoostingClassifier()),])

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2)pipeline.fit(X_train, Y_train)Y_predicted = pipeline.transform(X_test)

Page 13: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Apache Spark MLLib Pipelines

Page 14: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

MLLib Pipelinesfrom pyspark.ml.feature import StringIndexerfrom pyspark.ml.feature import VectorAssemblerfrom pyspark.ml import Pipelinefrom pyspark.ml.classification import DecisionTreeClassifier

label_indexer = StringIndexer(inputCol = ‘churned', outputCol = 'label')plan_indexer = StringIndexer(inputCol = ‘intl_plan', outputCol = 'intl_plan_indexed')

assembler = VectorAssembler( inputCols = ['intl_plan_indexed'] + reduced_numeric_cols, outputCol = 'features')classifier = DecisionTreeClassifier(labelCol = 'label', featuresCol = 'features')

pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])

Page 15: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Deploy!

Historic Data Model Training Persisted

Model

Model Scoring

New Data

Model Result

Page 16: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

You have a few options: • Pickle • Joblib • PMML • Custom

Well, how did you save your model?

Page 17: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Insecure Not Portable Big Slow

“Pickles are for delis”

http://pyvideo.org/pycon-us-2014/pickles-are-for-delis-not-software.html

Page 18: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Storing Models as PMML

// Export a Spark MLLib model to a local file in PMML format pipeline.toPMML(“/path/to_my_file.xml”)

// Export a scikit-learn model to a file in PMML format from sklearn2pmml import sklearn2pmml

sklearn2pmml(iris_pipeline, “DecisionTreeIris.pmml", with_repr = True)

Page 19: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Spark PMML Export Supported Models

Page 20: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Distributed Model Fitting

Page 21: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Modeling Lifecycle

Historic Data Model Training Persisted

Model

Model Scoring

New Data

Model Result

Page 22: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Model Training

MTraining

Data

Test Data

Model Pipeline: Featurization, Model Fitting

Persisted Model Evaluation

Historic Data

Page 23: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

MLLib Pipelinesfrom pyspark.ml.feature import StringIndexerfrom pyspark.ml.feature import VectorAssemblerfrom pyspark.ml import Pipelinefrom pyspark.ml.classification import DecisionTreeClassifier

label_indexer = StringIndexer(inputCol = ‘churned', outputCol = 'label')plan_indexer = StringIndexer(inputCol = ‘intl_plan', outputCol = 'intl_plan_indexed')

assembler = VectorAssembler( inputCols = ['intl_plan_indexed'] + reduced_numeric_cols, outputCol = 'features')classifier = DecisionTreeClassifier(labelCol = 'label', featuresCol = 'features')

pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])

Page 24: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Distributed Grid Search

Page 25: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Modeling Lifecycle

Historic Data Model Training Persisted

Model

Model Scoring

New Data

Model Result

Page 26: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Model Training

MTraining Data

Test

Model Pipeline:

Persisted Model Evaluation

MTraining Data

Test

Model Pipeline:

Persisted Model Evaluation

MTraining Data

Test

Model Pipeline:

Persisted Model Evaluation

MTraining Data

Test

Model Pipeline:

Persisted Model Evaluation

MTraining Data

Test

Model Pipeline:

Persisted Model Evaluation

MTraining Data

Test

Model Pipeline:

Persisted Model Evaluation

Page 27: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Fit multiple models… Serially

Page 28: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

from sklearn import ensemblefrom sklearn.grid_search import GridSearchCV

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,test_size=0.2)

tuned_parameters = { "n_estimators": [ 300 ], "max_depth" : [ 4 ], "learning_rate": [ 0.01 ], "min_samples_split" : [ 1 ], "loss" : [ 'ls', 'lad' ]}

gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error")preds = clf.fit(X_train, y_train)best = clf.best_estimator_

Page 29: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Fit multiple models… in Parallel

Page 30: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

from sklearn import ensemblefrom sklearn.grid_search import GridSearchCV

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,test_size=0.2)

tuned_parameters = { "n_estimators": [ 300, 400, 200 ], "max_depth" : [ 4, 3 ], "learning_rate": [ 0.01, 0.05, 0.001 ], "min_samples_split" : [ 1, 3 ], "loss" : [ 'ls', 'lad' ]}

gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error", n_jobs=10, pre_dispatch=2)preds = clf.fit(X_train, y_train)best = clf.best_estimator_

Page 31: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Fit multiple models… Distributed

https://bigdatapix.tumblr.com/

Page 32: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

from sklearn import ensemblefrom spark_sklearn import GridSearchCV

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,test_size=0.2)

tuned_parameters = { "n_estimators": [ 300, 400, 200 ], "max_depth" : [ 4, 3 ], "learning_rate": [ 0.01, 0.05, 0.001 ], "min_samples_split" : [ 1, 3 ], "loss" : [ 'ls', 'lad' ]}

gbr = ensemble.GradientBoostingClassifier()clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters, scoring="median_absolute_error")preds = clf.fit(X_train, y_train)best = clf.best_estimator_

Page 33: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Distributed Model Scoring

Page 34: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

What do you mean by “Deploy?”

Historic Data Model Training Persisted

Model

Model Scoring

New Data

Model Result

Page 35: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Scoring with REST Server

Persisted Model

Model Scoring

HTTP Request

HTTP Response

Page 36: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Distributed Batch Model Scoring: With REST server

Page 37: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Distributed Batch Model Scoring

Historic Data Model Training Persisted

Model

Model Scoring

New Data

Model Result

Page 38: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Distributed Batch Model Scoring: With REST Server

Page 39: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Distributed Batch Model Scoring: With Spark + JPMML

File pmmlFile = ...;

Evaluator evaluator = EvaluatorUtil.createEvaluator(pmmlFile);

TransformerBuilder pmmlTransformerBuilder = new TransformerBuilder(evaluator).withTargetCols().withOutputCols().exploded(false);

Transformer pmmlTransformer = pmmlTransformerBuilder.build();

Page 40: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Modeling Lifecycle

Historic Data Model Training Persisted

Model

Model Scoring

New Data

Model Result

Page 41: Spark & Machine Learning Workflows© Cloudera, Inc. All rights reserved. ‹#› Spark & Machine Learning Workflows Juliet Hougland @j_houg

‹#›© Cloudera, Inc. All rights reserved.

Juliet Hougland @j_houg

Thank You!