machine learning pipeline with spark ml

34
Machine Learning Pipeline with Spark ML End to End Machine learning https://github.com/RamkSwamy/sparkmlpipeline

Upload: datamantra

Post on 21-Apr-2017

1.276 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Machine learning pipeline with spark ml

Machine Learning Pipeline with Spark ML

End to End Machine learning

https://github.com/RamkSwamy/sparkmlpipeline

Page 2: Machine learning pipeline with spark ml

● Ram Kuppuswamy

● Worked in Microsoft for 13yrs

● Co-Founder, Zinnia Systems Pvt. Ltd.,

● Big data consultant and trainer at datamantra.io

Page 3: Machine learning pipeline with spark ml

Agenda● Machine Learning Pipeline● Spark ML API● Components of ML API● Building a pipeline● Persisting a model ● Evaluating a pipeline Model● Cross validating pipeline Model

Page 4: Machine learning pipeline with spark ml

Machine learning● Most of the developers think machine learning is mostly

learning algorithm● Most of the big data libraries like MLLib, Mahout are

focused on implementing algorithm in distributed manner

● But when you try to productionize an end to end solution you will quickly realize that, machine learning is not just about learning algorithm

● There are many other important steps to build an end to end machine learning application

Page 5: Machine learning pipeline with spark ml

Stages of Machine Learning application● Data Exploration

○ Read Data○ Missing data○ Look at correlation○ Statistics of independent variables

● Data Preparation (Preprocess data)■ Indexing the labels■ Handling categorical variables■ Numeric values in text data (wordtovec)

Page 6: Machine learning pipeline with spark ml

Stages of ML Application continued...

● Model training● Model evaluation● Model Tuning ● Repeat this process many times

Page 7: Machine learning pipeline with spark ml

Spark MLLib● Only focused on model learning● No standard way to do other steps of ML pipeline● No way to combine all these steps and execute them● Based on RDD API● Though some of these steps are added later, they were

not uniform across the algorithms

Page 8: Machine learning pipeline with spark ml

Spark ML

● Provides higher-level API for construction and tuning of ML workflows

● Built on top of DataFrames● We are using Spark 2.0 which will have ML as the

library for Machine Learning going forward and MLLib will be deprecated

Page 9: Machine learning pipeline with spark ml

Case Study● We use the following dataset from the following

Machine Learning Repository:http://archive.ics.uci.edu/ml/datasets/Census+Income

● They have given the training and test data in the following 2 files separately:○ Adult.data○ Adult.test

● Objective: To predict if the income of an individual is >50K or <=50K? by constructing a pipeline.

Page 10: Machine learning pipeline with spark ml

Abstractions of Spark ML● Transformer● Estimator● Evaluator● Pipeline● Params

Page 11: Machine learning pipeline with spark ml

Data

Page 12: Machine learning pipeline with spark ml

Data Exploration● Read the data and create a DataFrame

○ Util:loadSalaryCsvTrain ○ Util:loadSalaryCsvTest

● Looking at schema○ SalaryDataSchema

● Look at Statistics of variables○ SalaryDataExplore

Page 13: Machine learning pipeline with spark ml

Data Preparation

● Clean the data ○ cleanDataFrame Util

● Label Indexing● Categorical handling

○ String Indexing○ oneHot Encoding

Page 14: Machine learning pipeline with spark ml

Estimator● An Estimator abstraction uses an algorithm which is

fitted on a DataFrame returning a model.● It implements a method fit():

DF Estimator Model

Page 15: Machine learning pipeline with spark ml

Label Indexing● We want to create the label which is the dependent

variable● We have 2 different values a) >50K b) <=50K● One will take the value of ‘0’ and the other ‘1’ ● We use the StringIndexer API to achieve this● It encodes a string column of labels to a column of label

indices.● The indices are in [0, numLabels), ordered by label

frequencies, so the most frequent label gets index 0

Page 16: Machine learning pipeline with spark ml

String Indexer● We use the StringIndexer API to achieve this● It encodes a string column of labels to a column of label

indices.● SalaryLabelIndexing

Page 17: Machine learning pipeline with spark ml

Categorical handling● We have many categorical fields such as occupation,

sex, workclass, relationship and marital_status etc.,● They are all String types and we use the StringIndexer

to generate the indices and then use OneHotEncoder,which maps a column of label indices to a column of binary vectors, with at most a single one-value.

● This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.

Page 18: Machine learning pipeline with spark ml

StringIndxer● It is an Estimator which uses StringIndexerModel to fit

the data

Page 19: Machine learning pipeline with spark ml

Example: StringIndexer

Page 20: Machine learning pipeline with spark ml

Transformer

● A Transformer is an abstraction which has an algorithm which transforms one DataFrame to another.

● It implements a method transform()

DF DFTransformer

Page 21: Machine learning pipeline with spark ml

OneHotEncoder

● It is a Transformer which takes the data and converts into a vector

● CategoricalForWorkClass

Page 22: Machine learning pipeline with spark ml

Vector assembler● A feature transformer that merges multiple columns into

a vector column.● The LogisticRegression model expects a column named

“features” as vector by default or you can set it.● SalaryVectorAssembler

Page 23: Machine learning pipeline with spark ml

Building pipeline

Page 24: Machine learning pipeline with spark ml

Pipeline● Chain of Transformers and Estimators● Pipeline itself is an Estimator● It is fitted on a DataFrame turning it into a model● Once you have defined a pipeline, you can have the

training and test datasets go thru the same stages of processing○ buildOneHotPipeLine○ buildPipeLineForFeaturePreparation○ buildDataPrepPipeLine

Page 25: Machine learning pipeline with spark ml

Model training

Page 26: Machine learning pipeline with spark ml

Prediction with test data

Page 27: Machine learning pipeline with spark ml

Logistic regression● We want to train the LogisticRegression model with

training data● We create a pipeline, which will take the training data

and train the model with it and provides that model for us to use it to predict with the test data

● Example code : LRTraining module

Page 28: Machine learning pipeline with spark ml

Model training with training data● The model expects the feature vectors in a column

named “features” and the labels (dependent varible that we are trying to predict) in a column named “label” by default.

Page 29: Machine learning pipeline with spark ml

Model evaluation

Page 30: Machine learning pipeline with spark ml

Evaluator● Area under ROC curve is used to measure the accuracy

of our model● We use various evaluators such as

BinaryClassificationEvaluator() in this process● Evaluator takes the data and metric related parameter

and evaluates the metric asked for, say area under ROC curve or PR Curve etc.,

● It has evaluate() method

Page 31: Machine learning pipeline with spark ml

Evaluator continued● Takes Data and Parameters and provides metrics

○ SalaryEvaluator

DF

ParamEvaluator Metric

Page 32: Machine learning pipeline with spark ml

Model selection

Page 33: Machine learning pipeline with spark ml

Cross-validator● We want to tune the performance of our models● It takes the following inputs:

○ Estimator : pipeline we have built○ Parameter Grid : Regularization and num of iterations for LR○ Evaluator: Binary classification evaluator

● Find best Parameters○ SalaryCrossValidator

Page 34: Machine learning pipeline with spark ml

Recap ML Pipelines

Load Data

StringIndexer

OneHotEncoder

VectorAssembler

Pipeline

Evaluate

LogisticRegression

Transformer Estimator