machine learning pipeline with spark ml

Machine Learning Pipeline with Spark ML

End to End Machine learning

https://github.com/RamkSwamy/sparkmlpipeline



● Ram Kuppuswamy

● Worked in Microsoft for 13yrs

● Co-Founder, Zinnia Systems Pvt. Ltd.,

● Big data consultant and trainer at datamantra.io

http://datamantra.io/

Agenda● Machine Learning Pipeline● Spark ML API● Components of ML API● Building a pipeline● Persisting a model ● Evaluating a pipeline Model● Cross validating pipeline Model

Machine learning● Most of the developers think machine learning is mostly

learning algorithm● Most of the big data libraries like MLLib, Mahout are

focused on implementing algorithm in distributed manner

● But when you try to productionize an end to end solution you will quickly realize that, machine learning is not just about learning algorithm

● There are many other important steps to build an end to end machine learning application

Stages of Machine Learning application● Data Exploration

○ Read Data○ Missing data○ Look at correlation○ Statistics of independent variables

● Data Preparation (Preprocess data)■ Indexing the labels■ Handling categorical variables■ Numeric values in text data (wordtovec)

Stages of ML Application continued...

● Model training● Model evaluation● Model Tuning ● Repeat this process many times

Spark MLLib● Only focused on model learning● No standard way to do other steps of ML pipeline● No way to combine all these steps and execute them● Based on RDD API● Though some of these steps are added later, they were

not uniform across the algorithms

Spark ML

● Provides higher-level API for construction and tuning of ML workflows

● Built on top of DataFrames● We are using Spark 2.0 which will have ML as the

library for Machine Learning going forward and MLLib will be deprecated

http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/spark-2.0.0-SNAPSHOT-2016_05_22_20_01-fc44b69-docs/sql-programming-guide.html#dataframes

Case Study● We use the following dataset from the following

Machine Learning Repository:http://archive.ics.uci.edu/ml/datasets/Census+Income

● They have given the training and test data in the following 2 files separately:○ Adult.data○ Adult.test

● Objective: To predict if the income of an individual is >50K or <=50K? by constructing a pipeline.

http://archive.ics.uci.edu/ml/datasets/Census+Income

http://archive.ics.uci.edu/ml/datasets/Census+Income

Abstractions of Spark ML● Transformer● Estimator● Evaluator● Pipeline● Params

Data Exploration● Read the data and create a DataFrame

○ Util:loadSalaryCsvTrain ○ Util:loadSalaryCsvTest

● Looking at schema○ SalaryDataSchema

● Look at Statistics of variables○ SalaryDataExplore

Data Preparation

● Clean the data ○ cleanDataFrame Util

● Label Indexing● Categorical handling

○ String Indexing○ oneHot Encoding

Estimator● An Estimator abstraction uses an algorithm which is

fitted on a DataFrame returning a model.● It implements a method fit():

DF Estimator Model

Label Indexing● We want to create the label which is the dependent

variable● We have 2 different values a) >50K b) <=50K● One will take the value of ‘0’ and the other ‘1’ ● We use the StringIndexer API to achieve this● It encodes a string column of labels to a column of label

indices.● The indices are in [0, numLabels), ordered by label

frequencies, so the most frequent label gets index 0

String Indexer● We use the StringIndexer API to achieve this● It encodes a string column of labels to a column of label

indices.● SalaryLabelIndexing

Categorical handling● We have many categorical fields such as occupation,

sex, workclass, relationship and marital_status etc.,● They are all String types and we use the StringIndexer

to generate the indices and then use OneHotEncoder,which maps a column of label indices to a column of binary vectors, with at most a single one-value.

● This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.

StringIndxer● It is an Estimator which uses StringIndexerModel to fit

the data

Example: StringIndexer

Transformer

● A Transformer is an abstraction which has an algorithm which transforms one DataFrame to another.

● It implements a method transform()

DF DFTransformer

OneHotEncoder

● It is a Transformer which takes the data and converts into a vector

● CategoricalForWorkClass

Vector assembler● A feature transformer that merges multiple columns into

a vector column.● The LogisticRegression model expects a column named

“features” as vector by default or you can set it.● SalaryVectorAssembler

Building pipeline

Pipeline● Chain of Transformers and Estimators● Pipeline itself is an Estimator● It is fitted on a DataFrame turning it into a model● Once you have defined a pipeline, you can have the

training and test datasets go thru the same stages of processing○ buildOneHotPipeLine○ buildPipeLineForFeaturePreparation○ buildDataPrepPipeLine

Model training

Prediction with test data

Logistic regression● We want to train the LogisticRegression model with

training data● We create a pipeline, which will take the training data

and train the model with it and provides that model for us to use it to predict with the test data

● Example code : LRTraining module

Model training with training data● The model expects the feature vectors in a column

named “features” and the labels (dependent varible that we are trying to predict) in a column named “label” by default.

Model evaluation

Evaluator● Area under ROC curve is used to measure the accuracy

of our model● We use various evaluators such as

BinaryClassificationEvaluator() in this process● Evaluator takes the data and metric related parameter

and evaluates the metric asked for, say area under ROC curve or PR Curve etc.,

● It has evaluate() method

Evaluator continued● Takes Data and Parameters and provides metrics

○ SalaryEvaluator

DF

ParamEvaluator Metric

Model selection

Cross-validator● We want to tune the performance of our models● It takes the following inputs:

○ Estimator : pipeline we have built○ Parameter Grid : Regularization and num of iterations for LR○ Evaluator: Binary classification evaluator

● Find best Parameters○ SalaryCrossValidator

Recap ML Pipelines

Load Data

StringIndexer

OneHotEncoder

VectorAssembler

Pipeline

Evaluate

LogisticRegression

Transformer Estimator

machine learning pipeline with spark ml

Data & Analytics