machine learning pipeline with spark ml
TRANSCRIPT
Machine Learning Pipeline with Spark ML
End to End Machine learning
https://github.com/RamkSwamy/sparkmlpipeline
● Ram Kuppuswamy
● Worked in Microsoft for 13yrs
● Co-Founder, Zinnia Systems Pvt. Ltd.,
● Big data consultant and trainer at datamantra.io
Agenda● Machine Learning Pipeline● Spark ML API● Components of ML API● Building a pipeline● Persisting a model ● Evaluating a pipeline Model● Cross validating pipeline Model
Machine learning● Most of the developers think machine learning is mostly
learning algorithm● Most of the big data libraries like MLLib, Mahout are
focused on implementing algorithm in distributed manner
● But when you try to productionize an end to end solution you will quickly realize that, machine learning is not just about learning algorithm
● There are many other important steps to build an end to end machine learning application
Stages of Machine Learning application● Data Exploration
○ Read Data○ Missing data○ Look at correlation○ Statistics of independent variables
● Data Preparation (Preprocess data)■ Indexing the labels■ Handling categorical variables■ Numeric values in text data (wordtovec)
Stages of ML Application continued...
● Model training● Model evaluation● Model Tuning ● Repeat this process many times
Spark MLLib● Only focused on model learning● No standard way to do other steps of ML pipeline● No way to combine all these steps and execute them● Based on RDD API● Though some of these steps are added later, they were
not uniform across the algorithms
Spark ML
● Provides higher-level API for construction and tuning of ML workflows
● Built on top of DataFrames● We are using Spark 2.0 which will have ML as the
library for Machine Learning going forward and MLLib will be deprecated
Case Study● We use the following dataset from the following
Machine Learning Repository:http://archive.ics.uci.edu/ml/datasets/Census+Income
● They have given the training and test data in the following 2 files separately:○ Adult.data○ Adult.test
● Objective: To predict if the income of an individual is >50K or <=50K? by constructing a pipeline.
Abstractions of Spark ML● Transformer● Estimator● Evaluator● Pipeline● Params
Data
Data Exploration● Read the data and create a DataFrame
○ Util:loadSalaryCsvTrain ○ Util:loadSalaryCsvTest
● Looking at schema○ SalaryDataSchema
● Look at Statistics of variables○ SalaryDataExplore
Data Preparation
● Clean the data ○ cleanDataFrame Util
● Label Indexing● Categorical handling
○ String Indexing○ oneHot Encoding
Estimator● An Estimator abstraction uses an algorithm which is
fitted on a DataFrame returning a model.● It implements a method fit():
DF Estimator Model
Label Indexing● We want to create the label which is the dependent
variable● We have 2 different values a) >50K b) <=50K● One will take the value of ‘0’ and the other ‘1’ ● We use the StringIndexer API to achieve this● It encodes a string column of labels to a column of label
indices.● The indices are in [0, numLabels), ordered by label
frequencies, so the most frequent label gets index 0
String Indexer● We use the StringIndexer API to achieve this● It encodes a string column of labels to a column of label
indices.● SalaryLabelIndexing
Categorical handling● We have many categorical fields such as occupation,
sex, workclass, relationship and marital_status etc.,● They are all String types and we use the StringIndexer
to generate the indices and then use OneHotEncoder,which maps a column of label indices to a column of binary vectors, with at most a single one-value.
● This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.
StringIndxer● It is an Estimator which uses StringIndexerModel to fit
the data
Example: StringIndexer
Transformer
● A Transformer is an abstraction which has an algorithm which transforms one DataFrame to another.
● It implements a method transform()
DF DFTransformer
OneHotEncoder
● It is a Transformer which takes the data and converts into a vector
● CategoricalForWorkClass
Vector assembler● A feature transformer that merges multiple columns into
a vector column.● The LogisticRegression model expects a column named
“features” as vector by default or you can set it.● SalaryVectorAssembler
Building pipeline
Pipeline● Chain of Transformers and Estimators● Pipeline itself is an Estimator● It is fitted on a DataFrame turning it into a model● Once you have defined a pipeline, you can have the
training and test datasets go thru the same stages of processing○ buildOneHotPipeLine○ buildPipeLineForFeaturePreparation○ buildDataPrepPipeLine
Model training
Prediction with test data
Logistic regression● We want to train the LogisticRegression model with
training data● We create a pipeline, which will take the training data
and train the model with it and provides that model for us to use it to predict with the test data
● Example code : LRTraining module
Model training with training data● The model expects the feature vectors in a column
named “features” and the labels (dependent varible that we are trying to predict) in a column named “label” by default.
Model evaluation
Evaluator● Area under ROC curve is used to measure the accuracy
of our model● We use various evaluators such as
BinaryClassificationEvaluator() in this process● Evaluator takes the data and metric related parameter
and evaluates the metric asked for, say area under ROC curve or PR Curve etc.,
● It has evaluate() method
Evaluator continued● Takes Data and Parameters and provides metrics
○ SalaryEvaluator
DF
ParamEvaluator Metric
Model selection
Cross-validator● We want to tune the performance of our models● It takes the following inputs:
○ Estimator : pipeline we have built○ Parameter Grid : Regularization and num of iterations for LR○ Evaluator: Binary classification evaluator
● Find best Parameters○ SalaryCrossValidator
Recap ML Pipelines
Load Data
StringIndexer
OneHotEncoder
VectorAssembler
Pipeline
Evaluate
LogisticRegression
Transformer Estimator