Download - Predictive Analytics -Workshop
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
#GHCI16
2016Introduction to Predictive Analytics- Hands On Workshop Using R & Python
Presenters:
PythonLavanya Sita Tekumalla Sharmistha Jat
R
Maheshwari DhandapaniSubramanian LakshminarayananSowmya VenugopalBindu
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Agenda
•Basics of Predictive Modeling Techniques (30m)•Hands on Workshop: Regression• (1) Build Model : R (30m) (2) Build Model : Python(30m)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
What is Predictive Analytics?Learn from available data and make meaningful predictions
Why Predictive Analytics?
Too much data – too many scenarios...Hard for humans to explicitly describe predictive rules for all scenarios
Exercise: lets predict something…
Predict how long it takes to reach home
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Common Analytics Tasks...
Supervised Learning Regression : Predict continuous target
Can I predict time taken to get home from past history?Can I predict Sensex Value from past market history?
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Common Analytics Tasks...
Supervised Learning Classification : Predict the class/type of objectClassify Images of Cats from Dogs from examples?Identify hand written digits by studying examples
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Common Analytics Tasks...
Unsupervised Learning Clustering : Identify groups inherent in dataGiven a set of news articles, what are the underlying topics or themes?
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Predict Movie Success ??
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Predict Movie Success: Features
•Features:
–Actors–Director–Gross budget–Social media feedback–Genre and keywords–Release date
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Example: Predict Movie Sales?
Known Data: Available advertising dollars and corresponding sales for lots of prior movies
Prediction Task: For a new movie, given advertising budget – can you forecast sales ?
Regression:
Sales = f (Advertising budget)
How to learn f ????
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Example: Movie Hit / Flop from budget and Trailer Facebook likes?
Known Data: Available budgets and facebook statistics of various hit and flop movies...
Prediction Task: For a new movie, I know budget and facebook likes on trailer – what is the probability of hit ?
Classification:
Can I learn the Seperating LineBetween hit and flop movies? Budget
Face
book
Lik
es
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
The Predictive Analytics Framework
Data/Examples
Feature Extraction Learning Algorithm
Model
New Data Instance
PredictionEvaluation: How well is my algorithm working ?Model Selection: What learning Algorithm to use ?
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Important Aspects of Analytics Framework:
•Feature Engineering: Finding the discerning characteristics
•Data Collection: Collecting the right data / combining multiple sources
•Cleanup: Huge effort - noise/missing data/format conversion...
"If you torture the data long enough, it will confess to anything." -- Ronald Coase
“The goal is to turn data into information and information into insight." -- Carly Fiorina
PAGE 13 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Regression AnalysisWhat ?
●“Regression analysis is a way of finding and representing the relationship between two or more variables.”
●Simple tool yet effective for prediction and estimates
Why ?● To predict an event/outcome using attributes or
features influencing it.Examples• Why UPS truck drivers don’t take a left turn?• Predict movie rating
PAGE 14 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Regression AnalysisHow ?The key is to arrive at equation which brings in the relationship between the outcome and its influencing features.
It answers the questions:• Which variables matter most or the least?
— Independent /Predictors/Features — Dependent/Outcome
• How do those variables interact with each other?
Y = β0+β1x1+β2x2......+εMovie Rating
Budget
Duration
PAGE 15 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Data ExplorationIdentify the nature of data and pattern in underlying set
Descriptive analysis : Describes or summarizes the raw data making it more human interpretable. It condenses data into nuggets of information (Mean,Median)
- Missing data , when impute, when omit (R packages :Mice, VIM, Amelia)
- Nature of data distribution ( around the mean, skewness, outliers)
Data Variable
Continuous-Quantitative
Categorical-Qualitative
PAGE 16 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Visualize Data Distribution
PAGE 17 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Visualization of variables relationship
- How two features/variables are related with one-another?• -1.00 → Increase in one variable
cause decrease in other• +1.00 → increase in one variable
causes increase in other• 0 → is a perfect zero correlation
- Is there a redundancy?
PAGE 18 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Data Cleansing
What is cleaning“Conversion of raw data → technically correct data → to consistent data “
Why is cleansing importantIncorrect or inconsistent data can lead to drawing false conclusions.
• Removal of outliers which can skew your results• Removal of missing data • Removal of duplicates• Transformation of data
List of R Packages for data cleansingMICE, Amelia, missForest, Hmisc, mi
PAGE 19 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Plotting missing data using mice package in R
Data Cleansing
PAGE 20 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Feature selectionTo identify the important variables for building predictive models which is free from “correlated variables”, “bias” and “unwanted noise”.e.g. Boruta Package in R → Identifies important variables using Random Forrest
PAGE 21 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Building the Model
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
R - Workshop
PAGE 23 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
R SetUP
• Copy the install binaries and packages to your laptop• Install R & Rstudio• Install the Packages (ggplot2,VIM,mice,Hmisc etc)• Copy the Model code, RDS file and the Dataset• Set the working directory using
• Setwd(<dir where you have the script, dataset,RDS file>)
PAGE 24 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Explore Data using R
PAGE 25 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Validate the model• Run model against “test” data set which was set
aside to predict after training• Check the Prediction vs Actual observed value• (Cross)Validation is done to assess the “fit”ness of
model• Model should not under (or) over-fit future unseen
data• Validate regression using
— R2 (higher is better)— Residuals ( ideally should have random distribution to avoid
heteroscedasticity )
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Python - Workshop
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Basic Pipeline
1) Data loading and Inspection2) Cleaning and Preprocessing3) Train , Test partitioning4) Feature Selection5) Regression6) Model Selection, parameter tuning, regularisation
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Data Loading
# loading imdb data into a python list format
import csvimdb_data_csv= csv.reader(open('movie_metadata.csv'))imdb_data=[]for item in imdb_data_csv: imdb_data.append(item)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Columns in Data 'color' 'director_name' 'num_critic_for_reviews' 'duration' 'director_facebook_likes' 'director_facebook_likes' 'actor_2_name' 'actor_1_facebook_likes' 'gross' 'genres' 'actor_1_name' 'movie_title' 'num_voted_users'
'cast_total_facebook_likes', 'actor_3_name', 'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country', 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio', 'movie_facebook_likes'
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Preprocessing of data
Steps:
1) Convert text fields to numbers2) Convert strings (numbers in CSV get read up as strings) to float or int type3) Remove NANs4) Remove un-interesting columns from data5) Feature selection
data_float = preprocessing(imdb_data)data_np = np.array(data_float)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Train and Test data partitioning
from sklearn.model_selection import train_test_split
# remove label from datadata_np_x = np.delete(data_np, [20], axis=1)
# data partitioning x_train, x_test, y_train, y_test = train_test_split(data_np_x, data_np[:,20], test_size=0.25, random_state=0)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Regression
# apply regression and voila !!
from sklearn.linear_model import Ridgeregr_0 = Ridge(alpha=1.0)regr_0.fit(x_train, y_train)y_pred = regr_0.predict(x_test)
# model evaluationfrom sklearn.metrics import mean_absolute_errorprint 'absolute error: ', mean_absolute_error(y_test, y_pred)
from sklearn.metrics import mean_squared_errorprint 'squared error: ',mean_squared_error(y_test, y_pred)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Feature Selection
Select important columns which correlate well with output1) Model learning and inference faster2) Accuracy Improvement3) Feature Selection using PCA
from sklearn.decomposition import TruncatedSVDfrom copy import deepcopysvd = TruncatedSVD(n_components=5, n_iter=7, random_state=42)
data_svd = deepcopy(data_np_onehot)data_svd = svd.fit_transform(data_svd)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Model SelectionHow to select parameters of a model
Types of Regression
Popular regression models1) Linear Regression2) Ridge Regression: L2 Smoothing3) Kernel regression: Higher order/non-linear4) Lasso Regression: L1 Smoothing5) Decision Tree regression (CART)6) Random Forest Regression
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Ridge Regression: Regularization
Why Regularization??
-- Less Training Data: Avoid Overfitting
-- Noisy Data: Smoothing/ Robustness to Outliers
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Ridge Regression: Regularization
# apply Ridge regression !!
from sklearn.linear_model import Ridgeregr_ridge = Ridge(alpha=10);regr_ridge.fit(x_train, y_train)y_pred = regr_ridge.predict(x_test)
# model evaluationprint 'ridge absolute error: ', mean_absolute_error(y_test, y_pred)print 'ridge squared error: ',mean_squared_error(y_test, y_pred)
#Alpha determines how much smoothing/ regularization of weights we want
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
How to select Parameter alpha?K-fold Cross Validation:
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
How to select Parameter alpha?K-fold Cross Validation:
verbose_level=10from sklearn.model_selection import GridSearchCVregr_ridge = GridSearchCV(Ridge(), cv=3, verbose=verbose_level, param_grid={"alpha": [ 10,1,0.1]})
regr_ridge.fit(x_train, y_train)y_pred = regr_ridge.predict(x_test)print(regr_ridge.best_params_);
# model evaluationprint 'ridge absolute error: ', mean_absolute_error(y_test, y_pred)print 'ridge squared error: ',mean_squared_error(y_test, y_pred)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Lasso regression: Feature Sparsity
Another form of Regularization with L1 Norm:
# Lasso Regression
from sklearn.linear_model import Ridgeregr_0 = Ridge(alpha=1.0)regr_0.fit(x_train, y_train)y_pred = regr_0.predict(x_test)
#Alpha determines how much sparsity inducing smoothing/ regularization of weights we want
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Lasso regression: Feature Sparsity
Ridge Regression Lasso Regression
Plotting the Coefficients in Ridge Regression vs Lasso Regression
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Lasso Regularization Regression
verbose_level=1from sklearn.linear_model import Lassoregr_ls = GridSearchCV(Lasso(), cv=2, verbose=verbose_level, param_grid={"alpha": [ 0.01,0.1,1,10]})regr_ls.fit(x_train, y_train)y_pred = regr_ls.predict(x_test)print(regr_ls.best_params_);
# model evaluationprint 'Lasso absolute error: ', mean_absolute_error(y_test, y_pred)print 'Lasso squared error: ',mean_squared_error(y_test, y_pred)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Decision Tree Regression
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Decision Tree Regression: Vizualization with depth
Depth 1 Depth 2Depth 1 Depth 5
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Decision Tree Regression
regr_dt = GridSearchCV(DecisionTreeRegressor(), cv=2, verbose=verbose_level, param_grid={"max_depth": [ 2,3,4,5,6]})#regr_dt = DecisionTreeRegressor(max_depth=2)regr_dt.fit(x_train, y_train)y_pred = regr_dt.predict(x_test)print(regr_dt.best_params_);
# model evaluationprint 'decision tree absolute error: ', mean_absolute_error(y_test, y_pred)print 'decsion tree squared error: ',mean_squared_error(y_test, y_pred)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Random Forest for Regression
--> Learn multiple Decision Trees with random partitions of data--> Predict value as average of prediction from multiple trees
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Random Forest Regression
from sklearn.ensemble import RandomForestRegressorregr_rf = GridSearchCV(RandomForestRegressor(), cv=2, verbose=verbose_level, param_grid={"max_depth": [ 2,3,4,5]})
#regr_dt = DecisionTreeRegressor(max_depth=2)regr_rf.fit(x_train, y_train)y_pred = regr_rf.predict(x_test)print(regr_rf.best_params_);
# model evaluationprint 'Random Forest absolute error: ', mean_absolute_error(y_test, y_pred)print 'Random Forest squared error: ',mean_squared_error(y_test, y_pred)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Other Forms Of Regression
# Support Vector Regression
kfold_regr = GridSearchCV(SVR(), cv=5, verbose=10, param_grid={"C": [ 10,1,0.1, 1e-2], "epsilon": [ 0.05,0.1, 0.2]})
#Gaussian Process Regression
kfold_regr = GridSearchCV(GaussianProcessRegressor(kernel=None), cv=5, verbose=10, param_grid={"alpha": [ 10,1,0.1, 1e-2]})
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Recap of Python Session
Preprocessing – --> Feature Selection, --> Handling missing data --> Handling categorical data
Model Evaluation: Making training and testing data
Model Selection - --> Find parameters : Cross validation --> Various regression models: a. Simple Model : Linear Regression b. Regularization (L2 norm): Ridge regression c. Sparse Regularization: Lasso regression d. Interpretable – decision trees e. Random forests– Ensambles on Decision trees
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Thank you