predictive analytics -workshop

PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA

#GHCI16

2016Introduction to Predictive Analytics- Hands On Workshop Using R & Python

Presenters:

PythonLavanya Sita Tekumalla Sharmistha Jat

R

Maheshwari DhandapaniSubramanian LakshminarayananSowmya VenugopalBindu


Agenda

•Basics of Predictive Modeling Techniques (30m)•Hands on Workshop: Regression• (1) Build Model : R (30m) (2) Build Model : Python(30m)


What is Predictive Analytics?Learn from available data and make meaningful predictions

Why Predictive Analytics?

Too much data – too many scenarios...Hard for humans to explicitly describe predictive rules for all scenarios

Exercise: lets predict something…

Predict how long it takes to reach home


Common Analytics Tasks...

Supervised Learning Regression : Predict continuous target

Can I predict time taken to get home from past history?Can I predict Sensex Value from past market history?



Supervised Learning Classification : Predict the class/type of objectClassify Images of Cats from Dogs from examples?Identify hand written digits by studying examples



Unsupervised Learning Clustering : Identify groups inherent in dataGiven a set of news articles, what are the underlying topics or themes?


Predict Movie Success ??


Predict Movie Success: Features

•Features:

–Actors–Director–Gross budget–Social media feedback–Genre and keywords–Release date


Example: Predict Movie Sales?

Known Data: Available advertising dollars and corresponding sales for lots of prior movies

Prediction Task: For a new movie, given advertising budget – can you forecast sales ?

Regression:

Sales = f (Advertising budget)

How to learn f ????


Example: Movie Hit / Flop from budget and Trailer Facebook likes?

Known Data: Available budgets and facebook statistics of various hit and flop movies...

Prediction Task: For a new movie, I know budget and facebook likes on trailer – what is the probability of hit ?

Classification:

Can I learn the Seperating LineBetween hit and flop movies? Budget

Face

book

Lik

es


The Predictive Analytics Framework

Data/Examples

Feature Extraction Learning Algorithm

Model

New Data Instance

PredictionEvaluation: How well is my algorithm working ?Model Selection: What learning Algorithm to use ?


Important Aspects of Analytics Framework:

•Feature Engineering: Finding the discerning characteristics

•Data Collection: Collecting the right data / combining multiple sources

•Cleanup: Huge effort - noise/missing data/format conversion...

"If you torture the data long enough, it will confess to anything." -- Ronald Coase

“The goal is to turn data into information and information into insight." -- Carly Fiorina

| GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA

Regression AnalysisWhat ?

●“Regression analysis is a way of finding and representing the relationship between two or more variables.”

●Simple tool yet effective for prediction and estimates

Why ?● To predict an event/outcome using attributes or

features influencing it.Examples• Why UPS truck drivers don’t take a left turn?• Predict movie rating


Regression AnalysisHow ?The key is to arrive at equation which brings in the relationship between the outcome and its influencing features.

It answers the questions:• Which variables matter most or the least?

— Independent /Predictors/Features — Dependent/Outcome

• How do those variables interact with each other?

Y = β0+β1x1+β2x2......+εMovie Rating

Budget

Duration


Data ExplorationIdentify the nature of data and pattern in underlying set

Descriptive analysis : Describes or summarizes the raw data making it more human interpretable. It condenses data into nuggets of information (Mean,Median)

- Missing data , when impute, when omit (R packages :Mice, VIM, Amelia)

- Nature of data distribution ( around the mean, skewness, outliers)

Data Variable

Continuous-Quantitative

Categorical-Qualitative


Visualize Data Distribution


Visualization of variables relationship

- How two features/variables are related with one-another?• -1.00 → Increase in one variable

cause decrease in other• +1.00 → increase in one variable

causes increase in other• 0 → is a perfect zero correlation

- Is there a redundancy?


Data Cleansing

What is cleaning“Conversion of raw data → technically correct data → to consistent data “

Why is cleansing importantIncorrect or inconsistent data can lead to drawing false conclusions.

• Removal of outliers which can skew your results• Removal of missing data • Removal of duplicates• Transformation of data

List of R Packages for data cleansingMICE, Amelia, missForest, Hmisc, mi


Plotting missing data using mice package in R

Data Cleansing


Feature selectionTo identify the important variables for building predictive models which is free from “correlated variables”, “bias” and “unwanted noise”.e.g. Boruta Package in R → Identifies important variables using Random Forrest


Building the Model


R - Workshop


R SetUP

• Copy the install binaries and packages to your laptop• Install R & Rstudio• Install the Packages (ggplot2,VIM,mice,Hmisc etc)• Copy the Model code, RDS file and the Dataset• Set the working directory using

• Setwd(<dir where you have the script, dataset,RDS file>)


Explore Data using R


Validate the model• Run model against “test” data set which was set

aside to predict after training• Check the Prediction vs Actual observed value• (Cross)Validation is done to assess the “fit”ness of

model• Model should not under (or) over-fit future unseen

data• Validate regression using

— R2 (higher is better)— Residuals ( ideally should have random distribution to avoid

heteroscedasticity )

http://www.statsmakemecry.com/smmctheblog/confusing-stats-terms-explained-heteroscedasticity-heteroske.html


Python - Workshop


Basic Pipeline

1) Data loading and Inspection2) Cleaning and Preprocessing3) Train , Test partitioning4) Feature Selection5) Regression6) Model Selection, parameter tuning, regularisation


Data Loading

# loading imdb data into a python list format

import csvimdb_data_csv= csv.reader(open('movie_metadata.csv'))imdb_data=[]for item in imdb_data_csv: imdb_data.append(item)


Columns in Data 'color' 'director_name' 'num_critic_for_reviews' 'duration' 'director_facebook_likes' 'director_facebook_likes' 'actor_2_name' 'actor_1_facebook_likes' 'gross' 'genres' 'actor_1_name' 'movie_title' 'num_voted_users'

'cast_total_facebook_likes', 'actor_3_name', 'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country', 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio', 'movie_facebook_likes'


Preprocessing of data

Steps:

1) Convert text fields to numbers2) Convert strings (numbers in CSV get read up as strings) to float or int type3) Remove NANs4) Remove un-interesting columns from data5) Feature selection

data_float = preprocessing(imdb_data)data_np = np.array(data_float)


Train and Test data partitioning

from sklearn.model_selection import train_test_split

# remove label from datadata_np_x = np.delete(data_np, [20], axis=1)

# data partitioning x_train, x_test, y_train, y_test = train_test_split(data_np_x, data_np[:,20], test_size=0.25, random_state=0)


Regression

# apply regression and voila !!

from sklearn.linear_model import Ridgeregr_0 = Ridge(alpha=1.0)regr_0.fit(x_train, y_train)y_pred = regr_0.predict(x_test)

# model evaluationfrom sklearn.metrics import mean_absolute_errorprint 'absolute error: ', mean_absolute_error(y_test, y_pred)

from sklearn.metrics import mean_squared_errorprint 'squared error: ',mean_squared_error(y_test, y_pred)


Feature Selection

Select important columns which correlate well with output1) Model learning and inference faster2) Accuracy Improvement3) Feature Selection using PCA

from sklearn.decomposition import TruncatedSVDfrom copy import deepcopysvd = TruncatedSVD(n_components=5, n_iter=7, random_state=42)

data_svd = deepcopy(data_np_onehot)data_svd = svd.fit_transform(data_svd)


Model SelectionHow to select parameters of a model

Types of Regression

Popular regression models1) Linear Regression2) Ridge Regression: L2 Smoothing3) Kernel regression: Higher order/non-linear4) Lasso Regression: L1 Smoothing5) Decision Tree regression (CART)6) Random Forest Regression


Ridge Regression: Regularization

Why Regularization??

-- Less Training Data: Avoid Overfitting

-- Noisy Data: Smoothing/ Robustness to Outliers


Ridge Regression: Regularization

# apply Ridge regression !!

from sklearn.linear_model import Ridgeregr_ridge = Ridge(alpha=10);regr_ridge.fit(x_train, y_train)y_pred = regr_ridge.predict(x_test)

# model evaluationprint 'ridge absolute error: ', mean_absolute_error(y_test, y_pred)print 'ridge squared error: ',mean_squared_error(y_test, y_pred)

#Alpha determines how much smoothing/ regularization of weights we want


How to select Parameter alpha?K-fold Cross Validation:


How to select Parameter alpha?K-fold Cross Validation:

verbose_level=10from sklearn.model_selection import GridSearchCVregr_ridge = GridSearchCV(Ridge(), cv=3, verbose=verbose_level, param_grid={"alpha": [ 10,1,0.1]})

regr_ridge.fit(x_train, y_train)y_pred = regr_ridge.predict(x_test)print(regr_ridge.best_params_);

# model evaluationprint 'ridge absolute error: ', mean_absolute_error(y_test, y_pred)print 'ridge squared error: ',mean_squared_error(y_test, y_pred)


Lasso regression: Feature Sparsity

Another form of Regularization with L1 Norm:

# Lasso Regression

from sklearn.linear_model import Ridgeregr_0 = Ridge(alpha=1.0)regr_0.fit(x_train, y_train)y_pred = regr_0.predict(x_test)

#Alpha determines how much sparsity inducing smoothing/ regularization of weights we want


Lasso regression: Feature Sparsity

Ridge Regression Lasso Regression

Plotting the Coefficients in Ridge Regression vs Lasso Regression


Lasso Regularization Regression

verbose_level=1from sklearn.linear_model import Lassoregr_ls = GridSearchCV(Lasso(), cv=2, verbose=verbose_level, param_grid={"alpha": [ 0.01,0.1,1,10]})regr_ls.fit(x_train, y_train)y_pred = regr_ls.predict(x_test)print(regr_ls.best_params_);

# model evaluationprint 'Lasso absolute error: ', mean_absolute_error(y_test, y_pred)print 'Lasso squared error: ',mean_squared_error(y_test, y_pred)


Decision Tree Regression


Decision Tree Regression: Vizualization with depth

Depth 1 Depth 2Depth 1 Depth 5


Decision Tree Regression

regr_dt = GridSearchCV(DecisionTreeRegressor(), cv=2, verbose=verbose_level, param_grid={"max_depth": [ 2,3,4,5,6]})#regr_dt = DecisionTreeRegressor(max_depth=2)regr_dt.fit(x_train, y_train)y_pred = regr_dt.predict(x_test)print(regr_dt.best_params_);

# model evaluationprint 'decision tree absolute error: ', mean_absolute_error(y_test, y_pred)print 'decsion tree squared error: ',mean_squared_error(y_test, y_pred)


Random Forest for Regression

--> Learn multiple Decision Trees with random partitions of data--> Predict value as average of prediction from multiple trees


Random Forest Regression

from sklearn.ensemble import RandomForestRegressorregr_rf = GridSearchCV(RandomForestRegressor(), cv=2, verbose=verbose_level, param_grid={"max_depth": [ 2,3,4,5]})

#regr_dt = DecisionTreeRegressor(max_depth=2)regr_rf.fit(x_train, y_train)y_pred = regr_rf.predict(x_test)print(regr_rf.best_params_);

# model evaluationprint 'Random Forest absolute error: ', mean_absolute_error(y_test, y_pred)print 'Random Forest squared error: ',mean_squared_error(y_test, y_pred)


Other Forms Of Regression

# Support Vector Regression

kfold_regr = GridSearchCV(SVR(), cv=5, verbose=10, param_grid={"C": [ 10,1,0.1, 1e-2], "epsilon": [ 0.05,0.1, 0.2]})

#Gaussian Process Regression

kfold_regr = GridSearchCV(GaussianProcessRegressor(kernel=None), cv=5, verbose=10, param_grid={"alpha": [ 10,1,0.1, 1e-2]})


Recap of Python Session

Preprocessing – --> Feature Selection, --> Handling missing data --> Handling categorical data

Model Evaluation: Making training and testing data

Model Selection - --> Find parameters : Cross validation --> Various regression models: a. Simple Model : Linear Regression b. Regularization (L2 norm): Ridge regression c. Sparse Regularization: Lasso regression d. Interpretable – decision trees e. Random forests– Ensambles on Decision trees


Thank you

predictive analytics -workshop

Data & Analytics