predective analytcis v0.1 as
TRANSCRIPT
Mirror mirror on the wall, help me predict and know it all…Ankur Sansanwalhttps://www.linkedin.com/in/ankursansanwal
2
Why are we here
3
Lets establish the baseline
• What does this represent?
I eat this for breakfastThere is always a first time
4
Lets establish the baseline
• How familiar R you with
Jedi MasterI publish R libraries
It’s a dance studio down the road
5
Lets get ready
• Please install R Studio (GUI) or R (Shell)– You can download R Studio from https://www.rstudio.com/ and R
from https://cran.r-project.org/bin/windows/base/• Copy the wine.csv & wine_test.csv from the thumb drive
– This data comes from Liquid Assets (www.liquidasset.com/winedata.html)
• This talk is inspired by Edge Analytics course work at edx.org
6
Statistics refresher before the fun part
Independent & Dependent variable One Variable Linear Regression SSE SST R2
7
What is
Dependent variable Variable that you are tying to predict
Independent variable Variables that you believe influence dependent variable
8
• Y = 0.5 (avg. growing temp) – 1.25How does price change based on temp?Is this line perfect?
• Baseline model y = 7What is the price of wine when temp is 16 or 18 degrees?
•
One Variable Linear Regression
Avg. growing Temp(Independent Var.)
Price
(De
pend
ent V
ar.)
1
1
2
2
9
• How do you calculate error?
The best model (line) should have minimal errors…
Avg. growing Temp(Independent Var.)
Price
(De
pend
ent V
ar.)
2
2 • Error = Actual value – Prediction value
= 8 – 7 = 1
10
One of the measures of quality of the model is Sum of Squared Errors (SSE)
Avg. growing Temp(Independent Var.)
Price
(De
pend
ent V
ar.)
2
• SSE = (e1)2 + (e2)2 + (e3)2 + (e4)2 + …………… + (en)2
11
• Lets call SSE for baseline as SST• SST = 10.15• SSE = 6.5
The smaller the errors vs baseline the better is the model
Avg. growing Temp(Independent Var.)
Price
(De
pend
ent V
ar.)
1
2• R2 = 1 – SSE / SST
= 1 – 6.5 / 10.15= 0. 44
12
But why is value of R2 always between 0 & 1?
• R2 = 1 – SSE / SST
• 0 <= SSE < = SST Why is this the case?
13
Quiz
• What would be the R2 of a perfect model?
14
Remember
• Good models for easy problems will have R2 close to 1• Good models for hard problems will have R2 close to 0
15
The real deal
Build a multi-variable regression model in R
16
Step 1 – Have a look at the data (wine.csv)
17
Step 2 – Ask: what's the business question we are trying to solve? “Predict the price of wine?”
18
Quiz • What’s the independent & dependent variable in our data-set?
19
Step 3 – Check your working directory?
20
Step 4 – Your working directory should be set to the location of wine.csv file
21
Step 5 – Load csv file
22
Step 6 – Lets create 1 variable regression model
Name of the modelDependent var
Independent varName of the data set
23
Step 7 – Understand the output of the model
Error termsModel used
R2
Adjusts R2 for number of number of indep. var. relative to the no. of data pts.
24
Step 8 – Calculate SSE for Model1
As the name goes, sum of Squared errors
25
Step 9 – Add another independent variable to the model The new variable
26
Step 10 – Compare the two models Which is better?
27
Step 11 – But lets calculate SSE for Model 2 as well….
To know which of the two models is better, compare their SSE
28
Step 12 – Lets go all in…
29
Step 13 – So which of the models is the best?Model 1 Model 2 Model 3
Impendent variables AGST AGST + Harvest Rain AGST + Harvest Rain + Winter Rain + Age + FrancePop
R2 0.43 0.70 0.82SSE 5.73 2.97 1.73
30
Step 14 – What does the output tell us about the independent variables
Coefficient
• If a coefficient is close to 0 remove it It means that the independent variable does not change our prediction for dependent variable
• Larger the abs. value of t-value, the more likely the coefficient is to be significant
• Closer the value of Pr(>|t|) to 1, less significant is the independent variable
Estimate / Std. Error
How much is coef. likely to vary from est. value
Probability of coefficient is actually 0
31
Step 15 – Lets improve the model by excluding FrancePop (most insignificant var.)
32
Refresher – What is correlation
+10-1
Highly correlatedHighly correlated No correlation
33
Step 16 – ID Multicollinearity .ie. Situation when two independent var. are highly correlated
34
Are we there yet!
Use our model to predict price of wine
35
Before we do that
• Data we build our model is called train data• Data we test our model is called test data
36
Step 17 – Load test data
37
Step 18 – Predict (finally)
Pretty close!
38
Step 19 – Last but not the least, lets calculate R2 to quantify how good our prediction is…
39
Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius -- and a lot of courage -- to move in the opposite direction.
40
Things that have worked for me
• “See” the data before you model– I use Sublime text editor
• Define the business question – WRITE it down• Be friend’s with ETL jedi (Data transformation expert)• Its Ok if its simple!