predective analytcis v0.1 as

40
Mirror mirror on the wall, help me predict and know it all… Ankur Sansanwal https://www.linkedin.com/in/ankursansanwal

Upload: ankur-sansanwal

Post on 16-Jan-2017

134 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Predective analytcis v0.1 AS

Mirror mirror on the wall, help me predict and know it all…Ankur Sansanwalhttps://www.linkedin.com/in/ankursansanwal

Page 2: Predective analytcis v0.1 AS

2

Why are we here

Page 3: Predective analytcis v0.1 AS

3

Lets establish the baseline

• What does this represent?

I eat this for breakfastThere is always a first time

Page 4: Predective analytcis v0.1 AS

4

Lets establish the baseline

• How familiar R you with

Jedi MasterI publish R libraries

It’s a dance studio down the road

Page 5: Predective analytcis v0.1 AS

5

Lets get ready

• Please install R Studio (GUI) or R (Shell)– You can download R Studio from https://www.rstudio.com/ and R

from https://cran.r-project.org/bin/windows/base/• Copy the wine.csv & wine_test.csv from the thumb drive

– This data comes from Liquid Assets (www.liquidasset.com/winedata.html)

• This talk is inspired by Edge Analytics course work at edx.org

Page 6: Predective analytcis v0.1 AS

6

Statistics refresher before the fun part

Independent & Dependent variable One Variable Linear Regression SSE SST R2

Page 7: Predective analytcis v0.1 AS

7

What is

Dependent variable Variable that you are tying to predict

Independent variable Variables that you believe influence dependent variable

Page 8: Predective analytcis v0.1 AS

8

• Y = 0.5 (avg. growing temp) – 1.25How does price change based on temp?Is this line perfect?

• Baseline model y = 7What is the price of wine when temp is 16 or 18 degrees?

One Variable Linear Regression

Avg. growing Temp(Independent Var.)

Price

(De

pend

ent V

ar.)

1

1

2

2

Page 9: Predective analytcis v0.1 AS

9

• How do you calculate error?

The best model (line) should have minimal errors…

Avg. growing Temp(Independent Var.)

Price

(De

pend

ent V

ar.)

2

2 • Error = Actual value – Prediction value

= 8 – 7 = 1

Page 10: Predective analytcis v0.1 AS

10

One of the measures of quality of the model is Sum of Squared Errors (SSE)

Avg. growing Temp(Independent Var.)

Price

(De

pend

ent V

ar.)

2

• SSE = (e1)2 + (e2)2 + (e3)2 + (e4)2 + …………… + (en)2

Page 11: Predective analytcis v0.1 AS

11

• Lets call SSE for baseline as SST• SST = 10.15• SSE = 6.5

The smaller the errors vs baseline the better is the model

Avg. growing Temp(Independent Var.)

Price

(De

pend

ent V

ar.)

1

2• R2 = 1 – SSE / SST

= 1 – 6.5 / 10.15= 0. 44

Page 12: Predective analytcis v0.1 AS

12

But why is value of R2 always between 0 & 1?

• R2 = 1 – SSE / SST

• 0 <= SSE < = SST Why is this the case?

Page 13: Predective analytcis v0.1 AS

13

Quiz

• What would be the R2 of a perfect model?

Page 14: Predective analytcis v0.1 AS

14

Remember

• Good models for easy problems will have R2 close to 1• Good models for hard problems will have R2 close to 0

Page 15: Predective analytcis v0.1 AS

15

The real deal

Build a multi-variable regression model in R

Page 16: Predective analytcis v0.1 AS

16

Step 1 – Have a look at the data (wine.csv)

Page 17: Predective analytcis v0.1 AS

17

Step 2 – Ask: what's the business question we are trying to solve? “Predict the price of wine?”

Page 18: Predective analytcis v0.1 AS

18

Quiz • What’s the independent & dependent variable in our data-set?

Page 19: Predective analytcis v0.1 AS

19

Step 3 – Check your working directory?

Page 20: Predective analytcis v0.1 AS

20

Step 4 – Your working directory should be set to the location of wine.csv file

Page 21: Predective analytcis v0.1 AS

21

Step 5 – Load csv file

Page 22: Predective analytcis v0.1 AS

22

Step 6 – Lets create 1 variable regression model

Name of the modelDependent var

Independent varName of the data set

Page 23: Predective analytcis v0.1 AS

23

Step 7 – Understand the output of the model

Error termsModel used

R2

Adjusts R2 for number of number of indep. var. relative to the no. of data pts.

Page 24: Predective analytcis v0.1 AS

24

Step 8 – Calculate SSE for Model1

As the name goes, sum of Squared errors

Page 25: Predective analytcis v0.1 AS

25

Step 9 – Add another independent variable to the model The new variable

Page 26: Predective analytcis v0.1 AS

26

Step 10 – Compare the two models Which is better?

Page 27: Predective analytcis v0.1 AS

27

Step 11 – But lets calculate SSE for Model 2 as well….

To know which of the two models is better, compare their SSE

Page 28: Predective analytcis v0.1 AS

28

Step 12 – Lets go all in…

Page 29: Predective analytcis v0.1 AS

29

Step 13 – So which of the models is the best?Model 1 Model 2 Model 3

Impendent variables AGST AGST + Harvest Rain AGST + Harvest Rain + Winter Rain + Age + FrancePop

R2 0.43 0.70 0.82SSE 5.73 2.97 1.73

Page 30: Predective analytcis v0.1 AS

30

Step 14 – What does the output tell us about the independent variables

Coefficient

• If a coefficient is close to 0 remove it It means that the independent variable does not change our prediction for dependent variable

• Larger the abs. value of t-value, the more likely the coefficient is to be significant

• Closer the value of Pr(>|t|) to 1, less significant is the independent variable

Estimate / Std. Error

How much is coef. likely to vary from est. value

Probability of coefficient is actually 0

Page 31: Predective analytcis v0.1 AS

31

Step 15 – Lets improve the model by excluding FrancePop (most insignificant var.)

Page 32: Predective analytcis v0.1 AS

32

Refresher – What is correlation

+10-1

Highly correlatedHighly correlated No correlation

Page 33: Predective analytcis v0.1 AS

33

Step 16 – ID Multicollinearity .ie. Situation when two independent var. are highly correlated

Page 34: Predective analytcis v0.1 AS

34

Are we there yet!

Use our model to predict price of wine

Page 35: Predective analytcis v0.1 AS

35

Before we do that

• Data we build our model is called train data• Data we test our model is called test data

Page 36: Predective analytcis v0.1 AS

36

Step 17 – Load test data

Page 37: Predective analytcis v0.1 AS

37

Step 18 – Predict (finally)

Pretty close!

Page 38: Predective analytcis v0.1 AS

38

Step 19 – Last but not the least, lets calculate R2 to quantify how good our prediction is…

Page 39: Predective analytcis v0.1 AS

39

Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius -- and a lot of courage -- to move in the opposite direction.

Page 40: Predective analytcis v0.1 AS

40

Things that have worked for me

• “See” the data before you model– I use Sublime text editor

• Define the business question – WRITE it down• Be friend’s with ETL jedi (Data transformation expert)• Its Ok if its simple!