predective analytcis v0.1 as

Mirror mirror on the wall, help me predict and know it all…Ankur Sansanwalhttps://www.linkedin.com/in/ankursansanwal

2

Why are we here

3

Lets establish the baseline

• What does this represent?

I eat this for breakfastThere is always a first time

4

Lets establish the baseline

• How familiar R you with

Jedi MasterI publish R libraries

It’s a dance studio down the road

5

Lets get ready

• Please install R Studio (GUI) or R (Shell)– You can download R Studio from https://www.rstudio.com/ and R

from https://cran.r-project.org/bin/windows/base/• Copy the wine.csv & wine_test.csv from the thumb drive

– This data comes from Liquid Assets (www.liquidasset.com/winedata.html)

• This talk is inspired by Edge Analytics course work at edx.org

https://www.rstudio.com/

https://www.rstudio.com/

https://cran.r-project.org/bin/windows/base/



http://www.liquidasset.com/winedata.html

6

Statistics refresher before the fun part

Independent & Dependent variable One Variable Linear Regression SSE SST R2

7

What is

Dependent variable Variable that you are tying to predict

Independent variable Variables that you believe influence dependent variable

8

• Y = 0.5 (avg. growing temp) – 1.25How does price change based on temp?Is this line perfect?

• Baseline model y = 7What is the price of wine when temp is 16 or 18 degrees?

•

One Variable Linear Regression

Avg. growing Temp(Independent Var.)

Price

(De

pend

ent V

ar.)

1

1

2

2

9

• How do you calculate error?

The best model (line) should have minimal errors…


Price

(De

pend

ent V

ar.)

2

2 • Error = Actual value – Prediction value

= 8 – 7 = 1

10

One of the measures of quality of the model is Sum of Squared Errors (SSE)


Price

(De

pend

ent V

ar.)

2

• SSE = (e1)2 + (e2)2 + (e3)2 + (e4)2 + …………… + (en)2

11

• Lets call SSE for baseline as SST• SST = 10.15• SSE = 6.5

The smaller the errors vs baseline the better is the model


Price

(De

pend

ent V

ar.)

1

2• R2 = 1 – SSE / SST

= 1 – 6.5 / 10.15= 0. 44

12

But why is value of R2 always between 0 & 1?

• R2 = 1 – SSE / SST

• 0 <= SSE < = SST Why is this the case?

13

Quiz

• What would be the R2 of a perfect model?

14

Remember

• Good models for easy problems will have R2 close to 1• Good models for hard problems will have R2 close to 0

15

The real deal

Build a multi-variable regression model in R

16

Step 1 – Have a look at the data (wine.csv)

17

Step 2 – Ask: what's the business question we are trying to solve? “Predict the price of wine?”

18

Quiz • What’s the independent & dependent variable in our data-set?

19

Step 3 – Check your working directory?

20

Step 4 – Your working directory should be set to the location of wine.csv file

21

Step 5 – Load csv file

22

Step 6 – Lets create 1 variable regression model

Name of the modelDependent var

Independent varName of the data set

23

Step 7 – Understand the output of the model

Error termsModel used

R2

Adjusts R2 for number of number of indep. var. relative to the no. of data pts.

24

Step 8 – Calculate SSE for Model1

As the name goes, sum of Squared errors

25

Step 9 – Add another independent variable to the model The new variable

26

Step 10 – Compare the two models Which is better?

27

Step 11 – But lets calculate SSE for Model 2 as well….

To know which of the two models is better, compare their SSE

28

Step 12 – Lets go all in…

29

Step 13 – So which of the models is the best?Model 1 Model 2 Model 3

Impendent variables AGST AGST + Harvest Rain AGST + Harvest Rain + Winter Rain + Age + FrancePop

R2 0.43 0.70 0.82SSE 5.73 2.97 1.73

30

Step 14 – What does the output tell us about the independent variables

Coefficient

• If a coefficient is close to 0 remove it It means that the independent variable does not change our prediction for dependent variable

• Larger the abs. value of t-value, the more likely the coefficient is to be significant

• Closer the value of Pr(>|t|) to 1, less significant is the independent variable

Estimate / Std. Error

How much is coef. likely to vary from est. value

Probability of coefficient is actually 0

31

Step 15 – Lets improve the model by excluding FrancePop (most insignificant var.)

32

Refresher – What is correlation

+10-1

Highly correlatedHighly correlated No correlation

33

Step 16 – ID Multicollinearity .ie. Situation when two independent var. are highly correlated

34

Are we there yet!

Use our model to predict price of wine

35

Before we do that

• Data we build our model is called train data• Data we test our model is called test data

36

Step 17 – Load test data

37

Step 18 – Predict (finally)

Pretty close!

38

Step 19 – Last but not the least, lets calculate R2 to quantify how good our prediction is…

39

Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius -- and a lot of courage -- to move in the opposite direction.

40

Things that have worked for me

• “See” the data before you model– I use Sublime text editor

• Define the business question – WRITE it down• Be friend’s with ETL jedi (Data transformation expert)• Its Ok if its simple!

predective analytcis v0.1 as

Documents