data mining with r- regression

23
7/27/2019 Data Mining With R- Regression http://slidepdf.com/reader/full/data-mining-with-r-regression 1/23 Data Mining with R Regression models Hamideh Iraj [email protected]

Upload: hamideh-iraj

Post on 13-Apr-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 1/23

Data Mining with RRegression models

Hamideh Iraj

[email protected]

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 2/23

Slides Reference

This a curation from:

Data Analysis Course

Weeks 4-5-6

https://www.coursera.org/course/dataanalysis

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 3/23

Galton Data – Introduction

library(UsingR)

data(galton)

----------------------------------

Head(galton)

Tail(galton)

----------------------------------

Dim(galton)

Str(galton)

summary(galton)

summary(galton$child)

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 4/23

Galton Data - Plotting

par(mfrow=c(1,2))

hist(galton$child,col="blue",breaks=100)

hist(galton$parent,col="blue",breaks=100)

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 5/23

Galton Data – Plotting - cont.

pairs(galton)

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 6/23

What is Regression Analysis?

regression analysis is a statistical process for estimating therelationships among variables. It includes many techniques formodeling and analyzing several variables, when the focus is on threlationship between a dependent variable and one ormore independent variables.

http://en.wikipedia.org/wiki/Regression_analysis

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 7/23

Fitting a line

plot(galton$child, galton$parent, pch=19,col="blue")

lm1 <- lm(child ~ parent, data=galton)

lines(galton$parent,lm1$fitted,col="red", lwd=3)

The linewidth

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 8/23

Plot Residuals

plot(galton$parent,lm1$residuals,col="blue",pch=19)

Abline (c(0,0),col="red",lwd=3)

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 9/23

Linear Model Coefficients

>Summary(lm1)

lm1$coeff

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 10/23

Why care about model Accuracy?

http://en.wikipedia.org/wiki/Linear_regression

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 11/23

Model Accuracy Measures

P-value

Confidence Interval

R2

Adjusted R2 

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 12/23

P-value

Most Common Measure of Statistical Significance

Idea: Suppose nothing is going on - how unusual is it to see the estimate we go

Some typical values (single test)

P < 0.05 (significant)

P < 0.01 (strongly significant)

P < 0.001 (very significant)

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 13/23

Confidence intervals

A confidence interval is a type of interval estimate of a populatioparameter and is used to indicate the reliability of an estimate

confint(lm1,level=0.95)

http://en.wikipedia.org/wiki/Confidence_interval

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 14/23

R2 

R2 : the proportion of response variation "explained" by theregressors in the model.

R2= 1 :the fitted model explains all variability in

R2 = 0 indicates no 'linear' relationship (for straight line regressionthis means that the straight line model is a constant line (slope=0intercept=\bar{y}) between the response variable and regressors)

http://en.wikipedia.org/wiki/Coefficient_of_determination

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 15/23

Adjusted R2 

The use of an adjusted R2 (often written as \bar R^2 and pronounc

"R bar squared") is an attempt to take account of the phenomeno

of the R2 automatically and spuriously increasing when extra

explanatory variables are added to the model.

http://en.wikipedia.org/wiki/Coefficient_of_determination

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 16/23

Predicting with Linear Regression

coef(lm1)[1] + coef(lm1)[2]*80

newdata <- data.frame(parent=80)predict(lm1,newdata)

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 17/23

Multivariate Linear Regression

WHO childhood hunger data

Dataset:http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?ofile=text&filter=COUNTRY:*

hunger <- read.csv("./hunger.csv")

hunger <- hunger[hunger$Sex!="Both sexes", ]

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 18/23

Multivariate Linear Regression – cont.

lmBoth <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex)

lmBoth2 <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex +hunger$Sex*hunger$Year)

Same slopes

Differentslopes

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 19/23

Model Selection

step(lmBoth2)

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 20/23

Regression with Factor Variables

Outcome is still quantitative

Covariate(s) are factor variables

Fitting lines = fitting means

Want to evaluate contribution of all factor levels at once

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 21/23

Regression with Factor Variables – cont.

Dataset: http://www.rossmanchance.com/iscam2/data/movies03RT.txt

movies <- read.table("./movies.txt",sep="\t",header=T,quote="")

head(movies)

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 22/23

Regression with Factor Variables – cont.

lm2 <- lm(movies$score ~ as.factor(movies$rating))

summary(lm2)

7/27/2019 Data Mining With R- Regression

http://slidepdf.com/reader/full/data-mining-with-r-regression 23/23