data mining with r- regression
TRANSCRIPT
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 1/23
Data Mining with RRegression models
Hamideh Iraj
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 2/23
Slides Reference
This a curation from:
Data Analysis Course
Weeks 4-5-6
https://www.coursera.org/course/dataanalysis
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 3/23
Galton Data – Introduction
library(UsingR)
data(galton)
----------------------------------
Head(galton)
Tail(galton)
----------------------------------
Dim(galton)
Str(galton)
summary(galton)
summary(galton$child)
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 4/23
Galton Data - Plotting
par(mfrow=c(1,2))
hist(galton$child,col="blue",breaks=100)
hist(galton$parent,col="blue",breaks=100)
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 5/23
Galton Data – Plotting - cont.
pairs(galton)
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 6/23
What is Regression Analysis?
regression analysis is a statistical process for estimating therelationships among variables. It includes many techniques formodeling and analyzing several variables, when the focus is on threlationship between a dependent variable and one ormore independent variables.
http://en.wikipedia.org/wiki/Regression_analysis
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 7/23
Fitting a line
plot(galton$child, galton$parent, pch=19,col="blue")
lm1 <- lm(child ~ parent, data=galton)
lines(galton$parent,lm1$fitted,col="red", lwd=3)
The linewidth
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 8/23
Plot Residuals
plot(galton$parent,lm1$residuals,col="blue",pch=19)
Abline (c(0,0),col="red",lwd=3)
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 9/23
Linear Model Coefficients
>Summary(lm1)
lm1$coeff
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 10/23
Why care about model Accuracy?
http://en.wikipedia.org/wiki/Linear_regression
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 11/23
Model Accuracy Measures
P-value
Confidence Interval
R2
Adjusted R2
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 12/23
P-value
Most Common Measure of Statistical Significance
Idea: Suppose nothing is going on - how unusual is it to see the estimate we go
Some typical values (single test)
P < 0.05 (significant)
P < 0.01 (strongly significant)
P < 0.001 (very significant)
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 13/23
Confidence intervals
A confidence interval is a type of interval estimate of a populatioparameter and is used to indicate the reliability of an estimate
confint(lm1,level=0.95)
http://en.wikipedia.org/wiki/Confidence_interval
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 14/23
R2
R2 : the proportion of response variation "explained" by theregressors in the model.
R2= 1 :the fitted model explains all variability in
R2 = 0 indicates no 'linear' relationship (for straight line regressionthis means that the straight line model is a constant line (slope=0intercept=\bar{y}) between the response variable and regressors)
http://en.wikipedia.org/wiki/Coefficient_of_determination
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 15/23
Adjusted R2
The use of an adjusted R2 (often written as \bar R^2 and pronounc
"R bar squared") is an attempt to take account of the phenomeno
of the R2 automatically and spuriously increasing when extra
explanatory variables are added to the model.
http://en.wikipedia.org/wiki/Coefficient_of_determination
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 16/23
Predicting with Linear Regression
coef(lm1)[1] + coef(lm1)[2]*80
newdata <- data.frame(parent=80)predict(lm1,newdata)
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 17/23
Multivariate Linear Regression
WHO childhood hunger data
Dataset:http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?ofile=text&filter=COUNTRY:*
hunger <- read.csv("./hunger.csv")
hunger <- hunger[hunger$Sex!="Both sexes", ]
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 18/23
Multivariate Linear Regression – cont.
lmBoth <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex)
lmBoth2 <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex +hunger$Sex*hunger$Year)
Same slopes
Differentslopes
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 19/23
Model Selection
step(lmBoth2)
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 20/23
Regression with Factor Variables
Outcome is still quantitative
Covariate(s) are factor variables
Fitting lines = fitting means
Want to evaluate contribution of all factor levels at once
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 21/23
Regression with Factor Variables – cont.
Dataset: http://www.rossmanchance.com/iscam2/data/movies03RT.txt
movies <- read.table("./movies.txt",sep="\t",header=T,quote="")
head(movies)
7/27/2019 Data Mining With R- Regression
http://slidepdf.com/reader/full/data-mining-with-r-regression 22/23
Regression with Factor Variables – cont.
lm2 <- lm(movies$score ~ as.factor(movies$rating))
summary(lm2)