simple linear reg stat

Post on 26-Dec-2015

67 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

simple linear reg statistics

TRANSCRIPT

Chapter 10 Simple Linear Regression and Correlation

Linear Regression

Methods for studying the relationship of two or more quantitative variables

Example: • Predict salary from education and years of experience• Predict sales from the amount of advertising expenditures• Predict vocabulary size from the age and amount of education of parents

Variables:• Response/outcome/dependent variable• Predictor/explanatory/independent variable

1

Relationships between the response and predictor variables• Functional or mathematical relation:

– deterministic• Structural or statistical relation:

error – stochastic/probabilistic

Goals: 1) What is a reasonable model?

(a) (b) errors

2) When has unknown parameters, estimate the parameters3) predict at new

2

Simple Linear Regression (SLR)

Basic model:

• : the response/dependent variable• : the predictor/explanatory/independent variable• : the observed value of • : treated as a fixed quantity (or conditioned upon)• : the random error, typically assumed 0 and

, and usually assumed normally distributedKey assumptions (to be checked later):

• Linear relationship• Independent (uncorrelated) errors• Constant variance errors• Normally distributed errors

3

The SLR model can also be written as

| ~ ,

4

• The mean of given (known as the condition mean) is a linear function of given by

• is the conditional mean when 0• If we replace by then is interpreted as conditional

mean when • is the slope, i.e. change in the mean of per unit change in • is the variation of responses about the mean • The relationship is described by the true regression lineE Y|

• The model is called “linear” not because it is linear in , but rather because it is linear in the parameters and

5

Example: Crime Rate A criminologist studying the relationship between level of education

and crime rate in medium-sized U.S. counties collected the following data for a random sample of 84 counties; is the percentage of individuals in the county having at least a high-school diploma and is the crime rate (crimes reported per 100, 000 residents) last year.

60 65 70 75 80 85 90

2000

4000

6000

8000

1000

012

000

1400

0

Scatter Plot for the Crime Rate Data

Percentage of having at least high school diplomas

Crim

e R

ate

(per

100

K re

side

nts

6

Fitting the SLR model - least squares (LS) estimationChoose , to minimize the sum of squared deviations

(vertical distance) of all data points to the fitted line:, ∑

, ≡

Taking first partial derivatives and setting them equal to zero yields normal equations:

∑ ∑∑ ∑ ∑

which are equivalent to ∑ 0∑ 0

7

• Least squares estimators:∑ ∑ ∑ ∑

∑ ∑∑ ∑ ∑

∑ ∑

∑ ∑ ∑ ∑

∑ ∑ ∑

∑ ∑ ∑

,

8

• , and are the best linear unbiased estimates of and

• The fitted values: • Residuals: • Least squares (LS) line:

, is the “centroid” of the scatter plot

60 65 70 75 80 85 90

2000

4000

6000

8000

1000

012

000

1400

0

Scatter Plot for the Crime Rate Data

Percentage of having at least high school diplomas

Crim

e R

ate

(per

100

K re

side

nts

,

20517.6 170.58

0 20 40 60 80

-400

0-2

000

020

0040

0060

00

Res

idua

ls

9

Goodness of fit of the LS lineResiduals:

Error sum of squares (SSE): ∑Compare with the SSE for the simplest model:

, and ∑ , referred to as the (corrected) total sum of squares (SST), which measures the variability of around its mean

Then SST can be decomposed as∑ ∑ ∑

SST = SSR + SSESSR: the regression sum of squares, which measures the variation in that is accounted for by regression on x

10

The coefficient of determination:

1 , 0 1

which represents the proportion of variation in that is accounted for by regression on .

Relationship to the sample correlation coefficient :

The sign of is the same as the sign of .

11

Estimation of A common unbiased estimator of is given by

∑2 2

MSE: Mean square error• The d.f. for is 2 since 2 unknown parameters and

are estimated from the data of size .

Crime rate example continued:Obtain the point estimates of the following: (1) The difference in the mean crime rate for the two counties whose high-

school graduation rates differ by one percentage point;(2) The mean crime rate last year in counties with high school graduation

percentage X=80;(3) The random error .

12

# read in the data set> crime=read.table("crimerate.txt",header=FALSE)> names(crime)=c("rate","percentage")

# scatter plot> plot(crime$percentage,crime$rate,main="Scatter Plot for the Crime Rate Data", xlab="Percentage of having at least high school diplomas", ylab="Crime Rate (per 100K residents",type="p",pch=16)

# fitting a SLR model using least squares> g1=lm(rate~percentage,data=crime)

# adding the fitted LR line in the scatter plot> abline(g1,col="red",lwd=2)

60 65 70 75 80 85 90

2000

4000

6000

8000

1000

012

000

1400

0

Scatter Plot for the Crime Rate Data

Percentage of having at least high school diplomas

Crim

e R

ate

(per

100

K re

side

nts

13

# LS estimation results> summary(g1)

Call:lm(formula = rate ~ percentage, data = crime)

Residuals:Min 1Q Median 3Q Max

-5278.3 -1757.5 -210.5 1575.3 6803.3

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 20517.60 3277.64 6.260 1.67e-08 ***percentage -170.58 41.57 -4.103 9.57e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2356 on 82 degrees of freedomMultiple R-squared: 0.1703, Adjusted R-squared: 0.1602 F-statistic: 16.83 on 1 and 82 DF, p-value: 9.571e-05

14

> summary(g1)$coeffEstimate Std. Error t value Pr(>|t|)

(Intercept) 20517.5999 3277.64269 6.259865 1.672906e-08percentage -170.5752 41.57433 -4.102897 9.571396e-05

> predict(g1,data.frame(percentage=80),se=TRUE)$fit

1 6871.585$se.fit[1] 263.6425$df[1] 82$residual.scale[1] 2356.292

> deviance(g1) # SSE[1] 455273165> df.residual(g1) # df for SSE[1] 82> sqrt(deviance(g1)/df.residual(g1)) # estimate for sigma[1] 2356.292

15

> residuals(g1)1 2 3 4 5 6 7 8

591.96401 1648.56552 1660.99033 1518.99033 568.44147 -159.63749 -2357.48712 -828.00967 9 10 11 12 13 14 15 16

97.96401 1401.56552 -1233.46080 285.56552 2426.26477 -1594.28410 -1493.43448 -2615.16004 …

81 82 83 84 -1363.25778 2533.01666 621.14071 28.11439

> summary(g1)$residuals # do the same as residuals(g1)> sum(residuals(g1)^2) # SSE[1] 455273165

> plot(residuals(g1),pch=16,main="Scatter Plot of Residuals“,ylab="Residuals",xlab="")> abline(h=0,lty=2)

0 20 40 60 80

-400

0-2

000

020

0040

0060

00

Scatter Plot of Residuals

Res

idua

ls

16

> fitted.values(g1)1 2 3 4 5 6 7 8 9

7895.036 6530.434 6701.010 6701.010 5677.559 9259.637 8918.487 6701.010 7895.036 10 11 12 13 14 15 16 17 18

6530.434 7724.461 6530.434 7212.735 6189.284 6530.434 7042.160 7212.735 8065.611 …

82 83 84 5506.983 6359.859 7553.886> plot(crime$percentage,fitted.values(g1),pch=16,xlab="Percentage",ylab="Fitted Values")> abline(g1,lty=2)> plot(fitted.values(g1),residuals(g1),main="",ylab="Residuals",xlab=expression(hat(y)),pch=16)> plot(crime$percentage,residuals(g1),main="",ylab="Residuals",xlab="Percentage",pch=16)

60 65 70 75 80 85 90

5000

6000

7000

8000

9000

1000

0

Percentage

Fitte

d V

alue

s

5000 6000 7000 8000 9000 10000

-400

0-2

000

020

0040

0060

00

y

Res

idua

ls

60 65 70 75 80 85 90

-400

0-2

000

020

0040

0060

00

Percentage

Res

idua

ls

17

Statistical Inference for Simple Linear Regression

Inference on and

∑ ∑

∑ ∑

∑ ∑

∑ ∑

18

~ 0,1 and ~ 0,1

~

, , and are independently distributed

∑ and

~ and ~

100 1 % CI’s on and are given by

, /

, /

19

Hypotheses tests: : vs. :

Use the t-test:

~ when is true

Reject at level if | |

, /

or p-value 2

Particularly, for testing if there is a linear relationship,: 0 vs. : 0

Reject at level if | |

, /

20

Crime Rate Example continued:(1) Test linear relationship at 0.05

> summary(g1)Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 20517.60 3277.64 6.260 1.67e-08 ***percentage -170.58 41.57 -4.103 9.57e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2356 on 82 degrees of freedomMultiple R-squared: 0.1703, Adjusted R-squared: 0.1602 F-statistic: 16.83 on 1 and 82 DF, p-value: 9.571e-05

21

(2) Calculate a 95% CI on the change in the mean crime rate for every one percentage point increase in high-school graduation rate. > confint(g1)

2.5 % 97.5 %(Intercept) 13997.3245 27037.87538percentage -253.2798 -87.87061

> # we can specify a particular parameter> # as well as change confidence level> confint(g1,"percentage",level=0.9)

5 % 95 %percentage -239.7403 -101.4101

22

Analysis of Variance (ANOVA) for SLR

ANOVA is a statistical technique to decompose the total variability in the ’s into separate variance components associated with specific sources

Decomposition of the variability and degrees of freedom (d.f.)∑ ∑ ∑

SST = SSR + SSEd.f. n-1 = 1 + n-2

A mean square is defined by a sum of squares divided by its d.f.Mean square regression: /1Mean square error: / 2

23

Since /

~ ,

we can test : 0 vs. : 0 at level by rejecting if , , (equivalent to , / )

Analysis of variance (ANOVA) table

Source of Variation(Source)

Sum of Squares(SS)

Degrees of Freedom (d.f.)

Mean Square (MS) F statistic

Regression SSR 11

Error SSE 22

Total SST 124

Crime Rate Example continued:- Test the significance of the linear relationship between the crime rate and the high-school graduation rate at 0.05

> anova(g1)Analysis of Variance Table

Response: rateDf Sum Sq Mean Sq F value Pr(>F)

percentage 1 93462942 93462942 16.834 9.571e-05 ***Residuals 82 455273165 5552112 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

25

Prediction of Future Observations To predict the value of a future response ∗ at a specified value ∗

Use confidence interval to estimate the fixed unknown mean of ∗, denoted by ∗ ∗ ∗

∗ ∗ ∗ ∗ ∗

∗, /

Use prediction interval to predict the value of the r.v. ∗

∗ ∗~ 0, 1∗

∗, / 1

1 ∗

26

Crime rate example continued:(a) Calculate 95% CI for the average crime rate in counties with

80% high-school graduation rate;(b) Calculate 95% PI for the crime rate of a future selected county

with 80% high-school graduation rate.

> predict(g1,data.frame(percentage=80), interval="confidence")$fit

fit lwr upr1 6871.585 6347.116 7396.054

> predict(g1,data.frame(percentage=80), interval="prediction")$fit

fit lwr upr1 6871.585 2154.92 11588.25

27

> grid=seq(60,90,1)> conf=predict(g1,data.frame(percentage=grid),interval="confidence")> pred=predict(g1,data.frame(percentage=grid),interval="prediction")> matplot(grid,pred,lty=c(1,2,2),col=c("red","green","green"),type="l",lwd=2,main="CI vs PI", xlab="Percentage of having at least high school diplomas", ylab="Crime Rate (per 100K residents)")> matplot(grid,conf[,2:3],lty=c(2,2),col=c("blue","blue"),type="l",add=T,lwd=2)

60 65 70 75 80 85 90

2000

4000

6000

8000

1000

012

000

1400

0

CI vs PI

Percentage of having at least high school diplomas

Crim

e R

ate

(per

100

K re

side

nts)

Both CI and PI have shortest widths when ∗ ;

Predicting beyond the range of observed data (extrapolation) is risky and should generally be avoided

28

top related