simple linear reg stat

Chapter 10 Simple Linear Regression and Correlation

Linear Regression

Methods for studying the relationship of two or more quantitative variables

Example: • Predict salary from education and years of experience• Predict sales from the amount of advertising expenditures• Predict vocabulary size from the age and amount of education of parents

Variables:• Response/outcome/dependent variable• Predictor/explanatory/independent variable

Relationships between the response and predictor variables• Functional or mathematical relation:

– deterministic• Structural or statistical relation:

error – stochastic/probabilistic

Goals: 1) What is a reasonable model?

(a) (b) errors

2) When has unknown parameters, estimate the parameters3) predict at new

Simple Linear Regression (SLR)

Basic model:

• : the response/dependent variable• : the predictor/explanatory/independent variable• : the observed value of • : treated as a fixed quantity (or conditioned upon)• : the random error, typically assumed 0 and

, and usually assumed normally distributedKey assumptions (to be checked later):

• Linear relationship• Independent (uncorrelated) errors• Constant variance errors• Normally distributed errors

The SLR model can also be written as

• The mean of given (known as the condition mean) is a linear function of given by

• is the conditional mean when 0• If we replace by then is interpreted as conditional

mean when • is the slope, i.e. change in the mean of per unit change in • is the variation of responses about the mean • The relationship is described by the true regression lineE Y|

• The model is called “linear” not because it is linear in , but rather because it is linear in the parameters and

Example: Crime Rate A criminologist studying the relationship between level of education

and crime rate in medium-sized U.S. counties collected the following data for a random sample of 84 counties; is the percentage of individuals in the county having at least a high-school diploma and is the crime rate (crimes reported per 100, 000 residents) last year.

60 65 70 75 80 85 90

Scatter Plot for the Crime Rate Data

Percentage of having at least high school diplomas

Fitting the SLR model - least squares (LS) estimationChoose , to minimize the sum of squared deviations

(vertical distance) of all data points to the fitted line:, ∑

Taking first partial derivatives and setting them equal to zero yields normal equations:

∑ ∑∑ ∑ ∑

which are equivalent to ∑ 0∑ 0

• Least squares estimators:∑ ∑ ∑ ∑

∑ ∑∑ ∑ ∑

∑ ∑

∑ ∑ ∑ ∑

∑ ∑ ∑

• , and are the best linear unbiased estimates of and

• The fitted values: • Residuals: • Least squares (LS) line:

, is the “centroid” of the scatter plot

60 65 70 75 80 85 90

20517.6 170.58

0 20 40 60 80

Goodness of fit of the LS lineResiduals:

Error sum of squares (SSE): ∑Compare with the SSE for the simplest model:

, and ∑ , referred to as the (corrected) total sum of squares (SST), which measures the variability of around its mean

Then SST can be decomposed as∑ ∑ ∑

SST = SSR + SSESSR: the regression sum of squares, which measures the variation in that is accounted for by regression on x

The coefficient of determination:

1 , 0 1

which represents the proportion of variation in that is accounted for by regression on .

Relationship to the sample correlation coefficient :

The sign of is the same as the sign of .

Estimation of A common unbiased estimator of is given by

∑2 2

MSE: Mean square error• The d.f. for is 2 since 2 unknown parameters and

are estimated from the data of size .

Crime rate example continued:Obtain the point estimates of the following: (1) The difference in the mean crime rate for the two counties whose high-

school graduation rates differ by one percentage point;(2) The mean crime rate last year in counties with high school graduation

percentage X=80;(3) The random error .

# read in the data set> crime=read.table("crimerate.txt",header=FALSE)> names(crime)=c("rate","percentage")

# scatter plot> plot(crime$percentage,crime$rate,main="Scatter Plot for the Crime Rate Data", xlab="Percentage of having at least high school diplomas", ylab="Crime Rate (per 100K residents",type="p",pch=16)

# fitting a SLR model using least squares> g1=lm(rate~percentage,data=crime)

# adding the fitted LR line in the scatter plot> abline(g1,col="red",lwd=2)

60 65 70 75 80 85 90

# LS estimation results> summary(g1)

Call:lm(formula = rate ~ percentage, data = crime)

Residuals:Min 1Q Median 3Q Max

-5278.3 -1757.5 -210.5 1575.3 6803.3

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 20517.60 3277.64 6.260 1.67e-08 ***percentage -170.58 41.57 -4.103 9.57e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2356 on 82 degrees of freedomMultiple R-squared: 0.1703, Adjusted R-squared: 0.1602 F-statistic: 16.83 on 1 and 82 DF, p-value: 9.571e-05

> summary(g1)$coeffEstimate Std. Error t value Pr(>|t|)

(Intercept) 20517.5999 3277.64269 6.259865 1.672906e-08percentage -170.5752 41.57433 -4.102897 9.571396e-05

> predict(g1,data.frame(percentage=80),se=TRUE)$fit

1 6871.585$se.fit[1] 263.6425$df[1] 82$residual.scale[1] 2356.292

> deviance(g1) # SSE[1] 455273165> df.residual(g1) # df for SSE[1] 82> sqrt(deviance(g1)/df.residual(g1)) # estimate for sigma[1] 2356.292

> residuals(g1)1 2 3 4 5 6 7 8

591.96401 1648.56552 1660.99033 1518.99033 568.44147 -159.63749 -2357.48712 -828.00967 9 10 11 12 13 14 15 16

97.96401 1401.56552 -1233.46080 285.56552 2426.26477 -1594.28410 -1493.43448 -2615.16004 …

81 82 83 84 -1363.25778 2533.01666 621.14071 28.11439

> summary(g1)$residuals # do the same as residuals(g1)> sum(residuals(g1)^2) # SSE[1] 455273165

> plot(residuals(g1),pch=16,main="Scatter Plot of Residuals“,ylab="Residuals",xlab="")> abline(h=0,lty=2)

0 20 40 60 80

Scatter Plot of Residuals

> fitted.values(g1)1 2 3 4 5 6 7 8 9

7895.036 6530.434 6701.010 6701.010 5677.559 9259.637 8918.487 6701.010 7895.036 10 11 12 13 14 15 16 17 18

6530.434 7724.461 6530.434 7212.735 6189.284 6530.434 7042.160 7212.735 8065.611 …

82 83 84 5506.983 6359.859 7553.886> plot(crime$percentage,fitted.values(g1),pch=16,xlab="Percentage",ylab="Fitted Values")> abline(g1,lty=2)> plot(fitted.values(g1),residuals(g1),main="",ylab="Residuals",xlab=expression(hat(y)),pch=16)> plot(crime$percentage,residuals(g1),main="",ylab="Residuals",xlab="Percentage",pch=16)

60 65 70 75 80 85 90

Percentage

5000 6000 7000 8000 9000 10000

60 65 70 75 80 85 90

Percentage

Statistical Inference for Simple Linear Regression

Inference on and

∑ ∑

~ 0,1 and ~ 0,1

, , and are independently distributed

∑ and

~ and ~

100 1 % CI’s on and are given by

Hypotheses tests: : vs. :

Use the t-test:

~ when is true

Reject at level if | |

or p-value 2

Particularly, for testing if there is a linear relationship,: 0 vs. : 0

Reject at level if | |

Crime Rate Example continued:(1) Test linear relationship at 0.05

> summary(g1)Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 20517.60 3277.64 6.260 1.67e-08 ***percentage -170.58 41.57 -4.103 9.57e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2356 on 82 degrees of freedomMultiple R-squared: 0.1703, Adjusted R-squared: 0.1602 F-statistic: 16.83 on 1 and 82 DF, p-value: 9.571e-05

(2) Calculate a 95% CI on the change in the mean crime rate for every one percentage point increase in high-school graduation rate. > confint(g1)

2.5 % 97.5 %(Intercept) 13997.3245 27037.87538percentage -253.2798 -87.87061

> # we can specify a particular parameter> # as well as change confidence level> confint(g1,"percentage",level=0.9)

5 % 95 %percentage -239.7403 -101.4101

Analysis of Variance (ANOVA) for SLR

ANOVA is a statistical technique to decompose the total variability in the ’s into separate variance components associated with specific sources

Decomposition of the variability and degrees of freedom (d.f.)∑ ∑ ∑

SST = SSR + SSEd.f. n-1 = 1 + n-2

A mean square is defined by a sum of squares divided by its d.f.Mean square regression: /1Mean square error: / 2

Since /

we can test : 0 vs. : 0 at level by rejecting if , , (equivalent to , / )

Analysis of variance (ANOVA) table

Source of Variation(Source)

Sum of Squares(SS)

Degrees of Freedom (d.f.)

Mean Square (MS) F statistic

Regression SSR 11

Error SSE 22

Total SST 124

Crime Rate Example continued:- Test the significance of the linear relationship between the crime rate and the high-school graduation rate at 0.05

> anova(g1)Analysis of Variance Table

Response: rateDf Sum Sq Mean Sq F value Pr(>F)

percentage 1 93462942 93462942 16.834 9.571e-05 ***Residuals 82 455273165 5552112 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Prediction of Future Observations To predict the value of a future response ∗ at a specified value ∗

Use confidence interval to estimate the fixed unknown mean of ∗, denoted by ∗ ∗ ∗

∗ ∗ ∗ ∗ ∗

∗, /

Use prediction interval to predict the value of the r.v. ∗

∗ ∗~ 0, 1∗

∗, / 1

Crime rate example continued:(a) Calculate 95% CI for the average crime rate in counties with

80% high-school graduation rate;(b) Calculate 95% PI for the crime rate of a future selected county

with 80% high-school graduation rate.

> predict(g1,data.frame(percentage=80), interval="confidence")$fit

fit lwr upr1 6871.585 6347.116 7396.054

> predict(g1,data.frame(percentage=80), interval="prediction")$fit

fit lwr upr1 6871.585 2154.92 11588.25

> grid=seq(60,90,1)> conf=predict(g1,data.frame(percentage=grid),interval="confidence")> pred=predict(g1,data.frame(percentage=grid),interval="prediction")> matplot(grid,pred,lty=c(1,2,2),col=c("red","green","green"),type="l",lwd=2,main="CI vs PI", xlab="Percentage of having at least high school diplomas", ylab="Crime Rate (per 100K residents)")> matplot(grid,conf[,2:3],lty=c(2,2),col=c("blue","blue"),type="l",add=T,lwd=2)

60 65 70 75 80 85 90

CI vs PI

Both CI and PI have shortest widths when ∗ ;

Predicting beyond the range of observed data (extrapolation) is risky and should generally be avoided

simple linear reg stat

linear function

regression sum of squares

crime rate datapercentage

slr model

conditional mean

error sum of squares

crime rate crimes

squares ls line

Documents

review of statistical models and linear regression concepts...

stat 301 review (final) - purdue...

stat 5102 lecture slides: deck 6 gauss-markov theorem,...

ucla stat 110b linear regression analysis applied statistics...

stat 250 dr. kari lock morgan simple linear regression ·...

l21: chapter 12: linear regression...multiple regression...

posn stat posn nbr title hire date vac stat … fileposn...

basic statistics 2 - grant's tutoring · basic statistics 1...

paper bf-140 using regression splines in sas stat...

linear regression - sas · multiple linear regression...

stat 512 class 1 - purdue...

stat 8260 | theory of linear models lecture...

department of biostatistics department of stat. and or ·...

stat 470-5 today: general linear model assignment 1:

ucla stat 110b linear regression analysis applied...

simple linear reg

stat 714 linear statistical models

stat 112: notes 2 today’s class: section 3.3. –full...

stat 714 linear statistical...

days of rainfallheavy rainfall linear reg. of rainfallliner...