Download - Formal Statement of Simple Linear Regression Modelwguo/Math344_2012/Review for Final I.pdf · Normal Equations I The result of this maximization step are called the normal equations

Formal Statement of Simple Linear Regression Model

Yi = β0 + β1Xi + εi

I Yi value of the response variable in the i th trial

I β0 and β1 are parameters

I Xi is a known constant, the value of the predictor variable inthe i th trial

I εi is a random error term with mean E(εi ) = 0 and varianceVar(εi ) = σ2

I i = 1, . . . , n

Least Squares Linear Regression

I Seek to minimize

Q =n∑

i=1

(Yi − (β0 + β1Xi ))2

I Choose b0 and b1 as estimators for β0 and β1. b0 and b1 willminimize the criterion Q for the given sample observations(X1,Y1), (X2,Y2), · · · , (Xn,Yn).

Normal Equations

I The result of this maximization step are called the normalequations. b0 and b1 are called point estimators of β0 and β1

respectively. ∑Yi = nb0 + b1

∑Xi∑

XiYi = b0

∑Xi + b1

∑X 2i

I This is a system of two equations and two unknowns. Thesolution is given by. . .

Solution to Normal Equations

After a lot of algebra one arrives at

b1 =

∑(Xi − X )(Yi − Y )∑

(Xi − X )2

b0 = Y − b1X

X =

∑Xi

n

Y =

∑Yi

n

Properties of Solution

I The i th residual is defined to be

ei = Yi − Yi

I∑

i ei = 0

I∑

i Yi =∑

i Yi

I∑

i Xiei = 0

I∑

i Yiei = 0

I The regression line always goes through the point X , Y

Alternative format of linear regression model:

Yi = β∗0 + β1(Xi − X ) + εi

The least squares estimator b1 for β1 remains the same as before.The least squares estimator for β∗0 = β0 + β1X becomes

b∗0 = b0 + b1X = (Y − b1X ) + b1X = Y

Hence the estimated regression function is

Y = Y + b1(X − X )

s2 estimator for σ2

s2 = MSE =SSE

n − 2=

∑(Yi − Yi )

2

n − 2=

∑e2i

n − 2

I MSE is an unbiased estimator of σ2

E(MSE ) = σ2

I The sum of squares SSE has n − 2 “degrees of freedom”associated with it.

I Cochran’s theorem (later in the course) tells us where degree’sof freedom come from and how to calculate them.

wguo

Text Box

Normal Error Regression Model


I Yi value of the response variable in the i th trial

I β0 and β1 are parameters

I Xi is a known constant, the value of the predictor variable inthe i th trial

I εi ∼iid N(0, σ2)note this is different, now we know the distribution

I i = 1, . . . , n

Inference concerning β1

Tests concerning β1 (the slope) are often of interest, particularly

H0 : β1 = 0

Ha : β1 6= 0

the null hypothesis model

Yi = β0 + (0)Xi + εi

implies that there is no relationship between Y and X.

Note the means of all the Yi ’s are equal at all levels of Xi .

Sampling Dist. Of b1

I The point estimator for b1 is

b1 =

∑(Xi − X )(Yi − Y )∑

(Xi − X )2

I For a normal error regression model the sampling distributionof b1 is normal, with mean and variance given by

E(b1) = β1

Var(b1) =σ2∑

(Xi − X )2

Estimated variance of b1

I When we don’t know σ2 then we have to replace it with theMSE estimate

I Let

s2 = MSE =SSE

n − 2

whereSSE =

∑e2i

andei = Yi − Yi

plugging in we get

Var(b1) =σ2∑

(Xi − X )2

Var(b1) =s2∑

(Xi − X )2

wguo

Text Box

Recap

I We now have an expression for the sampling distribution of b1

when σ2 is known

b1 ∼ N (β1,σ2∑

(Xi − X )2) (1)

I When σ2 is unknown we have an unbiased point estimator ofσ2

Var(b1) =s2∑

(Xi − X )2

Sampling Distribution of (b1 − β1)/S(b1)

I b1 is normally distributed so (b1 − β1)/(√

Var(b1)) is astandard normal variable

I We don’t know Var(b1) so it must be estimated from data.We have already denoted it’s estimate

I If using the estimate V (b1) it can be shown that

b1 − β1

S(b1)∼ t(n − 2)

S(b1) =

√V (b1)

Confidence Intervals and Hypothesis Tests

Now that we know the sampling distribution of b1 (t with n-2degrees of freedom) we can construct confidence intervals andhypothesis tests easily.

1− α confidence limits for β1

I The 1− α confidence limits for β1 are

b1 ± t(1− α/2; n − 2)s{b1}

I Note that this quantity can be used to calculate confidenceintervals given n and α.

I Fixing α can guide the choice of sample size if a particularconfidence interval is desired

I Given a sample size, vice versa.

I Also useful for hypothesis testing

Tests Concerning β1

I Example 1I Two-sided test

I H0 : β1 = 0I Ha : β1 6= 0I Test statistic

t∗ =b1 − 0

s{b1}

Tests Concerning β1

I We have an estimate of the sampling distribution of b1 fromthe data.

I If the null hypothesis holds then the b1 estimate coming fromthe data should be within the 95% confidence interval of thesampling distribution centered at 0 (in this case)

t∗ =b1 − 0

s{b1}

Decision rules

if |t∗| ≤ t(1− α/2; n − 2), acceptH0

if |t∗| > t(1− α/2; n − 2), rejectH0

Absolute values make the test two-sided

Inferences Concerning β0

I Largely, inference procedures regarding β0 can be performedin the same way as those for β1

I Remember the point estimator b0 for β0

b0 = Y − b1X

Sampling distribution of b0

I When error variance is known

E(b0) = β0

σ2{b0} = σ2(1

n+

X 2∑(Xi − X )2

)

I When error variance is unknown

s2{b0} = MSE (1

n+

X 2∑(Xi − X )2

)

Confidence interval for β0

The 1− α confidence limits for β0 are obtained in the samemanner as those for β1

b0 ± t(1− α/2; n − 2)s{b0}

Sampling Distribution of Yh

I We haveYh = b0 + b1Xh

I Since this quantity is itself a linear combination of the Y ′i s it’ssampling distribution is itself normal.

I The mean of the sampling distribution is

E{Yh} = E{b0}+ E{b1}Xh = β0 + β1Xh

Biased or unbiased?

Sampling Distribution of Yh

I So, plugging in, we get

σ2{Yh} = σ2

(1

n+

(Xh − X )2∑(Xi − X )2

)I Since we often won’t know σ2 we can, as usual, plug in

S2 = SSE/(n − 2), our estimate for it to get our estimate ofthis sampling distribution variance

s2{Yh} = S2

(1

n+

(Xh − X )2∑(Xi − X )2

)

No surprise. . .

I The sampling distribution of our point estimator for theoutput is distributed as a t-distribution with two degrees offreedom

Yh − E{Yh}s{Yh}

∼ t(n − 2)

I This means that we can construct confidence intervals in thesame manner as before.

Confidence Intervals for E(Yh)

I The 1− α confidence intervals for E(Yh) are

Yh ± t(1− α/2; n − 2)s{Yh}

I From this hypothesis tests can be constructed as usual.

Prediction interval for single new observation

I If the regression parameters are unknown the 1− α predictioninterval for a new observation Yh is given by the followingtheorem

Yh ± t(1− α/2; n − 2)s{pred}

I We have

σ2{pred} = σ2{Yh− Yh} = σ2{Yh}+σ2{Yh} = σ2 +σ2{Yh}

An unbiased estimator of σ2{pred} iss2{pred} = MSE + s2{Yh}, which is given by

s2{pred} = MSE

[1 +

1

n+

(Xh − X )2∑(Xi − X )2

]

ANOVA table for simple lin. regression

Source of Variation SS df MS E(MS)

Regression SSR =∑

(Yi − Y )2 1 MSR = SSR/1 σ2 + β21

∑(Xi − X )2

Error SSE =∑

(Yi − Yi )2 n − 2 MSE = SSE/(n − 2) σ2

Total SSTO =∑

(Yi − Y )2 n − 1

F Test of β1 = 0 vs. β1 6= 0

ANOVA provides a battery of useful tests. For example, ANOVAprovides an easy test for

Two-sided testH0 : β1 = 0Ha : β1 6= 0Test statistic

Test statistic from beforet∗ = b1−0

s{b1}ANOVA test statisticF ∗ = MSR

MSE

F Distribution

I The F distribution is the ratio of two independent χ2 randomvariables normalized by their corresponding degrees offreedom.

I The test statistic F ∗ follows the distributionF ∗ ∼ F (1, n − 2)

Hypothesis Test Decision Rule

Since F ∗ is distributed as F (1, n − 2) when H0 holds, the decisionrule to follow when the risk of a Type I error is to be controlled atα is:

If F ∗ ≤ F (1− α; 1, n − 2), conclude H0

If F ∗ > F (1− α; 1, n − 2), conclude Ha

General Linear Test

I The test of β1 = 0 versus β1 6= 0 is but a single example of ageneral test for a linear statistical models.

I The general linear test has three partsI Full ModelI Reduced ModelI Test Statistic

Full Model Fit

I A full linear model is first fit to the data


I Using this model the error sum of squares is obtained, here forexample the simple linear model with non-zero slope is the“full” model

SSE (F ) =∑

[Yi − (b0 + b1Xi )]2 =∑

(Yi − Yi )2 = SSE

Fit Reduced Model

I One can test the hypothesis that a simpler model is a“better” model via a general linear test (which is really alikelihood ratio test in disguise). For instance, consider a“reduced” model in which the slope is zero (i.e. norelationship between input and output).

H0 : β1 = 0Ha : β1 6= 0

I The model when H0 holds is called the reduced or restrictedmodel.

Yi = β0 + εi

I The SSE for the reduced model is obtained

SSE (R) =∑

(Yi − b0)2 =∑

(Yi − Y )2 = SSTO

Test Statistic

I The idea is to compare the two error sums of squares SSE(F)and SSE(R).

I Because the full model F has more parameters than thereduced model R SSE (F ) ≤ SSE (R) always

I In the general linear test, the test statistic is

F ∗ =

SSE(R)−SSE(F )dfR−dfFSSE(F )

dfF

which follows the F distribution when H0 holds.

I dfR and dfF are those associated with the reduced and fullmodel error sums of square respectively

R2 (Coefficient of determination)

I SSTO measures the variation in the observations Yi when X isnot considered

I SSE measures the variation in the Yi after a predictor variableX is employed

I A natural measure of the effect of X in reducing variation in Yis to express the reduction in variation (SSTO − SSE = SSR)as a proportion of the total variation

R2 =SSR

SSTO= 1− SSE

SSTO

I Note that since 0 ≤ SSE ≤ SSTO then 0 ≤ R2 ≤ 1

Coefficient of Correlation

r = ±√

R2

Range:−1 ≤ r ≤ 1

Remedial Measures

I How do we know that the regression function is a goodexplainer of the observed data?- Plotting- Tests

I What if it is not? What can we do about it?- Transformation of variables

Residuals

I Remember, the definition of residuals:

ei = Yi − Yi

I And the difference between that and the unknown true error

ε = Yi − E (Yi )

I In a normal regression model the εi ’s are assumed to be iidN(0, σ2) random variables. The observed residuals ei shouldreflect these properties.

Departures from Model...

To be studied by residuals

I Regression function not linear

I Error terms do not have constant variance

I Error terms are not independent

I Model fits all but one or a few outlier observations

I Error terms are not normally distributed

I One or more predictor variables have been omitted from themodel

Diagnostics for Residuals

I Plot of residuals against predictor variable

I Plot of absolute or squared residuals against predictor variable

I Plot of residuals against fitted values

I Plot of residuals against time or other sequence

I Plot of residuals against omitted predictor variables

I Box plot of residuals

I Normal probability plot of residuals

Tests Involving Residuals

I Tests for constancy of variance (Brown-Forsythe test,Breusch-Pagan test, Section 3.6)

I Tests for normality of error distribution

Brown-Forsythe Test

I The test statistic for comparing the means of the absolutedeviations of the residuals around the group medians is

t∗BF = d1−d2

s√

1n1

+ 1n2

where the pooled variance

s2 =∑

(di1−d1)2+∑

(di2−d2)2

n−2

Brown-Forsythe Test

I If n1 and n2 are not extremely small

t∗BF ∼ t(n − 2)

approximately

I From this confidence intervals and tests can be constructed.

F test for lack of fit

I Formal test for determining whether a specific type ofregression function adequately fits the data.

I Assumptions (usual):- observations Y |X are

1. i.i.d.2. normally distributed3. same variance σ2

I Requires: repeat observations at one or more X levels (calledreplicates)

wguo

Text Box

Full Model vs. Regression Model

I The full model is

Yij = µj + εij Full model

where- µj are parameters j = 1, ..., c- εij are iid N(0, σ2)

I Since the error terms have expectation zero

E (Yij) = µj

Full Model

I In the full model there is a different mean (a free parameter)for each Xi

I In the regression model the mean responses are constrained tolie on a line

E (Y ) = β0 + β1X

Fitting the Full Model

I The estimators of µj are simply

µj = Yj

I The error sum of squares of the full model therefore is

SSE (F ) =∑∑

(Yij − Yj)2 = SSPE

SSPE: Pure Error Sum of Squares

Degrees of Freedom

I Ordinary total sum of squares had n-1 degrees of freedom.

I Each of the j terms is a ordinary total sum of squares- Each then has nj − 1 degrees of freedom

I The number of degrees of freedom of SSPE is the sum of thecomponent degrees of freedom

dfF =∑j

(nj − 1) =∑j

nj − c = n − c

General Linear Test

I Remember: the general linear test proposes a reduced modelnull hypotheses- this will be our normal regression model

I The full model will be as described (one independent mean foreach level of X)

H0 : E (Y ) = β0 + β1XHa : E (Y ) 6= β0 + β1X

SSE For Reduced Model

The SSE for the reduced model is as before- remember

SSE (R) =∑i

∑j

[Yij − (b0 + b1Xj)]2

=∑i

∑j

(Yij − Yij)2

- and has n-2 degrees of freedom dfR = n − 2

F Test Statistic

From the general linear test approach

F ∗ = SSE(R)−SSE(F )dfR−dfF ÷ SSE(F )

dfF

F ∗ = SSE−SSPE(n−2)−(n−c) ÷

SSPEn−c

Lack of fit sum of squares:

SSLF = SSE − SSPE

Then

F ∗ =SSLF

(n − 2)− (n − c)÷ SSPE

n − c=

MSLF

MSPE

F Test Rule

I From the F test we know that large values of F ∗ lead us toreject the null hypothesis:If F ∗ ≤ F (1− α; c − 2, n − c), conclude H0

If F ∗ > F (1− α; c − 2, n − c), conclude Ha

Variance decomposition

SSE = SSPE + SSLF.∑∑(Yij − Yij)

2 =∑∑

(Yij − Yj)2 +

∑∑(Yj − Yij)

2

Example decomposition

Box Cox Transforms

I It can be difficult to graphically determine whichtransformation of Y is most appropriate for correcting- skewness of the distributions of error terms- unequal variances- nonlinearity of the regression function

I The Box-Cox procedure automatically identifies atransformation from the family of power transformations on Y

Box Cox Transforms

I This family is of the form

Y ′ = Y λ

I Examples include

λ = 2 Y ′ = Y 2

λ = .5 Y ′ =√

Yλ = 0 Y ′ = lnY (by definition)

λ = −.5 Y ′ = 1√Y

λ = −1 Y ′ = 1Y

Box Cox Cont.

I The normal error regression model with the response variablea member of the family of power transformations becomes

Y λi = β0 + β1Xi + εi

I This model has an additional parameter that needs to beestimated

I Maximum likelihood is a way to estimate this parameter

Using the Bonferroni inequality cont.

I To achieve a 1− α family confidence interval for β0 and β1

(for example) using the Bonferroni procedure we know thatboth individual intervals must shrink.

I Returning to our confidence intervals for β0 and β1 frombefore

b0 ± t(1− α/2; n − 2)s{b0}b1 ± t(1− α/2; n − 2)s{b1}

I To achieve a 1− α family confidence interval these intervalsmust widen to

b0 ± t(1− α/4; n − 2)s{b0}b1 ± t(1− α/4; n − 2)s{b1}

I ThenP(A1 ∩ A2) ≥ 1−P(A2)−P(A1) = 1−α/4−α/4 = 1−α/2

Download - Formal Statement of Simple Linear Regression Modelwguo/Math344_2012/Review for Final I.pdf · Normal Equations I The result of this maximization step are called the normal equations

Top Related