Formal Statement of Simple Linear Regression Model
Yi = β0 + β1Xi + εi
I Yi value of the response variable in the i th trial
I β0 and β1 are parameters
I Xi is a known constant, the value of the predictor variable inthe i th trial
I εi is a random error term with mean E(εi ) = 0 and varianceVar(εi ) = σ2
I i = 1, . . . , n
Least Squares Linear Regression
I Seek to minimize
Q =n∑
i=1
(Yi − (β0 + β1Xi ))2
I Choose b0 and b1 as estimators for β0 and β1. b0 and b1 willminimize the criterion Q for the given sample observations(X1,Y1), (X2,Y2), · · · , (Xn,Yn).
Normal Equations
I The result of this maximization step are called the normalequations. b0 and b1 are called point estimators of β0 and β1
respectively. ∑Yi = nb0 + b1
∑Xi∑
XiYi = b0
∑Xi + b1
∑X 2i
I This is a system of two equations and two unknowns. Thesolution is given by. . .
Solution to Normal Equations
After a lot of algebra one arrives at
b1 =
∑(Xi − X )(Yi − Y )∑
(Xi − X )2
b0 = Y − b1X
X =
∑Xi
n
Y =
∑Yi
n
Properties of Solution
I The i th residual is defined to be
ei = Yi − Yi
I∑
i ei = 0
I∑
i Yi =∑
i Yi
I∑
i Xiei = 0
I∑
i Yiei = 0
I The regression line always goes through the point X , Y
Alternative format of linear regression model:
Yi = β∗0 + β1(Xi − X ) + εi
The least squares estimator b1 for β1 remains the same as before.The least squares estimator for β∗0 = β0 + β1X becomes
b∗0 = b0 + b1X = (Y − b1X ) + b1X = Y
Hence the estimated regression function is
Y = Y + b1(X − X )
s2 estimator for σ2
s2 = MSE =SSE
n − 2=
∑(Yi − Yi )
2
n − 2=
∑e2i
n − 2
I MSE is an unbiased estimator of σ2
E(MSE ) = σ2
I The sum of squares SSE has n − 2 “degrees of freedom”associated with it.
I Cochran’s theorem (later in the course) tells us where degree’sof freedom come from and how to calculate them.
Normal Error Regression Model
Yi = β0 + β1Xi + εi
I Yi value of the response variable in the i th trial
I β0 and β1 are parameters
I Xi is a known constant, the value of the predictor variable inthe i th trial
I εi ∼iid N(0, σ2)note this is different, now we know the distribution
I i = 1, . . . , n
Inference concerning β1
Tests concerning β1 (the slope) are often of interest, particularly
H0 : β1 = 0
Ha : β1 6= 0
the null hypothesis model
Yi = β0 + (0)Xi + εi
implies that there is no relationship between Y and X.
Note the means of all the Yi ’s are equal at all levels of Xi .
Sampling Dist. Of b1
I The point estimator for b1 is
b1 =
∑(Xi − X )(Yi − Y )∑
(Xi − X )2
I For a normal error regression model the sampling distributionof b1 is normal, with mean and variance given by
E(b1) = β1
Var(b1) =σ2∑
(Xi − X )2
Estimated variance of b1
I When we don’t know σ2 then we have to replace it with theMSE estimate
I Let
s2 = MSE =SSE
n − 2
whereSSE =
∑e2i
andei = Yi − Yi
plugging in we get
Var(b1) =σ2∑
(Xi − X )2
Var(b1) =s2∑
(Xi − X )2
Recap
I We now have an expression for the sampling distribution of b1
when σ2 is known
b1 ∼ N (β1,σ2∑
(Xi − X )2) (1)
I When σ2 is unknown we have an unbiased point estimator ofσ2
Var(b1) =s2∑
(Xi − X )2
Sampling Distribution of (b1 − β1)/S(b1)
I b1 is normally distributed so (b1 − β1)/(√
Var(b1)) is astandard normal variable
I We don’t know Var(b1) so it must be estimated from data.We have already denoted it’s estimate
I If using the estimate V (b1) it can be shown that
b1 − β1
S(b1)∼ t(n − 2)
S(b1) =
√V (b1)
Confidence Intervals and Hypothesis Tests
Now that we know the sampling distribution of b1 (t with n-2degrees of freedom) we can construct confidence intervals andhypothesis tests easily.
1− α confidence limits for β1
I The 1− α confidence limits for β1 are
b1 ± t(1− α/2; n − 2)s{b1}
I Note that this quantity can be used to calculate confidenceintervals given n and α.
I Fixing α can guide the choice of sample size if a particularconfidence interval is desired
I Given a sample size, vice versa.
I Also useful for hypothesis testing
Tests Concerning β1
I Example 1I Two-sided test
I H0 : β1 = 0I Ha : β1 6= 0I Test statistic
t∗ =b1 − 0
s{b1}
Tests Concerning β1
I We have an estimate of the sampling distribution of b1 fromthe data.
I If the null hypothesis holds then the b1 estimate coming fromthe data should be within the 95% confidence interval of thesampling distribution centered at 0 (in this case)
t∗ =b1 − 0
s{b1}
Decision rules
if |t∗| ≤ t(1− α/2; n − 2), acceptH0
if |t∗| > t(1− α/2; n − 2), rejectH0
Absolute values make the test two-sided
Inferences Concerning β0
I Largely, inference procedures regarding β0 can be performedin the same way as those for β1
I Remember the point estimator b0 for β0
b0 = Y − b1X
Sampling distribution of b0
I When error variance is known
E(b0) = β0
σ2{b0} = σ2(1
n+
X 2∑(Xi − X )2
)
I When error variance is unknown
s2{b0} = MSE (1
n+
X 2∑(Xi − X )2
)
Confidence interval for β0
The 1− α confidence limits for β0 are obtained in the samemanner as those for β1
b0 ± t(1− α/2; n − 2)s{b0}
Sampling Distribution of Yh
I We haveYh = b0 + b1Xh
I Since this quantity is itself a linear combination of the Y ′i s it’ssampling distribution is itself normal.
I The mean of the sampling distribution is
E{Yh} = E{b0}+ E{b1}Xh = β0 + β1Xh
Biased or unbiased?
Sampling Distribution of Yh
I So, plugging in, we get
σ2{Yh} = σ2
(1
n+
(Xh − X )2∑(Xi − X )2
)I Since we often won’t know σ2 we can, as usual, plug in
S2 = SSE/(n − 2), our estimate for it to get our estimate ofthis sampling distribution variance
s2{Yh} = S2
(1
n+
(Xh − X )2∑(Xi − X )2
)
No surprise. . .
I The sampling distribution of our point estimator for theoutput is distributed as a t-distribution with two degrees offreedom
Yh − E{Yh}s{Yh}
∼ t(n − 2)
I This means that we can construct confidence intervals in thesame manner as before.
Confidence Intervals for E(Yh)
I The 1− α confidence intervals for E(Yh) are
Yh ± t(1− α/2; n − 2)s{Yh}
I From this hypothesis tests can be constructed as usual.
Prediction interval for single new observation
I If the regression parameters are unknown the 1− α predictioninterval for a new observation Yh is given by the followingtheorem
Yh ± t(1− α/2; n − 2)s{pred}
I We have
σ2{pred} = σ2{Yh− Yh} = σ2{Yh}+σ2{Yh} = σ2 +σ2{Yh}
An unbiased estimator of σ2{pred} iss2{pred} = MSE + s2{Yh}, which is given by
s2{pred} = MSE
[1 +
1
n+
(Xh − X )2∑(Xi − X )2
]
ANOVA table for simple lin. regression
Source of Variation SS df MS E(MS)
Regression SSR =∑
(Yi − Y )2 1 MSR = SSR/1 σ2 + β21
∑(Xi − X )2
Error SSE =∑
(Yi − Yi )2 n − 2 MSE = SSE/(n − 2) σ2
Total SSTO =∑
(Yi − Y )2 n − 1
F Test of β1 = 0 vs. β1 6= 0
ANOVA provides a battery of useful tests. For example, ANOVAprovides an easy test for
Two-sided testH0 : β1 = 0Ha : β1 6= 0Test statistic
Test statistic from beforet∗ = b1−0
s{b1}ANOVA test statisticF ∗ = MSR
MSE
F Distribution
I The F distribution is the ratio of two independent χ2 randomvariables normalized by their corresponding degrees offreedom.
I The test statistic F ∗ follows the distributionF ∗ ∼ F (1, n − 2)
Hypothesis Test Decision Rule
Since F ∗ is distributed as F (1, n − 2) when H0 holds, the decisionrule to follow when the risk of a Type I error is to be controlled atα is:
If F ∗ ≤ F (1− α; 1, n − 2), conclude H0
If F ∗ > F (1− α; 1, n − 2), conclude Ha
General Linear Test
I The test of β1 = 0 versus β1 6= 0 is but a single example of ageneral test for a linear statistical models.
I The general linear test has three partsI Full ModelI Reduced ModelI Test Statistic
Full Model Fit
I A full linear model is first fit to the data
Yi = β0 + β1Xi + εi
I Using this model the error sum of squares is obtained, here forexample the simple linear model with non-zero slope is the“full” model
SSE (F ) =∑
[Yi − (b0 + b1Xi )]2 =∑
(Yi − Yi )2 = SSE
Fit Reduced Model
I One can test the hypothesis that a simpler model is a“better” model via a general linear test (which is really alikelihood ratio test in disguise). For instance, consider a“reduced” model in which the slope is zero (i.e. norelationship between input and output).
H0 : β1 = 0Ha : β1 6= 0
I The model when H0 holds is called the reduced or restrictedmodel.
Yi = β0 + εi
I The SSE for the reduced model is obtained
SSE (R) =∑
(Yi − b0)2 =∑
(Yi − Y )2 = SSTO
Test Statistic
I The idea is to compare the two error sums of squares SSE(F)and SSE(R).
I Because the full model F has more parameters than thereduced model R SSE (F ) ≤ SSE (R) always
I In the general linear test, the test statistic is
F ∗ =
SSE(R)−SSE(F )dfR−dfFSSE(F )
dfF
which follows the F distribution when H0 holds.
I dfR and dfF are those associated with the reduced and fullmodel error sums of square respectively
R2 (Coefficient of determination)
I SSTO measures the variation in the observations Yi when X isnot considered
I SSE measures the variation in the Yi after a predictor variableX is employed
I A natural measure of the effect of X in reducing variation in Yis to express the reduction in variation (SSTO − SSE = SSR)as a proportion of the total variation
R2 =SSR
SSTO= 1− SSE
SSTO
I Note that since 0 ≤ SSE ≤ SSTO then 0 ≤ R2 ≤ 1
Coefficient of Correlation
r = ±√
R2
Range:−1 ≤ r ≤ 1
Remedial Measures
I How do we know that the regression function is a goodexplainer of the observed data?- Plotting- Tests
I What if it is not? What can we do about it?- Transformation of variables
Residuals
I Remember, the definition of residuals:
ei = Yi − Yi
I And the difference between that and the unknown true error
ε = Yi − E (Yi )
I In a normal regression model the εi ’s are assumed to be iidN(0, σ2) random variables. The observed residuals ei shouldreflect these properties.
Departures from Model...
To be studied by residuals
I Regression function not linear
I Error terms do not have constant variance
I Error terms are not independent
I Model fits all but one or a few outlier observations
I Error terms are not normally distributed
I One or more predictor variables have been omitted from themodel
Diagnostics for Residuals
I Plot of residuals against predictor variable
I Plot of absolute or squared residuals against predictor variable
I Plot of residuals against fitted values
I Plot of residuals against time or other sequence
I Plot of residuals against omitted predictor variables
I Box plot of residuals
I Normal probability plot of residuals
Tests Involving Residuals
I Tests for constancy of variance (Brown-Forsythe test,Breusch-Pagan test, Section 3.6)
I Tests for normality of error distribution
Brown-Forsythe Test
I The test statistic for comparing the means of the absolutedeviations of the residuals around the group medians is
t∗BF = d1−d2
s√
1n1
+ 1n2
where the pooled variance
s2 =∑
(di1−d1)2+∑
(di2−d2)2
n−2
Brown-Forsythe Test
I If n1 and n2 are not extremely small
t∗BF ∼ t(n − 2)
approximately
I From this confidence intervals and tests can be constructed.
F test for lack of fit
I Formal test for determining whether a specific type ofregression function adequately fits the data.
I Assumptions (usual):- observations Y |X are
1. i.i.d.2. normally distributed3. same variance σ2
I Requires: repeat observations at one or more X levels (calledreplicates)
Full Model vs. Regression Model
I The full model is
Yij = µj + εij Full model
where- µj are parameters j = 1, ..., c- εij are iid N(0, σ2)
I Since the error terms have expectation zero
E (Yij) = µj
Full Model
I In the full model there is a different mean (a free parameter)for each Xi
I In the regression model the mean responses are constrained tolie on a line
E (Y ) = β0 + β1X
Fitting the Full Model
I The estimators of µj are simply
µj = Yj
I The error sum of squares of the full model therefore is
SSE (F ) =∑∑
(Yij − Yj)2 = SSPE
SSPE: Pure Error Sum of Squares
Degrees of Freedom
I Ordinary total sum of squares had n-1 degrees of freedom.
I Each of the j terms is a ordinary total sum of squares- Each then has nj − 1 degrees of freedom
I The number of degrees of freedom of SSPE is the sum of thecomponent degrees of freedom
dfF =∑j
(nj − 1) =∑j
nj − c = n − c
General Linear Test
I Remember: the general linear test proposes a reduced modelnull hypotheses- this will be our normal regression model
I The full model will be as described (one independent mean foreach level of X)
H0 : E (Y ) = β0 + β1XHa : E (Y ) 6= β0 + β1X
SSE For Reduced Model
The SSE for the reduced model is as before- remember
SSE (R) =∑i
∑j
[Yij − (b0 + b1Xj)]2
=∑i
∑j
(Yij − Yij)2
- and has n-2 degrees of freedom dfR = n − 2
F Test Statistic
From the general linear test approach
F ∗ = SSE(R)−SSE(F )dfR−dfF ÷ SSE(F )
dfF
F ∗ = SSE−SSPE(n−2)−(n−c) ÷
SSPEn−c
Lack of fit sum of squares:
SSLF = SSE − SSPE
Then
F ∗ =SSLF
(n − 2)− (n − c)÷ SSPE
n − c=
MSLF
MSPE
F Test Rule
I From the F test we know that large values of F ∗ lead us toreject the null hypothesis:If F ∗ ≤ F (1− α; c − 2, n − c), conclude H0
If F ∗ > F (1− α; c − 2, n − c), conclude Ha
Variance decomposition
SSE = SSPE + SSLF.∑∑(Yij − Yij)
2 =∑∑
(Yij − Yj)2 +
∑∑(Yj − Yij)
2
Example decomposition
Box Cox Transforms
I It can be difficult to graphically determine whichtransformation of Y is most appropriate for correcting- skewness of the distributions of error terms- unequal variances- nonlinearity of the regression function
I The Box-Cox procedure automatically identifies atransformation from the family of power transformations on Y
Box Cox Transforms
I This family is of the form
Y ′ = Y λ
I Examples include
λ = 2 Y ′ = Y 2
λ = .5 Y ′ =√
Yλ = 0 Y ′ = lnY (by definition)
λ = −.5 Y ′ = 1√Y
λ = −1 Y ′ = 1Y
Box Cox Cont.
I The normal error regression model with the response variablea member of the family of power transformations becomes
Y λi = β0 + β1Xi + εi
I This model has an additional parameter that needs to beestimated
I Maximum likelihood is a way to estimate this parameter
Using the Bonferroni inequality cont.
I To achieve a 1− α family confidence interval for β0 and β1
(for example) using the Bonferroni procedure we know thatboth individual intervals must shrink.
I Returning to our confidence intervals for β0 and β1 frombefore
b0 ± t(1− α/2; n − 2)s{b0}b1 ± t(1− α/2; n − 2)s{b1}
I To achieve a 1− α family confidence interval these intervalsmust widen to
b0 ± t(1− α/4; n − 2)s{b0}b1 ± t(1− α/4; n − 2)s{b1}
I ThenP(A1 ∩ A2) ≥ 1−P(A2)−P(A1) = 1−α/4−α/4 = 1−α/2