09 inference for regression part1

8/17/2019 09 Inference for Regression Part1

1/12

- 1 -

INFERENCE FOR REGRESSION – PART 1

Topics Outline

Review of Least Squares Regression Line

The Linear Regression Model

Confidence Intervals for the Intercept and the Slope

Testing the Hypothesis of No Linear Relationship

Inference about Prediction

Residuals

Conditions for Regression Inference

Review of Least Squares Regression Line

In simple linear regression, we consider a data set consisting of the paired observations

),(,),,( 11 nn y x y x

. Our goal is to investigate how the two quantitative variables x and y,corresponding to the data values i x and i y , are related. We are also interested in predicting a

future response y from information about x.

The correlation coefficient r measures the direction and strength of the linear relationship betweentwo quantitative variables. Values of r close to ( – 1) or (+1) indicate a strong negative or positivelinear relationship.

The least-squares regression line of the response variable y on the explanatory variable x is the line

bxa ŷ

that minimizes the sum of the squares of the vertical distances of the data points ),( ii y x

from the line. The slope

x

y

s

sr b

of the regression line is the rate at which the predicted response ŷ changes along the line as the

explanatory variable x changes. Specifically, b is the change in ŷ when x increases by 1.

The intercept of the regression line

xb ya

is the predicted response ŷ when the explanatory variable x = 0. This prediction is of no statisticalinterest unless x can actually take values near 0.

The coefficient of determination 2r is the square of the correlation coefficient r .It measures the fraction of the variation in the response variable y that is explained by the leastsquares regression on the explanatory variable x.

The least squares regression line can be used to predict the value of the response variable y for agiven value of the explanatory variable x by substituting this x into the equation of the line.


2/12

- 2 -

Example 1

Car plant electricity usage

The manager of a car plant wishes to investigate how the plant’s electricity usage depends upon

the plant’s production, based on the data for each month of the previous year:

x yMonth Production

($ million)

Electricity usage

(million kWh)

January 4.51 2.48February 3.58 2.26March 4.31 2.47April 5.06 2.77May 5.64 2.99June 4.99 3.05July 5.29 3.18August 5.83 3.46

September 4.70 3.03October 5.61 3.26 November 4.90 2.67December 4.20 2.53

The scatterplot shows a positive linear relationship, with no extreme outliers or potentiallyinfluential observations. Higher levels of production do tend to require higher levels of electricity.

The correlation coefficient r = 8021.02

r 0.896 is high, indicating a strong linear

relationship between Production and Electricity. The equation of the least squares regression line is

bxa ŷ = 0.409 + 0.499 x

Because 2r = 0.8021, about 80% of the variation in Electricity usage is explained by Production levels.

Is the observed relationship statistically significant?

y = 0.4988x + 0.409

R² = 0.8021

2

2.25

2.5

2.75

3

3.25

3.5

3.5 4 4.5 5 5.5 6 E l e c t r i c i t y u s a g e ( m i l l i o n k W h )

Production ($ million)

Car Plant Electricity Usage


3/12

- 3 -

The Linear Regression Model

Regression analysis is used primarily to predict the values of the response variable y based on thevalues of the explanatory variable x. To assess the accuracy of these predictions, we need toconsider the mathematical model for linear regression.

Figure 1 provides a summary of the estimation process for simple linear regression.

The mathematical model for linear regression analysis assumes that the observed data points

),(,),,( 11 nn y x y x constitute a random sample from a population. We suppose that in the

population there is an underlying linear relationship between the explanatory variable x and theresponse variable y:

x y

where is a random variable referred to as the error (or residual) term. The error term accountsfor the variability in y that cannot be explained by the linear relationship between x and y.

The random variable is assumed to have a mean of zero and standard deviation .A consequence of this assumption is that the mean of y is equal to:

x y

This is the equation of the true regression line.

The unknown parameters (true intercept) and (true slope), which determine the relationship

between x and y, can be estimated from the data set ),(,),,( 11 nn y x y x .

It can be shown that the estimators a and b from the least squares method are the

best li near unbiased estimators of and (whatever that means!).

The estimation of and is a statistical process much like the estimation of using the sample

statistic x . In regression, and are two unknown parameters of interest, and the coefficients a

and b obtained from the least squares line are the sample statistics used to estimate these parameters.

The third unknown parameter, the standard deviation of the error , can also be estimatedfrom the data set. Recall that the residuals (errors) are the vertical deviations of the data pointsfrom the least-squares line:

residual = (observed y) – (predicted y) = y – ŷ

There are n residuals, one for each data point and their mean is 0.The estimate of is given by the sample standard deviation of the residuals

n

i

ii

n

i

i y yn

residual n

s1

2

1

2 )ˆ(2

1)0(

2

1

and is referred to as the regression standard error (or standard error of estimate).The regression standard error for our example is s = 0.173. (See Excel output on the last page.)


4/12

- 4 -

Figure 1 The estimation process in simple linear regression.

Regression Model x y

( - st. dev. of )True Regression Line

x y

Regression Parameters

,,

Sample Data

x y

x1 y1 x2 y2. .

. .

. .

xn yn

Compute the

sample statistics

a, b, s

and the estimated

regression line

bxa ŷ

The values of

a, b, s

provide estimates of

,,


5/12

- 5 -

Confidence Intervals for the Intercept and the Slope

If we did experiment many times with the same s xi ' we would get different s yi ' each time,

due to random errors. Therefore, we would also get different values for the least squares

estimators a and b of the population parameters and . Indeed, a and b are sample statistics

that have their own sampling distributions.

Let aSE and bSE be estimates of the standard errors (i.e. standard deviations) of a and b,

respectively. It can be shown that the level C confidence intervals for the intercept and the

slope are given by the following confidence limits:

: aSE t a *

: bSE t b *

Here t* is the critical value for the t (n – 2) density curve with area C between – t* and t*.

Note: All t procedures in simple linear regression have n – 2 degrees of freedom.

Example 1 (Continued)

For our example (see Excel output),

a = 0.4090 aSE = 0.3860

b = 0.4988 bSE = 0.0784

There are 12 data points, so the degrees of freedom are n − 2 = 10.For 95% confidence and df = 10, the t -Table gives t * = 2.228.

95% CI for : aSE t a * = 0.4090 ± (2.228)(0.3860) = 0.4090 ± 0.86 = – 0.45 to 1.27

Hence, the true value for the intercept lies in the interval from – 0.45 to 1.27,and this statement is made with 95% confidence.

Note: Inferences for the population intercept are rarely of practical importance.

95% CI for : bSE t b * = 0.4988 ± (2.228)(0.0784) = 0.4988 ± 0.1748 = 0.32 to 0.67

Thus the management of the car plant can be 95% confident that within the range of the data set,the mean electricity usage increases by somewhere between a third of a million kilowatt-hoursand two thirds of a million kilowatt-hours for every additional $1 million dollars of production.


6/12

- 6 -

Testing the Hypothesis of No Linear Relationship

One of the first things we want to do upon obtaining the sample regression equation

bxa ŷ

is to test its slope b. If there is no (linear) relationship between the variables x and y,

then the slope of the regression equation would be expected to be zero.If b = 0, then a ŷ and thus x is useless as a predictor of y.

Recall that is unknown and represents the slope of the true unknown regression line

x y

while b is the estimate of the slope obtained by fitting a line to the data set.

Hence, we can determine the existence of a statistically significant relationship between x and y

variables by testing whether (the true slope) is equal to 0.

The null and alternative hypotheses are stated as follows:

0:0 H (There is no linear relationship.)

0:a H (There is a linear relationship.)

If the null hypothesis is rejected, we would conclude that there is evidence of a linear relationship.It can be shown that the test statistic is

bSE

bt

Example 1 (Continued) To test the hypothesis

0:

0:0

a

H

H

we calculate the test statistic (see also Excel output):

37.60784.0

4988.0t

The t -Table shows that the two-sided P -value for t distribution with 10 degrees of freedom issmaller than 0.001. (Excel gives P -value = 0.000082.)

We reject 0 H and conclude that the slope of the population regression line is not 0.

In other words, the data provide very strong evidence to conclude that the distribution ofelectricity usage does depend upon the level of production.

An alternative to testing the existence of a linear relationship between x and y variables is to set

up a confidence interval for and to determine whether the hypothesized value ( = 0 ) is

included in the interval. The 95% confidence interval for is 0.32 to 0.67.

Because this interval does not contain 0, we conclude that there is a significant linear relationship between x and y. Had the interval included 0, the conclusion would have been that no (linear)relationship exists between the variables.


7/12

- 7 -

Inference about Prediction

There are several reasons for building a linear regression. One, of course, is to predict responsevalues ( y’s) at one or more values of the explanatory variable x.


If the monthly production is x* = $5 million, then the plant manager can predict that theelectricity usage will be

*ˆ y 903.2)5)(4988.0(409.0 kWh

How accurate this prediction is likely to be?Can we supply this prediction with a margin of error?

Given a specified value of the explanatory variable x*, which is not necessarily one of the values

n x x ,,1 , we can construct two fundamentally different kinds of intervals.

1. Confidence interval for the expected (mean) response E ( y*) = ** x y

:

meanSE t y **ˆ n

i

i

mean

x x

x x

n sSE

1

2

2

)(

)*(1

This confidence interval expresses our uncertainty about the regression line.

If we knew and , then we would know the regression line exactly and our confidence

interval would be one point.

2. Prediction interval for an individual (future) response y* :

ind SE t y **ˆ n

i

i

ind

x x

x x

n sSE

1

2

2

)(

)*(11

This prediction interval expresses our uncertainty about the regression line and the fact that

there are errors in the data. If we knew and , we would know the regression line exactly,

but the length of our prediction interval would not shrink to zero, since the error term in

y* = + x* + ε*

always has a fixed variance 2 .


8/12

- 8 -

In both intervals, t * is the critical value for the t (n − 2) density curve with area C between −t * and t *,and

n

i

ii y y

n s

1

2)ˆ(2

1

is the regression standard error.

Both intervals are centered at *ˆ y and have the usual form

point estimate ± (critical value)(standard error)

SE t y **ˆ

However, the prediction interval is wider than the confidence interval because it is harder to predict one individual response than to predict a mean response.

Individuals are always more variable than averages!

Excel’s Regression tool does not have an option for computing confidence and prediction intervals.These intervals can be computed using formulas along with the output of the Regression tool.


For our example, *ˆ y = 2.903 t * = 2.228 SE mean = 0.0507 SE ind = 0.1802

The 95% confidence interval for the mean response ** x y

to the value x* = 5 is

meanSE t y **ˆ = 2.903 ± (2.228)(0.0507) = 2.903 ± 0.113 or 2.79 to 3.02

This interval implies that with a monthly production of $5 million, the mean electricity usage is between about 2.8 and 3 million kWh.

A 95% prediction interval for a future response to the value x* = 5 is

ind SE t y **ˆ = 2.903 ± (2.228)(0.1802) = 2.903 ± 0.401 or 2.50 to 3.30

This prediction interval indicates that if next month’s production target is $5 million,

then with 95% confidence next month’s electricity usage will be somewhere between 2.5 and 3.3million kWh.

Thus, while the expected or average electricity usage in a month with $5 million of production isknown to lie somewhere between 2.8 and 3.0 million kWh, the electricity usage in a particularmonth with $5 million of production will be somewhere between 2.5 and 3.3 million kWh.


9/12

- 9 -

Residuals

The residuals ( y − ŷ) give useful information about the contribution of individual data points tothe overall pattern of scatter. Residual values show how much the observed values differ fromthe fitted values. If a particular residual is positive, the corresponding data point is above theline; if it is negative, the point is below the line. The only time a residual is zero is when the

point lies directly on the line.

Example 1 (Continued)There are twelve residuals:

Observation 1 2 3 4 5 6 7 8 9 10 11 12

Residual – 0.18 0.07 – 0.09 – 0.16 – 0.23 0.15 0.13 0.14 0.28 0.05 – 0.18 0.03

We can construct a residual plot by plotting the residuals against the explanatory variable x or the

predicted (also called fitted) values ŷ . In a residual plot, the “residual = 0” line represents the

position of the least-squares line in the scatterplot of y against x. (See Excel output.)

Residual plots are the primary tool for determining whether the assumed regression model isappropriate.

Conditions for Regression Inference

An important step in determining whether the assumed linear regression model x y

is appropriate involves testing for the significance of the relationship between the explanatory andresponse variables. The tests of significance in regression analysis are based on four assumptions

about the error term .

Figure 2 illustrates the regression model assumptions and their implications. Note that in this

graphical interpretation, the mean response y moves along a straight line as the explanatory

variable x changes. The normal curves show how the observed response y will vary when x is held

fixed at different values. All of the curves have the same standard deviation , so the variabilityof y is the same for all values of x.

Figure 2 Assumptions for the linear regression model.


10/12

- 10 -

Here are the four conditions for regression inference, their implications and how to check if theconditions are satisfied.

1. Linearity

Condition: The error term is a random variable with a mean 0.

ImplicationBecause and are constants, the mean of y is

x y

implying a linear relationship between x and y.

How to checkLook for curved patterns or other departures from a straight-line overall pattern in the residual plot.(You can also use the original scatterplot, but the residual plot magnifies any effects.)

Example 1The scatterplot and the residual plot both show a linear relationship.

2. Independence

Condition: The values of are statistically independent.

Implication

The value of for a particular value of x is not related to the value of for any other value of x.Thus, the value of y for a particular value of x is not related to the value of y for any other value of x.

How to checkSigns of dependence in the residual plot are a bit subtle. In general, if the residual plot displaysa random pattern with no apparent trends, cycles, alternations, or clumping, it is reasonable toconclude that the independence assumption holds.

Example 1The residual plot shows a random variation around the “residual = 0” line.

3. Normality

Condition: The error term is a normally distributed random variable

(with mean 0 and standard deviation ).Implication

Because y is a linear function of , y is also a normally distributed random variable

(with mean x y and standard deviation ).

How to checkCheck for clear skewness or other major departures from normality in the histogram of the residuals.Or, check if the points in the normal probability plot (Q-Q plot) are far from a 45o line.

Example 1The histogram of the residuals does not show any important deviations from normality.


11/12

- 11 -

4. Equal spread

Condition: The standard deviation of is the same for all values of x.

Implication

The standard deviation of y about the regression line equals and is the same for all

values of x.

How to checkLook at the scatter of the residuals above and below the “residual = 0” line in the residual plot. The scatter should be roughly the same from one end to the other.

Example 1The residual plot shows no unusual variation in the scatter of the residuals above and below the line as x varies.

Example 2

The following figure shows some general patterns that might be observed in any residual plot.

Good pattern – residuals are randomly scattered.

Curved pattern – the relationship is not linear.

Change in variability –

is not equal for all values of x.


12/12

- 12 -

Excel Output for Car Plant Electricity Usage

Regression Statistics

Multiple R 0.895606

R Square 0.802109

Adjusted R Square 0.782320Standard Error 0.172948

Observations 12

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 0.409048 0.385991 1.059736 0.314190 -0.450992 1.269089

Production 0.498830 0.078352 6.366551 0.000082 0.324252 0.673409

-0.30

-0.20

-0.10

0.00

0.10

0.20

0.30

3.5 4 4.5 5 5.5 6 R e s i d u a l s

Production

Residuals versus Production

0

1

2

3

-0.2 -0.1 0 0.1 0.2 0.3

F r e q

u e n c y

Residual

Histogram of the Residuals

09 inference for regression part1

Documents