1 simple linear regression chapter 16. 2 introduction in this chapter we examine the relationship...

1

Simple Linear Regression Simple Linear Regression

Chapter 16

2

Introduction

• In this chapter we examine the relationship among interval variables via a mathematical equation.

• The motivation for using the technique:– Forecast the value of a dependent variable (y) from

the value of independent variables (x1, x2,…xk.).– Analyze the specific relationships between the

independent variables and the dependent variable.

3

16.1 Simple Linear Regression Model

The model has a deterministic and a probabilistic components

4

House size

HouseCost

Most lots sell for $25,000

Building a house costs about

75$ per square foot.

House cost = 25000 + 75(Size)

The Deterministic part of the model

5


However, house cost may vary even among same size houses!

The Model

6


House size

HouseCost

Most lots sell for $25,000

The Model

Since cost behave unpredictably,we add a random component.

7

The Model

• The simple first order (linear) regression model is:

y = dependent variablex = independent variable0 = y-intercept

1 = slope of the line

= error variable

xy 10 xy 10

x

y

0 Run

Rise = Rise/Run

0 and 1 are unknown populationparameters, therefore are estimated from the data.

8

16.2 The Least Squares Method

• The estimates are determined by – drawing a sample from the population of interest,– calculating sample statistics.– producing a straight line that cuts into the data.

Question: What is the best line to describe the specific linear relationship?

x

y

9

The Least Squares (Regression) Line

A good line is considered one, that minimizes the sum of squared errors.

X

Actual value of Y

Equation value of YError

10


3

3

41

1

4

(1,2)

2

2

(2,4)

(3,1.5)

(4,3.2)

2.5

Here is a short comparison of two possible lines drawn over 4 data points: 1. Horizontal line2. Positively increasing lineWhich line is better?

11


3

3

41

1

4

(1,2)

2

2

(2,4)

(3,1.5)

(4,3.2)

Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99

2.5

The smaller the sum of squared differencesthe better the fit of the line to the data.

12

The Estimated Coefficients

To calculate the estimates of 0 and 1 that minimize the differences between the data points and the line, use the formulas shown below (alternative formulae are suggested later):

xbyb

)xx()yy)(xx(

b

10

2i

ii1

xbyb

)xx()yy)(xx(

b

10

2i

ii1

The regression equation that estimatesthe equation of the first order linear modelis:

xbby 10 xbby 10

13

• Example 2– A car dealer wants to find

the relationship between the odometer reading and the selling price of 3-year old Tauruses.

– A random sample of 100 cars is selected, and the data recorded.

– Find the regression line.

Car Price Odometer1 14.6 37.42 14.1 44.83 14.0 45.84 15.6 30.95 15.6 31.76 14.7 34.0. . .. . .. . .

Dependent variable y

Inependent variable x

The Simple Linear Regression Line

14

• In order to more easily use Excel results, we’ll use another version of the formula for b1:

COV (the covariance of x and y) is a measure of the common ‘movement’ of x and y (do both generally increase together, or move in opposite directions)

xbyb

S

y)cov(x,b

10

2x

1

xbyb

S

y)cov(x,b

10

2x

1


cov(x,y) is also denoted by Sxy

15

-2.909

Excel provides a population covariance of -2.879. The sample covariance is:-2.879n/(n-1)=-2.879(100/(99) = -2.909

2i

ii

2i2

x

)x(x

)y)(yx(xy)cov(x,

43.5091n

)x(xs

14.841;y36.01;x


• Solution– Solving by hand: Calculate statistics. See data in

where n = 100.17,248

011).0669)(36.(14,822.82xbyb

.066943.509

2.909s

Y)cov(X,b

10

2x

1

.0669x17.248xbby 10 ˆ

Car Price

16

• Solution – continued– Using the computer:

Tools > Data analysis > Regression > [Shade the y range and the x range] > OK


Car Price

17

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.805168R Square 0.648295Adjusted R Square0.644707Standard Error0.326489Observations 100

ANOVAdf SS MS F Significance F

Regression 1 19.255607 19.255607 180.6429887 5.75078E-24Residual 98 10.446293 0.1065948Total 99 29.7019

CoefficientsStandard Error t Stat P-value Lower 95% Upper 95%Intercept 17.24873 0.1820926 94.725045 3.57186E-98 16.88737056 17.61008Odometer -0.066861 0.0049746 -13.44035 5.75078E-24 -0.076732894 -0.056989

669x17,248y 0.ˆ


Odometer readingSelling price

Car Price

18

Odometer Line Fit Plot

13000

14000

15000

16000

Odometer

Pri

ce

.0669x17,248y ˆ

Interpreting the Linear Regression -Equation

The intercept is b0 = $17,248.

0 No data

• The regression equation describes the linear relationship within the range covered by the sample only. • Thus, do not interpret the intercept as the “Price of cars that have not been driven”

17025

This is the slope of the line.For each additional mile on the odometer,the price decreases by an average of $0.0669

19

Interpreting the Linear Regression -Equation

• Remember: The regression equation pertains to the sample only!!

• To generalize the results by making inference about the population, we are about to apply statistical inference techniques.

20

16.3 Model Assumptions

• The error is a critical part of the regression model.• Four requirements involving must be satisfied.

– The probability distribution of is normal.– The mean of is zero: E() = 0.– The standard deviation of is for all values of x.– The set of errors associated with different values of y are

all independent.

21

The Normality of

The mean value of y for a given value of x is E(y|x) = 0 + 1x + E() = 0 + 1x, since E() = 0.

0 + 1x1

0 + 1x2

0 + 1x3

E(y|x2)

E(y|x3)

x1 x2 x3

E(y|x1)

The standard deviation remains constant,

but the mean value changes with x

Recall: y = 0+1x+

Since 0+1x is deterministicand is normally distributed,y is also normally distributed.

The standard deviation of y is for all values of y

22

+

+

+

+

+

++

++

+

+

+

+

+

+

++

++

+

+

++

+

Notice, that for small values of y, and for large values of y the errorsare mostly negative, while for midrange values of y the errors are positive.The errors are not independent.Consequently, linear regression is not the correct model to work with here.One can also question the assumption of ‘The mean error is zero’.

The independence of the errorsHere is a case where linear regression is not the rightmodel to apply to.

23

16.4 Assessing the Model

• The least squares method produces a regression line whether or not there is a linear relationship between x and y.

• Consequently, it is important to assess how well the linear model fits the data (how strong the linear relationship is).

• Several methods are used to assess the model. All are based on the sum of squares for errors, SSE.

24

• SSE is the sum of vertical differences between the points and the regression line. It was the function we minimized when constructing the regression equation.

• It can serve as a measure of how well the line fits the data. SSE is defined by

.)yy(SSEn

1i

2ii

.)yy(SSEn

1i

2ii

Sum of Squares for Errors

2x

2Y

s

)Y,Xcov(s)1n(SSE

2x

2Y

s

)Y,Xcov(s)1n(SSE

• A shortcut formulao

xi

yi+

iy

25

If is small the errors tend to be close to their mean, and the model fits the data well.Since the mean error is equal to zero, the model fits the data well when is close to zero.Therefore, we can, use as a measure of the suitability of using a linear model.To do this we need to estimate . An unbiased estimator of

2 is given by s2

2nSSE

s2ε

2nSSE

s2ε

Standard Error of Estimate

The standard error of estimateis defined by s

26

Testing the Slope

Different inputs (x) yielddifferent outputs (y).

No linear relationship.Different inputs (x) yieldthe same output (y).

The slope is not equal to zero The slope is equal to zero

Linear relationship.Linear relationship.Linear relationship.Linear relationship.

27

• We can draw inference about 1 from b1 by testingH0: 1 = 0H1: 1 = 0 (or < 0,or > 0)– The test statistic is

– If the error variable is normally distributed, the statistic is Student t distribution with d.f. = n-2.

1b

11

sb

t

1b

11

sb

t

The standard error of b1.

2x

bs)1n(

ss

1

2x

bs)1n(

ss

1

where

Testing the Slope

28

• Example 4– Test to determine whether there is enough evidence to infer

that there is a linear relationship between the car auction price and the odometer reading for all three-year-old Tauruses, in example 2. Use = 5%.

– Solution. The alternative hypothesis here (H1) is of the form ‘Not equal to zero’, because we try to show that there is a linear relationship, which can be verified by either positive value or negative value for 1.

Testing the Slope,Example

29

• Solving by hand– To compute “t” we need the values of b1 and sb1.

– For the two tails rejection region, t > t.025 or t < -t.025 with = n-2 = 98. From the t-table t.025 = 1.984 approximately,

13.44.00497

0.0669s

βbt

.004979)(99)(43.50

.3265

1)s(n

ss

.0669b

1

1

b

11

2x

εb

1


Obtained from Descriptive Statisticsin Excel.

30

SUMMARY OUTPUT





• Using the computer


There is overwhelming evidence to inferthat the odometer reading affects the auction selling price.`

Car Price

31

– To measure the strength of the linear relationship we use the coefficient of determination.

Coefficient of determination R2

Here the line explains allthe variation between the different ‘y’ values.

Here the line explains onlysome of the variation among the ‘y’ values (there are errors).

32

Coefficient of determination

• The mathematical formulation of R2 is based on the following characteristic ( stated without a proof):

Overall variation in y =

The variation of the regression model around the mean +

The variation of the actual points around the line

That is: SST = SSR + SSE

Observe next

33

Coefficient of Determination = [Variation in y explained by the linear relationship]

[Total Variation in y

Coefficient of Determination R2

2i

2i

2

)yy(SSRand)yy(SSTWhereSSTSSR

R

2i

2i

2

)yy(SSRand)yy(SSTWhereSSTSSR

R

34

Coefficient of determination - insight

SST = SSR + SSE

• When the relationship between x and y is perfectly linear – there are no deviation of the actual points from the line,so SSE = 0. Therefore, SSR = SST and R2 = SSR/SST = 1

• When the relationship between x and y is not perfectly linear –there are deviations of the actual points from the regression line,so SSE > 0. Therefore, SSR < SST, and R2 = SSR/SST < 1.

• When no linear relationship exists between x and y – none of the total variation among the actual points is explainedby a linear relationship, so SST = SSE. Therefore, R2 = 0.

35

• Example 3– Find the coefficient of determination for example 2; what

does this statistic tell you about the model?• Solution

– Solving by hand; we use an alternative form of the R2 formula, that makes it easier to use Excel.

.6483ss

Y)][cov(X,R 3000)(43.509)(.

2.909][2y

2x

22 2

Coefficient of determination,Example

36

– Using the computer From the regression output we have

Coefficient of determination – Example

SUMMARY OUTPUT





65% of the variation in the auctionselling price is explained by the variation in odometer reading. Therest (35%) remains unexplained bythis model.

37

Coefficient of Correlation

• The coefficient of correlation is used to measure the strength of association between two variables.

• The coefficient values range between -1 and 1.– If r = -1 (negative association) or r = +1 (positive

association) every point falls on the regression line.– If r = 0 there is no linear pattern.

• The coefficient can be used to test for linear relationship between two variables.

38

• If we are satisfied with how well the model fits the data, we can use it to predict the values of y.

• To make a prediction we use– Point prediction, and– Interval prediction

16.6 Using the Regression Equation

• Before using the regression model, we need to assess how well it fits the data.

39

Point Prediction

• Example 5– Predict the selling price of a three-year-old Taurus

with 40,000 miles on the odometer (Example 2).

– It is predicted that a 40,000 miles car would sell for $14,575.

– How close is this prediction to the real price?

575,14)000,40(0623.17067x0623.17067y A point prediction

40

Interval Estimates• Two intervals can be used to discover how closely the

predicted value will match the true value of y.– Prediction interval – predicts y for a given value of x,– Confidence interval – predicts the average y for a given x.

– The confidence interval– The confidence interval

2x

2g

2 s)1n()xx(

n1

sty

2x

2g

2 s)1n()xx(

n1

sty

– The prediction interval– The prediction interval

2x

2g

2 s)1n()xx(

n1

1sty

2x

2g

2 s)1n()xx(

n1

1sty

41

Interval Estimates,Example

• Example 5 - continued – Provide an interval estimate for the bidding price on

a Ford Taurus with 40,000 miles on the odometer.– Two types of predictions are required:

• A prediction for a specific car• A prediction for the mean price per car

42


• Solution– A prediction interval provides the price estimate for a

single car:

2x

2g

1n,2 s)1n(

)xx(

n1

1sty

65214.5741)(43.509)(10036.011)(40

1001

15)1.984(.326.0669(40)][17.0252

xg

t.025,98

Approximately

t/2.n-1 s 2xs

xxg

b1b0

43

• Solution – continued– A confidence interval provides the estimate of the

mean price per car for a Ford Taurus with 40,000 miles reading on the odometer.

• The confidence interval (95%) =

2

i

2g

2)xx(

)xx(

n1

sty

.07614.5741)(43.509)(10036.011)(40

1001

5)1.984(.32614.5742


1 simple linear regression chapter 16. 2 introduction in this chapter we examine the relationship...

Documents

x y slide

model slide

squares regression line

best line

good line

straight line

horizontal line

error variable x y