1 copyright © 2005 brooks/cole, a division of thomson learning, inc. summarizing bivariate data...

1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Summarizing Bivariate Data

Introduction to Linear Regression


Linear Relations

The relationship y = a + bx is the equation of a straight line.

The value b, called the slope of the line, is the amount by which y increases when x increases by 1 unit.

The value of a, called the intercept (or sometimes the vertical intercept) of the line, is the height of the line above the value x = 0.


Example

x

y

0 2 4 6 8

0

5

10

15 y = 7 + 3x

a = 7

x increases by 1

y increases by b = 3


Example

y

y = 17 - 4x

x increases by 1

y changes by b = -4 (i.e., changes by –4)

0 2 4 6 8

0

5

10

15

a = 17


Least Squares Line

The most widely used criterion for measuring the goodness of fit of a line

y = a + bx to bivariate data (x1, y1),

(x2, y2),, (xn, yn) is the sum of the of the squared deviations about the line:

2

2 2

1 1 n n

y (a bx)

y (a bx ) y (a bx )

The line that gives the best fit to the data is the one that minimizes this sum; it is called the least squares line or sample regression line.


Coefficients a and b

The slope of the least squares line is

2

x x y yb

x x

And the y intercept is a y bx

We write the equation of the least squares line as

where the ^ above y emphasizes that (read as y-hat) is a prediction of y resulting from the substitution of a particular value into the equation.

y a bx

y


Calculating Formula for b

2

2

x yxy

nbx

xn


Greyhound Example Continued

x y

240 39 -85.615 7329.994 -20.038 1715.60430 81 104.385 10896.148 21.962 2292.4569 17 -256.615 65851.456 -42.038 10787.72

607 96 281.385 79177.302 36.962 10400.41257 61 -68.615 4708.071 1.962 -134.59480 70.5 154.385 23834.609 11.462 1769.49340 65 14.385 206.917 5.962 85.75467 82 141.385 19989.609 22.962 3246.41335 67 9.385 88.071 7.962 74.72239 47 -86.615 7502.225 -12.038 1042.7295 20 -230.615 53183.456 -39.038 9002.87

178 35 -147.615 21790.302 -24.038 3548.45496 87 170.385 29030.917 27.962 4764.224233 768 323589.08 48596.19

y y x-x y-y2(x x)x x


CalculationsFrom the previous slide, we have

The regression line is ˆ y 10.138 0.150 x .18

Also n 13, x 4233 and y 768

4233 768so x 325.615 and y 59.0385

13 13This gives

a y - bx 59.0385- 0.15018(325.615) 10.138

2

2

x x y y 48596.19 and

x x 323589.08

So

x x y y 48596.19b 0.15018

323589.08x x


Minitab Graph

600500400300200100 0

105

95

85

75

65

55

45

35

25

15

Distance

Sta

nd

ard

Far

eS = 6.80319 R-Sq = 93.5 % R-Sq(adj) = 92.9 %

Standard Fare= 10.1380 + 0.150179 Distance

Regression Plot

The following graph is a copy of the output from a Minitab command to graph the regression line.


Greyhound Example Revisited

x y x2 xy

240 39 57600 9360430 81 184900 3483069 17 4761 1173607 96 368449 58272257 61 66049 15677480 70.5 230400 33840340 65 115600 22100467 82 218089 38294335 67 112225 22445239 47 57121 1123395 20 9025 1900178 35 31684 6230496 87 246016 431524233 768 1701919 298506



Using the calculation formula we have:

Notice that we get the same result.

2 2

2

2

x y 4233 768298506xy

13nb4233x

1701919x13n

485

n 13, x 4233, y 768

x 1701919, and xy 298506

so

As before a y - bx 59.0385- 0.15018(325.

96.190.15018

323589.615) 10.138

and the regression line i ˆ ys

1

10.138 0.15 x.018


Three Important Questions To examine how useful or effective the line summarizing the relationship between x and y, we consider the following three questions.1. Is a line an appropriate way to summarize the relationship between

the two variables?

2. Are there any unusual aspects of the data set that we need to consider before proceeding to use the regression line to make predictions?

3. If we decide that it is reasonable to use the regression line as a basis for prediction, how accurate can we expect predictions based on the regression line to be?


TerminologyThe predicted or fitted values result from substituting each sample x value into the equation for the least squares line. This gives

=1st predicted value

=2nd predicted value

=nth predicted value

1 1

2 2

n n

y a bx

y a bx

...

y a bx

The predicted or fitted values result from substituting each sample x value into the equation for the least squares line. This gives

=1st predicted value

=2nd predicted value

=nth predicted value

1 1

2 2

n n

y a bx

y a bx

...

y a bx

The residuals for the least squares line are the values: 1 1 2 2 n ny y , y y , ..., y yˆ ˆ ˆ The residuals for the least squares line are the values: 1 1 2 2 n ny y , y y , ..., y yˆ ˆ ˆ


Greyhound Example Continued

x yPredicted value Residual

240 39 46.18 -7.181430 81 74.72 6.28569 17 20.50 -3.500607 96 101.30 -5.297257 61 48.73 12.266480 70.5 82.22 -11.724340 65 61.20 3.801467 82 80.27 1.728335 67 60.45 6.552239 47 46.03 0.96995 20 24.41 -4.405178 35 36.87 -1.870496 87 84.63 2.373

y yy 10.1 .150xˆ


6005004003002001000

10

0

-10

x

Res

idu

alResiduals Versus x

(response is y)

Residual PlotA residual plot is a scatter plot of the data pairs (x, residual). The following plot was produced by Minitab from the Greyhound example.


0

x

Res

idu

al

Residual Plot - What to look for.

Isolated points or patterns indicate potential problems.

Ideally the the points should be randomly spread out above and below zero.

This residual plot indicates no systematic bias using the least squares line to predict the y value.Generally this is the kind of pattern that you would like to see.

Note:1.Values below 0 indicate over prediction 2.Values above 0 indicate under prediction.


6005004003002001000

10

0

-10

x

Res

idu

al

Residuals Versus x(response is y)

The Greyhound example continued

For the Greyhound example, it appears that the line systematically predicts fares that are too high for cities close to Rochester and predicts fares that are too little for most cities between 200 and 500 miles.

Predicted fares are too high.

Predicted fares are too low.


1009080706050403020

10

0

-10

Fitted Value

Res

idu

al

Residuals Versus the Fitted Values(response is y)

More Residual PlotsAnother common type of residual plot is a scatter plot of the data pairs ( , residual). The following plot was produced by Minitab for the Greyhound data. Notice, that this residual plot shows the same type of systematic problems with the model.

yAnother common type of residual plot is a scatter plot of the data pairs ( , residual). The following plot was produced by Minitab for the Greyhound data. Notice, that this residual plot shows the same type of systematic problems with the model.

y


Definition formulae

The total sum of squares, denoted by SSTo, is defined as

2 2 21 2 n

2

SSTo (y y) (y y) (y y)

(y y)

The residual sum of squares, denoted by SSResid, is defined as

2 2 21 1 2 2 n n

2

SSResid (y y ) (y y ) (y y )ˆ ˆ ˆ

(y y)ˆ


Calculational formulae

SSTo and SSResid are generally found as part of the standard output from most statistical packages or can be obtained using the following computational formulas:

22 y

SSTo yn

2SSResid y a y b xy

The coefficient of determination, r2, can be computed as

2 SSResidr 1

SSTo


Coefficient of Determination

The coefficient of determination, denoted by r2, gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y.

Note that the coefficient of determination is the square of the Pearson correlation coefficient.



2n 13, y 768, y 53119, xy 298506

b 0.150179 and a 10.1380

2 22

2

y 768SSTo y 53119 78072.2

n 13

SSResid y a y b xy

53119 10.1380(768) 0.150179(298506)

509.117


We can say that 93.5% of the variation in the Fare (y) can be attributed to the least squares linear relationship between distance (x) and fare.


2 SSResid 509.117r 1 1 0.9348

SSTo 7807.23


More on variability

The standard deviation about the least squares line is denoted se and given by

se is interpreted as the “typical” amount by which an observation deviates from the least squares line.

e

SSResids

n 2


The “typical” deviation of actual fare from the prediction is $6.80.


e

SSResid 509.117s $6.80

n 2 11


Minitab output for Regression

Regression Analysis: Standard Fare versus Distance

The regression equation isStandard Fare = 10.1 + 0.150 Distance

Predictor Coef SE Coef T PConstant 10.138 4.327 2.34 0.039Distance 0.15018 0.01196 12.56 0.000

S = 6.803 R-Sq = 93.5% R-Sq(adj) = 92.9%

Analysis of Variance

Source DF SS MS F PRegression 1 7298.1 7298.1 157.68 0.000Residual Error 11 509.1 46.3Total 12 7807.2

SSTo

SSResidser2

a b

Least squares regression line

1 copyright © 2005 brooks/cole, a division of thomson learning, inc. summarizing bivariate data...

Documents

line y

division of thomson

example y y

example x y

squares line

y increases

b slide

relationship y