© buddy freeman, 2015 simple linear regression and correlation

© Buddy Freeman, 2015

Simple Linear Regression

and

Correlation


Tennessee Energy Consumption Estimates (taken from http://www.eia.doe.gov/emeu/states/sep_use/tra/use_tra_tn.html)

Motor Gasoline EthanolYear (in 1000s of barrels) (in 1000s of barrels)1983 53,310 2811984 56,348 5921985 57,068 6861986 59,317 8571987 56,506 1,2771988 58,224 1,4101989 58,937 1,0791990 56,954 5831991 55,187 4261992 57,667 5161993 60,286 5931994 62,062 8411995 63,907 3581996 63,928 71997 65,162 71998 66,842 81999 69,151 02000 68,252 02001 67,385 0

Correlation question:From 1983 to 2001 in the state of Tennessee, were motor gasoline consumption and ethanol consumption significantly related to each other?

In a correlation problem, oneis interested in measuring the strength of the relationship between variables.


Tennessee Energy Consumption Estimates (taken from http://www.eia.doe.gov/emeu/states/sep_use/tra/use_tra_tn.html)

Motor Gasoline EthanolYear (in 1000s of barrels) (in 1000s of barrels)1983 53,310 2811984 56,348 5921985 57,068 6861986 59,317 8571987 56,506 1,2771988 58,224 1,4101989 58,937 1,0791990 56,954 5831991 55,187 4261992 57,667 5161993 60,286 5931994 62,062 8411995 63,907 3581996 63,928 71997 65,162 71998 66,842 81999 69,151 02000 68,252 02001 67,385 0

Regression question:From 1983 to 2001 in the state of Tennessee, could the ethanol consumption in one year have been used to predict motor gasoline consumption in the followingyear?

In a regression problem, oneis interested in predicting one variable (called the dependent variable) based on another variable (called the independent variable).



and

Correlation

The Key Word



and

Correlation

A Straight Line


What is the equation for a straight line?

bmxy Do you recall ?

mWhat is ?

bWhat is ?

x is the independent variable, and y is the dependent variable.

Answer: the slope

Answer: the y-intercept

In the text, the equation is given by:

xbby 10ˆ


Given a random sample of the related x and y values,

xbbY ii 10ˆ

The General Simple Linear Regression Problem

find the value of the slope and the value of the y-intercept that yields the “best” fit to these points.

i Xi Yi

1 X1 Y1

2 X2 Y2

. . .

. . .

. . .n Xn Yn


X

YGiven a random sample of the related x and y values, find the value of the slope and the value of the y-intercept that yields the “best” fit to these points.

xbbY ii 10ˆ

Visually

What does “best” mean? By “best” we mean the smallest error in prediction.


xbbY ii 10ˆ

X

YError Defined

By “best” we mean the smallest error in prediction.

If one picks an arbitrary point in the random sample, (Xi, Yi),how “far” is the point from theline: ?

Yi is the actual y value.

xbbY ii 10ˆ is the predicted y-value.

(value on the line)

xbbY ii 10ˆ

The error is the difference between Yi and .

Error = ii YY ˆ}


X

YGeneral Problem Restated

Error = ii YY ˆ}

Given a random sample of the related x and y values, find the value of the slope and the value of the y-intercept that yields the smallest error over all the sample.

ii YY ˆWhat would you want ?

}{

}{

{

}

{

{

The errors for the points above the line should balance the errors for the points below the line, resultingin a sum of zero.

Unfortunately, there are an infinite number of lines possessing this property. Any line that passes through the point, ,will have this property, because it is a property of the mean.

YX ,


X

YGeneral Problem Restated in terms of Least Squares

Error = ii YY ˆ}

Given a random sample of the related x and y values, find the value of the slope and the value of the y-intercept that yields the smallest sum of the squares of the errors (SSE) over all the sample.

}{

}{

{

}

{

{

2ˆ SSE ii YY

Find the value of b0 and the value of b1 that will minimize

where xbbY ii 10ˆ


X

YSolution of the Least Squares Problem

Error = ii YY ˆ}

}{

}{

{

}

{

{


where xbbY ii 10ˆ 2ˆ SSE ii YY

210 SSE ii xbbY

Noting that SSE is a function of two variables, we can restate the problem once again.


X

YSolution of the Least Squares Problem

Error = ii YY ˆ}

}{

}{

{

}

{

{


f(b0, b1) =

210 SSE ii xbbY

Finding the values of variables that will maximize/minimize a function is a calculus problem. Becausecalculus is not a prerequisiteto this course, the details are omitted, but the process results intwo equationsand two unknowns.


The Normal Equations

or

matrix form algebraic form

There are many ways to solve a system of two equations and two unknowns. If you have a favorite, feel free to use it.

yx

y

b

b

xx

xn

1

0

2

xybxbx

ybxbn

12

0

10

Two relationships that I expect you to know are:

andXin variation

Y & Xin n covariatio1 b

n

xb

n

yb 10


The Normal Equations

or

matrix form algebraic form

There are many ways to solve a system of two equations and two unknowns. If you have a favorite, feel free to use it.

Now the specifics are introduced with an example.

yx

y

b

b

xx

xn

1

0

2

xybxbx

ybxbn

12

0

10

Two relationships that I expect you to know are:

andXin variation

Y & Xin n covariatio1 b

n

xb

n

yb 10


DayX = Number

of CommercialsY = Water Consumption

(gallons)

1 11 8,0002 7 5,0003 12 9,0004 8 4,0005 10 8,0006 13 10,0007 8 5,0008 10 7,0009 14 10,000

10 7 4,000

The Random Sample


Generate Graph

First, graph the data. The scatter plot of the data may indicate that a linear model is totally inappropriate and a waste of time.

The following three slides give some examples of nonlinear patterns.

Following the nonlinear examples, the graph of the data in the random sample is constructed.


0

5

10

15

20

25

30

35

40

45

0 5 10 15 20

Example of a Nonlinear Pattern

cbxaxxf 2


0

100

200

300

400

500

600

700

0 5 10 15 20 25

dcxbxaxxf 23



0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

0 5 10 15 20

axbexf



0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

0 5 10 15 20

axbexf

axbbexf ax lnlnln

Transformed to a linear pattern



(11, 8000)

( 7, 5000)

(12, 9000)

( 8, 4000)

(10, 8000)

(13, 10000)

( 8, 5000)

(10, 7000)

(14, 10000)

( 7, 4000)

Number of Commercials

H2OConsumption

The Scatter Graph



H2OConsumption

Find the slope and the y-intercept ofthe line that is the “best” fit to these points.

xbbY ii 10ˆ

The Scatter Graph



H2OConsumption

Find the slope and the y-intercept ofthe line that is the “best” fit to these points.

xbbY ii 10ˆ

The Scatter Graph(with “guesstimated” line)


Worksheet:

Day X X2XY Y Y2

1 11 121 88,000 8,000 64,000,0002 7 49 35,000 5,000 25,000,0003 12 144 108,000 9,000 81,000,0004 8 64 32,000 4,000 16,000,0005 10 100 80,000 8,000 64,000,0006 13 169 130,000 10,000 100,000,0007 8 64 40,000 5,000 25,000,0008 10 100 70,000 7,000 49,000,0009 14 196 140,000 10,000 100,000,00010 7 49 28,000 4,000 16,000,000

TOTALS 100 1,056 751,000 70,000 540,000,000

The Initial Calculations


DefinitionalFormula

ComputationalFormula

2XXSSX i

2i YY SSTSSY

YYXXY)&(Xcov ii

2ii YYSSE

2i YYSSR SSX

Y&Xcov 2

SSR-SST

n

YXXY

n

YY

22

n

XX

22

Some Basic Formulas


X = Number of Commercials;Y = Water Consumption (gallons)

100 X 056,12 X

000,751XY

000,70Y 000,000,5402Y

10n

XbYbY- 10 intercept n

X

n

Yb

1

value.predicted theˆ

:model regression Sample

10 xbbY ii

714.910

56

000,51

10100

056,1

10000,70100

000,751

2

14.107,210

100714.910

10

000,70

value.predicted the714.91014.107,2ˆ xY ii

SSX

& cov

Xin variation

Y and Xin n covariatio 1

YXbslope


Interpretation of the Slope and the Y-intercept

value.predicted the714.91014.107,2ˆ xY ii

X = Number of Commercials;Y = Water Consumption (gallons)

Interpret the slope.(What does the slope mean in terms of the problem?)

For each additional commercial, we expect the water consumption to increase by 910.714 gallons.

Interpret the y-intercept.(What does the y-intercept mean in terms of the problem?)

If there are no commercials, we expect the water consumption to be a negative 2,107.14 gallons.

?????????? Think about it. ??????????


Microsoft Clip ArtMicrosoft Clip Art

Welcome toMulvany, Tennessee

CityWaterPlant

Reservoir

Sensorline

River

If the water consumption is a negative 2,107.14 gallons, which way is the water flowing in the pipe from the reservoir to the city?

We know that the water doesnot flow back into the reservoir.

Does this result mean that theregression model is worthless?


Interpolation versus Extrapolation

DayX = Number

of CommercialsY = Water Consumption

(gallons)

1 11 8,0002 7 5,0003 12 9,0004 8 4,0005 10 8,0006 13 10,0007 8 5,0008 10 7,0009 14 10,000

10 7 4,000smallest Xlargest X

smallest X



H2OConsumption

Between the smallest (7) and the largest (14) values of X used to compute the sample regression model,we may interpolate with statistical significance.

xbbY ii 10ˆ

Interpolation versus Extrapolation

( 7, 5000)( 7, 4000)

smallest X7

(14, 10000)

largest X14

Relevant Range

Inter

polat

ion

ExtrapolationExtrapolation

To determine if the model has statistical significance, we still have to perform some more calculations.


Calculation of SSE by Definition

1 11 8,000 7,910.7142 7 5,000 4,267.8583 12 9,000 8,821.4284 8 4,000 5,178.5725 10 8,000 7,000.0006 13 10,000 9,732.1427 8 5,000 5,178.5728 10 7,000 7,000.0009 14 10,000 10,642.856

10 7 4,000 4,267.858

iX iY2,107.14ˆ iY

iX910.714i



1 11 8,000 7,910.7142 7 5,000 4,267.8583 12 9,000 8,821.4284 8 4,000 5,178.5725 10 8,000 7,000.0006 13 10,000 9,732.1427 8 5,000 5,178.5728 10 7,000 7,000.0009 14 10,000 10,642.856

10 7 4,000 4,267.858

iX iY2,107.14ˆ iY

iX910.714i

First, you insert the Xi values into the sample regression equation to calculate the predicted values.



Second, you calculate the deviations of the points from the line.

1 11 8,000 7,910.714 89.2862 7 5,000 4,267.858 732.1423 12 9,000 8,821.428 178.5724 8 4,000 5,178.572 -1,178.5725 10 8,000 7,000.000 1,000.0006 13 10,000 9,732.142 267.8587 8 5,000 5,178.572 -178.5728 10 7,000 7,000.000 0.0009 14 10,000 10,642.856 -642.856

10 7 4,000 4,267.858 -267.8580.000

iX iY ii YY ˆ2,107.14ˆ iY

iX910.714i



1 11 8,000 7,910.714 89.286 7,971.9902 7 5,000 4,267.858 732.142 536,031.9083 12 9,000 8,821.428 178.572 31,887.9594 8 4,000 5,178.572 -1,178.572 1,389,031.9595 10 8,000 7,000.000 1,000.000 1,000,000.0006 13 10,000 9,732.142 267.858 71,747.9087 8 5,000 5,178.572 -178.572 31,887.9598 10 7,000 7,000.000 0.000 0.0009 14 10,000 10,642.856 -642.856 413,263.837

10 7 4,000 4,267.858 -267.858 71,747.9080.000 SSE = 3,553,571.429

iX iY ii YY ˆ 2ii YY 2,107.14ˆ iY

iX910.714i


Second, you calculate the deviations of the points from the line.


Finally, you calculate the squares of the deviations of the points from the line and sum them to obtain SSE.


Calculation of SSE by “Backing” into it

= variation explained by regression

SSESSRSST +

variation not explainedby regression



Therefore,

SSR.SSTSSE


SSESSRSST +




However,

n

YY

22

000,000,5010

000,70000,000,540

2

Therefore,

SSR.SSTSSE


SSESSRSST +


SSY SST



and

Hence,

However,

n

YY

22

000,000,5010

000,70000,000,540

2

Therefore,

SSR.SSTSSE


SSESSRSST +


SSY SST

SSX

Y&XcovSSR

2

.5714.428,446,4656

51,000 2

42863,553,571.

.571446,446,428 - 50,000,000

SSRSSTSSE


standard error of the estimate = eS = 2eS

error variance = =

Calculation of the Standard Error of the Estimate

2eS

MSE 2-n

SSE

42857.196,4448

4286.571,553,3

standard error of the estimate = eS = 2eS

48.666

42857.196,444

gallons

Interpretation:The “typical” error made when predicting the number of gallons of water consumed based on the number of commercials is about 666.48 gallons.


The Question

At the .05 level of significance, is there evidence that a linear relationship exists between the number of commercials and water consumption?

We have almost enough calculated to be ableto answer the question.

just one more...........................................................


standard error of the slope =1bS

(also called the standard error of the regression coefficient, b1)

1bSSSX

Se

Calculation of the Standard Error of the Slope

062.8956

48.666


or

1

112

bn S

bt

MSE

MSR

S

SF

e

Rn 2

2

2,1

Test Statistics for Regression

1Now, what is ?

Well, that’s another story.


The End

© buddy freeman, 2015 simple linear regression and correlation

Documents