© buddy freeman, 2015 simple linear regression and correlation

44
© Buddy Freeman, 2015 Simple Linear Regression and Correlation

Upload: margaretmargaret-hubbard

Post on 26-Dec-2015

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Simple Linear Regression

and

Correlation

Page 2: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Tennessee Energy Consumption Estimates (taken from http://www.eia.doe.gov/emeu/states/sep_use/tra/use_tra_tn.html)

Motor Gasoline EthanolYear (in 1000s of barrels) (in 1000s of barrels)1983 53,310 2811984 56,348 5921985 57,068 6861986 59,317 8571987 56,506 1,2771988 58,224 1,4101989 58,937 1,0791990 56,954 5831991 55,187 4261992 57,667 5161993 60,286 5931994 62,062 8411995 63,907 3581996 63,928 71997 65,162 71998 66,842 81999 69,151 02000 68,252 02001 67,385 0

Correlation question:From 1983 to 2001 in the state of Tennessee, were motor gasoline consumption and ethanol consumption significantly related to each other?

In a correlation problem, oneis interested in measuring the strength of the relationship between variables.

Page 3: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Tennessee Energy Consumption Estimates (taken from http://www.eia.doe.gov/emeu/states/sep_use/tra/use_tra_tn.html)

Motor Gasoline EthanolYear (in 1000s of barrels) (in 1000s of barrels)1983 53,310 2811984 56,348 5921985 57,068 6861986 59,317 8571987 56,506 1,2771988 58,224 1,4101989 58,937 1,0791990 56,954 5831991 55,187 4261992 57,667 5161993 60,286 5931994 62,062 8411995 63,907 3581996 63,928 71997 65,162 71998 66,842 81999 69,151 02000 68,252 02001 67,385 0

Regression question:From 1983 to 2001 in the state of Tennessee, could the ethanol consumption in one year have been used to predict motor gasoline consumption in the followingyear?

In a regression problem, oneis interested in predicting one variable (called the dependent variable) based on another variable (called the independent variable).

Page 4: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Simple Linear Regression

and

Correlation

The Key Word

Page 5: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Simple Linear Regression

and

Correlation

A Straight Line

Page 6: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

What is the equation for a straight line?

bmxy Do you recall ?

mWhat is ?

bWhat is ?

x is the independent variable, and y is the dependent variable.

Answer: the slope

Answer: the y-intercept

In the text, the equation is given by:

xbby 10ˆ

Page 7: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Given a random sample of the related x and y values,

xbbY ii 10ˆ

The General Simple Linear Regression Problem

find the value of the slope and the value of the y-intercept that yields the “best” fit to these points.

i Xi Yi

1 X1 Y1

2 X2 Y2

. . .

. . .

. . .n Xn Yn

Page 8: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

X

YGiven a random sample of the related x and y values, find the value of the slope and the value of the y-intercept that yields the “best” fit to these points.

xbbY ii 10ˆ

Visually

What does “best” mean? By “best” we mean the smallest error in prediction.

Page 9: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

xbbY ii 10ˆ

X

YError Defined

By “best” we mean the smallest error in prediction.

If one picks an arbitrary point in the random sample, (Xi, Yi),how “far” is the point from theline: ?

Yi is the actual y value.

xbbY ii 10ˆ is the predicted y-value.

(value on the line)

xbbY ii 10ˆ

The error is the difference between Yi and .

Error = ii YY ˆ}

Page 10: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

X

YGeneral Problem Restated

Error = ii YY ˆ}

Given a random sample of the related x and y values, find the value of the slope and the value of the y-intercept that yields the smallest error over all the sample.

ii YY ˆWhat would you want ?

}{

}{

{

}

{

{

The errors for the points above the line should balance the errors for the points below the line, resultingin a sum of zero.

Unfortunately, there are an infinite number of lines possessing this property. Any line that passes through the point, ,will have this property, because it is a property of the mean.

YX ,

Page 11: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

X

YGeneral Problem Restated in terms of Least Squares

Error = ii YY ˆ}

Given a random sample of the related x and y values, find the value of the slope and the value of the y-intercept that yields the smallest sum of the squares of the errors (SSE) over all the sample.

}{

}{

{

}

{

{

2ˆ SSE ii YY

Find the value of b0 and the value of b1 that will minimize

where xbbY ii 10ˆ

Page 12: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

X

YSolution of the Least Squares Problem

Error = ii YY ˆ}

}{

}{

{

}

{

{

Find the value of b0 and the value of b1 that will minimize

where xbbY ii 10ˆ 2ˆ SSE ii YY

210 SSE ii xbbY

Noting that SSE is a function of two variables, we can restate the problem once again.

Page 13: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

X

YSolution of the Least Squares Problem

Error = ii YY ˆ}

}{

}{

{

}

{

{

Find the value of b0 and the value of b1 that will minimize

f(b0, b1) =

210 SSE ii xbbY

Finding the values of variables that will maximize/minimize a function is a calculus problem. Becausecalculus is not a prerequisiteto this course, the details are omitted, but the process results intwo equationsand two unknowns.

Page 14: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

The Normal Equations

or

matrix form algebraic form

There are many ways to solve a system of two equations and two unknowns. If you have a favorite, feel free to use it.

yx

y

b

b

xx

xn

1

0

2

xybxbx

ybxbn

12

0

10

Two relationships that I expect you to know are:

andXin variation

Y & Xin n covariatio1 b

n

xb

n

yb 10

Page 15: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

The Normal Equations

or

matrix form algebraic form

There are many ways to solve a system of two equations and two unknowns. If you have a favorite, feel free to use it.

Now the specifics are introduced with an example.

yx

y

b

b

xx

xn

1

0

2

xybxbx

ybxbn

12

0

10

Two relationships that I expect you to know are:

andXin variation

Y & Xin n covariatio1 b

n

xb

n

yb 10

Page 16: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

DayX = Number

of CommercialsY = Water Consumption

(gallons)

1 11 8,0002 7 5,0003 12 9,0004 8 4,0005 10 8,0006 13 10,0007 8 5,0008 10 7,0009 14 10,000

10 7 4,000

The Random Sample

Page 17: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Generate Graph

First, graph the data. The scatter plot of the data may indicate that a linear model is totally inappropriate and a waste of time.

The following three slides give some examples of nonlinear patterns.

Following the nonlinear examples, the graph of the data in the random sample is constructed.

Page 18: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

0

5

10

15

20

25

30

35

40

45

0 5 10 15 20

Example of a Nonlinear Pattern

cbxaxxf 2

Page 19: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

0

100

200

300

400

500

600

700

0 5 10 15 20 25

dcxbxaxxf 23

Example of a Nonlinear Pattern

Page 20: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

0 5 10 15 20

axbexf

Example of a Nonlinear Pattern

Page 21: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

0 5 10 15 20

axbexf

axbbexf ax lnlnln

Transformed to a linear pattern

Example of a Nonlinear Pattern

Page 22: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

(11, 8000)

( 7, 5000)

(12, 9000)

( 8, 4000)

(10, 8000)

(13, 10000)

( 8, 5000)

(10, 7000)

(14, 10000)

( 7, 4000)

Number of Commercials

H2OConsumption

The Scatter Graph

Page 23: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Number of Commercials

H2OConsumption

Find the slope and the y-intercept ofthe line that is the “best” fit to these points.

xbbY ii 10ˆ

The Scatter Graph

Page 24: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Number of Commercials

H2OConsumption

Find the slope and the y-intercept ofthe line that is the “best” fit to these points.

xbbY ii 10ˆ

The Scatter Graph(with “guesstimated” line)

Page 25: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Worksheet:

Day X X2XY Y Y2

1 11 121 88,000 8,000 64,000,0002 7 49 35,000 5,000 25,000,0003 12 144 108,000 9,000 81,000,0004 8 64 32,000 4,000 16,000,0005 10 100 80,000 8,000 64,000,0006 13 169 130,000 10,000 100,000,0007 8 64 40,000 5,000 25,000,0008 10 100 70,000 7,000 49,000,0009 14 196 140,000 10,000 100,000,00010 7 49 28,000 4,000 16,000,000

TOTALS 100 1,056 751,000 70,000 540,000,000

The Initial Calculations

Page 26: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

DefinitionalFormula

ComputationalFormula

2XXSSX i

2i YY SSTSSY

YYXXY)&(Xcov ii

2ii YYSSE

2i YYSSR SSX

Y&Xcov 2

SSR-SST

n

YXXY

n

YY

22

n

XX

22

Some Basic Formulas

Page 27: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

X = Number of Commercials;Y = Water Consumption (gallons)

100 X 056,12 X

000,751XY

000,70Y 000,000,5402Y

10n

XbYbY- 10 intercept n

X

n

Yb

1

value.predicted theˆ

:model regression Sample

10 xbbY ii

714.910

56

000,51

10100

056,1

10000,70100

000,751

2

14.107,210

100714.910

10

000,70

value.predicted the714.91014.107,2ˆ xY ii

SSX

& cov

Xin variation

Y and Xin n covariatio 1

YXbslope

Page 28: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Interpretation of the Slope and the Y-intercept

value.predicted the714.91014.107,2ˆ xY ii

X = Number of Commercials;Y = Water Consumption (gallons)

Interpret the slope.(What does the slope mean in terms of the problem?)

For each additional commercial, we expect the water consumption to increase by 910.714 gallons.

Interpret the y-intercept.(What does the y-intercept mean in terms of the problem?)

If there are no commercials, we expect the water consumption to be a negative 2,107.14 gallons.

?????????? Think about it. ??????????

Page 29: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Microsoft Clip ArtMicrosoft Clip Art

Welcome toMulvany, Tennessee

CityWaterPlant

Reservoir

Sensorline

River

If the water consumption is a negative 2,107.14 gallons, which way is the water flowing in the pipe from the reservoir to the city?

We know that the water doesnot flow back into the reservoir.

Does this result mean that theregression model is worthless?

Page 30: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Interpolation versus Extrapolation

DayX = Number

of CommercialsY = Water Consumption

(gallons)

1 11 8,0002 7 5,0003 12 9,0004 8 4,0005 10 8,0006 13 10,0007 8 5,0008 10 7,0009 14 10,000

10 7 4,000smallest Xlargest X

smallest X

Page 31: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Number of Commercials

H2OConsumption

Between the smallest (7) and the largest (14) values of X used to compute the sample regression model,we may interpolate with statistical significance.

xbbY ii 10ˆ

Interpolation versus Extrapolation

( 7, 5000)( 7, 4000)

smallest X7

(14, 10000)

largest X14

Relevant Range

Inter

polat

ion

ExtrapolationExtrapolation

To determine if the model has statistical significance, we still have to perform some more calculations.

Page 32: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Calculation of SSE by Definition

1 11 8,000 7,910.7142 7 5,000 4,267.8583 12 9,000 8,821.4284 8 4,000 5,178.5725 10 8,000 7,000.0006 13 10,000 9,732.1427 8 5,000 5,178.5728 10 7,000 7,000.0009 14 10,000 10,642.856

10 7 4,000 4,267.858

iX iY2,107.14ˆ iY

iX910.714i

Page 33: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Calculation of SSE by Definition

1 11 8,000 7,910.7142 7 5,000 4,267.8583 12 9,000 8,821.4284 8 4,000 5,178.5725 10 8,000 7,000.0006 13 10,000 9,732.1427 8 5,000 5,178.5728 10 7,000 7,000.0009 14 10,000 10,642.856

10 7 4,000 4,267.858

iX iY2,107.14ˆ iY

iX910.714i

First, you insert the Xi values into the sample regression equation to calculate the predicted values.

Page 34: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Calculation of SSE by Definition

Second, you calculate the deviations of the points from the line.

1 11 8,000 7,910.714 89.2862 7 5,000 4,267.858 732.1423 12 9,000 8,821.428 178.5724 8 4,000 5,178.572 -1,178.5725 10 8,000 7,000.000 1,000.0006 13 10,000 9,732.142 267.8587 8 5,000 5,178.572 -178.5728 10 7,000 7,000.000 0.0009 14 10,000 10,642.856 -642.856

10 7 4,000 4,267.858 -267.8580.000

iX iY ii YY ˆ2,107.14ˆ iY

iX910.714i

First, you insert the Xi values into the sample regression equation to calculate the predicted values.

Page 35: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

1 11 8,000 7,910.714 89.286 7,971.9902 7 5,000 4,267.858 732.142 536,031.9083 12 9,000 8,821.428 178.572 31,887.9594 8 4,000 5,178.572 -1,178.572 1,389,031.9595 10 8,000 7,000.000 1,000.000 1,000,000.0006 13 10,000 9,732.142 267.858 71,747.9087 8 5,000 5,178.572 -178.572 31,887.9598 10 7,000 7,000.000 0.000 0.0009 14 10,000 10,642.856 -642.856 413,263.837

10 7 4,000 4,267.858 -267.858 71,747.9080.000 SSE = 3,553,571.429

iX iY ii YY ˆ 2ii YY 2,107.14ˆ iY

iX910.714i

Calculation of SSE by Definition

Second, you calculate the deviations of the points from the line.

First, you insert the Xi values into the sample regression equation to calculate the predicted values.

Finally, you calculate the squares of the deviations of the points from the line and sum them to obtain SSE.

Page 36: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Calculation of SSE by “Backing” into it

= variation explained by regression

SSESSRSST +

variation not explainedby regression

Page 37: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Calculation of SSE by “Backing” into it

Therefore,

SSR.SSTSSE

= variation explained by regression

SSESSRSST +

variation not explainedby regression

Page 38: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Calculation of SSE by “Backing” into it

However,

n

YY

22

000,000,5010

000,70000,000,540

2

Therefore,

SSR.SSTSSE

= variation explained by regression

SSESSRSST +

variation not explainedby regression

SSY SST

Page 39: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

Calculation of SSE by “Backing” into it

and

Hence,

However,

n

YY

22

000,000,5010

000,70000,000,540

2

Therefore,

SSR.SSTSSE

= variation explained by regression

SSESSRSST +

variation not explainedby regression

SSY SST

SSX

Y&XcovSSR

2

.5714.428,446,4656

51,000 2

42863,553,571.

.571446,446,428 - 50,000,000

SSRSSTSSE

Page 40: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

standard error of the estimate = eS = 2eS

error variance = =

Calculation of the Standard Error of the Estimate

2eS

MSE 2-n

SSE

42857.196,4448

4286.571,553,3

standard error of the estimate = eS = 2eS

48.666

42857.196,444

gallons

Interpretation:The “typical” error made when predicting the number of gallons of water consumed based on the number of commercials is about 666.48 gallons.

Page 41: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

The Question

At the .05 level of significance, is there evidence that a linear relationship exists between the number of commercials and water consumption?

We have almost enough calculated to be ableto answer the question.

just one more...........................................................

Page 42: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

standard error of the slope =1bS

(also called the standard error of the regression coefficient, b1)

1bSSSX

Se

Calculation of the Standard Error of the Slope

062.8956

48.666

Page 43: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

or

1

112

bn S

bt

MSE

MSR

S

SF

e

Rn 2

2

2,1

Test Statistics for Regression

1Now, what is ?

Well, that’s another story.

Page 44: © Buddy Freeman, 2015 Simple Linear Regression and Correlation

© Buddy Freeman, 2015

The End