introduction to probability and statistics thirteenth edition chapter 12 linear regression and...

45
Introduction to Probability Introduction to Probability and Statistics and Statistics Thirteenth Edition Thirteenth Edition Chapter 12 Linear Regression and Correlation

Upload: milo-scott

Post on 03-Jan-2016

234 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Introduction to Probability Introduction to Probability and Statisticsand Statistics

Thirteenth EditionThirteenth Edition

Chapter 12

Linear Regression and Correlation

Page 2: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Correlation & Regression

Univariate & Bivariate Statistics U: frequency distribution, mean, mode, range,

standard deviation B: correlation – two variables

Correlation linear pattern of relationship between one

variable (x) and another variable (y) – an association between two variables

graphical representation of the relationship between two variables

Warning: No proof of causality Cannot assume x causes y

Page 3: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

1. Correlation Analysis

• Correlation coefficient Correlation coefficient measures the strength of the relationship between x and y

yyxx

xy

SS

Sr

yyxx

xy

SS

Sr

Sample Pearson’s correlation Sample Pearson’s correlation coefficient coefficient

n

xxS i

ixx

2

2

n

yyS i

iyy

2

2

n

yxyxS ii

iiyy

Page 4: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Pearson’s Correlation Coefficient

“r” indicates…strength of relationship (strong, weak, or none)direction of relationship

positive (direct) – variables move in same direction

negative (inverse) – variables move in opposite directions

r ranges in value from –1.0 to +1.0

Strong Negative No Rel. Strong Positive

-1.0 0.0 +1.0

Page 5: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Limitations of Correlation

linearity: can’t describe non-linear relationshipse.g., relation between anxiety & performance

no proof of causationCannot assume x causes y

Page 6: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Y

X

Y

X

Y

Y

X

X

Linear relationships Curvilinear relationships

Some Correlation Patterns

Page 7: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Y

X

Y

X

Y

Y

X

X

Strong relationships Weak relationships

Some Correlation Patterns

Page 8: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Example

The table shows the heights and weights of n = 10 randomly selected college football players.

Player 1 2 3 4 5 6 7 8 9 10

Height, x 73 71 75 72 72 75 67 69 71 69

Weight, y 185 175 200 210 190 195 150 170 180 175

8261.)2610)(4.60(

328

2610 4.60 328

r

SSS yyxxxy

Page 9: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Height

Weig

ht

75747372717069686766

210

200

190

180

170

160

150

Scatterplot of Weight vs Height

r = .8261

Strong positive correlation

As the player’s height increases, so does his weight.

r = .8261

Strong positive correlation

As the player’s height increases, so does his weight.

Example – scatter plot

Page 10: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Inference using r

• The population coefficient of coefficient of correlationcorrelation is called (“rho”). We can test for a significant correlation between x and y using a t test:

0:H

0:H

1

0

. 2- withor if HReject 1

2 :StatisticTest

2/2/0

2

dfnttttr

nrt

Page 11: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Is there a significant positive correlation between weight and height in the population of all college football players?

8261.r 8261.r

0:H

0:H

1

0

0:H

0:H

1

0

15.4 8261.1

8 8261.

1

2

:StatisticTest

2

2

r

nrt

15.4 8261.1

8 8261.

1

2

:StatisticTest

2

2

r

nrt

Use the t-table with n-2 = 8 df to bound the p-value as p-value < .005.

There is a significant positive correlation between weight and height in the population of all college football players.

Use the t-table with n-2 = 8 df to bound the p-value as p-value < .005.

There is a significant positive correlation between weight and height in the population of all college football players.

Example

Page 12: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

2. Linear Regression2. Linear Regression

Regression: Correlation + Prediction

Regression analysis is used to predict the value of one variable (the dependent variable) on the basis of other variables (the independent variables).◦Dependent variable: denoted Y◦Independent variables: denoted X1, X2, …, Xk

Page 13: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Example

Let y be the monthly sales revenue for a company. This might be a function of several variables:◦x1 = advertising expenditure◦x2 = time of year◦x3 = state of economy◦x4 = size of inventory

We want to predict y using knowledge of x1, x2, x3 and x4.

Page 14: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Some Questions

Which of the independent variables are useful and which are not?

How could we create a prediction equation to allow us to predict y using knowledge of x1, x2, x3 etc?

How good is this prediction?

We start with the simplest case, in which the response y is a function of a single independent variable, x.

We start with the simplest case, in which the response y is a function of a single independent variable, x.

Page 15: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

A statistical model separates the systematic component of a relationship from the random component.

A statistical model separates the systematic component of a relationship from the random component.

Data

Statistical model

Systematic component

+Randomerrors

In regression, the systematic component is the overall linear relationship, and the random component is the variation around the line.

In regression, the systematic component is the overall linear relationship, and the random component is the variation around the line.

Model BuildingModel Building

Page 16: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Explanatory and Response Variables are NumericRelationship between the mean of the response

variable and the level of the explanatory variable assumed to be approximately linear (straight line)

Model:

),0(~1 NxY

• 1 > 0 Positive Association

• 1 < 0 Negative Association

• 1 = 0 No Association

A Simple Linear Regression A Simple Linear Regression ModelModel

Page 17: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

X

Y

x

= Slope

1

y

Error:

Regression Plot

Picturing the Simple Linear Regression Picturing the Simple Linear Regression ModelModel

0

= Intercept

Page 18: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Simple Linear Regression Simple Linear Regression AnalysisAnalysis

Variables: x = Independent Variable y = Dependent VariableParameters: = y Intercept β = Slopeε ~ normal distribution with mean 0 and

variance 2

xy

exbay ˆy = actual value of a score = predicted valuey

Page 19: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Simple Linear Regression Model…

y

x

b=slope=y/x

intercept

xbay ˆ

a

Page 20: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

The Method of Least The Method of Least SquaresSquares

The equation of the best-fitting line is calculated using a set of n pairs (xi, yi).

•We choose our estimates a and b to estimate and so that the vertical distances of the points from the line,are minimized.

22 )()ˆ(

ˆ

bxayyy

ba

bxay

SSE

minimize to and Choose

:line fitting Best

Page 21: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Least Squares Estimators

xbyaS

Sb

bxayn

yxxy

n

yy

n

xx

xx

xy

xy

yyxx

and

where :line fitting Best

S

S S

:squares of sums the Calculate

ˆ

))((

)()( 22

22

xbyaS

Sb

bxayn

yxxy

n

yy

n

xx

xx

xy

xy

yyxx

and

where :line fitting Best

S

S S

:squares of sums the Calculate

ˆ

))((

)()( 22

22

Page 22: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Example

The table shows the IQ scores for a random sample of n = 10 college freshmen, along with their final calculus grades.

Student 1 2 3 4 5 6 7 8 9 10

IQ Scores, x 39 43 21 64 57 47 28 75 34 52

Calculus grade, y 65 78 52 82 92 89 73 98 56 75

Use your calculator to find the sums and sums of squares.

Use your calculator to find the sums and sums of squares.

7646

36854

5981623634

76046022

yx

xy

yx

yx

7646

36854

5981623634

76046022

yx

xy

yx

yx

Page 23: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

:line fitting Best

and .76556

S

2056S

2474S

xy

ab

xy

yy

xx

77.78.40ˆ

78.40)46(76556.762474

1894

189410

)760)(460(36854

10

)760(59816

10

)460(23634

2

2

:line fitting Best

and .76556

S

2056S

2474S

xy

ab

xy

yy

xx

77.78.40ˆ

78.40)46(76556.762474

1894

189410

)760)(460(36854

10

)760(59816

10

)460(23634

2

2

ExampleExample

Page 24: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

The Analysis of Variance

The total variation in the experiment is measured by the total sum of squarestotal sum of squares:

2)yyS yy ( SS Total2)yyS yy ( SS Total

The Total SSTotal SS is divided into two parts:

SSRSSR (sum of squares for regression): measures the variation explained by using x in the model.

SSESSE (sum of squares for error): measures the leftover variation not explained by x.

Page 25: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

We calculate

9741.1449

2474

1894)(SSR

22

xx

xy

S

S

9741.1449

2474

1894)(SSR

22

xx

xy

S

S

0259.606

9741.14492056

)(

SSR - SS TotalSSE2

xx

xyyy S

SS

The Analysis of Variance

Page 26: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

The ANOVA Table

Total df = Mean Squares

Regression df = Error df =

n -1

1

n –1 – 1 = n - 2

MSR = SSR/(1)

MSE = SSE/(n-2)

Source df SS MS F

Regression 1 SSR SSR/(1) MSR/MSE

Error n - 2 SSE SSE/(n-2)

Total n -1 Total SS

Page 27: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

The Calculus Problem

Source df SS MS F

Regression 1 1449.9741 1449.9741 19.14

Error 8 606.0259 75.7532

Total 9 2056.0000

0259.6069741.14492056

)(

9741.14492474

1894)(

2

22

xx

xyyy

xx

xy

S

SS

S

S

SSR - SS TotalSSE

SSR

0259.6069741.14492056

)(

9741.14492474

1894)(

2

22

xx

xyyy

xx

xy

S

SS

S

S

SSR - SS TotalSSE

SSR

Page 28: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

You can test the overall usefulness of the model using an F test. If the model is useful, MSR will be large compared to the unexplained variation, MSE.

This test is exactly equivalent to the t-test, with t2 = F.

predictingin usefulnot is model:H

Hypothesis

0 y

. 2- and1 with FF if HReject

MSE

MSRF

:StatisticTest

0 dfn

Testing the Usefulness of the Model (The F Test)

Page 29: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Regression Analysis: y versus xThe regression equation is y = 40.8 + 0.766 xPredictor Coef SE Coef T PConstant 40.784 8.507 4.79 0.001x 0.7656 0.1750 4.38 0.002

S = 8.70363 R-Sq = 70.5% R-Sq(adj) = 66.8%

Analysis of VarianceSource DF SS MS F PRegression 1 1450.0 1450.0 19.14 0.002Residual Error 8 606.0 75.8Total 9 2056.0

Regression coefficients,

a and b

Minitab Minitab OutputOutput

MSE

0:H

test To

0 Least squares regression line

Ft 2

Page 30: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Testing the Usefulness of the Model

• The first question to ask is whether the independent variable x is of any use in predicting y.

• If it is not, then the value of y does not change, regardless of the value of x. This implies that the slope of the line, , is zero.

0:0:0 aH versusH 0:0:0 aH versusH

Page 31: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

The test statistic is function of b, our best estimate of Using MSE as the best estimate of the random variation 2, we obtain a t statistic.

xx

xx

S

MSEtbndf

t

S

MSE

bt

2/2

0

:interval confidencea or with

ondistributi a has which :statistic Test

xx

xx

S

MSEtbndf

t

S

MSE

bt

2/2

0

:interval confidencea or with

ondistributi a has which :statistic Test

Testing the Usefulness of the Model

0:0:0 aH versusH 0:0:0 aH versusH

Page 32: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

The Calculus Problem

•Is there a significant relationship between the calculus grades and the IQ scores at the 5% level of significance?

38.42474/7532.75

07656.

/

0

xxS

bt

MSE

0:0:0 aH versusH

Reject H 0 when |t| > 2.306. Since t = 4.38 falls into the rejection region, H 0 is rejected .

There is a significant linear relationship between the calculus grades and the IQ scores for the population of college freshmen.

Page 33: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Measuring the Strength Measuring the Strength of the Relationship of the Relationship

• If the independent variable x is of useful in predicting y, you will want to know how well the model fits.

• The strength of the relationship between x and y can be measured using:

SS TotalSSR

:iondeterminat of tCoefficien

:tcoefficien nCorrelatio

yyxx

xy

yyxx

xy

SS

Sr

SS

Sr

2

2

SS TotalSSR

:iondeterminat of tCoefficien

:tcoefficien nCorrelatio

yyxx

xy

yyxx

xy

SS

Sr

SS

Sr

2

2

Page 34: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Measuring the Strength of the Relationship

• Since Total SS = SSR + SSE, r2

measures the proportion of the total variation in

the responses that can be explained by using the independent variable x in the model.

the percent reduction the total variation by using the regression equation rather than just using the sample mean y-bar to estimate y.SS Total

SSR 2r

SS TotalSSR

2rFor the calculus problem, r2 = .705 or 70.5%. Meaning that 70.5% of the variability of Calculus Scores can be exlain by the model.

Page 35: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Estimation and Estimation and Prediction Prediction

xx

xx

S

xx

nMSEty

xxy

S

xx

nMSEty

xxy

20

2/

0

20

2/

0

)(11ˆ

)(1ˆ

: when of valueparticulara predict To

: when of valueaverage the estimate To

xx

xx

S

xx

nMSEty

xxy

S

xx

nMSEty

xxy

20

2/

0

20

2/

0

)(11ˆ

)(1ˆ

: when of valueparticulara predict To

: when of valueaverage the estimate To Confidence interval

Prediction interval

Page 36: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

The Calculus The Calculus ProblemProblem

Estimate the average calculus grade for students whose IQ score is 50 with a 95% confidence interval.

85.61. to 72.51or

79.06.76556(50)40.78424 Calculate

55.606.79

2474

)4650(

10

17532.75306.2ˆ

ˆ

2

y

y

85.61. to 72.51or

79.06.76556(50)40.78424 Calculate

55.606.79

2474

)4650(

10

17532.75306.2ˆ

ˆ

2

y

y

Page 37: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Estimate the calculus grade for a particular student whose IQ score is 50 with a 95% confidence interval.

100.17. to 57.95or

79.06.76556(50)40.78424 Calculate

11.2106.79

2474

)4650(

10

117532.75306.2ˆ

ˆ

2

y

y

100.17. to 57.95or

79.06.76556(50)40.78424 Calculate

11.2106.79

2474

)4650(

10

117532.75306.2ˆ

ˆ

2

y

y

Notice how much wider this interval is!

The Calculus The Calculus ProblemProblem

Page 38: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Predicted Values for New ObservationsNew Obs Fit SE Fit 95.0% CI 95.0% PI1 79.06 2.84 (72.51, 85.61) (57.95,100.17)

Values of Predictors for New ObservationsNew Obs x1 50.0

Minitab OutputMinitab Output

Green prediction bands are always wider than red confidence bands.

Both intervals are narrowest when x = x-bar.

Confidence and prediction intervals when x = 50

x

y

80706050403020

120

110

100

90

80

70

60

50

40

30

S 8.70363R-Sq 70.5%R-Sq(adj) 66.8%

Regression95% CI95% PI

Fitted Line Ploty = 40.78 + 0.7656 x

Page 39: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Estimation and Estimation and Prediction Prediction • Once you have

determined that the regression line is useful used the diagnostic plots to check for violation of

the regression assumptions.• You are ready to use the regression line to

Estimate the average value of y for a given value of x

Predict a particular value of y for a given value of x.

Page 40: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

• The best estimate of either E(y) or y for a given value x = x0 is

• Particular values of y are more difficult to predict, requiring a wider range of values in the prediction interval.

0ˆ bxay 0ˆ bxay

Estimation and Estimation and Prediction Prediction

Page 41: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Regression Assumptions

1. The relationship between x and y is linear, given by y = + x +

2. The random error terms are independent and, for any value of x, have a normal distribution with mean 0 and constant variance, 2.

1. The relationship between x and y is linear, given by y = + x +

2. The random error terms are independent and, for any value of x, have a normal distribution with mean 0 and constant variance, 2.

•Remember that the results of a regression analysis are only valid when the necessary assumptions have been satisfied.

Assumptions:

Page 42: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Diagnostic ToolsDiagnostic Tools

1. Normal probability plot or histogram of residuals

2. Plot of residuals versus fit or residuals versus variables

3. Plot of residual versus order

1. Normal probability plot or histogram of residuals

2. Plot of residuals versus fit or residuals versus variables

3. Plot of residual versus order

Page 43: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

Residuals

• The residual errorresidual error is the “leftover” variation in each data point after the variation explained by the regression model has been removed.

• If all assumptions have been met, these residuals should be normalnormal, with mean 0 and variance 2.

iiii bxayyy or Residual ˆ iiii bxayyy or Residual ˆ

Page 44: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

If the normality assumption is valid, the plot should resemble a straight line, sloping upward to the right.

If not, you will often see the pattern fail in the tails of the graph.

If the normality assumption is valid, the plot should resemble a straight line, sloping upward to the right.

If not, you will often see the pattern fail in the tails of the graph.

Normal Probability Normal Probability PlotPlot

Residual

Perc

ent

20100-10-20

99

95

90

80

70

605040

30

20

10

5

1

Normal Probability Plot of the Residuals(response is y)

Page 45: Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation

If the equal variance assumption is valid, the plot should appear as a random scatter around the zero center line.

If not, you will see a pattern in the residuals.

If the equal variance assumption is valid, the plot should appear as a random scatter around the zero center line.

If not, you will see a pattern in the residuals.

Residuals versus FitsResiduals versus Fits

Fitted Value

Residual

10090807060

15

10

5

0

-5

-10

Residuals Versus the Fitted Values(response is y)