introduction to probability and statistics thirteenth edition chapter 12 linear regression and...

Introduction to Probability Introduction to Probability and Statisticsand Statistics

Thirteenth EditionThirteenth Edition

Chapter 12

Linear Regression and Correlation

Correlation & Regression

Univariate & Bivariate Statistics U: frequency distribution, mean, mode, range,

standard deviation B: correlation – two variables

Correlation linear pattern of relationship between one

variable (x) and another variable (y) – an association between two variables

graphical representation of the relationship between two variables

Warning: No proof of causality Cannot assume x causes y

1. Correlation Analysis

• Correlation coefficient Correlation coefficient measures the strength of the relationship between x and y

Sample Pearson’s correlation Sample Pearson’s correlation coefficient coefficient

yxyxS ii

Pearson’s Correlation Coefficient

“r” indicates…strength of relationship (strong, weak, or none)direction of relationship

positive (direct) – variables move in same direction

negative (inverse) – variables move in opposite directions

r ranges in value from –1.0 to +1.0

Strong Negative No Rel. Strong Positive

-1.0 0.0 +1.0

Limitations of Correlation

linearity: can’t describe non-linear relationshipse.g., relation between anxiety & performance

no proof of causationCannot assume x causes y

Linear relationships Curvilinear relationships

Some Correlation Patterns

Strong relationships Weak relationships

Some Correlation Patterns

Example

The table shows the heights and weights of n = 10 randomly selected college football players.

Player 1 2 3 4 5 6 7 8 9 10

Height, x 73 71 75 72 72 75 67 69 71 69

Weight, y 185 175 200 210 190 195 150 170 180 175

8261.)2610)(4.60(

2610 4.60 328

SSS yyxxxy

Height

75747372717069686766

Scatterplot of Weight vs Height

r = .8261

Strong positive correlation

As the player’s height increases, so does his weight.

r = .8261

Strong positive correlation

As the player’s height increases, so does his weight.

Example – scatter plot

Inference using r

• The population coefficient of coefficient of correlationcorrelation is called (“rho”). We can test for a significant correlation between x and y using a t test:

. 2- withor if HReject 1

2 :StatisticTest

dfnttttr

Is there a significant positive correlation between weight and height in the population of all college football players?

8261.r 8261.r

15.4 8261.1

8 8261.

:StatisticTest

15.4 8261.1

8 8261.

:StatisticTest

Use the t-table with n-2 = 8 df to bound the p-value as p-value < .005.

There is a significant positive correlation between weight and height in the population of all college football players.

Use the t-table with n-2 = 8 df to bound the p-value as p-value < .005.

There is a significant positive correlation between weight and height in the population of all college football players.

Example

2. Linear Regression2. Linear Regression

Regression: Correlation + Prediction

Regression analysis is used to predict the value of one variable (the dependent variable) on the basis of other variables (the independent variables).◦Dependent variable: denoted Y◦Independent variables: denoted X1, X2, …, Xk

Example

Let y be the monthly sales revenue for a company. This might be a function of several variables:◦x1 = advertising expenditure◦x2 = time of year◦x3 = state of economy◦x4 = size of inventory

We want to predict y using knowledge of x1, x2, x3 and x4.

Some Questions

Which of the independent variables are useful and which are not?

How could we create a prediction equation to allow us to predict y using knowledge of x1, x2, x3 etc?

How good is this prediction?

We start with the simplest case, in which the response y is a function of a single independent variable, x.

A statistical model separates the systematic component of a relationship from the random component.

Statistical model

Systematic component

+Randomerrors

In regression, the systematic component is the overall linear relationship, and the random component is the variation around the line.

Model BuildingModel Building

Explanatory and Response Variables are NumericRelationship between the mean of the response

variable and the level of the explanatory variable assumed to be approximately linear (straight line)

Model:

),0(~1 NxY

• 1 > 0 Positive Association

• 1 < 0 Negative Association

• 1 = 0 No Association

A Simple Linear Regression A Simple Linear Regression ModelModel

= Slope

Error:

Regression Plot

Picturing the Simple Linear Regression Picturing the Simple Linear Regression ModelModel

= Intercept

Simple Linear Regression Simple Linear Regression AnalysisAnalysis

Variables: x = Independent Variable y = Dependent VariableParameters: = y Intercept β = Slopeε ~ normal distribution with mean 0 and

variance 2

exbay ˆy = actual value of a score = predicted valuey

Simple Linear Regression Model…

b=slope=y/x

intercept

xbay ˆ

The Method of Least The Method of Least SquaresSquares

The equation of the best-fitting line is calculated using a set of n pairs (xi, yi).

•We choose our estimates a and b to estimate and so that the vertical distances of the points from the line,are minimized.

22 )()ˆ(

bxayyy

minimize to and Choose

:line fitting Best

Least Squares Estimators

where :line fitting Best

:squares of sums the Calculate

)()( 22

where :line fitting Best

:squares of sums the Calculate

)()( 22

Example

The table shows the IQ scores for a random sample of n = 10 college freshmen, along with their final calculus grades.

Student 1 2 3 4 5 6 7 8 9 10

IQ Scores, x 39 43 21 64 57 47 28 75 34 52

Calculus grade, y 65 78 52 82 92 89 73 98 56 75

Use your calculator to find the sums and sums of squares.

5981623634

76046022

5981623634

76046022

:line fitting Best

and .76556

77.78.40ˆ

78.40)46(76556.762474

189410

)760)(460(36854

)760(59816

)460(23634

:line fitting Best

and .76556

77.78.40ˆ

78.40)46(76556.762474

189410

)760)(460(36854

)760(59816

)460(23634

ExampleExample

The Analysis of Variance

The total variation in the experiment is measured by the total sum of squarestotal sum of squares:

2)yyS yy ( SS Total2)yyS yy ( SS Total

The Total SSTotal SS is divided into two parts:

SSRSSR (sum of squares for regression): measures the variation explained by using x in the model.

SSESSE (sum of squares for error): measures the leftover variation not explained by x.

We calculate

9741.1449

1894)(SSR

9741.1449

1894)(SSR

0259.606

9741.14492056

SSR - SS TotalSSE2

xyyy S

The Analysis of Variance

The ANOVA Table

Total df = Mean Squares

Regression df = Error df =

n –1 – 1 = n - 2

MSR = SSR/(1)

MSE = SSE/(n-2)

Source df SS MS F

Regression 1 SSR SSR/(1) MSR/MSE

Error n - 2 SSE SSE/(n-2)

Total n -1 Total SS

The Calculus Problem

Source df SS MS F

Regression 1 1449.9741 1449.9741 19.14

Error 8 606.0259 75.7532

Total 9 2056.0000

0259.6069741.14492056

9741.14492474

1894)(

SSR - SS TotalSSE

0259.6069741.14492056

9741.14492474

1894)(

SSR - SS TotalSSE

You can test the overall usefulness of the model using an F test. If the model is useful, MSR will be large compared to the unexplained variation, MSE.

This test is exactly equivalent to the t-test, with t2 = F.

predictingin usefulnot is model:H

Hypothesis

. 2- and1 with FF if HReject

:StatisticTest

Testing the Usefulness of the Model (The F Test)

Regression Analysis: y versus xThe regression equation is y = 40.8 + 0.766 xPredictor Coef SE Coef T PConstant 40.784 8.507 4.79 0.001x 0.7656 0.1750 4.38 0.002

S = 8.70363 R-Sq = 70.5% R-Sq(adj) = 66.8%

Analysis of VarianceSource DF SS MS F PRegression 1 1450.0 1450.0 19.14 0.002Residual Error 8 606.0 75.8Total 9 2056.0

Regression coefficients,

a and b

Minitab Minitab OutputOutput

test To

0 Least squares regression line

Testing the Usefulness of the Model

• The first question to ask is whether the independent variable x is of any use in predicting y.

• If it is not, then the value of y does not change, regardless of the value of x. This implies that the slope of the line, , is zero.

0:0:0 aH versusH 0:0:0 aH versusH

The test statistic is function of b, our best estimate of Using MSE as the best estimate of the random variation 2, we obtain a t statistic.

MSEtbndf

:interval confidencea or with

ondistributi a has which :statistic Test

MSEtbndf

:interval confidencea or with

ondistributi a has which :statistic Test

Testing the Usefulness of the Model

0:0:0 aH versusH 0:0:0 aH versusH

The Calculus Problem

•Is there a significant relationship between the calculus grades and the IQ scores at the 5% level of significance?

38.42474/7532.75

07656.

0:0:0 aH versusH

Reject H 0 when |t| > 2.306. Since t = 4.38 falls into the rejection region, H 0 is rejected .

There is a significant linear relationship between the calculus grades and the IQ scores for the population of college freshmen.

Measuring the Strength Measuring the Strength of the Relationship of the Relationship

• If the independent variable x is of useful in predicting y, you will want to know how well the model fits.

• The strength of the relationship between x and y can be measured using:

SS TotalSSR

:iondeterminat of tCoefficien

:tcoefficien nCorrelatio

SS TotalSSR

:iondeterminat of tCoefficien

:tcoefficien nCorrelatio

Measuring the Strength of the Relationship

• Since Total SS = SSR + SSE, r2

measures the proportion of the total variation in

the responses that can be explained by using the independent variable x in the model.

the percent reduction the total variation by using the regression equation rather than just using the sample mean y-bar to estimate y.SS Total

SSR 2r

SS TotalSSR

2rFor the calculus problem, r2 = .705 or 70.5%. Meaning that 70.5% of the variability of Calculus Scores can be exlain by the model.

Estimation and Estimation and Prediction Prediction

nMSEty

)(11ˆ

: when of valueparticulara predict To

: when of valueaverage the estimate To

nMSEty

)(11ˆ

: when of valueparticulara predict To

: when of valueaverage the estimate To Confidence interval

Prediction interval

The Calculus The Calculus ProblemProblem

Estimate the average calculus grade for students whose IQ score is 50 with a 95% confidence interval.

85.61. to 72.51or

79.06.76556(50)40.78424 Calculate

55.606.79

)4650(

17532.75306.2ˆ

85.61. to 72.51or

79.06.76556(50)40.78424 Calculate

55.606.79

)4650(

17532.75306.2ˆ

Estimate the calculus grade for a particular student whose IQ score is 50 with a 95% confidence interval.

100.17. to 57.95or

79.06.76556(50)40.78424 Calculate

11.2106.79

)4650(

117532.75306.2ˆ

100.17. to 57.95or

79.06.76556(50)40.78424 Calculate

11.2106.79

)4650(

117532.75306.2ˆ

Notice how much wider this interval is!

The Calculus The Calculus ProblemProblem

Predicted Values for New ObservationsNew Obs Fit SE Fit 95.0% CI 95.0% PI1 79.06 2.84 (72.51, 85.61) (57.95,100.17)

Values of Predictors for New ObservationsNew Obs x1 50.0

Minitab OutputMinitab Output

Green prediction bands are always wider than red confidence bands.

Both intervals are narrowest when x = x-bar.

Confidence and prediction intervals when x = 50

80706050403020

S 8.70363R-Sq 70.5%R-Sq(adj) 66.8%

Regression95% CI95% PI

Fitted Line Ploty = 40.78 + 0.7656 x

Estimation and Estimation and Prediction Prediction • Once you have

determined that the regression line is useful used the diagnostic plots to check for violation of

the regression assumptions.• You are ready to use the regression line to

Estimate the average value of y for a given value of x

Predict a particular value of y for a given value of x.

• The best estimate of either E(y) or y for a given value x = x0 is

• Particular values of y are more difficult to predict, requiring a wider range of values in the prediction interval.

0ˆ bxay 0ˆ bxay

Estimation and Estimation and Prediction Prediction

Regression Assumptions

1. The relationship between x and y is linear, given by y = + x +

2. The random error terms are independent and, for any value of x, have a normal distribution with mean 0 and constant variance, 2.

1. The relationship between x and y is linear, given by y = + x +

2. The random error terms are independent and, for any value of x, have a normal distribution with mean 0 and constant variance, 2.

•Remember that the results of a regression analysis are only valid when the necessary assumptions have been satisfied.

Assumptions:

Diagnostic ToolsDiagnostic Tools

1. Normal probability plot or histogram of residuals

2. Plot of residuals versus fit or residuals versus variables

3. Plot of residual versus order

1. Normal probability plot or histogram of residuals

2. Plot of residuals versus fit or residuals versus variables

3. Plot of residual versus order

Residuals

• The residual errorresidual error is the “leftover” variation in each data point after the variation explained by the regression model has been removed.

• If all assumptions have been met, these residuals should be normalnormal, with mean 0 and variance 2.

iiii bxayyy or Residual ˆ iiii bxayyy or Residual ˆ

If the normality assumption is valid, the plot should resemble a straight line, sloping upward to the right.

If not, you will often see the pattern fail in the tails of the graph.

If the normality assumption is valid, the plot should resemble a straight line, sloping upward to the right.

If not, you will often see the pattern fail in the tails of the graph.

Normal Probability Normal Probability PlotPlot

Residual

20100-10-20

605040

Normal Probability Plot of the Residuals(response is y)

If the equal variance assumption is valid, the plot should appear as a random scatter around the zero center line.

If not, you will see a pattern in the residuals.

If the equal variance assumption is valid, the plot should appear as a random scatter around the zero center line.

If not, you will see a pattern in the residuals.

Residuals versus FitsResiduals versus Fits

Fitted Value

Residual

10090807060

Residuals Versus the Fitted Values(response is y)

introduction to probability and statistics thirteenth edition chapter 12 linear regression and...

Documents

regression & correlation

correlation & regression (2)

data mining, prediction, correlation, regression,...

correlation & regression (students)

correlation and regression

pearson correlation, spearman correlation &linear regression

correlation and regression dr. ghada abo-zaid correlation...

correlation nd regression

regression and correlation

correlation n regression

correlation & regression (3)

correlation, linear regression

correlation vs. regression

2 correlation & regression

correlation & linear regression

correlation and regression - mysmu.edu ·...

statistics 1 – correlation and regression exam questions...

xuhua xia correlation and regression introduction to linear...

regression and correlation analysis - regression and...

1 chapter 10 correlation and regression 10.2 correlation...