© copyright mcgraw-hill 2000 10-1 correlation and regression chapter 10

52
© Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

Upload: dayna-small

Post on 05-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-1

Correlation andRegression

CHAPTER 10

Page 2: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-2

Objectives

Draw a scatter plot for a set of ordered pairs.

Compute the correlation coefficient.

Test the hypothesis H0: 0.

Compute the equation of the regression line.

Page 3: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-3

Objectives (cont’d.)

Compute the coefficient of determination.

Compute the standard error of estimate.

Find a prediction interval.

Be familiar with the concept of multiple regression.

Page 4: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-4

Introduction

In addition to hypothesis testing and confidence intervals, inferential statistics involves determining whether a relationship between two or more numerical or quantitative variables exists.

Page 5: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-5

Statistical Methods

Correlation is a statistical method used to determine whether a linear relationship between variables exists.

Regression is a statistical method used to describe the nature of the relationship between variables—that is, positive or negative, linear or nonlinear.

Page 6: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-6

Statistical Questions

1. Are two or more variables related?

2. If so, what is the strength of the relationship?

3. What type or relationship exists?

4. What kind of predictions can be made from the relationship?

Page 7: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-7

Vocabulary

A correlation coefficient is a measure of how variables are related.

In a simple relationship, there are only two types of variables under study.

In multiple relationships, many variables are under study.

Page 8: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-8

Scatter Plots

A scatter plot is a graph of the ordered pairs (x,y) of numbers consisting of the independent variable, x, and the dependent variable, y.

A scatter plot is a visual way to describe the nature of the relationship between the independent and dependent variables.

Page 9: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-9

Scatter Plot Example

30

40

50

60

70

80

90

100

0 5 10 15 20

Number of absences

Fin

al g

rad

e

Page 10: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-10

Correlation Coefficient

The correlation coefficient computed from the sample data measures the strength and direction of a linear relationship between two variables.

The symbol for the sample correlation coefficient is r.

The symbol for the population correlation coefficient is .

Page 11: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-11

Correlation Coefficient (cont’d.)

The range of the correlation coefficient is from 1 to 1.

If there is a strong positive linear relationship between the variables, the value of r will be close to 1.

If there is a strong negative linear relationship between the variables, the value of r will be close to 1.

Page 12: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-12

Correlation Coefficient (cont’d.)

When there is no linear relationship between the variables or only a weak relationship, the value of r will be close to 0.

Strong negativelinear relationship

Strong positivelinear relationship

1 1

0

No linear relationship

Page 13: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-13

Formula for the Correlation Coefficient r

where n is the number of data pairs.

rn xy x y

n x x n y y

( ) ( )( )

2 2 2 2c ha f c ha f

Page 14: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-14

Population Correlation Coefficient

Formally defined, the population correlation coefficient, , is the correlation computed by using all possible pairs of data values (x, y) taken from a population.

Page 15: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-15

Hypothesis Testing

In hypothesis testing, one of the following is true:

H0: 0 This null hypothesis means that there is no correlation between

the x and y variables in the population.

H1: 0 This alternative hypothesis means that there is a significant correlation between the variables in

the population.

Page 16: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-16

t Test for the Correlation Coefficient

Formula for the t test for the correlation coefficient:

with degrees of freedom equal to n 2.

t rn

r

2

1 2

Page 17: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-17

Possible Relationships Between Variables

There is a direct cause-and-effect relationship between the variables: that is, x causes y.

There is a reverse cause-and-effect relationship between the variables: that is, y causes x.

The relationship between the variable may be caused by a third variable: that is, y may appear to cause x but in reality z causes x.

Page 18: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-18

Possible Relationships Between Variables

There may be a complexity of interrelationships among many variables; that is, x may cause y but w, t, and z fit into the picture as well.

The relationship may be coincidental: although a researcher may find a relationship between x and y, common sense may prove otherwise.

Page 19: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-19

Interpretation of Relationships

When the null hypothesis is rejected, the researcher must consider all possibilities and select the appropriate relationship between the variables as determined by the study. Remember, correlation does not necessarily imply causation.

Page 20: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-20

Regression Line

If the value of the correlation coefficient is significant, the next step is to determine the equation of the regression line which is the data’s line of best fit.

Best fit means that the sum of the squares of the vertical distance from each point to the line is at a minimum.

Page 21: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-21

Scatter Plot with Three Lines

y

x

Page 22: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-22

A Linear Relation

y

x

Page 23: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-23

Equation of a Line

In algebra, the equation of a line is usually given as , where m is the slope of the line and b is the y intercept.

In statistics, the equation of the regression line is written as , where b is the slope of the line and a is the y' intercept.

' y a bx

y mx b

Page 24: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-24

Regression Line

Formulas for the regression line :

where a is the y' intercept and b is the slope of the line.

y a bx'

2

22

22

y x x xya

n x x

n xy x yb

n x x

Page 25: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-25

Rounding Rule

When calculating the values of a and b, round to three decimal places.

Page 26: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-26

Assumptions for Valid Predictions

For any specific value of the independent variable x, the value of the dependent variable y must be normally distributed about the regression line.

The standard deviation of each of the dependent variables must be the same for each value of the independent variable.

Page 27: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-27

Limits of Predictions

Remember when assumptions are made, they are based on present conditions or on the premise that present trends will continue.

The assumption may not prove true in the future.

Page 28: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-28

Procedure

Finding the correlation coefficient and the regression line equation

Step 1 Make a table with columns for subject, x, y, xy, x2, and y2.

Step 2 Find the values of xy, x2, and y2. Place them in the appropriate

columns.

Step 3 Substitute in the formula to find the value of r.

Page 29: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-29

Procedure (cont’d.)

Step 4 When r is significant, substitute in

the formulas to find the values of a

and b for the regression line

equation.

Page 30: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-30

Total Variation

The total variation, , is the sum of the squares of the vertical distance each point is from the mean.

The total variation can be divided into two parts: that which is attributed to the relationship of x and y, and that which is due to chance.

2y y

Page 31: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-31

Two Parts of Total Variation

The variation obtained from the relationship (i.e., from the predicted y' values) is and is called the explained variation.

Variation due to chance, found by , is called the unexplained variation. This variation cannot be attributed to the relationships.

2'y y

2'y y

Page 32: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-32

Total Variation (cont’d.)

Hence, the total variation is equal to the sum of the explained variation and the unexplained variation.

For a single point, the differences are called deviations.

2 2 2' 'y y y y y y

Page 33: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-33

Coefficient of Determination

The coefficient of determination is a measure of the variation of the dependent variable that is explained by the regression line and the independent variable.

The symbol for the coefficient of determination is r2.

Page 34: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-34

Coefficient of Nondetermination

The coefficient of nondetermination is a measure of the unexplained variation.

The formula for the coefficient of nondetermination is:

21.00 r

Page 35: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-35

Standard Error of Estimate

The standard error of estimate, denoted by Sest is the standard deviation of the observed y values about the predicted y' values. The formula for the standard error of estimate is:

2'2

esty y

Sn

Page 36: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-36

Prediction Interval

The standard error of estimate can be used for constructing a prediction interval about a y' value.

The formula for the prediction interval is:

The d.f. n 2.

2

2 22

2

2 22

1' 1 '

11

est

est

n x Xy t s y y

n n x x

n x Xt s

n n x x

Page 37: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-37

Multiple Regression

In multiple regression, there are several independent variables and one dependent variable, and the equation is:

where x1, x2,…,xk are the independent variables.

1 1 2 2' k ky a b x b x b x

Page 38: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-38

Multiple Regression (cont’d.)

Multiple regression analysis is used when a statistician thinks there are several independent variables contributing to the variation of the dependent variable.

This analysis then can be used to increase the accuracy of predictions for the dependent variable over one independent variable alone.

Page 39: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-39

Assumptions for Multiple Regression

Normality assumption—for any specific value of the independent variable, the values of the y variable are normally distributed.

Equal variance assumption—the variances for the y variable are the same for each value of the independent variable.

Linearity assumption—there is a linear relationship between the dependent variable and the independent variables.

Page 40: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-40

Assumptions (cont’d.)

Nonmulticolinearity assumption—the independent variables are not correlated.

Independence assumption—the values for the y variable are independent.

Page 41: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-41

Multiple Correlation Coefficient

In multiple regression, as in simple regression, the strength of the relationship between the independent variables and the dependent variable is measured by a correlation coefficient.

This multiple correlation coefficient is symbolized by R.

Page 42: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-42

Multiple Correlation Coefficient Formula

The formula for R is:

where ryx1 is the correlation coefficient for the

variables y and x1;ryx2 is the correlation

coefficient for the variables y and x2; and rx1,x2

is the value of the correlation coefficient for the variables x1 and x2.

1 2 1 2 1 2

1 2

2 2

2

2

1yx yx yx yx x x

x x

r r r r rR

r

Page 43: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-43

Coefficient of Multiple Determination

As with simple regression, R2 is the coefficient of multiple determination, and it is the amount of variation explained by the regression model.

The expression 1-R2 represents the amount of unexplained variation, called the error or residual variation.

Page 44: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-44

F Test for Significance of R

The formula for the F test is:

where n is the number of data groups

(x1, x2,…, y)

and k is the number of independent variables.

The degrees of freedom are d.f.N n k and d.f.D n k 1.

2

21 1

R kF

R n k

Page 45: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-45

Adjusted R 2

Since the value of R2 is dependent on n (the number of data pairs) and k (the number of variables), statisticians also calculate what is called an adjusted R2, denoted by R2

adj.

This is based on the number of degrees of freedom.

Page 46: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-46

Adjusted R 2 (Cont’d)

The formula for the adjusted R2 is:

22 1 1

11

adj

R nR

n k

Page 47: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-47

Summary

The strength and direction of the linear relationship between variables is measured by the value of the correlation coefficient r.

r can assume values between and including 1 and 1.

The closer the value of the correlation coefficient is to 1 or 1, the stronger the linear relationship is between the variables.

A value of 1 or 1 indicates a perfect linear relationship.

Page 48: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-48

Summary (cont’d.)

Relationships can be linear or curvilinear.

To determine the shape, one draws a scatter plot of the variables.

If the relationship is linear, the data can be approximated by a straight line, called the regression line or the line of best fit.

Page 49: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-49

Summary (cont’d.)

In addition, relationships can be multiple. That is, there can be two or more independent variables and one dependent variable.

A coefficient of correlation and a regression equation can be found for multiple relationships, just as they can be found for simple relationships.

Page 50: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-50

Summary (cont’d.)

The coefficient of determination is a better indicator of the strength of a linear relationship than the correlation coefficient.

It is better because it identifies the percentage of variation of the dependent variable that is directly attributable to the variation of the independent variable.

The coefficient of determination is obtained by squaring the correlation coefficient and converting the result to a percentage.

Page 51: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-51

Summary (cont’d.)

Another statistic used in correlation and regression is the standard error of estimate, which is an estimate of the standard deviation of the y values about the predicted y' values.

The standard error of estimate can be used to construct a prediction interval about a specific value point estimate y' of the mean or the y values for a given x.

Page 52: © Copyright McGraw-Hill 2000 10-1 Correlation and Regression CHAPTER 10

© Copyright McGraw-Hill 200010-52

Conclusion

Many relationships among variables exist in the real world. One way to determine whether a relationship exists is to use the statistical techniques known as correlation and regression.