chapter 10 correlation and regression © mcgraw-hill, bluman, 5th ed., chapter 10 1

63
Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Upload: sabina-quinn

Post on 17-Dec-2015

260 views

Category:

Documents


12 download

TRANSCRIPT

Page 1: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Chapter 10

Correlation and Regression

© McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Page 2: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Chapter 10 Overview Introduction 10-1 Correlation

10-2 Regression

10-3 Coefficient of Determination and Standard Error of the Estimate

Bluman, Chapter 10 2

Page 3: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Chapter 10 Objectives1. Draw a scatter plot for a set of ordered pairs.

2. Compute the correlation coefficient.

3. Test the hypothesis H0: ρ = 0.

4. Compute the equation of the regression line.

5. Compute the coefficient of determination.

6. Compute the standard error of the estimate.

7. Find a prediction interval.

Bluman, Chapter 10 3

Page 4: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Introduction In addition to hypothesis testing and

confidence intervals, inferential statistics involves determining whether a relationship between two or more numerical or quantitative variables exists.

Bluman, Chapter 10 4

Page 5: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Introduction CorrelationCorrelation is a statistical method used

to determine whether a linear relationship between variables exists.

RegressionRegression is a statistical method used to describe the nature of the relationship between variables—that is, positive or negative, linear or nonlinear.

Bluman, Chapter 10 5

Page 6: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Introduction The purpose of this chapter is to answer

these questions statistically:

1. Are two or more variables related?

2. If so, what is the strength of the relationship?

3. What type of relationship exists?

4. What kind of predictions can be made from the relationship?

Bluman, Chapter 10 6

Page 7: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Introduction

1. Are two or more variables related?

2. If so, what is the strength of the relationship?

To answer these two questions, statisticians use the correlation coefficientcorrelation coefficient, a numerical measure to determine whether two or more variables are related and to determine the strength of the relationship between or among the variables.

Bluman, Chapter 10 7

Page 8: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Introduction

3. What type of relationship exists?

There are two types of relationships: simple and multiple.

In a simple relationship, there are two variables: an independent variable independent variable (predictor variable) and a dependent variable dependent variable (response variable).

In a multiple relationship, there are two or more independent variables that are used to predict one dependent variable.

Bluman, Chapter 10 8

Page 9: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Introduction

4. What kind of predictions can be made from the relationship?

Predictions are made in all areas and daily. Examples include weather forecasting, stock market analyses, sales predictions, crop predictions, gasoline price predictions, and sports predictions. Some predictions are more accurate than others, due to the strength of the relationship. That is, the stronger the relationship is between variables, the more accurate the prediction is.

Bluman, Chapter 10 9

Page 10: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

10.1 Scatter Plots and Correlation A scatter plot scatter plot is a graph of the ordered

pairs (x, y) of numbers consisting of the independent variable x and the dependent variable y.

Bluman, Chapter 10 10

Page 11: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Chapter 10Correlation and Regression

Section 10-1Example 10-1

Page #536

Bluman, Chapter 10 11

Page 12: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Example 10-1: Car Rental CompaniesConstruct a scatter plot for the data shown for car rental companies in the United States for a recent year.

Step 1: Draw and label the x and y axes.

Step 2: Plot each point on the graph.

Bluman, Chapter 10 12

Page 13: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Example 10-1: Car Rental Companies

Bluman, Chapter 10 13

Positive Relationship

Page 14: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Correlation The correlation coefficient correlation coefficient computed from

the sample data measures the strength and direction of a linear relationship between two variables.

There are several types of correlation coefficients. The one explained in this section is called the Pearson product moment Pearson product moment correlation coefficient (PPMC)correlation coefficient (PPMC).

The symbol for the sample correlation coefficient is r. The symbol for the population correlation coefficient is .

Bluman, Chapter 10 14

Page 15: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Correlation The range of the correlation coefficient is from

1 to 1. If there is a strong positive linear strong positive linear

relationship relationship between the variables, the value of r will be close to 1.

If there is a strong negative linear strong negative linear relationship relationship between the variables, the value of r will be close to 1.

Bluman, Chapter 10 15

Page 16: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Correlation

Bluman, Chapter 10 16

Page 17: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Correlation CoefficientThe formula for the correlation coefficient is

where n is the number of data pairs.

Rounding Rule: Round to three decimal places.

Bluman, Chapter 10 17

2 22 2

n xy x yr

n x x n y y

Page 18: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Chapter 10Correlation and Regression

Section 10-1Example 10-2

Page #540

Bluman, Chapter 10 18

Page 19: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Example 10-4: Car Rental CompaniesCompute the correlation coefficient for the data in Example 10–1.

Bluman, Chapter 10 19

CompanyCars x

(in 10,000s)Income y(in billions) xy x2 y2

ABCDEF

63.029.020.819.113.48.5

7.03.92.12.81.41.5

441.00113.1043.6853.4818.762.75

3969.00841.00432.64364.81179.5672.25

49.0015.214.417.841.962.25

Σ x = 153.8

Σ y = 18.7

Σ xy = 682.77

Σ x2 = 5859.26

Σ y2 = 80.67

Page 20: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Example 10-4: Car Rental CompaniesCompute the correlation coefficient for the data in Example 10–1.

Bluman, Chapter 10 20

Σx = 153.8, Σy = 18.7, Σxy = 682.77, Σx2 = 5859.26, Σy2 = 80.67, n = 6

2 22 2

n xy x yr

n x x n y y

2 2

6 682.77 153.8 18.7

6 5859.26 153.8 6 80.67 18.7

r

0.982 (strong positive relationship)r

Page 21: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Hypothesis Testing In hypothesis testing, one of the following is

true:

H0: 0 This null hypothesis means that there is no correlation between

the x and y variables in the population.

H1: 0 This alternative hypothesis means that there is a significant correlation between the variables in the population.

Bluman, Chapter 10 21

Page 22: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

t Test for the Correlation Coefficient

Bluman, Chapter 10 22

2

2

1

with degrees of freedom equal to 2.

nt r

r

n

Page 23: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Chapter 10Correlation and Regression

Section 10-1Example 10-3

Page #544

Bluman, Chapter 10 23

Page 24: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Example 10-3: Car Rental CompaniesTest the significance of the correlation coefficient found in Example 10–2. Use α = 0.05 and r = 0.982.

Step 1: State the hypotheses.

H0: ρ = 0 and H1: ρ 0

Step 2: Find the critical value.

Since α = 0.05 and there are 6 – 2 = 4 degrees of freedom, the critical values obtained from Table F are ±2.776.

Bluman, Chapter 10 24

Page 25: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Step 3: Compute the test value.

Step 4: Make the decision. Reject the null hypothesis.

Step 5: Summarize the results. There is a significant relationship between the number of cars a rental agency owns and its annual income.

Example 10-3: Car Rental Companies

Bluman, Chapter 10 25

2

2

1

n

t rr 2

6 20.982 10.4

1 0.982

Page 26: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Chapter 10Correlation and Regression

Section 10-1Example 10-4

Page #545

Bluman, Chapter 10 26

Page 27: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Example 10-4: Car Rental CompaniesUsing Table I, test the significance of the correlation coefficient r = 0.067, from Example 10–3, at α = 0.01.

Step 1: State the hypotheses.

H0: ρ = 0 and H1: ρ 0

There are 9 – 2 = 7 degrees of freedom. The value in Table I when α = 0.01 is 0.798.

For a significant relationship, r must be greater than 0.798 or less than -0.798. Since r = 0.067, do not reject the null.

Hence, there is not enough evidence to say that there is a significant linear relationship between the variables.

Bluman, Chapter 10 27

Page 28: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Possible Relationships Between Variables

When the null hypothesis has been rejected for a specific a value, any of the following five possibilities can exist.

1.There is a direct cause-and-effect relationship between the variables. That is, x causes y.

2.There is a reverse cause-and-effect relationship between the variables. That is, y causes x.

3.The relationship between the variables may be caused by a third variable.

4.There may be a complexity of interrelationships among many variables.

5.The relationship may be coincidental.

Bluman, Chapter 10 28

Page 29: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Possible Relationships Between Variables

1. There is a reverse cause-and-effect relationship between the variables. That is, y causes x.

For example, water causes plants to grow poison causes death heat causes ice to melt

Bluman, Chapter 10 29

Page 30: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Possible Relationships Between Variables

2. There is a reverse cause-and-effect relationship between the variables. That is, y causes x.

For example, Suppose a researcher believes excessive coffee

consumption causes nervousness, but the researcher fails to consider that the reverse situation may occur. That is, it may be that an extremely nervous person craves coffee to calm his or her nerves.

Bluman, Chapter 10 30

Page 31: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Possible Relationships Between Variables

3. The relationship between the variables may be caused by a third variable.

For example, If a statistician correlated the number of deaths

due to drowning and the number of cans of soft drink consumed daily during the summer, he or she would probably find a significant relationship. However, the soft drink is not necessarily responsible for the deaths, since both variables may be related to heat and humidity.

Bluman, Chapter 10 31

Page 32: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Possible Relationships Between Variables

4. There may be a complexity of interrelationships among many variables.

For example, A researcher may find a significant relationship

between students’ high school grades and college grades. But there probably are many other variables involved, such as IQ, hours of study, influence of parents, motivation, age, and instructors.

Bluman, Chapter 10 32

Page 33: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Possible Relationships Between Variables

5. The relationship may be coincidental.

For example, A researcher may be able to find a significant

relationship between the increase in the number of people who are exercising and the increase in the number of people who are committing crimes. But common sense dictates that any relationship between these two values must be due to coincidence.

Bluman, Chapter 10 33

Page 34: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

10.2 Regression If the value of the correlation coefficient is

significant, the next step is to determine the equation of the regression line regression line which is the data’s line of best fit.

Bluman, Chapter 10 34

Page 35: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Regression

Bluman, Chapter 10 35

Best fit Best fit means that the sum of the squares of the vertical distance from each point to the line is at a minimum.

Page 36: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Regression Line

Bluman, Chapter 10 36

y a bx

2

22

22

where

= intercept

= the slope of the line.

y x x xya

n x x

n xy x yb

n x x

a y

b

Page 37: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Chapter 10Correlation and Regression

Section 10-2Example 10-5

Page #553

Bluman, Chapter 10 37

Page 38: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Example 10-5: Car Rental CompaniesFind the equation of the regression line for the data in Example 10–2, and graph the line on the scatter plot.

Bluman, Chapter 10 38

Σx = 153.8, Σy = 18.7, Σxy = 682.77, Σx2 = 5859.26, Σy2 = 80.67, n = 6

2

22

y x x xya

n x x

22

n xy x yb

n x x

2

18.7 5859.26 153.8 682.77

6 5859.26 153.8

0.396

2

6 682.77 153.8 18.7

6 5859.26 153.8

0.106

0.396 0.106 y a bx y x

Page 39: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Find two points to sketch the graph of the regression line.

Use any x values between 10 and 60. For example, let x equal 15 and 40. Substitute in the equation and find the corresponding y value.

Plot (15,1.986) and (40,4.636), and sketch the resulting line.

Example 10-5: Car Rental Companies

Bluman, Chapter 10 39

0.396 0.106

0.396 0.106 15

1.986

y x

0.396 0.106

0.396 0.106 40

4.636

y x

Page 40: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Example 10-5: Car Rental CompaniesFind the equation of the regression line for the data in Example 10–2, and graph the line on the scatter plot.

Bluman, Chapter 10 40

0.396 0.106 y x

15, 1.986

40, 4.636

Page 41: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Chapter 10Correlation and Regression

Section 10-2Example 10-6

Page #555

Bluman, Chapter 10 41

Page 42: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Use the equation of the regression line to predict the income of a car rental agency that has 200,000 automobiles.

x = 20 corresponds to 200,000 automobiles.

Hence, when a rental agency has 200,000 automobiles, its revenue will be approximately $2.516 billion.

Example 10-11: Car Rental Companies

Bluman, Chapter 10 42

0.396 0.106

0.396 0.106 20

2.516

y x

Page 43: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Regression The magnitude of the change in one variable

when the other variable changes exactly 1 unit is called a marginal changemarginal change. The value of slope b of the regression line equation represents the marginal change.

For valid predictions, the value of the correlation coefficient must be significant.

When r is not significantly different from 0, the best predictor of y is the mean of the data values of y.

Bluman, Chapter 10 43

Page 44: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Assumptions for Valid Predictions1. For any specific value of the independent

variable x, the value of the dependent variable y must be normally distributed about the regression line. See Figure 10–16(a).

2. The standard deviation of each of the dependent variables must be the same for each value of the independent variable. See Figure 10–16(b).

Bluman, Chapter 10 44

Page 45: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Extrapolations (Future Predictions) ExtrapolationExtrapolation, or making predictions beyond

the bounds of the data, must be interpreted cautiously.

Remember that when predictions are made, they are based on present conditions or on the premise that present trends will continue. This assumption may or may not prove true in the future.

Bluman, Chapter 10 45

Page 46: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Procedure TableStep 1: Make a table with subject, x, y, xy, x2, and

y2 columns.

Step 2: Find the values of xy, x2, and y2. Place them in the appropriate columns and sum each column.

Step 3: Substitute in the formula to find the value of r.

Step 4: When r is significant, substitute in the formulas to find the values of a and b for the regression line equation y = a + bx.

Bluman, Chapter 10 46

Page 47: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

10.3 Coefficient of Determination and Standard Error of the Estimate

The total variation total variation is the sum of the squares of the vertical distances each point is from the mean.

The total variation can be divided into two parts: that which is attributed to the relationship of x and y, and that which is due to chance.

Bluman, Chapter 10 47

2 y y

Page 48: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Variation The variation obtained from the

relationship (i.e., from the predicted y' values) is and is called the explained variationexplained variation.

Variation due to chance, found by , is called the unexplained unexplained variationvariation. This variation cannot be attributed to the relationships.

Bluman, Chapter 10 48

2 y y

2 y y

Page 49: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Variation

Bluman, Chapter 10 49

Page 50: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Coefficient of Determiation The coefficient of determination coefficient of determination is the

ratio of the explained variation to the total variation.

The symbol for the coefficient of determination is r 2.

Another way to arrive at the value for r 2 is to square the correlation coefficient.

Bluman, Chapter 10 50

2 explained variation

total variationr

Page 51: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Coefficient of Nondetermiation The coefficient of nondetermination coefficient of nondetermination is

a measure of the unexplained variation. The formula for the coefficient of

determination is 1.00 – r 2.

Bluman, Chapter 10 51

Page 52: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Standard Error of the Estimate The standard error of estimatestandard error of estimate,

denoted by sest is the standard deviation of the observed y values about the predicted y' values. The formula for the standard error of estimate is:

Bluman, Chapter 10 52

2

2

est

y ys

n

Page 53: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Chapter 10Correlation and Regression

Section 10-3Example 10-7

Page #569

Bluman, Chapter 10 53

Page 54: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

A researcher collects the following data and determines that there is a significant relationship between the age of a copy machine and its monthly maintenance cost. The regression equation is y = 55.57 + 8.13x. Find the standard error of the estimate.

Example 10-7: Copy Machine Costs

Bluman, Chapter 10 54

Page 55: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Example 10-7: Copy Machine Costs

Bluman, Chapter 10 55

MachineAge x

(years)Monthlycost, y y y – y (y – y )2

ABCDEF

123446

6278709093103

63.7071.8379.9688.0988.09

104.35

-1.706.17

-9.961.914.91

-1.35

2.8938.068999.2016

3.648124.1081

1.8225

169.7392

55.57 8.13

55.57 8.13 1 63.70

55.57 8.13 2 71.83

55.57 8.13 3 79.96

55.57 8.13 4 88.09

55.57 8.13 6 104.35

y x

y

y

y

y

y

2

2

est

y ys

n

169.73926.51

4 ests

Page 56: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Chapter 10Correlation and Regression

Section 10-3Example 10-8

Page #570

Bluman, Chapter 10 56

Page 57: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Example 10-8: Copy Machine Costs

Bluman, Chapter 10 57

2

2

est

y a y b xys

n

Page 58: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Example 10-8: Copy Machine Costs

Bluman, Chapter 10 58

496 1778

MachineAge x

(years)Monthlycost, y xy y2

ABCDEF

123446

6278709093

103

62156210360372618

3,8446,0844,9008,1008,649

10,609

2

2

est

y a y b xys

n

42,186

42,186 55.57 496 8.13 17786.48

4

ests

Page 59: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Formula for the Prediction Interval about a Value y

Bluman, Chapter 10 59

2

2 22

2

2 22

11

11

with d.f. = - 2

est

est

n x Xy t s y

n n x x

n x Xy t s

n n x x

n

Page 60: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Chapter 10Correlation and Regression

Section 10-3Example 10-9

Page #571

Bluman, Chapter 10 60

Page 61: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

For the data in Example 10–7, find the 95% prediction interval for the monthly maintenance cost of a machine that is 3 years old.

Step 1: Find

Step 2: Find y for x = 3.

Step 3: Find sest.

(as shown in Example 10-13)

Example 10-9: Copy Machine Costs

Bluman, Chapter 10 61

2, , and . x x X2 20

20 82 3.36

x x X

55.57 8.13 3 79.96 y

6.48ests

Page 62: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Step 4: Substitute in the formula and solve.

Example 10-9: Copy Machine Costs

Bluman, Chapter 10 62

2

2 22

2

2 22

11

11

est

est

n x Xy t s y

n n x x

n x Xy t s

n n x x

2

2

2

2

6 3 3.3179.96 2.776 6.48 1

6 6 82 20

6 3 3.3179.96 2.776 6.48 1

6 6 82 20

y

Page 63: Chapter 10 Correlation and Regression © McGraw-Hill, Bluman, 5th ed., Chapter 10 1

Step 4: Substitute in the formula and solve.

Example 10-9: Copy Machine Costs

Bluman, Chapter 10 63

2

2

2

2

6 3 3.3179.96 2.776 6.48 1

6 6 82 20

6 3 3.3179.96 2.776 6.48 1

6 6 82 20

y

79.96 19.43 79.96 19.43

60.53 99.39

y

y

Hence, you can be 95% confident that the interval 60.53 < y < 99.39 contains the actual value of y.