chapter 9 section 9.1 - correlation - tnfaculty.southwest.tn.edu/hprovinc/content/materials/lecture...

16
1 | Page Chapter 9 Section 9.1 - Correlation Objectives: Introduce linear correlation, independent and dependent variables, and the types of correlation Find a correlation coefficient Test a population correlation coefficient ρ using a table Perform a hypothesis test for a population correlation coefficient ρ Distinguish between correlation and causation Correlation A relationship between two variables. The data can be represented by ordered pairs (x, y) x is the independent (or explanatory) variable y is the dependent (or response) variable A scatter plot can be used to determine whether a linear (straight line) correlation exists between two variables. Types of Correlation

Upload: lamnhan

Post on 17-Apr-2018

240 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Chapter 9 Section 9.1 - Correlation - TNfaculty.southwest.tn.edu/hprovinc/content/Materials/Lecture Notes... · 1 | P a g e Chapter 9 Section 9.1 - Correlation Objectives: • Introduce

1 | P a g e

Chapter 9

Section 9.1 - Correlation

Objectives:

• Introduce linear correlation, independent and dependent variables, and the types of correlation

• Find a correlation coefficient

• Test a population correlation coefficient ρ using a table

• Perform a hypothesis test for a population correlation coefficient ρ

• Distinguish between correlation and causation

Correlation

• A relationship between two variables.

• The data can be represented by ordered pairs (x, y)

x is the independent (or explanatory) variable

y is the dependent (or response) variable

A scatter plot can be used to determine whether a linear (straight line) correlation exists between two

variables.

Types of Correlation

Page 2: Chapter 9 Section 9.1 - Correlation - TNfaculty.southwest.tn.edu/hprovinc/content/Materials/Lecture Notes... · 1 | P a g e Chapter 9 Section 9.1 - Correlation Objectives: • Introduce

2 | P a g e

Example: Constructing a Scatter Plot

An economist want to determine whether there is a linear relationship between a country’s gross

domestic product (GDP) and carbon dioxide (CO2) emissions. The data are shown in the table. Display

the data in a scatter plot and determine whether there appears to be a positive or negative linear

correlation or no linear correlation. (Source: World Bank and U.S. Energy Information Administration)

Correlation coefficient

• A measure of the strength and the direction of a linear relationship between two variables.

• The symbol r represents the sample correlation coefficient.

• A formula for r is

• The population correlation coefficient is represented by ρ (rho).

• The range of the correlation coefficient is -1 to 1.

Solution:

2 22 2

n xy x yr

n x x n y y

n is the number of data pairs

Page 3: Chapter 9 Section 9.1 - Correlation - TNfaculty.southwest.tn.edu/hprovinc/content/Materials/Lecture Notes... · 1 | P a g e Chapter 9 Section 9.1 - Correlation Objectives: • Introduce

3 | P a g e

Linear Correlation

Calculating a Correlation Coefficient

Page 4: Chapter 9 Section 9.1 - Correlation - TNfaculty.southwest.tn.edu/hprovinc/content/Materials/Lecture Notes... · 1 | P a g e Chapter 9 Section 9.1 - Correlation Objectives: • Introduce

4 | P a g e

Example: Finding the Correlation Coefficient

Calculate the correlation coefficient for the gross domestic products and carbon dioxide emissions data.

What can you conclude?

Using a Table to Test a Population Correlation Coefficient ρ

• Once the sample correlation coefficient r has been calculated, we need to determine whether

there is enough evidence to decide that the population correlation coefficient ρ is significant at

a specified level of significance.

• Use Table 11 in Appendix B.

• If |r| is greater than the critical value, there is enough evidence to decide that the correlation

coefficient ρ is significant.

Solution:

Page 5: Chapter 9 Section 9.1 - Correlation - TNfaculty.southwest.tn.edu/hprovinc/content/Materials/Lecture Notes... · 1 | P a g e Chapter 9 Section 9.1 - Correlation Objectives: • Introduce

5 | P a g e

Example: Determine whether ρ is significant for five pairs of data (n = 5) at a level of significance of

α = 0.01.

Solution:

If |r| > 0.959, the correlation is significant. Otherwise, there is not enough evidence to conclude that the

correlation is significant.

Page 6: Chapter 9 Section 9.1 - Correlation - TNfaculty.southwest.tn.edu/hprovinc/content/Materials/Lecture Notes... · 1 | P a g e Chapter 9 Section 9.1 - Correlation Objectives: • Introduce

6 | P a g e

Example: Using a Table to Test a Population Correlation Coefficient ρ

Below is the data for Old Faithful, you used 25 pairs of data to find r ≈ 0.979. Is the correlation

coefficient significant? Use α = 0.05.

Hypothesis Testing for a Population Correlation Coefficient ρ

• A hypothesis test can also be used to determine whether the sample correlation coefficient r

provides enough evidence to conclude that the population correlation coefficient ρ is significant

at a specified level of significance.

• A hypothesis test can be one-tailed or two-tailed.

• Left-tailed test

H0: ρ ≥ 0 (no significant negative correlation)

Ha: ρ < 0 (significant negative correlation)

• Right-tailed test

H0: ρ ≤ 0 (no significant positive correlation)

Ha: ρ > 0 (significant positive correlation)

• Two-tailed test

H0: ρ = 0 (no significant correlation)

Ha: ρ ≠ 0 (significant correlation)

Solution:

Page 7: Chapter 9 Section 9.1 - Correlation - TNfaculty.southwest.tn.edu/hprovinc/content/Materials/Lecture Notes... · 1 | P a g e Chapter 9 Section 9.1 - Correlation Objectives: • Introduce

7 | P a g e

The t-Test for the Correlation Coefficient

• Can be used to test whether the correlation between two variables is significant.

• The test statistic is r

• The standardized test statistic

follows a t-distribution with d.f. = n – 2.

• In this text, only two-tailed hypothesis tests for ρ are considered.

Using the t-Test for ρ

Page 8: Chapter 9 Section 9.1 - Correlation - TNfaculty.southwest.tn.edu/hprovinc/content/Materials/Lecture Notes... · 1 | P a g e Chapter 9 Section 9.1 - Correlation Objectives: • Introduce

8 | P a g e

Example: t-Test for a Correlation Coefficient

Previously you calculated r ≈ 0.882 (On page 4 on notes). Test the significance of this correlation

coefficient. Use α = 0.05.

Correlation and Causation

• The fact that two variables are strongly correlated does not in itself imply a cause-and-effect

relationship between the variables.

• If there is a significant correlation between two variables, you should consider the following

possibilities.

1. Is there a direct cause-and-effect relationship between the variables?

• Does x cause y?

2. Is there a reverse cause-and-effect relationship between the variables?

• Does y cause x?

3. Is it possible that the relationship between the variables can be caused by a third

variable or by a combination of several other variables?

4. Is it possible that the relationship between two variables may be a coincidence?

Solution:

Page 9: Chapter 9 Section 9.1 - Correlation - TNfaculty.southwest.tn.edu/hprovinc/content/Materials/Lecture Notes... · 1 | P a g e Chapter 9 Section 9.1 - Correlation Objectives: • Introduce

9 | P a g e

Section 9.2 - Linear Regression

Objectives:

• Find the equation of a regression line

• Predict y-values using a regression equation

Regression lines

• After verifying that the linear correlation between two variables is significant, next we

determine the equation of the line that best models the data (regression line).

• Can be used to predict the value of y for a given value of x.

Residual

• The difference between the observed y-value and the predicted y-value for a given x-value on

the line.

Regression line (line of best fit)

• The line for which the sum of the squares of the residuals is a minimum.

• The equation of a regression line for an independent variable x and a dependent variable y is

ŷ = mx + b

where ‘m’ is the slope, ‘b’ is the y-intercept and is the predicted y-value for a given x value

Page 10: Chapter 9 Section 9.1 - Correlation - TNfaculty.southwest.tn.edu/hprovinc/content/Materials/Lecture Notes... · 1 | P a g e Chapter 9 Section 9.1 - Correlation Objectives: • Introduce

10 | P a g e

The Equation of a Regression Line

• ŷ = mx + b where

• is the mean of the y-values in the data

• is the mean of the x-values in the data

• The regression line always passes through the point

Example: Finding the Equation of a Regression Line

Find the equation of the regression line for the gross domestic products and carbon dioxide emissions

data.

22

n xy x ym

n x x

,x y

Solution:

Page 11: Chapter 9 Section 9.1 - Correlation - TNfaculty.southwest.tn.edu/hprovinc/content/Materials/Lecture Notes... · 1 | P a g e Chapter 9 Section 9.1 - Correlation Objectives: • Introduce

11 | P a g e

Example: Predicting y-Values Using Regression Equations

The regression equation for the gross domestic products (in trillions of dollars) and carbon dioxide

emissions (in millions of metric tons) data is ŷ = 196.152x + 102.289. Use this equation to predict the

expected carbon dioxide emissions for the following gross domestic products. (Recall from section 9.1

that x and y have a significant linear correlation.)

1. 1.2 trillion dollars

Solution:

2. 2.0 trillion dollars

Solution:

3. 2.5 trillion dollars

Solution:

Page 12: Chapter 9 Section 9.1 - Correlation - TNfaculty.southwest.tn.edu/hprovinc/content/Materials/Lecture Notes... · 1 | P a g e Chapter 9 Section 9.1 - Correlation Objectives: • Introduce

12 | P a g e

Section 9.3 - Measures of Regression and Prediction Intervals

Objectives:

• Interpret the three types of variation about a regression line

• Find and interpret the coefficient of determination

• Find and interpret the standard error of the estimate for a regression line

• Construct and interpret a prediction interval for y

Variation About a Regression Line

• Three types of variation about a regression line

Total variation

Explained variation

Unexplained variation

• To find the total variation, you must first calculate

The total deviation

The explained deviation

The unexplained deviation

Total Deviation =

Explained Deviation =

Unexplained Deviation =

Total variation

• The sum of the squares of the differences between the y-value of each ordered pair and the

mean of y.

Total Variation =

Explained variation

• The sum of the squares of the differences between each predicted y-value and the mean of y.

Explained Variation =

Unexplained variation

• The sum of the squares of the differences between the y-value of each ordered pair and each

corresponding predicted y-value.

Unexplained Variation =

The sum of the explained and unexplained variation is equal to the total variation.

Total variation = Explained variation + Unexplained variation

Page 13: Chapter 9 Section 9.1 - Correlation - TNfaculty.southwest.tn.edu/hprovinc/content/Materials/Lecture Notes... · 1 | P a g e Chapter 9 Section 9.1 - Correlation Objectives: • Introduce

13 | P a g e

Coefficient of determination

• The ratio of the explained variation to the total variation.

• Denoted by r2

Example: Coefficient of Determination

The correlation coefficient for the gross domestic products and carbon dioxide emissions data as

calculated in Section 9.1 is r ≈ 0.883. Find the coefficient of determination. What does this tell you about

the explained variation of the data about the regression line? About the unexplained variation?

Solution:

Standard error of estimate

• The standard deviation of the observed yi -values about the predicted ŷ-value for a given xi -

value.

• Denoted by se.

• The closer the observed y-values are to the predicted y-values, the smaller the standard error of

estimate will be.

2 Explained variationTotal variation

r

2( )ˆ2

i ie

y ys

n

n is the number of ordered pairs in the data set

Page 14: Chapter 9 Section 9.1 - Correlation - TNfaculty.southwest.tn.edu/hprovinc/content/Materials/Lecture Notes... · 1 | P a g e Chapter 9 Section 9.1 - Correlation Objectives: • Introduce

14 | P a g e

Example: Standard Error of Estimate

The regression equation for the gross domestic products and carbon dioxide emissions data as

calculated in section 9.2 is ŷ = 196.152x + 102.289 Find the standard error of estimate.

Solution:

Page 15: Chapter 9 Section 9.1 - Correlation - TNfaculty.southwest.tn.edu/hprovinc/content/Materials/Lecture Notes... · 1 | P a g e Chapter 9 Section 9.1 - Correlation Objectives: • Introduce

15 | P a g e

Prediction Intervals

• Two variables have a bivariate normal distribution if for any fixed value of x, the corresponding

values of y are normally distributed and for any fixed values of y, the corresponding x-values are

normally distributed.

Constructing a Prediction Interval for y for a Specific Value of x

• A prediction interval can be constructed for the true

value of y.

• Given a linear regression equation ŷ = mx + b and x0,

a specific value of x, a c-prediction interval for y is

ŷ – E < y < ŷ + E where

202 2

( )11

( )c e

n x xE t s

n n x x

• The point estimate is ŷ and the margin of error is E.

The probability that the prediction interval contains y

is c.

Page 16: Chapter 9 Section 9.1 - Correlation - TNfaculty.southwest.tn.edu/hprovinc/content/Materials/Lecture Notes... · 1 | P a g e Chapter 9 Section 9.1 - Correlation Objectives: • Introduce

16 | P a g e

Example: Constructing a Prediction Interval

Construct a 95% prediction interval for the carbon dioxide emission when the gross domestic product is

$3.5 trillion. What can you conclude?

Recall, n = 10, ŷ = 196.152x + 102.289, se = 138.255

Solution:

215.8, 32.44, 1.975x x x