basic concepts of correlation. definition a correlation exists between two variables when the values...

Post on 13-Jan-2016

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Basic Concepts of Correlation

Definition

A correlation exists between two variables when the values of one are somehow associated with the values of the other in some way.

Definition

The linear correlation coefficient r measures the strength of the linear relationship between the paired quantitative x- and y-values in a sample.

Exploring the Data

We can often see a relationship between two variables by constructing a scatterplot.

Figure 10-2 following shows scatterplots with different characteristics.

Scatterplots of Paired Data

Figure 10-2

Scatterplots of Paired Data

Figure 10-2

Scatterplots of Paired Data

Figure 10-2

Requirements

1. The sample of paired (x, y) data is a simple random sample of quantitative data.

2. Visual examination of the scatterplot must confirm that the points approximate a straight-line pattern.

3. The outliers must be removed ONLY if they are known to be errors. The effects of any other outliers should be considered by calculating r with and without the outliers included.

Notation for the Linear Correlation Coefficient

n = number of pairs of sample data

denotes the addition of the items indicated.

x denotes the sum of all x-values.

x2 indicates that each x-value should be squared and then those squares added.

(x)2 indicates that the x-values should be added and then the total squared.

Notation for the Linear Correlation Coefficient

xy indicates that each x-value should be first multiplied by its corresponding y-value. After obtaining all such products, find their sum.

r = linear correlation coefficient for sample data.

= linear correlation coefficient for population data.

Formula 10-1

nxy – (x)(y)

n(x2) – (x)2 n(y2) – (y)2r =

The linear correlation coefficient r measures the strength of a linear relationship between the paired values in a sample.

Computer software or calculators can compute r

Formula

Interpreting r

Using Table A-6: If the absolute value of the computed value of r, denoted |r|, exceeds the value in Table A-6, conclude that there is a linear correlation. Otherwise, there is not sufficient evidence to support the conclusion of a linear correlation.

Using Software: If the computed P-value is less than or equal to the significance level, conclude that there is a linear correlation. Otherwise, there is not sufficient evidence to support the conclusion of a linear correlation.

- Ken Carroll, 1975

Correlation of incidence of death due to breast cancer with animal fat intake

- Ken Carroll, 1975

Correlation of incidence of death due to breast cancer with vegetable fat intake

From Francis Anscombe , 1973… each plot has the same mean (y=7.5), SD (4.12), r, (0.816), and linear regression line (y = 3 + 0.5x)…

p p p p r2

df 0.10 0.05 0.02 0.01…8 0.549 0.632 0.716 0.765 0.5859 0.521 0.602 0.685 0.735 0.54010 0.497 0.576 0.658 0.708 0.50111 0.476 0.553 0.634 0.684 0.46712 0.458 0.532 0.612 0.661 0.43713 0.441 0.514 0.592 0.641 0.41014 0.426 0.497 0.574 0.623 0.38815 0.412 0.482 0.558 0.606 0.36716 0.400 0.468 0.542 0.590 0.34817 0.389 0.456 0.528 0.575 0.33118 0.378 0.444 0.516 0.561 0.31519 0.369 0.433 0.503 0.549 0.28820 0.360 0.423 0.492 0.537 0.65130 0.296 0.349 0.409 0.449 0.20240 0.257 0.304 0.358 0.393 0.15450 0.231 0.273 0.322 0.354 0.12560 0.211 0.250 0.295 0.325 0.10680 0.183 0.217 0.256 0.283 0.080100 0.164 0.195 0.230 0.254 0.065

From the previous slide: note that with df of 9r = 0.816 is significant at p < 0.01

A meaningless correlation in anyone’s book!!!

Interpreting the Linear Correlation Coefficient r

Critical Values from Table A-6 and the Computed Value of r

Interpreting r:Explained Variation

The value of r2 is the proportion of the variation in y that is explained by the linear relationship between x and y.

Common Errors Involving Correlation

1. Causation: It is wrong to conclude that correlation implies causality.

2. Averages: Averages suppress individual variation and may inflate the correlation coefficient.

3. Linearity: There may be some relationship between x and y even when there is no linear correlation.

Part 1: Basic Concepts of Regression

Regression

The typical equation of a straight liney = mx + b is expressed in the formy = b0 + b1x, where b0 is the y-intercept and b1 is the slope.

^

The regression equation expresses a relationship between x (called the explanatory variable, predictor variable or independent variable), and y (called the response variable or dependent variable).

^

Definitions Regression Equation

Given a collection of paired data, the regression equation

Regression Line

The graph of the regression equation is called the regression line (or line of best fit, or least squares line).

y = b0 + b1x^

algebraically describes the relationship between the two variables.

Notation for Regression Equation

y-intercept of regression equation

Slope of regression equation

Equation of the regression line

PopulationParameter

SampleStatistic

0 b0

1 b1

y = 0 + 1 x y = b0 + b1x

^

Requirements

1. The sample of paired (x, y) data is a random sample of quantitative data.

2. Visual examination of the scatterplot shows that the points approximate a straight-line pattern.

3. Any outliers must be removed if they are known to be errors. Consider the effects of any outliers that are not known errors.

Formulas for b0 and b1

Formula 10-3 (slope)

(y-intercept)Formula 10-4

calculators or computers can compute these values

b0y b

1x

b

1r

sy

sx

The regression line fits the sample points best.

Special Property

Rounding the y-intercept b0 and the Slope b1

Round to three significant digits.

If you use the formulas 10-3 and 10-4, do not round intermediate values.

- Ken Carroll, 1975

From Francis Anscombe , 1973… each plot has the same mean (y=7.5), SD (4.12), r, (0.816), and linear regression line (y = 3 + 0.5x)…

We should know that the regression equation is an estimate of the true regression equation. This estimate is based on one particular set of sample data, but another sample drawn from the same population would probably lead to a slightly different equation.

1. Use the regression equation for predictions only if the graph of the regression line on the scatterplot confirms that the regression line fits the points reasonably well?????.

Using the Regression Equation for Predictions

2. Use the regression equation for predictions only if the linear correlation coefficient r indicates that there is a linear correlation between the two variables?????

Think… r2

A

% Fat % Fat Based on actual BD Based on predicted BD

A 25.96% 13.92%

B 7.91% 13.92%B

Be VERY skeptical of individual “results” based on linear regression equations being used to make multiple serial predictions

Summary

- correlation is used to calculate the strength of a linear relationship between paired variables from a sample

- r value > or < 0 indicates a non-zero association between the 2 variables

- a scatter plot is used to visualize the relationship between the 2 variables

- r CANNOT indicate a causal relationship

- interpret correlation by using r2… indicating the proportion of variation in variable A accounted for by the variation in variable B “significant r only means NOT a “0 association”

- regression equation describes the association between the 2 variables

- regression line is a graph of the regression equation describing the best fit line of the scatter plot

- only with confidence limits can you properly interpret a linear regression line

top related