biol 582 lecture set 11 bivariate data correlation regression

35
BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

Upload: jonas-byrd

Post on 13-Jan-2016

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

BIOL 582

Lecture Set 11

Bivariate Data

Correlation

Regression

Page 2: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

Thus far, we have considered whether means of a response variable differ among groups. Sometimes it is of interest to know whether a variable covaries with another variable, or whether the value of one variable can predict the value of another.

With bivariate data, two values are measured on each population (or experimental) unit. We denote the data as ordered pairs (xi, yi). The data can be both qualitative, one qualitative and one quantitative, or both

quantitative. In some examples, xi is the independent (predictor) variable

and yi is the dependent (response) variable.

Although it might not be readily apparent we have been working all along with qualitative (nominal) independent variables (e.g., grouping variables). Now we are going to shift gears and look at continuous quantitative independent variables.

BIOL 582 Considering Multiple Variables

Page 3: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

With bivariate data, two values are measured on each population (or experimental) unit. We denote the data as ordered pairs (xi, yi). The data can be both qualitative, one qualitative and one quantitative, or both

quantitative. In some examples, xi is the independent (predictor) variable

and yi is the dependent (response) variable.

Bivariate Quantitative variables

Scatter Plot:Weight vs. Length for pupfish data

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 10 20 30 40Length (mm)

We

igh

t (g

)

BIOL 582 Considering Multiple Variables

Page 4: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

Weight vs. Length for pupfish data

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 5 10 15 20 25 30 35 40Length (mm)

We

igh

t (g

)

Variables include units

Points are ordered pairs (xi, yi)

(21.56, 0.32)

(36.77, 1.36)

Independent (predictor) variable

Dependent (response)

variable

BIOL 582 Considering Multiple Variables

Page 5: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

Weight vs. Length for pupfish data

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 5 10 15 20 25 30 35 40Length (mm)

We

igh

t (g

)

Is there a linear relationship for the

data?

BIOL 582 Considering Multiple Variables

Page 6: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

x

y

x

y

x

y

x

y

x

y

Positive linear relationship Negative linear relationship

No relationship Non-linear relationships

BIOL 582 Considering Multiple Variables

Page 7: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

Correlation

The Linear Correlation Coefficient or Pearson Product Correlation Coefficient is a measure of the strength of linear relation between two quantitative variables. We use the Greek letter ρ (rho) to represent the population correlation coefficient and r to represent the sample correlation coefficient.

Sample Correlation Coefficient

where is the sample mean for the predictor variable,

is the sample standard deviation of the predictor variable,

is the sample mean of the response variable,

is the sample standard deviation of the response variable,

is the number of individual units in the sample.

1

n

s

yy

s

xx

ry

i

x

i

x

xs

y

ys

n

BIOL 582 Considering Multiple Variables

Page 8: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

Correlation

The Linear Correlation Coefficient or Pearson Product Correlation Coefficient is a measure of the strength of linear relation between two quantitative variables. We use the Greek letter ρ (rho) to represent the population correlation coefficient and r to represent the sample correlation coefficient.

Sample Correlation Coefficient

Here is a computationally easier way to calculate r

1

n

s

yy

s

xx

ry

i

x

i

BIOL 582 Considering Multiple Variables

Page 9: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

BIOL 582 Scatter Diagrams; Correlation

Consider the pupfish example

i xi yi

1 21.56 0.32

2 28.87 0.81

3 28.50 0.63

4 28.96 0.70

5 27.00 0.55

6 32.50 0.92

7 30.39 0.67

8 36.77 1.36

9 29.39 0.61

Weight vs. Length for pupfish data

0

0.4

0.8

1.2

1.6

0 10 20 30 40Length (mm)

We

igh

t (g

)

Add 3 more columns

Page 10: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

Consider the pupfish example Weight vs. Length for pupfish data

0

0.4

0.8

1.2

1.6

0 10 20 30 40Length (mm)

We

igh

t (g

)

i xi yi xi2 yi

2 xiyi

1 21.56 0.32

2 28.87 0.81

3 28.50 0.63

4 28.96 0.70

5 27.00 0.55

6 32.50 0.92

7 30.39 0.67

8 36.77 1.36

9 29.39 0.61

BIOL 582 Scatter Diagrams; Correlation

Page 11: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

Consider the pupfish example Weight vs. Length for pupfish data

0

0.4

0.8

1.2

1.6

0 10 20 30 40Length (mm)

We

igh

t (g

)

i xi yi xi2 yi

2 xiyi

1 21.56 0.32 464.83 0.10 6.90

2 28.87 0.81 833.48 0.66 23.38

3 28.50 0.63 812.25 0.40 17.96

4 28.96 0.70 838.68 0.49 20.27

5 27.00 0.55 729.00 0.30 14.85

6 32.50 0.92 1056.25 0.85 29.90

7 30.39 0.67 923.55 0.45 20.36

8 36.77 1.36 1352.03 1.85 50.01

9 29.39 0.61 863.77 0.37 17.93

sum 263.94 6.57 7873.85 5.46 201.56

94.0

)82.0*54.11/(88.8

)/(

r

SSSSSSr yyxxxy

BIOL 582 Scatter Diagrams; Correlation

Page 12: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

More on correlation coefficients

r meaning

1.0 Perfectly positively correlated

0.8 Strongly positively correlated

0.6

0.4 Weakly positively correlated

0.2

0 Not Correlated

-0.2

-0.4 Weakly negatively correlated

-0.6

-0.8 Strongly negatively correlated

-1.0 Perfectly negatively correlated

x

y

x

y

xx

y

Match: r = 0.1 r = 0.3 r = 0.9r = 0.7

BIOL 582 Scatter Diagrams; Correlation

Page 13: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

More on correlation coefficients

r meaning

1.0 Perfectly positively correlated

0.8 Strongly positively correlated

0.6

0.4 Weakly positively correlated

0.2

0 Not Correlated

-0.2

-0.4 Weakly negatively correlated

-0.6

-0.8 Strongly negatively correlated

-1.0 Perfectly negatively correlated

x

y

x

y

xx

y

Match: r = -0.1 r = -0.3 r = -0.9r = -0.7

BIOL 582 Scatter Diagrams; Correlation

Page 14: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

More on correlation coefficients WARNINGS

Question: Does a correlation coefficient of 0 mean no association or no relationship?

i xi yi

1 -2 4

2 -1 1

3 0 0

4 1 1

5 2 4

xi2 yi

2 xiyi

4 16 -8

1 1 -1

0 0 0

1 1 1

4 16 8

r = 0

yi = xi2

Thus, r = 0 could mean no association

or a non-linear relationship

BIOL 582 Scatter Diagrams; Correlation

Page 15: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

More on correlation coefficients WARNINGS

Question: How do extreme points affect correlation?

i xi yi

1 1 1

2 2 2

3 3 3

4 4 4

5 5 0

i xi yi

1 1 1

2 1 2

3 2 1

4 2 2

5 14 14

r = 0 r > 0.99

BIOL 582 Scatter Diagrams; Correlation

Page 16: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

More on correlation coefficients WARNINGS

Question: How do extreme points affect correlation?

i xi yi

1 1 1

2 2 2

3 3 3

4 4 4

5 5 0

i xi yi

1 1 1

2 1 2

3 2 1

4 2 2

5 14 14

r = 1 r =0

BIOL 582 Scatter Diagrams; Correlation

Page 17: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

More on correlation coefficients WARNINGS

Question: Does correlation mean causation?

Pupfish data (MR = metabolic rate, mgO2/hr) Length Weight MR

i xi yi zi

1 21.56 0.32 0.18

2 28.87 0.81 0.44

3 28.50 0.63 0.54

4 28.96 0.70 0.53

5 27.00 0.55 0.46

6 32.50 0.92 0.53

7 30.39 0.67 0.43

8 36.77 1.36 1.20

9 29.39 0.61 0.32

r = 0.94 r = 0.92

But, the correlation between length and MR is also strong:

r = 0.84

Neither length nor weight “cause” increase in MR. MR happens to be biologically, positively associated with weight. Weight also happens to have a positive association with length. Thus, it appears that length and MR are related when they are not really directly related.

Remember, causation can only be inferred from an experimental approach.

BIOL 582 Scatter Diagrams; Correlation

Page 18: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

We have considered whether or not there is a linear relationship between two variables, now let’s consider how to describe the relationship.

i xi yi zi

1 21.56 0.32 0.18

2 28.87 0.81 0.44

3 28.50 0.63 0.54

4 28.96 0.70 0.53

5 27.00 0.55 0.46

6 32.50 0.92 0.53

7 30.39 0.67 0.43

8 36.77 1.36 1.20

9 29.39 0.61 0.32

r = 0.94 r = 0.92

Length Weight MR

MR vs. Weight in pupfish

y = 0.90x - 0.14

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.5 1 1.5

Weight (g)

MR

(m

gO

2/h

r)

This is a line of “best fit” for the linear relationship. It is usually found by Least-Squares Regression.

This is the equation of the line.

BIOL 582 Least-Squares Regression

Page 19: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

We have considered whether or not there is a linear relationship between two variables, now let’s consider how to describe the relationship.

Least-Squares Regression Criterion

The least-squares regression line is the one that minimizes the sum of squared errors. It is the line that minimizes the square of vertical distance between observed values of y and those predicted by the line, (“y-hat”). We represent this as:

Minimize Σ residuals2

y

MR vs. Weight in pupfish

y = 0.90x - 0.14

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.5 1 1.5

Weight (g)

MR

(m

gO

2/h

r)

BIOL 582 Least-Squares Regression

Page 20: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

i

i

y

y

Observed

Predicted

ii yy Residual

Note: Some residuals are positive, some are negative. Therefore, we try to minimize Σ residuals2. This will (1) minimize the sum of positive values and (2) be analagous to calculating variance.

BIOL 582 Least-Squares Regression

Page 21: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

i

i

y

y

Observed

Predicted

ii yy Residual

Why is this not a better line?

Although not readily apparent, Σ residuals2 > Σ residuals2

BIOL 582 Least-Squares Regression

Page 22: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

So how do we find the “best fit” line to describe our linear relationship?

(x1,y1)

(x2,y2)

y

x

xyslope

BIOL 582 Least-Squares Regression

Page 23: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

So how do we find the “best fit” line to describe our linear relationship?

(x1,y1)

(x2,y2)

x

yy

x

xyslope

y -intercept

BIOL 582 Least-Squares Regression

Page 24: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

So how do we find the “best fit” line to describe our linear relationship?

Any line can be described as y = b0 + b1x , where b0 is the y-intercept

and b1 is the slope of the line.

In Least-Squares Regression, we define the linear relationship as:

xbby 10

What this equation means is that for any value of x, we can predict a

value of y (called y-hat), if we know the y-intercept, b0, and the slope,

b1. We can find the slope and intercept (in succession) with the following formulae:

xx

xy

x

y

SS

SS

s

srb 1

xbyb 10 The resulting equation minimizes the sum of squared residuals!!!

BIOL 582 Least-Squares Regression

Page 25: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

So how do we find the “best fit” line to describe our linear relationship?

Let’s consider the pupfish example:

i xi yi

1 21.56 0.32

2 28.87 0.81

3 28.50 0.63

4 28.96 0.70

5 27.00 0.55

6 32.50 0.92

7 30.39 0.67

8 36.77 1.36

9 29.39 0.61

Length WeightWe need to calculate:

xy

yy

xx

SS

SS

SS

y

x

y

x

s

s

r

y

x

-or-

BIOL 582 Least-Squares Regression

Page 26: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

So how do we find the “best fit” line to describe our linear relationship?

Let’s consider the pupfish example:

i xi xi2 yi yi

2 xiyi

1 21.56 464.83 0.32 0.10 6.90

2 28.87 833.48 0.81 0.66 23.38

3 28.50 812.25 0.63 0.40 17.96

4 28.96 838.68 0.70 0.49 20.27

5 27.00 729.00 0.55 0.30 14.85

6 32.50 1056.25 0.92 0.85 29.90

7 30.39 923.55 0.67 0.45 20.36

8 36.77 1352.03 1.36 1.85 50.01

9 29.39 863.77 0.61 0.37 17.93

Σ 263.94 7873.85 6.57 5.46 201.56

Length WeightHere is something to think about…..

The numerator is the “Sum of Squares”

1,

1

)(,

)(

2

2

2

2

2

2

22

22

nn

xx

sN

N

xx

n

xxs

N

x

ii

ii

ii

BIOL 582 Least-Squares Regression

Page 27: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

So how do we find the “best fit” line to describe our linear relationship?

Let’s consider the pupfish example:

Length Weight

Thus, it should be straightforward that

And each is easy to calculate with our data

1,

1,

1

22

22

n

SSs

n

SSs

n

SSs

nyxyxSS

nyySS

nxxSS

xyxy

yyy

xxx

iiiixy

iiyy

iixx

i xi xi2 yi yi

2 xiyi

1 21.56 464.83 0.32 0.10 6.90

2 28.87 833.48 0.81 0.66 23.38

3 28.50 812.25 0.63 0.40 17.96

4 28.96 838.68 0.70 0.49 20.27

5 27.00 729.00 0.55 0.30 14.85

6 32.50 1056.25 0.92 0.85 29.90

7 30.39 923.55 0.67 0.45 20.36

8 36.77 1352.03 1.36 1.85 50.01

9 29.39 863.77 0.61 0.37 17.93

Σ 263.94 7873.85 6.57 5.46 201.56

BIOL 582 Least-Squares Regression

Page 28: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

Length Weight

5.46

0.37

1.85

0.45

0.85

0.30

0.49

0.40

0.66

0.10

yi2

201.56

17.93

50.01

20.36

29.90

14.85

20.27

17.96

23.38

6.90

xiyi

6.577873.85263.94Σ

0.61863.7729.399

1.36

0.67

0.92

0.55

0.70

0.63

0.81

0.32

yi

1352.03

923.55

1056.25

729.00

838.68

812.25

833.48

464.83

xi2

36.778

30.397

32.506

27.005

28.964

28.503

28.872

21.561

xii

)(066.0223.1

223.1327.29*066.0730.0

066.0369.133/881.8

066.0083.4/289.0*94.0

940.0669.0*369.133

881.8

054.1)19/(81.8

289.0)19/(669.0

083.4)19/(369.133

881.8957.694.26356.201

669.0957.646.5

369.133994.26385.7873

73.09/57.6

327.299/94.263

10

0

1

1

2

2

xbbyxy

b

b

or

b

SSSS

SSr

s

s

s

SS

SS

SS

y

x

yyxx

xy

xy

y

x

xy

yy

xx

Thus, it should be straightforward that

And each is easy to calculate with our data

1,

1,

1

22

22

n

SSs

n

SSs

n

SSs

nyxyxSS

nyySS

nxxSS

xyxy

yyy

xxx

iiiixy

iiyy

iixx

BIOL 582 Least-Squares Regression

Page 29: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

Review: The steps of Least-Squares Regression:

1. Plot bivariate data

2. Calculate means for xi and yi.

3. Calculate SS, standard deviations (or both), and correlation coefficient.

4. Calculate slope.

5. Calculate y-intercept.

6. Describe linear equation

7. Calculate the Coefficient of Determination.

BIOL 582 Least-Squares Regression

Page 30: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

The Coefficient of Determination, R2, measures the percentage of total variation in the response variable that is explained by the least-squares regression line.

Recall the least-squares regression criterion: the least-squares regression line minimizes the sum of squared errors (residuals2).

R2 is a value between 0 and 1, AND FOR SIMPLE LINEAR REGRESSION, it is the same as r2. (It is not the same as r2 for multiple or non-linear regression)

An R2 of 0 means that none of the total variation is explained by the regression line (plot A) and an R2 of 1 means all of the variation is explained by the regression line (plot B). A value in between describes the proportion of explained variation.

A B

R2 = 0 R2 = 1

BIOL 582 The Coefficient of Determination

Page 31: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

The Coefficient of Determination, R2, measures the percentage of total variation in the response variable that is explained by the least-squares regression line.

So what is meant by “explained” and “unexplained” variation?

Consider this example:

i

i

y

y

Observed

Predicted

(1, 2)

(2, 2.2)

(3, 6)

(4, 9.8)

(5, 10)

= 2.36x - 1.08

R 2 = 0.9148

0

2

4

6

8

10

12

0 1 2 3 4 5 6

x

y

y

65/)108.962.22( nyy i

y

BIOL 582 The Coefficient of Determination

Page 32: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

(1, 2)(2, 2.2)

(3, 6)

(4, 9.8)

(5, 10)

= 2.36x - 1.08

R 2 = 0.9148

0

2

4

6

8

10

12

0 1 2 3 4 5 6

x

y

y

y

yyi

yy i

ii yy

BIOL 582 The Coefficient of Determination

Page 33: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

yyyyyy iiii

Total deviation Residual Explained deviation

(unexplained deviation)

Analogously, but algebraically too difficult to worry about,

Total Variation = Unexplained variation + Explained variation

SS(Total) = SS(error) + SS(R)

Where R stands for “regression” (Note: sometimes M is used for “model”)

BIOL 582 The Coefficient of Determination

Page 34: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

Length Weight

5.46

0.37

1.85

0.45

0.85

0.30

0.49

0.40

0.66

0.10

yi2

201.56

17.93

50.01

20.36

29.90

14.85

20.27

17.96

23.38

6.90

xiyi

6.577873.85263.94Σ

0.61863.7729.399

1.36

0.67

0.92

0.55

0.70

0.63

0.81

0.32

yi

1352.03

923.55

1056.25

729.00

838.68

812.25

833.48

464.83

xi2

36.778

30.397

32.506

27.005

28.964

28.503

28.872

21.561

xii

The pupfish data…..

ii xy 066.0223.1

72.0

20.1

78.0

92.0

56.0

69.0

66.0

68.0

20.0

iy

SSE SST

0.670.000.080.16

0.01-0.120.01-0.11

0.400.630.020.16

0.00-0.060.01-0.11

0.040.190.000.00

0.03-0.180.00-0.01

0.00-0.030.000.01

0.01-0.10.00-0.03

0.010.080.020.13

0.17-0.410.010.12

22 )()()()( yyyyyyyy iiiiii

R2 = 1 – SSE/SST

= 1 – 0.08/0.67 = 0.88

Note: This is the same as

r2 = 0.942 = 0.88

BIOL 582 The Coefficient of Determination

Page 35: BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

BIOL 582 Final Comments

• One can only square the correlation coefficient to get the coefficient of determination for the case of simple linear regression

• If one does multiple regression, or ANCOVA (combination of regression and factorial ANOVA), then the full or partial coefficient of determination is for the SS of all effects or one of the effects, respectively, with respect to the total SS. Values will not be the same as squaring correlation coefficients.

• ANOVA on regression models is pretty much the same as before. For simple linear regression, randomization can be used. Simply randomize values of y and hold x constant. This will be demonstrated next time