biol 582 lecture set 11 bivariate data correlation regression

BIOL 582

Lecture Set 11

Bivariate Data

Correlation

Regression

Thus far, we have considered whether means of a response variable differ among groups. Sometimes it is of interest to know whether a variable covaries with another variable, or whether the value of one variable can predict the value of another.

With bivariate data, two values are measured on each population (or experimental) unit. We denote the data as ordered pairs (xi, yi). The data can be both qualitative, one qualitative and one quantitative, or both

quantitative. In some examples, xi is the independent (predictor) variable

and yi is the dependent (response) variable.

Although it might not be readily apparent we have been working all along with qualitative (nominal) independent variables (e.g., grouping variables). Now we are going to shift gears and look at continuous quantitative independent variables.

BIOL 582 Considering Multiple Variables

With bivariate data, two values are measured on each population (or experimental) unit. We denote the data as ordered pairs (xi, yi). The data can be both qualitative, one qualitative and one quantitative, or both

quantitative. In some examples, xi is the independent (predictor) variable

and yi is the dependent (response) variable.

Bivariate Quantitative variables

Scatter Plot:Weight vs. Length for pupfish data

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 10 20 30 40Length (mm)

We

igh

t (g

)


Weight vs. Length for pupfish data

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 5 10 15 20 25 30 35 40Length (mm)

We

igh

t (g

)

Variables include units

Points are ordered pairs (xi, yi)

(21.56, 0.32)

(36.77, 1.36)

Independent (predictor) variable

Dependent (response)

variable



0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 5 10 15 20 25 30 35 40Length (mm)

We

igh

t (g

)

Is there a linear relationship for the

data?


x

y

x

y

x

y

x

y

x

y

Positive linear relationship Negative linear relationship

No relationship Non-linear relationships


Correlation

The Linear Correlation Coefficient or Pearson Product Correlation Coefficient is a measure of the strength of linear relation between two quantitative variables. We use the Greek letter ρ (rho) to represent the population correlation coefficient and r to represent the sample correlation coefficient.

Sample Correlation Coefficient

where is the sample mean for the predictor variable,

is the sample standard deviation of the predictor variable,

is the sample mean of the response variable,

is the sample standard deviation of the response variable,

is the number of individual units in the sample.

1

n

s

yy

s

xx

ry

i

x

i

x

xs

y

ys

n


Correlation

The Linear Correlation Coefficient or Pearson Product Correlation Coefficient is a measure of the strength of linear relation between two quantitative variables. We use the Greek letter ρ (rho) to represent the population correlation coefficient and r to represent the sample correlation coefficient.

Sample Correlation Coefficient

Here is a computationally easier way to calculate r

1

n

s

yy

s

xx

ry

i

x

i


BIOL 582 Scatter Diagrams; Correlation

Consider the pupfish example

i xi yi

1 21.56 0.32

2 28.87 0.81

3 28.50 0.63

4 28.96 0.70

5 27.00 0.55

6 32.50 0.92

7 30.39 0.67

8 36.77 1.36

9 29.39 0.61


0

0.4

0.8

1.2

1.6

0 10 20 30 40Length (mm)

We

igh

t (g

)

Add 3 more columns

Consider the pupfish example Weight vs. Length for pupfish data

0

0.4

0.8

1.2

1.6

0 10 20 30 40Length (mm)

We

igh

t (g

)

i xi yi xi2 yi

2 xiyi

1 21.56 0.32

2 28.87 0.81

3 28.50 0.63

4 28.96 0.70

5 27.00 0.55

6 32.50 0.92

7 30.39 0.67

8 36.77 1.36

9 29.39 0.61


Consider the pupfish example Weight vs. Length for pupfish data

0

0.4

0.8

1.2

1.6

0 10 20 30 40Length (mm)

We

igh

t (g

)

i xi yi xi2 yi

2 xiyi

1 21.56 0.32 464.83 0.10 6.90

2 28.87 0.81 833.48 0.66 23.38

3 28.50 0.63 812.25 0.40 17.96

4 28.96 0.70 838.68 0.49 20.27

5 27.00 0.55 729.00 0.30 14.85

6 32.50 0.92 1056.25 0.85 29.90

7 30.39 0.67 923.55 0.45 20.36

8 36.77 1.36 1352.03 1.85 50.01

9 29.39 0.61 863.77 0.37 17.93

sum 263.94 6.57 7873.85 5.46 201.56

94.0

)82.0*54.11/(88.8

)/(

r

SSSSSSr yyxxxy


More on correlation coefficients

r meaning

1.0 Perfectly positively correlated

0.8 Strongly positively correlated

0.6

0.4 Weakly positively correlated

0.2

0 Not Correlated

-0.2

-0.4 Weakly negatively correlated

-0.6

-0.8 Strongly negatively correlated

-1.0 Perfectly negatively correlated

x

y

x

y

xx

y

Match: r = 0.1 r = 0.3 r = 0.9r = 0.7


More on correlation coefficients

r meaning

1.0 Perfectly positively correlated

0.8 Strongly positively correlated

0.6

0.4 Weakly positively correlated

0.2

0 Not Correlated

-0.2

-0.4 Weakly negatively correlated

-0.6

-0.8 Strongly negatively correlated

-1.0 Perfectly negatively correlated

x

y

x

y

xx

y

Match: r = -0.1 r = -0.3 r = -0.9r = -0.7


More on correlation coefficients WARNINGS

Question: Does a correlation coefficient of 0 mean no association or no relationship?

i xi yi

1 -2 4

2 -1 1

3 0 0

4 1 1

5 2 4

xi2 yi

2 xiyi

4 16 -8

1 1 -1

0 0 0

1 1 1

4 16 8

r = 0

yi = xi2

Thus, r = 0 could mean no association

or a non-linear relationship



Question: How do extreme points affect correlation?

i xi yi

1 1 1

2 2 2

3 3 3

4 4 4

5 5 0

i xi yi

1 1 1

2 1 2

3 2 1

4 2 2

5 14 14

r = 0 r > 0.99



Question: How do extreme points affect correlation?

i xi yi

1 1 1

2 2 2

3 3 3

4 4 4

5 5 0

i xi yi

1 1 1

2 1 2

3 2 1

4 2 2

5 14 14

r = 1 r =0



Question: Does correlation mean causation?

Pupfish data (MR = metabolic rate, mgO2/hr) Length Weight MR

i xi yi zi

1 21.56 0.32 0.18

2 28.87 0.81 0.44

3 28.50 0.63 0.54

4 28.96 0.70 0.53

5 27.00 0.55 0.46

6 32.50 0.92 0.53

7 30.39 0.67 0.43

8 36.77 1.36 1.20

9 29.39 0.61 0.32

r = 0.94 r = 0.92

But, the correlation between length and MR is also strong:

r = 0.84

Neither length nor weight “cause” increase in MR. MR happens to be biologically, positively associated with weight. Weight also happens to have a positive association with length. Thus, it appears that length and MR are related when they are not really directly related.

Remember, causation can only be inferred from an experimental approach.


We have considered whether or not there is a linear relationship between two variables, now let’s consider how to describe the relationship.

i xi yi zi

1 21.56 0.32 0.18

2 28.87 0.81 0.44

3 28.50 0.63 0.54

4 28.96 0.70 0.53

5 27.00 0.55 0.46

6 32.50 0.92 0.53

7 30.39 0.67 0.43

8 36.77 1.36 1.20

9 29.39 0.61 0.32

r = 0.94 r = 0.92

Length Weight MR

MR vs. Weight in pupfish

y = 0.90x - 0.14

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.5 1 1.5

Weight (g)

MR

(m

gO

2/h

r)

This is a line of “best fit” for the linear relationship. It is usually found by Least-Squares Regression.

This is the equation of the line.

BIOL 582 Least-Squares Regression

We have considered whether or not there is a linear relationship between two variables, now let’s consider how to describe the relationship.

Least-Squares Regression Criterion

The least-squares regression line is the one that minimizes the sum of squared errors. It is the line that minimizes the square of vertical distance between observed values of y and those predicted by the line, (“y-hat”). We represent this as:

Minimize Σ residuals2

y

MR vs. Weight in pupfish

y = 0.90x - 0.14

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.5 1 1.5

Weight (g)

MR

(m

gO

2/h

r)


i

i

y

y

Observed

Predicted

ii yy Residual

Note: Some residuals are positive, some are negative. Therefore, we try to minimize Σ residuals2. This will (1) minimize the sum of positive values and (2) be analagous to calculating variance.


i

i

y

y

Observed

Predicted

ii yy Residual

Why is this not a better line?

Although not readily apparent, Σ residuals2 > Σ residuals2


So how do we find the “best fit” line to describe our linear relationship?

(x1,y1)

(x2,y2)

y

x

xyslope



(x1,y1)

(x2,y2)

x

yy

x

xyslope

y -intercept



Any line can be described as y = b0 + b1x , where b0 is the y-intercept

and b1 is the slope of the line.

In Least-Squares Regression, we define the linear relationship as:

xbby 10

What this equation means is that for any value of x, we can predict a

value of y (called y-hat), if we know the y-intercept, b0, and the slope,

b1. We can find the slope and intercept (in succession) with the following formulae:

xx

xy

x

y

SS

SS

s

srb 1

xbyb 10 The resulting equation minimizes the sum of squared residuals!!!



Let’s consider the pupfish example:

i xi yi

1 21.56 0.32

2 28.87 0.81

3 28.50 0.63

4 28.96 0.70

5 27.00 0.55

6 32.50 0.92

7 30.39 0.67

8 36.77 1.36

9 29.39 0.61

Length WeightWe need to calculate:

xy

yy

xx

SS

SS

SS

y

x

y

x

s

s

r

y

x

-or-




i xi xi2 yi yi

2 xiyi

1 21.56 464.83 0.32 0.10 6.90

2 28.87 833.48 0.81 0.66 23.38

3 28.50 812.25 0.63 0.40 17.96

4 28.96 838.68 0.70 0.49 20.27

5 27.00 729.00 0.55 0.30 14.85

6 32.50 1056.25 0.92 0.85 29.90

7 30.39 923.55 0.67 0.45 20.36

8 36.77 1352.03 1.36 1.85 50.01

9 29.39 863.77 0.61 0.37 17.93

Σ 263.94 7873.85 6.57 5.46 201.56

Length WeightHere is something to think about…..

The numerator is the “Sum of Squares”

1,

1

)(,

)(

2

2

2

2

2

2

22

22

nn

xx

sN

N

xx

n

xxs

N

x

ii

ii

ii




Length Weight

Thus, it should be straightforward that

And each is easy to calculate with our data

1,

1,

1

22

22

n

SSs

n

SSs

n

SSs

nyxyxSS

nyySS

nxxSS

xyxy

yyy

xxx

iiiixy

iiyy

iixx

i xi xi2 yi yi

2 xiyi

1 21.56 464.83 0.32 0.10 6.90

2 28.87 833.48 0.81 0.66 23.38

3 28.50 812.25 0.63 0.40 17.96

4 28.96 838.68 0.70 0.49 20.27

5 27.00 729.00 0.55 0.30 14.85

6 32.50 1056.25 0.92 0.85 29.90

7 30.39 923.55 0.67 0.45 20.36

8 36.77 1352.03 1.36 1.85 50.01

9 29.39 863.77 0.61 0.37 17.93

Σ 263.94 7873.85 6.57 5.46 201.56


Length Weight

5.46

0.37

1.85

0.45

0.85

0.30

0.49

0.40

0.66

0.10

yi2

201.56

17.93

50.01

20.36

29.90

14.85

20.27

17.96

23.38

6.90

xiyi

6.577873.85263.94Σ

0.61863.7729.399

1.36

0.67

0.92

0.55

0.70

0.63

0.81

0.32

yi

1352.03

923.55

1056.25

729.00

838.68

812.25

833.48

464.83

xi2

36.778

30.397

32.506

27.005

28.964

28.503

28.872

21.561

xii

)(066.0223.1

223.1327.29*066.0730.0

066.0369.133/881.8

066.0083.4/289.0*94.0

940.0669.0*369.133

881.8

054.1)19/(81.8

289.0)19/(669.0

083.4)19/(369.133

881.8957.694.26356.201

669.0957.646.5

369.133994.26385.7873

73.09/57.6

327.299/94.263

10

0

1

1

2

2

xbbyxy

b

b

or

b

SSSS

SSr

s

s

s

SS

SS

SS

y

x

yyxx

xy

xy

y

x

xy

yy

xx

Thus, it should be straightforward that

And each is easy to calculate with our data

1,

1,

1

22

22

n

SSs

n

SSs

n

SSs

nyxyxSS

nyySS

nxxSS

xyxy

yyy

xxx

iiiixy

iiyy

iixx


Review: The steps of Least-Squares Regression:

1. Plot bivariate data

2. Calculate means for xi and yi.

3. Calculate SS, standard deviations (or both), and correlation coefficient.

4. Calculate slope.

5. Calculate y-intercept.

6. Describe linear equation

7. Calculate the Coefficient of Determination.


The Coefficient of Determination, R2, measures the percentage of total variation in the response variable that is explained by the least-squares regression line.

Recall the least-squares regression criterion: the least-squares regression line minimizes the sum of squared errors (residuals2).

R2 is a value between 0 and 1, AND FOR SIMPLE LINEAR REGRESSION, it is the same as r2. (It is not the same as r2 for multiple or non-linear regression)

An R2 of 0 means that none of the total variation is explained by the regression line (plot A) and an R2 of 1 means all of the variation is explained by the regression line (plot B). A value in between describes the proportion of explained variation.

A B

R2 = 0 R2 = 1

BIOL 582 The Coefficient of Determination

The Coefficient of Determination, R2, measures the percentage of total variation in the response variable that is explained by the least-squares regression line.

So what is meant by “explained” and “unexplained” variation?

Consider this example:

i

i

y

y

Observed

Predicted

(1, 2)

(2, 2.2)

(3, 6)

(4, 9.8)

(5, 10)

= 2.36x - 1.08

R 2 = 0.9148

0

2

4

6

8

10

12

0 1 2 3 4 5 6

x

y

y

65/)108.962.22( nyy i

y


(1, 2)(2, 2.2)

(3, 6)

(4, 9.8)

(5, 10)

= 2.36x - 1.08

R 2 = 0.9148

0

2

4

6

8

10

12

0 1 2 3 4 5 6

x

y

y

y

yyi

yy i

ii yy


yyyyyy iiii

Total deviation Residual Explained deviation

(unexplained deviation)

Analogously, but algebraically too difficult to worry about,

Total Variation = Unexplained variation + Explained variation

SS(Total) = SS(error) + SS(R)

Where R stands for “regression” (Note: sometimes M is used for “model”)


Length Weight

5.46

0.37

1.85

0.45

0.85

0.30

0.49

0.40

0.66

0.10

yi2

201.56

17.93

50.01

20.36

29.90

14.85

20.27

17.96

23.38

6.90

xiyi

6.577873.85263.94Σ

0.61863.7729.399

1.36

0.67

0.92

0.55

0.70

0.63

0.81

0.32

yi

1352.03

923.55

1056.25

729.00

838.68

812.25

833.48

464.83

xi2

36.778

30.397

32.506

27.005

28.964

28.503

28.872

21.561

xii

The pupfish data…..

ii xy 066.0223.1

72.0

20.1

78.0

92.0

56.0

69.0

66.0

68.0

20.0

iy

SSE SST

0.670.000.080.16

0.01-0.120.01-0.11

0.400.630.020.16

0.00-0.060.01-0.11

0.040.190.000.00

0.03-0.180.00-0.01

0.00-0.030.000.01

0.01-0.10.00-0.03

0.010.080.020.13

0.17-0.410.010.12

22 )()()()( yyyyyyyy iiiiii

R2 = 1 – SSE/SST

= 1 – 0.08/0.67 = 0.88

Note: This is the same as

r2 = 0.942 = 0.88


BIOL 582 Final Comments

• One can only square the correlation coefficient to get the coefficient of determination for the case of simple linear regression

• If one does multiple regression, or ANCOVA (combination of regression and factorial ANOVA), then the full or partial coefficient of determination is for the SS of all effects or one of the effects, respectively, with respect to the total SS. Values will not be the same as squaring correlation coefficients.

• ANOVA on regression models is pretty much the same as before. For simple linear regression, randomization can be used. Simply randomize values of y and hold x constant. This will be demonstrated next time

biol 582 lecture set 11 bivariate data correlation regression

Documents

length mmweight gweight

independent predictor

variable covaries

dependent response variable

pupfish datasheet121

multiple variableschart10

strength of linear relation

sample standard deviation