biol 582 lecture set 11 bivariate data correlation regression
TRANSCRIPT
BIOL 582
Lecture Set 11
Bivariate Data
Correlation
Regression
Thus far, we have considered whether means of a response variable differ among groups. Sometimes it is of interest to know whether a variable covaries with another variable, or whether the value of one variable can predict the value of another.
With bivariate data, two values are measured on each population (or experimental) unit. We denote the data as ordered pairs (xi, yi). The data can be both qualitative, one qualitative and one quantitative, or both
quantitative. In some examples, xi is the independent (predictor) variable
and yi is the dependent (response) variable.
Although it might not be readily apparent we have been working all along with qualitative (nominal) independent variables (e.g., grouping variables). Now we are going to shift gears and look at continuous quantitative independent variables.
BIOL 582 Considering Multiple Variables
With bivariate data, two values are measured on each population (or experimental) unit. We denote the data as ordered pairs (xi, yi). The data can be both qualitative, one qualitative and one quantitative, or both
quantitative. In some examples, xi is the independent (predictor) variable
and yi is the dependent (response) variable.
Bivariate Quantitative variables
Scatter Plot:Weight vs. Length for pupfish data
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 10 20 30 40Length (mm)
We
igh
t (g
)
BIOL 582 Considering Multiple Variables
Weight vs. Length for pupfish data
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 5 10 15 20 25 30 35 40Length (mm)
We
igh
t (g
)
Variables include units
Points are ordered pairs (xi, yi)
(21.56, 0.32)
(36.77, 1.36)
Independent (predictor) variable
Dependent (response)
variable
BIOL 582 Considering Multiple Variables
Weight vs. Length for pupfish data
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 5 10 15 20 25 30 35 40Length (mm)
We
igh
t (g
)
Is there a linear relationship for the
data?
BIOL 582 Considering Multiple Variables
x
y
x
y
x
y
x
y
x
y
Positive linear relationship Negative linear relationship
No relationship Non-linear relationships
BIOL 582 Considering Multiple Variables
Correlation
The Linear Correlation Coefficient or Pearson Product Correlation Coefficient is a measure of the strength of linear relation between two quantitative variables. We use the Greek letter ρ (rho) to represent the population correlation coefficient and r to represent the sample correlation coefficient.
Sample Correlation Coefficient
where is the sample mean for the predictor variable,
is the sample standard deviation of the predictor variable,
is the sample mean of the response variable,
is the sample standard deviation of the response variable,
is the number of individual units in the sample.
1
n
s
yy
s
xx
ry
i
x
i
x
xs
y
ys
n
BIOL 582 Considering Multiple Variables
Correlation
The Linear Correlation Coefficient or Pearson Product Correlation Coefficient is a measure of the strength of linear relation between two quantitative variables. We use the Greek letter ρ (rho) to represent the population correlation coefficient and r to represent the sample correlation coefficient.
Sample Correlation Coefficient
Here is a computationally easier way to calculate r
1
n
s
yy
s
xx
ry
i
x
i
BIOL 582 Considering Multiple Variables
BIOL 582 Scatter Diagrams; Correlation
Consider the pupfish example
i xi yi
1 21.56 0.32
2 28.87 0.81
3 28.50 0.63
4 28.96 0.70
5 27.00 0.55
6 32.50 0.92
7 30.39 0.67
8 36.77 1.36
9 29.39 0.61
Weight vs. Length for pupfish data
0
0.4
0.8
1.2
1.6
0 10 20 30 40Length (mm)
We
igh
t (g
)
Add 3 more columns
Consider the pupfish example Weight vs. Length for pupfish data
0
0.4
0.8
1.2
1.6
0 10 20 30 40Length (mm)
We
igh
t (g
)
i xi yi xi2 yi
2 xiyi
1 21.56 0.32
2 28.87 0.81
3 28.50 0.63
4 28.96 0.70
5 27.00 0.55
6 32.50 0.92
7 30.39 0.67
8 36.77 1.36
9 29.39 0.61
BIOL 582 Scatter Diagrams; Correlation
Consider the pupfish example Weight vs. Length for pupfish data
0
0.4
0.8
1.2
1.6
0 10 20 30 40Length (mm)
We
igh
t (g
)
i xi yi xi2 yi
2 xiyi
1 21.56 0.32 464.83 0.10 6.90
2 28.87 0.81 833.48 0.66 23.38
3 28.50 0.63 812.25 0.40 17.96
4 28.96 0.70 838.68 0.49 20.27
5 27.00 0.55 729.00 0.30 14.85
6 32.50 0.92 1056.25 0.85 29.90
7 30.39 0.67 923.55 0.45 20.36
8 36.77 1.36 1352.03 1.85 50.01
9 29.39 0.61 863.77 0.37 17.93
sum 263.94 6.57 7873.85 5.46 201.56
94.0
)82.0*54.11/(88.8
)/(
r
SSSSSSr yyxxxy
BIOL 582 Scatter Diagrams; Correlation
More on correlation coefficients
r meaning
1.0 Perfectly positively correlated
0.8 Strongly positively correlated
0.6
0.4 Weakly positively correlated
0.2
0 Not Correlated
-0.2
-0.4 Weakly negatively correlated
-0.6
-0.8 Strongly negatively correlated
-1.0 Perfectly negatively correlated
x
y
x
y
xx
y
Match: r = 0.1 r = 0.3 r = 0.9r = 0.7
BIOL 582 Scatter Diagrams; Correlation
More on correlation coefficients
r meaning
1.0 Perfectly positively correlated
0.8 Strongly positively correlated
0.6
0.4 Weakly positively correlated
0.2
0 Not Correlated
-0.2
-0.4 Weakly negatively correlated
-0.6
-0.8 Strongly negatively correlated
-1.0 Perfectly negatively correlated
x
y
x
y
xx
y
Match: r = -0.1 r = -0.3 r = -0.9r = -0.7
BIOL 582 Scatter Diagrams; Correlation
More on correlation coefficients WARNINGS
Question: Does a correlation coefficient of 0 mean no association or no relationship?
i xi yi
1 -2 4
2 -1 1
3 0 0
4 1 1
5 2 4
xi2 yi
2 xiyi
4 16 -8
1 1 -1
0 0 0
1 1 1
4 16 8
r = 0
yi = xi2
Thus, r = 0 could mean no association
or a non-linear relationship
BIOL 582 Scatter Diagrams; Correlation
More on correlation coefficients WARNINGS
Question: How do extreme points affect correlation?
i xi yi
1 1 1
2 2 2
3 3 3
4 4 4
5 5 0
i xi yi
1 1 1
2 1 2
3 2 1
4 2 2
5 14 14
r = 0 r > 0.99
BIOL 582 Scatter Diagrams; Correlation
More on correlation coefficients WARNINGS
Question: How do extreme points affect correlation?
i xi yi
1 1 1
2 2 2
3 3 3
4 4 4
5 5 0
i xi yi
1 1 1
2 1 2
3 2 1
4 2 2
5 14 14
r = 1 r =0
BIOL 582 Scatter Diagrams; Correlation
More on correlation coefficients WARNINGS
Question: Does correlation mean causation?
Pupfish data (MR = metabolic rate, mgO2/hr) Length Weight MR
i xi yi zi
1 21.56 0.32 0.18
2 28.87 0.81 0.44
3 28.50 0.63 0.54
4 28.96 0.70 0.53
5 27.00 0.55 0.46
6 32.50 0.92 0.53
7 30.39 0.67 0.43
8 36.77 1.36 1.20
9 29.39 0.61 0.32
r = 0.94 r = 0.92
But, the correlation between length and MR is also strong:
r = 0.84
Neither length nor weight “cause” increase in MR. MR happens to be biologically, positively associated with weight. Weight also happens to have a positive association with length. Thus, it appears that length and MR are related when they are not really directly related.
Remember, causation can only be inferred from an experimental approach.
BIOL 582 Scatter Diagrams; Correlation
We have considered whether or not there is a linear relationship between two variables, now let’s consider how to describe the relationship.
i xi yi zi
1 21.56 0.32 0.18
2 28.87 0.81 0.44
3 28.50 0.63 0.54
4 28.96 0.70 0.53
5 27.00 0.55 0.46
6 32.50 0.92 0.53
7 30.39 0.67 0.43
8 36.77 1.36 1.20
9 29.39 0.61 0.32
r = 0.94 r = 0.92
Length Weight MR
MR vs. Weight in pupfish
y = 0.90x - 0.14
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.5 1 1.5
Weight (g)
MR
(m
gO
2/h
r)
This is a line of “best fit” for the linear relationship. It is usually found by Least-Squares Regression.
This is the equation of the line.
BIOL 582 Least-Squares Regression
We have considered whether or not there is a linear relationship between two variables, now let’s consider how to describe the relationship.
Least-Squares Regression Criterion
The least-squares regression line is the one that minimizes the sum of squared errors. It is the line that minimizes the square of vertical distance between observed values of y and those predicted by the line, (“y-hat”). We represent this as:
Minimize Σ residuals2
y
MR vs. Weight in pupfish
y = 0.90x - 0.14
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.5 1 1.5
Weight (g)
MR
(m
gO
2/h
r)
BIOL 582 Least-Squares Regression
i
i
y
y
Observed
Predicted
ii yy Residual
Note: Some residuals are positive, some are negative. Therefore, we try to minimize Σ residuals2. This will (1) minimize the sum of positive values and (2) be analagous to calculating variance.
BIOL 582 Least-Squares Regression
i
i
y
y
Observed
Predicted
ii yy Residual
Why is this not a better line?
Although not readily apparent, Σ residuals2 > Σ residuals2
BIOL 582 Least-Squares Regression
So how do we find the “best fit” line to describe our linear relationship?
(x1,y1)
(x2,y2)
y
x
xyslope
BIOL 582 Least-Squares Regression
So how do we find the “best fit” line to describe our linear relationship?
(x1,y1)
(x2,y2)
x
yy
x
xyslope
y -intercept
BIOL 582 Least-Squares Regression
So how do we find the “best fit” line to describe our linear relationship?
Any line can be described as y = b0 + b1x , where b0 is the y-intercept
and b1 is the slope of the line.
In Least-Squares Regression, we define the linear relationship as:
xbby 10
What this equation means is that for any value of x, we can predict a
value of y (called y-hat), if we know the y-intercept, b0, and the slope,
b1. We can find the slope and intercept (in succession) with the following formulae:
xx
xy
x
y
SS
SS
s
srb 1
xbyb 10 The resulting equation minimizes the sum of squared residuals!!!
BIOL 582 Least-Squares Regression
So how do we find the “best fit” line to describe our linear relationship?
Let’s consider the pupfish example:
i xi yi
1 21.56 0.32
2 28.87 0.81
3 28.50 0.63
4 28.96 0.70
5 27.00 0.55
6 32.50 0.92
7 30.39 0.67
8 36.77 1.36
9 29.39 0.61
Length WeightWe need to calculate:
xy
yy
xx
SS
SS
SS
y
x
y
x
s
s
r
y
x
-or-
BIOL 582 Least-Squares Regression
So how do we find the “best fit” line to describe our linear relationship?
Let’s consider the pupfish example:
i xi xi2 yi yi
2 xiyi
1 21.56 464.83 0.32 0.10 6.90
2 28.87 833.48 0.81 0.66 23.38
3 28.50 812.25 0.63 0.40 17.96
4 28.96 838.68 0.70 0.49 20.27
5 27.00 729.00 0.55 0.30 14.85
6 32.50 1056.25 0.92 0.85 29.90
7 30.39 923.55 0.67 0.45 20.36
8 36.77 1352.03 1.36 1.85 50.01
9 29.39 863.77 0.61 0.37 17.93
Σ 263.94 7873.85 6.57 5.46 201.56
Length WeightHere is something to think about…..
The numerator is the “Sum of Squares”
1,
1
)(,
)(
2
2
2
2
2
2
22
22
nn
xx
sN
N
xx
n
xxs
N
x
ii
ii
ii
BIOL 582 Least-Squares Regression
So how do we find the “best fit” line to describe our linear relationship?
Let’s consider the pupfish example:
Length Weight
Thus, it should be straightforward that
And each is easy to calculate with our data
1,
1,
1
22
22
n
SSs
n
SSs
n
SSs
nyxyxSS
nyySS
nxxSS
xyxy
yyy
xxx
iiiixy
iiyy
iixx
i xi xi2 yi yi
2 xiyi
1 21.56 464.83 0.32 0.10 6.90
2 28.87 833.48 0.81 0.66 23.38
3 28.50 812.25 0.63 0.40 17.96
4 28.96 838.68 0.70 0.49 20.27
5 27.00 729.00 0.55 0.30 14.85
6 32.50 1056.25 0.92 0.85 29.90
7 30.39 923.55 0.67 0.45 20.36
8 36.77 1352.03 1.36 1.85 50.01
9 29.39 863.77 0.61 0.37 17.93
Σ 263.94 7873.85 6.57 5.46 201.56
BIOL 582 Least-Squares Regression
Length Weight
5.46
0.37
1.85
0.45
0.85
0.30
0.49
0.40
0.66
0.10
yi2
201.56
17.93
50.01
20.36
29.90
14.85
20.27
17.96
23.38
6.90
xiyi
6.577873.85263.94Σ
0.61863.7729.399
1.36
0.67
0.92
0.55
0.70
0.63
0.81
0.32
yi
1352.03
923.55
1056.25
729.00
838.68
812.25
833.48
464.83
xi2
36.778
30.397
32.506
27.005
28.964
28.503
28.872
21.561
xii
)(066.0223.1
223.1327.29*066.0730.0
066.0369.133/881.8
066.0083.4/289.0*94.0
940.0669.0*369.133
881.8
054.1)19/(81.8
289.0)19/(669.0
083.4)19/(369.133
881.8957.694.26356.201
669.0957.646.5
369.133994.26385.7873
73.09/57.6
327.299/94.263
10
0
1
1
2
2
xbbyxy
b
b
or
b
SSSS
SSr
s
s
s
SS
SS
SS
y
x
yyxx
xy
xy
y
x
xy
yy
xx
Thus, it should be straightforward that
And each is easy to calculate with our data
1,
1,
1
22
22
n
SSs
n
SSs
n
SSs
nyxyxSS
nyySS
nxxSS
xyxy
yyy
xxx
iiiixy
iiyy
iixx
BIOL 582 Least-Squares Regression
Review: The steps of Least-Squares Regression:
1. Plot bivariate data
2. Calculate means for xi and yi.
3. Calculate SS, standard deviations (or both), and correlation coefficient.
4. Calculate slope.
5. Calculate y-intercept.
6. Describe linear equation
7. Calculate the Coefficient of Determination.
BIOL 582 Least-Squares Regression
The Coefficient of Determination, R2, measures the percentage of total variation in the response variable that is explained by the least-squares regression line.
Recall the least-squares regression criterion: the least-squares regression line minimizes the sum of squared errors (residuals2).
R2 is a value between 0 and 1, AND FOR SIMPLE LINEAR REGRESSION, it is the same as r2. (It is not the same as r2 for multiple or non-linear regression)
An R2 of 0 means that none of the total variation is explained by the regression line (plot A) and an R2 of 1 means all of the variation is explained by the regression line (plot B). A value in between describes the proportion of explained variation.
A B
R2 = 0 R2 = 1
BIOL 582 The Coefficient of Determination
The Coefficient of Determination, R2, measures the percentage of total variation in the response variable that is explained by the least-squares regression line.
So what is meant by “explained” and “unexplained” variation?
Consider this example:
i
i
y
y
Observed
Predicted
(1, 2)
(2, 2.2)
(3, 6)
(4, 9.8)
(5, 10)
= 2.36x - 1.08
R 2 = 0.9148
0
2
4
6
8
10
12
0 1 2 3 4 5 6
x
y
y
65/)108.962.22( nyy i
y
BIOL 582 The Coefficient of Determination
(1, 2)(2, 2.2)
(3, 6)
(4, 9.8)
(5, 10)
= 2.36x - 1.08
R 2 = 0.9148
0
2
4
6
8
10
12
0 1 2 3 4 5 6
x
y
y
y
yyi
yy i
ii yy
BIOL 582 The Coefficient of Determination
yyyyyy iiii
Total deviation Residual Explained deviation
(unexplained deviation)
Analogously, but algebraically too difficult to worry about,
Total Variation = Unexplained variation + Explained variation
SS(Total) = SS(error) + SS(R)
Where R stands for “regression” (Note: sometimes M is used for “model”)
BIOL 582 The Coefficient of Determination
Length Weight
5.46
0.37
1.85
0.45
0.85
0.30
0.49
0.40
0.66
0.10
yi2
201.56
17.93
50.01
20.36
29.90
14.85
20.27
17.96
23.38
6.90
xiyi
6.577873.85263.94Σ
0.61863.7729.399
1.36
0.67
0.92
0.55
0.70
0.63
0.81
0.32
yi
1352.03
923.55
1056.25
729.00
838.68
812.25
833.48
464.83
xi2
36.778
30.397
32.506
27.005
28.964
28.503
28.872
21.561
xii
The pupfish data…..
ii xy 066.0223.1
72.0
20.1
78.0
92.0
56.0
69.0
66.0
68.0
20.0
iy
SSE SST
0.670.000.080.16
0.01-0.120.01-0.11
0.400.630.020.16
0.00-0.060.01-0.11
0.040.190.000.00
0.03-0.180.00-0.01
0.00-0.030.000.01
0.01-0.10.00-0.03
0.010.080.020.13
0.17-0.410.010.12
22 )()()()( yyyyyyyy iiiiii
R2 = 1 – SSE/SST
= 1 – 0.08/0.67 = 0.88
Note: This is the same as
r2 = 0.942 = 0.88
BIOL 582 The Coefficient of Determination
BIOL 582 Final Comments
• One can only square the correlation coefficient to get the coefficient of determination for the case of simple linear regression
• If one does multiple regression, or ANCOVA (combination of regression and factorial ANOVA), then the full or partial coefficient of determination is for the SS of all effects or one of the effects, respectively, with respect to the total SS. Values will not be the same as squaring correlation coefficients.
• ANOVA on regression models is pretty much the same as before. For simple linear regression, randomization can be used. Simply randomize values of y and hold x constant. This will be demonstrated next time