correlation association between 2 variables. suppose we wished to graph the relationship between...
TRANSCRIPT
Correlation
Association between 2 variables
Suppose we wished to graph the relationship between foot length
58
60
62
64
66
68
70
72
74
Hei
gh
t
4 6 8 10 12 14
Foot Length
and height
In order to create the graph, which is called a scatterplot or scattergram, we need the foot length and height for each of our subjects.
of 20 subjects.
58
60
62
64
66
68
70
72
74
4 6 8 10 12 14
1. Find 12 inches on the x-axis.2. Find 70 inches on the y-axis.3. Locate the intersection of 12 and 70.4. Place a dot at the intersection of 12 and 70.
Hei
gh
t
Foot Length
Assume our first subject had a 12 inch foot and was 70 inches tall.
58
60
62
64
66
68
70
72
74
4 6 8 10 12 14
58
60
62
64
66
68
70
72
74
4 6 8 10 12 14
5. Find 8 inches on the x-axis.6. Find 62 inches on the y-axis.7. Locate the intersection of 8 and 62.8. Place a dot at the intersection of 8 and 62.9. Continue to plot points for each pair of scores.
Assume that our second subject had an 8 inch foot and was 62 inches tall.
58
60
62
64
66
68
70
72
74
4 6 8 10 12 14
Notice how the scores cluster to form a pattern.
The more closely they cluster to a line that is drawn through them, the stronger the linear relationship between the two variables is (in this case foot length and height).
If the points on the scatterplot have an upward movement from left to right,
If the points on the scatterplot have a downward movement from left to right,
58
60
62
64
66
68
70
72
74
4 6 8 10 12 14
58
60
62
64
66
68
70
72
74
4 6 8 10 12 14
we say the relationship between the variables is positive.
we say the relationship between the variables is negative.
A positive relationship means that high scores on one variable
58
60
62
64
66
68
70
72
74
4 6 8 10 12 14
are associated with high scores on the other variable
are associated with low scores on the other variable. It also indicates that low scores on one variable
58
60
62
64
66
68
70
72
74
4 6 8 10 12 14
A negative relationship means that high scores on one variable are associated with low scores on the other variable.
are associated with high scores on the other variable. It also indicates that low scores on one variable
Not only do relationships have direction (positive and negative), they also have strength (from 0.00 to 1.00 and from 0.00 to –1.00).
The more closely the points cluster toward a straight line,the stronger the relationship is.
A set of scores with r= –0.60 has the same strength as a set of scores with r= 0.60 because both sets cluster similarly.
For this procedure, we use Pearson’s r (also known as a Pearson Product Moment Correlation Coefficient). This statistical procedure can only be used when BOTH variables are measured on a continuous scale and you wish to measure a linear relationship.
Linear Relationship Curvilinear Relationship
NO
Pearson r
Formula for correlations
yx
xy
yx SDSD
Cov
SS
nyyxxr
/))((
or
y
i
x
i
s
yy
s
xx
nr
1
Assumptions of the PMCC
1. The measures are approximately normally distributed
2. The variance of the two measures is similar (homoscedasticity) -- check with scatterplot
3. The relationship is linear -- check with scatterplot
4. The sample represents the population5. The variables are measured on a interval
or ratio scale
Example
• We’ll use data from the class questionnaire in 2005 to see if a relationship exists between the number of times per week respondents eat fast food and their weight
• What’s your guess (hypothesis) about how the results of this test will turn out? .5? .8? ???
Example• To get a correlation
coefficient:• Slide the variables
over...
Example
• SPSS output
The red is our correlation coefficient. The blue is our level of significance resulting from the test…what does
that mean?
Digression - Hypotheses
• Many research designs involve statistical tests – involve accepting or rejecting a hypothesis
• Null (statistical) hypotheses assume no relationship between two or more variables.
• Statistics are used to test null hypotheses– E.g. We assume that there is no relationship
between weight and fast food consumption until we find statistical evidence that there is
Probability• Probability is the odds that a certain event will
occur• In research, we deal with the odds that
patterns in data have emerged by chance vs. they are representative of a real relationship
• Alpha () is the probability level (or significance level) set, in advance, by the researcher as the odds that something occurs by chance
Probability
• Alpha levels (cont.)– E.g. = .05 means that there will be a 5%
chance that significant findings are due to chance rather than a relationship in the data
– The lower the the better, but… level must be set in advance
Probability
• Most statistical tests produce a p-value that is then compared to the -level to accept or reject the null hypothesis• E.g. Researcher sets significance level at .05
a priori; test results show p = .02. • Researcher can then reject the null
hypothesis and conclude the result was not due to chance but to there being a real relationship in the data
• How about p = .051, when -level = .05?
Error
• Significance levels (e.g. = .05) are set in order to avoid error– Type I error = rejection of the null
hypothesis when it was actually true• Conclusion = relationship; there wasn’t one
(false positive) (= )
– Type II error = acceptance of the null hypothesis when it was actually false
• Conclusion = no relationship; there was one
Error – Truth Table
Null True Null False
Accept Type II error
Reject Type I error
Back to Our Example• Conclusion: No relationship exists between
weight and fast food consumption with this group of respondents
Really?
• Conclusion: No relationship exists between weight and fast food consumption with this group of subjects– Do you believe this? Can you critique it?
Construct validity? External validity?– Thinking in this fashion will help you adopt
a critical stance when reading research
Another Example
• Now let’s see if a relationship exists between weight and the number of piercings a person has– What’s your guess (hypothesis) about how
the results of this test will turn out?– It’s fine to guess, but remember that our
null hypothesis is that no relationship exists, until the data shows otherwise
Another Example (continued)
• What can we conclude from this test?
• Does this mean that weight causes piercings, or vice versa, or what?
Correlations and causality
• Correlations only describe the relationship, they do not prove cause and effect
• Correlation is a necessary, but not sufficient condition for determining causality
• There are Three Requirements to Infer a Causal Relationship
Correlations and causality
A statistically significant relationship between the variables
The causal variable occurred prior to the other variable
There are no other factors that could account for the cause Correlation studies do not meet the last
requirement and may not meet the second requirement (go back to internal validity – 497)
Correlations and causality
If there is a relationship between weight and # piercings it could be because weight # piercings weight # piercings weight some other factor # piercings
Which do you think is most likely here?
Other Types of Correlations
• Other measures of correlation between two variables:– Point-biserial correlation=use when you
have a dichotomous variable• The formula for computing a PBC is actually
just a mathematical simplification of the formula used to compute Pearson’s r, so to compute a PBC in SPSS, just compute r and the result is the same
Other Types of Correlations• Other measures of
correlation between two variables: (cont.)– Spearman rho
correlation; use with ordinal (rank) data
• Computed in SPSS the same way as Pearson’s r…simply toggle the Spearman button on the Bivariate Correlations window
Coefficient of Determination Correlation Coefficient Squared Percentage of the variability among scores on
one variable that can be attributed to differences in the scores on the other variable
The coefficient of determination is useful because it gives the proportion of the variance of one variable that is predictable from the other variable
Next week we will discuss regression, which builds upon correlation and utilizes this coefficient of determination
Correlation in excel
Use the function “correl”
The “arguments” (components) of the function are the two arrays
Applets (see applets page)
• http://www.stat.uiuc.edu/courses/stat100/java/GCApplet/GCAppletFrame.html
• http://www.stat.sc.edu/~west/applets/clicktest.html
• http://www.stat.sc.edu/~west/applets/rplot.html