1 matters arising 1.summary of last weeks lecture 2.the exercises 3.your queries
TRANSCRIPT
1
Matters arising
1. Summary of last week’s lecture
2. The exercises
3. Your queries
2
The Pearson correlation (r)
The PEARSON CORRELATION is a measure of a supposed linear association between two variables.
3
Linear, but imperfect association
• If the scatterplot is elliptical in shape, a linear association is indicated.
• In psychology, all measurement is subject to random error.
• No association between measured variables is ever perfect.
• That is why the points do not all lie on a straight line.
4
The Pearson correlation
Sums of squares
Sum of products
Explanation• The numerator of r is
known as a SUM OF PRODUCTS (SP).
• It is the sum of products that captures the extent to which X and Y are associated, or CO-VARY.
• The sums of squares in the denominator merely constrain the range of variation of r.
The sum of products captures covariation
• Points in the upper right quadrant have positive deviation products; points in the lower left also have positive deviation products (a minus times a minus is a plus).
• Points in the other two quadrants have negative products.
• Since the positive products predominate, we can expect the covariance to be very large.
• The negative products are small: the points are near the intersection of the mean lines.
Mean Preference score
Mean Actual Violence score
7
An elliptical scatterplot
• This is fine. • The elliptical
scatterplot indicates that there is indeed a basically linear relationship between variable Y1 and variable X1.
8
No association
• There is NO association between Z and Y.
• The high value of r is driven solely by the presence of a single OUTLIER.
9
Anscombe’s rule
• When you examine a scatterplot (something you should ALWAYS do when interpreting a correlation), ask yourself the following question:
• “Would the removal of one or two points at
random affect the basically ellipical shape of the scatterplot? If the shape would remain essentially the same, the value of r accurately reflects the association between the variables”.
10
Summary
• The Pearson correlation r is a measure of the strength of a SUPPOSED linear relationship between 2 variables.
• It is one of the most widely used of statistical measures; but it is also one of the most misused.
• You should always try to see the scatterplot when interpreting a value of r.
11
Exercise
From the Violence data, obtain a scatterplot and calculate the Pearson
correlation.
12
Direction of causation
• When we measure and obtaining the correlation between two variables we nearly always do so because we believe that one variable X causes or influences the other Y.
• We have measured Exposure X and Violence Y because we have the hypothesis that X causes Y.
13
The scatterplot of Y against X
• If we believe that X causes Y, we want to “PLOT Y AGAINST X ”.
• We want a scatterplot with Y on the vertical axis and X on the horizontal axis.
Richard
John
Jim
14
Ordering the plot
15
The default graph
16
The vertical scale
• Notice that the vertical axis begins at 3, rather than at zero.
• I like to see the whole scale on the vertical axis. • Double-click on the graph to enter the Chart
Editor. • Double-click on the vertical axis to enter a dialog
which will enable you to control the amount of the vertical scale that you can see.
17
Ordering the full Y scale
Uncheck Auto and enter zero into the Custom slot.
18
Final version
19
Why do I like to see the entire scale on the vertical axis?
20
Beware!
• Modern computing packages such as SPSS afford a bewildering variety of attractive graphs and displays to help you bring out the most important features of your results. You should certainly use them.
• But there are pitfalls awaiting the unwary.
21
Performance profiles
• We often want to see how mean performance varies (or not) over various treatment conditions.
• We might want to compare the performance of participants who have ingested different kinds (or dosages) of drugs with that of a comparison or control group.
• There is a set of methods known as Analysis of Variance (ANOVA) which enable us to do that.
22
Ordering a means plot
23
A picture of the results
24
The picture is false!
• The table of means shows miniscule differences among the five group means!
• The graph suggested that there were vast differences among the means!
25
A small scale view
• Only a microscopically small section of the scale is shown on the vertical axis.
• This greatly magnifies even small differences among the group means.
26
Putting things right
• Double-click on the image to get into the Graph Editor.
• Double-click on the vertical axis to access the scale specifications.
Click here
27
Putting things right …
• Uncheck the minimum value box and enter zero as the desired minimum point.
• Click Apply.
Amend entry
28
The true picture!
29
The true picture …
• The effect is dramatic. • The profile now
reflects the true situation.
• ALWAYS BE SUSPICIOUS OF GRAPHS THAT DO NOT SHOW THE COMPLETE VERTICAL SCALE!
30
Your queries
• Several of you have e-mailed me asking how you fit a line graph to a scatterplot.
• Last week, I said that an elliptical scatterplot indicated that the relationship between the variables was basically LINEAR.
• So we want the best-fitting straight line through the points.
• This is known as the REGRESSION LINE.
31
Drawing the regression line through the points
Choose Fit Line at Total.
To leave the Chart Editor, choose Close from the Edit menu or double-click on the Viewer outside the rectangle around the figure.
32
Finding the value of r
33
Hypothesis testing
• In HYPOTHESIS TESTING, a proposition known as the NULL HYPOTHESIS (H0) is set up.
• H0 is the NEGATION of your scientific hypothesis.
• So if our scientific hypothesis is that there is an association, H0 says there’s NO association.
34
The p-value
• To test H0, we gather our data and calculate the value of a TEST STATISTIC.
• If the null hypothesis is true, how probable would a value of our test statistic as extreme as ours have been?
• The answer is given by a probability known as the p-value.
• SPSS calls the p-value the ‘Sig.’, i.e., the SIGNIFICANCE PROBABILITY.
35
A “significant” result
• A SIGNIFICANCE LEVEL is a small probability accepted by convention as a criterion for a decision about a statistical test.
• Most commonly, the 0.05 significance level is accepted by psychologists.
• If the p-value of your test statistic is LESS than the 0.05 significance level, your result is said to be ‘significant beyond the 0.05 level’.
36
The result
• Report this result as follows: • r(27) = 0.89; p < .01
Number of pairs value of r p-value
Never report a p-value like this!
Report the p-value to 2 places of decimals: if it’s less than .01, use the inequality sign <.
The p-value
37
Lecture 9
MORE ON ASSOCIATION
38
We have shown that there is a strong association between a
child’s violence and the amount of violent screen material watched …
39
but have we really gathered evidence for the hypothesis that
exposure to screen violence promotes actual violence?
40
Remember:
CORRELATION
does not necessarily mean
CAUSATION
41
One causal model
• The hypothesis implies this CAUSAL MODEL.• The results are CONSISTENT with the
hypothesis.• The correlation may indeed arise because
exposure to violence causes actual violence.
42
Another causal model
• The child’s violent tendencies towards and appetite for violence lead to his (or her) watching violent programmes as often as possible.
• This model is also consistent with the data.
43
A third causal model
• NEITHER variable causes the other. • Both are determined by the behaviour of the
child’s parents.
44
The choice
• Does exposure cause violence (top model)?
• Does Violence lead to more exposure (middle model)?
• Are both exposure and violence caused by a third, background, variable (bottom model)?
45
A background variable
• Perhaps neither Exposure nor Actual violence cause one another.
• Perhaps they are caused by a background parental behaviour variable.
• We have data on such a variable.
• The background variable correlates highly with both Exposure and Actual violence.
46
Partial correlation
A PARTIAL CORRELATION is what remains of a Pearson correlation between two variables when the influence of a third variable has been removed, or PARTIALLED OUT.
47
Three variables
• Let X1, X2 and X3 be three variables.
• Let r12 be the Pearson correlation between X1 and X2.
• Let r(12.3) be the partial correlation between X1 and X2 when the covariation of each with X3 has been removed.
48
Partial correlation
49
Explanation
Removes the influence of the third variable.
Rescales with new variances, so that the range is as below.
50
Obtaining a partial correlation
51
The partial correlation
• The partial correlation fails to reach significance.• Now that we have taken the background variable into
consideration, we see that there is no significant correlation between Exposure and Actual violence.
• It appears that, of the three possible causal models, the ‘third party’ model gives the most convincing account of the data.
52
Levels of measurement • There are three levels: • 1. The SCALE level. The data are measures on
an independent scale with units. Heights, weights, performance scores and IQs are scale data. Each score has ‘stand-alone’ meaning.
• 2. The ORDINAL level. Data in the form of RANKS (1st, 3rd, 53rd). A rank has meaning only in relation to the other individuals in the sample. A rank does not express, in units, the extent to which a property is possessed.
• 3. The NOMINAL level. Assignments to categories (so-many males, so-many females.)
53
3. Nominal data
• NOMINAL data relate to qualitative variables or attributes, such as gender or blood group, and are merely records of CATEGORY MEMBERSHIP.
• Nominal data are merely LABELS: they may take the form of numbers, but such numbers are arbitrary code numbers representing, say, the different blood groups or different nationalities. ANY numbers will do, as long as they are all different.
54
A set of nominal data
• A medical researcher wishes to test the hypothesis that people with a certain type of body tissue (Critical) are more likely to show the presence of a potentially harmful antibody.
• Data are obtained on 79 people, who are classified with respect to 2 attributes:– 1. Tissue Type;– 2. Whether the antibody is present or absent.
55
The research question
• Do more of the people in the critical group have the antibody?
• We are asking whether there is an ASSOCIATION between the variables of category membership (tissue type) and presence/absence of the antibody.
• This is the SCIENTIFIC hypothesis.
56
The null hypothesis
• The NULL HYPOTHESIS is the negation of the scientific hypothesis.
• The null hypothesis states that there is NO association between tissue type and presence of the antibody.
57
Contingency tables (cross-tabulations)
• When we wish to investigate whether an association exists between qualitative or categorical variables, the starting point is usually a display known as a CONTINGENCY TABLE, whose rows and columns represent the categories of the qualitative variables we are studying.
• Contingency tables are also known as CROSS-TABULATIONS, or CROSSTABS.
58
The contingency table
• Is there an association between Tissue Type and Presence of the antibody?
• It looks as if the antibody is indeed more in evidence in the ‘Critical’ tissue group.
59
The null hypothesis
• The null hypothesis is the negation of our scientific hypothesis, namely, the statement that the two variables are INDEPENDENT.
• In other words, any differences in the relative incidence of the antibody in the different tissue groups have resulted from SAMPLING ERROR.
60
Expected cell frequencies
• The pattern of the OBSERVED FREQUENCIES (O) would suggest that there is a greater incidence of the antibody in the Critical tissue group.
• But the marginal totals showing the frequencies of the various groups in the sample also vary.
• What cell frequencies would we expect under the independence hypothesis?
61
Expected cell frequencies (E)
• According to the null hypothesis, the joint occurrence of the antibody and a particular tissue type are independent events.
• The probability of the joint occurrence of independent events is the product of their separate probabilities.
• We find the expected frequencies (E) by multiplying together the marginal totals that intersect at the cells concerned and dividing by the total number of observations.
62
The expected frequencies
• To obtain, say, the value of E for the top left cell, multiply the intersecting marginal totals (36 and 22) and divide by 79 (the total frequency), obtaining (36×22)/79 = 10.03 .
• In the Critical group, there seem to be large differences between O and E: fewer No’s than expected and more Yes’s.
63
The chi-square (χ2) statistic
• We need a statistic which compares the differences between the O and E, so that a large value will cast doubt upon the null hypothesis of independence.
• Such a statistic is CHI-SQUARE (χ2).
64
Formula for chi-square
• The element of chi-square expresses the square of the difference between O and E as a proportion of E.
• Add up these squared differences for all the cells in the contingency table.
65
The value of chi-square
There are 8 terms in the summation, but only the first two and the last are shown in the calculation below.
66
Degrees of freedom
• To decide whether a given value of chi-square is significant, we must specify the DEGREES OF FREEDOM df.
• If a contingency table has R rows and C columns, the degrees of freedom is given by
• df = (R – 1)(C – 1)• In our example, R = 4, C = 2 and so• df = (4 – 1)(2 – 1) = 3.
67
Significance
• SPSS will tell us that the p-value of a chi-square with a value of 10.655 in the chi-square distribution with three degrees of freedom is .014.
• We should write this result as: χ2(3) = 10.66; p = .01 .
• Since the result is significant beyond the .05 level, we have evidence against the null hypothesis of independence and evidence for the scientific hypothesis.
68
Summary
• This week I extended my discussion of statistical association to the topic of partial correlation.
• A partial correlation can help the researcher to choose from different causal models.
• I also considered the analysis of nominal data in the form of contingency tables.
• The chi-square statistic can be used to test for the presence of an association between qualitative or categorical variables.
69
Multiple-choice example
70
Multiple-choice example
71
Another example