1 matters arising 1.summary of last weeks lecture 2.the exercises

73
1 Matters arising 1. Summary of last week’s lecture 2. The exercises

Upload: faith-hunt

Post on 28-Mar-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

1

Matters arising

1. Summary of last week’s lecture

2. The exercises

Page 2: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

2

Last week

• This week I extended my discussion of statistical association to the topic of partial correlation.

• A partial correlation can help the researcher to choose from different causal models.

• I also considered the analysis of nominal data in the form of contingency tables.

• The chi-square statistic can be used to test for the presence of an association between qualitative or categorical variables.

Page 3: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

3

CORRELATION

does not necessarily mean

CAUSATION

Page 4: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

4

The choice

• A strong positive correlation between Exposure and Actual violence was obtained.

• But at least three CAUSAL MODELS are compatible with that result.

Page 5: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

5

A background variable

• Fortunately, we had information on a third variable, a measure of parental orientation towards violence.

• Both Exposure and Actual violence correlated highly with this Background variable.

Page 6: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

6

Partial correlation

A PARTIAL CORRELATION is what remains of a Pearson correlation between two variables when the influence of a third variable has been removed, or PARTIALLED OUT.

Page 7: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

7

Partial correlation

Removes the influence of the third variable.

Rescales with new variances, so that the range is as below.

Page 8: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

8

The partial correlation

• When correlations with Background are taken into account, the original correlation is now insignificant.

• The third model seems the most convincing.

Page 9: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

9

A medical question

• Is there an association between the type of body tissue one has and the presence of a potentially harmful antibody?

• This is a question of whether two QUALITATIVE VARIABLES are associated.

Page 10: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

10

A contingency table

• The pattern of frequencies in the CONTINGENCY TABLE suggests that there is indeed an association between Presence and Tissue Type.

• The null hypothesis is that the two variables are INDEPENDENT.

Page 11: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

11

Expected cell frequencies (E)

• The EXPECTED FREQUENCY E in each cell of the table is calculated from the MARGINAL TOTALS of the contingency table on the assumption that Tissue Type and Presence are independent.

• We compare the values of E with the OBSERVED FREQUENCIES O.

Page 12: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

12

The expected frequencies

• In the Critical group, there seem to be large discrepancies between O and E: fewer No’s than expected and more Yes’s.

Page 13: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

13

Formula for chi-square

• The magnitude of the discrepancies feeds into the value of the CHI-SQUARE statistic.

Page 14: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

14

The value of chi-square

The value of

chi-square is 10.66 .

Page 15: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

15

Degrees of freedom

• To decide whether a given value of chi-square is significant, we must specify the DEGREES OF FREEDOM df.

• If a contingency table has R rows and C columns, the degrees of freedom is given by

• df = (R – 1)(C – 1)• In our example, R = 4, C = 2 and so• df = (4 – 1)(2 – 1) = 3.

Page 16: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

16

Significance

• SPSS will tell us that the p-value of a chi-square with a value of 10.655 in the chi-square distribution with three degrees of freedom is .014.

• We should write this result as: χ2(3) = 10.66; p = .01 .

• Since the result is significant beyond the .05 level, we have evidence against the null hypothesis of independence and evidence for the scientific hypothesis.

Page 17: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

17

Multiple-choice example

Page 18: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

18

Solution

• It isn’t easy to ask a sensible multiple-choice question about partial correlation.

• C is obviously the correct answer.

Page 19: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

19

Example

Page 20: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

20

Solution

• A is wrong: we usually hope the null hypothesis will be falsified.

• B is wrong: it’s the null hypothesis that is tested.• C is wrong: the p-value must be less than 0.05 for

significance. • D is correct: significance requires a p-value of less than

0.05.

Page 21: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

21

Example

Page 22: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

22

Solution

• df = (R-1)(C-1) = (4 – 1)(5 – 1) = 12.• So the correct answer is B.

Page 23: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

23

Lecture 10

RUNNING CHI-SQUARE TESTS ON SPSS

Page 24: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

24

In Variable View

• In Variable View, Name three variables and assign Values to the code numbers making up the various tissue groups.

• Always assign CLEAR VALUE LABELS to make the output comprehensible.

Page 25: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

25

In Data View

• The third variable, Count, contains the frequencies of occurrence of the antibody in the different groups.

• When entering the data, it’s helpful to be able to view the value labels.

Page 26: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

26

What the rows in Data View represent

• SPSS assumes that, in Data View, each row contains information on just ONE participant or CASE.

• In our example, each row contains information about SEVERAL people.

• At some point, SPSS must be informed of this.

• You do this by WEIGHTING THE CASES with the frequencies.

Page 27: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

27

Weighting the cases

• Select Weight Cases from the Data menu.

• Complete the Weight Cases dialog by transferring Count to the Frequency Variable slot.

• Click OK to weight the cases with frequencies

Page 28: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

28

Another approach

• We could have dispensed with the Count variable and simply entered the data on each of the 79 people in the study.

• Here are 8 of the 79 cases.

• You don’t need the Weight cases procedure here.

Page 29: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

29

Selecting the chi-square test

The chi-square test is available in Crosstabs, on the Descriptive Statistics menu.

Page 30: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

30

The Crosstabs dialog

• We want the columns to represent the Presence variable, as in the contingency table.

Page 31: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

31

Clustered bar charts

Check the box labelled ‘Display clustered bar charts’

Page 32: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

32

Crosstabs: Statistics

• Choose Chi-square.• The Chi-square statistic

itself is not suitable as a measure of the strength of an association, because it is affected by the size of the data set.

• Click ‘Phi and Cramer’s V’. These are measures of the STRENGTH of the association between tissue type and the incidence of the antibody.

Page 33: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

33

Crosstabs: Cell Display

• Check the Observed and Expected buttons.

• Since the columns represent Yes’s and No’s, it will be useful to have the column PERCENTAGES.

Page 34: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

34

The output: contingency table

The percentages are useful: they show a marked predominance of Presence of the antibody in the Critical tissue group only.

Page 35: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

35

The clustered bar chart

• The figure shows the trend apparent from inspection of the column percentages.

• There is a marked presence of the antibody in the Critical tissue group.

Page 36: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

36

Result of the chi-square test

• The p-value in the column headed ‘Asymp.Sig.’: p = .014 .

• Write the result as:

χ2 (3) = 10.655; p = .01.

• Notice the information about the number of cells with values of E less than 5.

• When there are too many, the usual p-value cannot be trusted.

Page 37: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

37

Strength of the association

• Unlike a correlation, the value of chi-square is partly determined by the sample size and is therefore unsuitable as a measure of association strength.

• Interpret either Phi or Cramer’s statistic as the extent to which the incidence of the antibody can be accounted for by tissue type. Cramer’s V can take values in the range from 0 to +1.

Page 38: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

38

A smaller data set

• Is there an association between Tissue Type and Presence of the antibody?

• The antibody is indeed more in evidence in the ‘Critical’ tissue group.

High incidence in Critical category

Page 39: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

39

Result of the chi-square test

• How disappointing! It looks as if we haven’t demonstrated a significant association.

• Under the column headed ‘Asymp. Sig.’ is the p-value, which is given as .060 .

Page 40: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

40

Sampling distributions

• Because of sampling variability, the values of the statistics we calculate from our data would vary were the data-gathering exercise to be repeated.

• The distribution of a statistic is known as its SAMPLING DISTRIBUTION.

• Test statistics such as t, F and chi-square have known sampling distributions.

• You must know the sampling distribution of any statistic to produce an accurate p-value.

Page 41: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

41

The familiar chi-square formula

Page 42: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

42

The true definition of chi-square

• The familiar formula is not the defining formula for chi-square.

• Chi-square is NOT defined in the context of nominal data, but in terms of continuously distributed, independent standard normal variables Z as follows:

Page 43: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

43

True definition of chi-square

Page 44: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

44

An approximation

• The familiar chi-square statistic is only APPROXIMATELY distributed as chi-square.

• The approximation is good, provided that the expected frequencies E are adequately large.

Page 45: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

45

The meaning of ‘Asymptotic’

• The term ASYMPTOTIC denotes the limiting distribution of a statistic as the sample size approaches infinity.

• The ‘asymptotic’ p-value of a statistic is its p-value under the assumption that the statistic has the limiting distribution.

• That assumption may be false.

Page 46: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

46

Goodness of the approximation…

• In the SPSS output, the column headed ‘Asymp. Sig.’ contains a p-value calculated on the assumption that the approximate chi-square statistic behaves like the real chi-square statistic.

• But underneath the table there is a warning about low frequencies, indicating that the ‘asymptotic’ p-value cannot be relied upon.

Warning about low expected frequencies.

Page 47: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

47

Exact tests

• Fortunately, there are available EXACT TESTS, which do not make the assumption that the approximation is good.

• There are the Fisher exact tests, designed by R. A. Fisher many years ago; and there are modern ‘brute force’ methods requiring massive computation.

Page 48: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

48

Ordering an exact test

• Click the Exact… button at the bottom of the Crosstabs dialog box.

• Check the Exact radio button in the Exact Tests dialog.

Page 49: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

49

A better result!

• The exact test has shown that we DO have evidence for an association between tissue type and incidence of the antibody.

• The exact p-value is markedly lower than the asymptotic value.

Page 50: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

50

Regression

Page 51: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

51

The violence study scatterplot

Page 52: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

52

Linear association

• If two variables have a PERFECT linear relationship, the graph of one against the other is a straight line.

• The graph of temperature in degrees Fahrenheit against the equivalent Celsius temperature is a straight line.

Page 53: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

53

A perfect positive linear relationship

Degrees Fahrenheit

Degrees Celsius (0, 0)

932

5F C

Intercept → 32

Q

P

9 / 5P

SlopeQ

Page 54: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

54

The slope of the line

• The COEFFICIENT 9/5 in front of the Celsius variable is the SLOPE of the straight line.

• When the Celsius temperature increases by FIVE degrees, the Fahrenheit temperature increases by NINE degrees. When the Celsius temperature increases by one degree, the Fahrenheit temperature increases by 1.8 degrees.

Page 55: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

55

A strong linear association

• A narrowly elliptical scatterplot like this indicates a strong positive association between the two variables.

• The Pearson correlation is + 0.89 .

Page 56: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

56

Regression

• Regression is a set of techniques for exploiting the presence of statistical association among variables to make PREDICTIONS of values of one variable (the DV or CRITERION) from knowledge of the values of other variables (the IVs or REGRESSORS).

Page 57: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

57

Simple and multiple regression

• In the simplest case, there is just one IV or regressor. This is known as SIMPLE regression.

• In MULTIPLE regression, there are two or more IVs.

Page 58: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

58

The regression line

Page 59: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

59

The regression line of Violence upon Preference

The REGRESSION LINE is the line that fits the points best from the point of view of predicting Actual Violence from Preference. There is a precise criterion for the ‘best-fitting’ line.

Page 60: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

60

The regression equation

Page 61: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

61

F is a linear function of C

Degrees Fahrenheit

Degrees Celsius (0, 0)

932

5F C

Intercept → 32

Q

P

9 / 5P

SlopeQ

Page 62: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

62

The regression line

Y (Violence)

X (Exposure) (0, 0)

2.091 0.736Y X

Q

P

0.736P

SlopeQ

Page 63: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

63

Using the equation

Page 64: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

64

Predicting a score

Y (Violence)

X (Exposure) (0, 0)

Intercept

2.091

9

8.7Y 8.0Y

0.7

e Y Y

true Violence scoreY

predicted Violence scoreY

Page 65: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

65

The error in prediction

Page 66: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

66

Simple regression

• B is the slope and B0 is the intercept.

• Y/ is the y-coordinate of the point on the line above the value of X.

• An increase of one unit on variable X will result in an estimated increase of B units on variable Y.

• A NEGATIVE value of B would mean that an increase of one unit on variable X will result in an estimated REDUCTION of B units on Y.

regression constant (intercept)

regression coefficient (slope)

Page 67: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

67

The ‘least-squares’ criterion

• In ORDINARY LEAST SQUARES (OLS) REGRESSION, a RESIDUAL score e is the difference between the real value Y and the estimate Y/ from the regression equation.

• e = (Fred’s real violence – Fred’s predicted violence from the regression equation).

• OLS regression minimizes the sum of the squares of the residuals Σ(Y ─Y/)2 = Σe2.

Page 68: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

68

Finding the values of b0 and b

Page 69: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

69

Regression line with independence

• When the variables show no association, the slope of the regression line is zero and the line runs horizontally through the mean MY of the criterion or dependent variable.

• The intercept (B0) is MY in this case.

Page 70: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

70

Intercept-only prediction

• In OLS regression, the intercept B0 is related to the regression coefficient B1 according to

• B0 = MY – B1MX

• When X and Y are independent, the slope of the regression line is zero and

• B0 = MY

• The best we can do with regression is to draw a horizontal line at Y = MY through the middle of the cloud of points.

• Whatever the degree of association between X and Y, the INTERCEPT-ONLY prediction is Y/ = MY.

Page 71: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

71

Improved prediction

• There is a strong linear association here.

• The regression line makes much more accurate predictions than simply using the mean score on Actual violence as your prediction whatever the Preference value.

Page 72: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

72

Summary

• How to run chi-square tests of association on SPSS.

• When the data are scarce, the usual chi-square test can give a misleading result.

• Run an EXACT TEST if there are warnings about low expected frequencies.

• REGRESSION is a set of techniques for predicting a target (dependent) variable from a regressor or independent variable.

Page 73: 1 Matters arising 1.Summary of last weeks lecture 2.The exercises

73

An exercise

• I have placed the larger and smaller data sets for the Tissue and Antibody example on my website.

• Try running the chi-square tests.