statistical analysis of independent groups in spss · statistical analysis of independent groups in...

1

Discover… The Centre for Academic Success

http://library.bcu.ac.uk/learner/

Statistical Analysis of

Independent Groups in

SPSS

Based on materials provided by Coventry University and

Loughborough University under a National HE STEM

Programme Practice Transfer Adopters grant

Peter Samuels

30th October 2015



Overview

Lab session teaching you how to analyse

differences in the means/medians of two or

more independent samples of a single scale

variable

Common student activity

Self contained: only a finite number of

possibilities

2



Workshop outline

Two groups:

Descriptives

Assumption checking (for parametric tests)

Independent samples t-test

Mann Whitney U test

Several groups:

Descriptives


One-way ANOVA

Kruskall Wallis test

Post hoc testing



The data analysis process

for 2 independent groups

Descriptive statistics

Assumption

checking

Parametric testing:

t-test

Nonparametric testing:

Mann-Whitney U test

Pass Fail

3



Example 1: 2 stool designs

A research project involving two different designs of stool

Tested by 40 people

Each person was assigned to assess one product, providing in an overall performance score out of 100

20 people per stool



Create an error bar chart

Open the file TwoStools.spv

Graphs > Legacy Dialogs > Error Bar…

Click on Define

Put PerformanceScore as the Variable and Design as the Category Axis

Click OK

Go to the output window

4



Interpretation:

Confidence intervals of the means of the performance scores

Means of samples are the circles

95% confident means of populations lie between whiskers

As the intervals overlap we should suspect the test will come back negative (informal, not failsafe!)

Also observe the intervals are roughly equal



Robustness

Parameter-based

statistical tests make

certain assumptions

in their underlying

models

However, they often

work well in other

situations where

these assumptions

are violated

This is known as robustness

Robustness conditions depend upon the test being used

There are different opinions on robustness conditions

5



Assumption checking

Parametric tests are more sensitive than nonparametric

tests but require certain assumptions to hold to be used

Thus we need to check these assumptions first

Not required with this test for equal group sizes 25 due

to robustness exceptions (Sawilowsky and Blair, 1992)

Here our groups were equal but only of size 20 so we

need to test for normality

For small sample sizes the best test is Shapiro-Wilk

Reference: Sawilowsky, S. S. and Blair, R. C. (1992) A

more realistic look at the robustness and Type II error

properties of the t test to departures from population

normality. Psychological Bulletin, 111(2), pp. 352–360.



Assumption checking in SPSS

Analyze > Descriptive Statistics > Explore

Put PerformanceScore in the Dependent List and Design in the Factor List and select Plots…

Remove Stem-and-leaf, select Histogram and Normality plots with tests

6



Add a fitted normal curve to

the histograms

Double click on the first histogram in the output window – this opens the Chart Editor window

Select this button

Close the Properties widow and the Chart Editor window

Repeat with the other histogram



Design 1 histogram

appears to be

approximately

normally distributed

Design 2 histogram appears

to be a bit skewed to the

right. However its skewness

< twice its standard error.

7



The null and alternative

hypotheses

Statistical testing is about making a decision about the

significance of a data feature or summary statistic

We usually assume that this was just a random event

then seek to measure how unlikely such an event was

The statement of this position is known as the null

hypothesis and is written H0

In statistical testing we make a decision about whether

to accept or reject the null hypothesis based on the

probability (or ‘P-’) value of the test statistic

The logical opposite of the null hypothesis is known as

the alternative hypothesis



8



Standard significance levels

and the null hypothesis (H0)

P-value of

test statistic

Signif-

icant? Formal action

Informal

interpretation Example

> 0.1 No Retain H0 No evidence to

reject H0

Chris

Froome

< 0.1 and

> 0.05 No Retain H0

Weak evidence

to reject H0

Plebgate

libel trial

< 0.05 and

> 0.01

Yes: at

95%

Reject H0 at 95%

confidence

Evidence to

reject H0

Climate

change

< 0.01 and

> 0.001

Yes: at

99%

Reject H0 at 99%

confidence

Strong evidence

to reject H0

Plebgate

police trial

< 0.001 Yes: at

99.9%

Reject H0 at

99.9% confidence

Very strong

evidence to

reject H0

Higgs

boson



The Shapiro-Wilk test is negative for both designs as the “Sig.” (or probability) values are both > 0.05

Therefore we can use the appropriate parametric test (the independent samples t-test)

Both these tests are not very sensitive with small sample sizes and over sensitive with larger samples (e.g. > 100)

For large samples the probability values should be interpreted alongside the histograms with fitted normal curves and Q-Q plots – see normality checking sheet

9



The independent samples t-test

Applies to the different (independent) subjects with one scaled-based data value

Tests the difference between the means of the two samples

The samples can be different sizes

Here: Product scores for Designs 1 and 2

Assumes normality

Null hypothesis (H0): The means of the performance scores for the two designs are equal

Two variants: Depends upon whether the variances of the two designs can be assumed to be equal (use Levene’s test first, H0: Variances are equal) – all done together in SPSS



Analyze > Compare Means > Independent Samples T-Test…

Add PerformanceScore as the Test Variable and Design as the Grouping Variable

Select Define Groups… and add 1 for Group 1 and 2 for Group 2

10



Automatically computes Levene’s test and outputs both versions:

Not significant at 95% (so we retain H0)

So equal variances can be assumed (now look at this row)

t-test significant at 95% (between 0.05 and 0.01)

Interpretation: There is evidence that the mean performance scores for the stool designs are different (this is different from our informal interpretation of the error bar chart)



Nonparametric testing

A type of statistical inference which does not make any assumptions about the data coming from a distribution

Often applies to category-based data (nominal and ordinal) but can also apply to scale-based data if test assumptions are not met

Advantage: no need to check assumptions

Disadvantages:

Results are generally less sensitive (higher p-values)

Cannot handle more complex data structures (such as two-way ANOVA)

Appropriate test here is the Mann-Whitney U test

11



Mann-Whitney U Test

A non-parametric test of two independent samples of ordinal or scale-based data

Tests whether there is an increasing/decreasing relationship between two samples

Need at least about 10 data categories for ordinal variables, otherwise use the Chi-squared test

Alternative to a independent samples t-test for scale-based data if the assumptions are not met (not the case here – just shown for illustration purposes)

Samples can be different sizes

Null hypothesis: Design 2 performance scores are equally likely to be higher or lower than Design 1 performance scores



Running the Mann-Whitney U

test

Select: Analyze – Nonparametric Tests – Independent Samples…

On the Fields tab, add PerformanceScore in the Test Fields list and Design in the Groups list

Select Run

12



The correct test has been run

The “Sig.” value is about the same as the value for the independent samples t test (expected it to be higher)

Helpfully states the null hypothesis decision

Unhelpfully states the default significance level used (can be misleading)



The data analysis process for

several independent groups

Descriptive statistics

Assumption

checking

Parametric testing Nonparametric testing

Pass Fail

Post hoc testing

Significant differences

13



Example 2: 3 stool designs

A research project involving three different designs of a new product

Tested by 60 people

Each person was assigned to assess one product, providing in an overall performance score out of 100

20 people per product

Open the file ThreeStools.spv

Create descriptive statistics and an error bar chart as before



What is one-way ANOVA?

An extension of t-tests to several groups

Usually independent measures

Accounts for variations both within and between groups

95% confidence

intervals for 3 groups

of measurements

These confidence

intervals do not

overlap, but does this

mean we can conclude

they are not all from

the same population?

14



Initial observations

There appear to be differences between the sample means, i.e. variation between groups

But there is also variation within groups

Can we conclude that there are differences between groups (i.e. that they come from population with different means)?

We need a systematic objective approach – this is known as ANOVA

Called ANOVA from ANalysis Of VAriance

(The name is a bit confusing because it sounds like a variance test, not a means test)



Introduction to ANOVA

Better than doing lots of two sample tests, e.g. 6 groups would require 15 two sample tests

For every test, there is a 0.05 probability that we reject H0 when it should be retained (assuming H0 is true)

Doing several tests increases the probability of making a wrong inference of significance (Type I error)

E.g. the probability of a Type I error with 6 groups, assuming they are all equally randomly distributed is 1 − 0.9515 = 1 − 0.463 = 0.537, i.e. more than 1 in 2

15



The ANOVA model

yij denotes performance score for the jth

measurement of the ith design

The parameter mi denotes how the performance score for design i differs from the overall mean μ

eij denotes the error (or residual) for the jth measurement of the ith design

The ANOVA model assumes that all these errors are normally distributed with zero mean and equal variances

ijiij emy



Test hypothesis

In our example, we need to test the hypothesis:

H0: m1 = m2 = m3 = 0

Or, more simply, that the product score

population means are the same.

Intuitively, this is done by looking at the

difference between means relative to the

difference between observations, i.e. is the

mean-to-mean variation greater than what you

would expect by chance?

16



Assumptions

(Similar to the independent sample t-test assumptions)

1. The measurements for each group are normally

distributed. However, if there are many groups

there is a danger of Type I errors.

2. The errors for the whole data set are normally

distributed (this theoretically follows from

Assumption 1, but it is worth testing separately with

small samples). To calculate these errors we first

need to estimate the group means.

3. The variances of each group are equal (we can still

use a version of ANOVA even if this one fails)



Assumption 1: Check

normality of each group

No evidence that individual groups are not

normally distributed

17



First create the residuals

Select Analyze > General Linear Model > Univariate…

Assumption 2: Testing

errors for normality

Add the variables as shown

Select Save…

Choose Unstandard-ised residuals

Based on estimates of mi



Select Analyze > Descriptive Statistics > Explore

Add the residual variable as shown but with no factor

Select Plots… and Histogram and Normality plots with tests as before

Then add a normal curve to the histogram as before

18



Evidence that the residuals are not normally distributed

from the Shapiro-Wilk test (p < 0.05) even though the

degrees of freedom have been reduced slightly. The

Kolmogorov-Smirnov test is even more significant.



Kurtosis (peakedness) looks a bit high

Formally we should compare the absolute value of the kurtosis with twice its standard error – this is significant as it is higher

19



Assumption 3: Equal variances

Analyze > Compare Means > One-Way ANOVA…

Add PerformanceScore to the Dependent List and Design as the factor

Select Options… and Homogeneity of variance test



Significant at 95% (p-value < 0.05) so we

have evidence to reject assumption of

equality of variances

Carries out a Levene’s test for homogeneity

of variance (similar to the t-test)

Null hypothesis: The variances are equal

20



Robustness of ANOVA

ANOVA is quite robust to changes in skewness but not

to changes in kurtosis. Thus, it should not be used when:

Kurtosis > 2 × Standard Error of Kurtosis

for any group or the errors.

Otherwise, provided the group sizes are equal and there

are at least 20 degrees of freedom, ANOVA is quite

robust to violations of its assumptions

However, the variances must still be equal

Source:

Field, A. (2013) Discovering Statistics using SPSS. 4th edn.

London: SAGE, pp. 444-445.



Robustness calculation

Group Kurtosis Standard Error

of Kurtosis Condition met

Design 1 0.493 0.992 Yes

Design 2 0.435 0.992 Yes

Design 3 0.115 0.992 Yes

Errors 1.553 0.608 No

Group sizes are equal

Total degrees of freedom = 20 + 20 + 20 – 1 = 59 > 20

Also standard ANOVA cannot be used because the

variances are not equal

21



Summary of findings: ANOVA

assumptions

Assumption Finding

1. Normality of groups No evidence of non-normality

2. Normality of errors Evidence of non-normality

3. Equality of variances Evidence of non-equality

Robustness Kurtosis of errors too high



One-way ANOVA

If all 3 assumptions (or the robustness exceptions to non-normality) are OK then use standard one-way ANOVA

Analyze > Compare Means > One-Way ANOVA

Under Options… select Descriptive

Shown for illustration purposes

22



Significance level < 0.001

So there is very strong evidence of differences in performance score between the three designs



What if these assumptions

are in doubt?

If normality assumptions (or their robust exceptions) are in doubt:

Use a nonparametric test: Kruskal-Wallis or median if there is no trend in the groups or Jonckheere-Terpstra if you are looking for a trend (e.g. mean of group 1 < mean of group 2 < mean of group 3, etc.)

Available under Analyze – Nonparametic Tests – Independent Samples…

If equality of variances assumption in doubt:

Use the Brown-Forsythe or Welch test

Select ANOVA and click on Options… button and select the Brown-Forsythe and Welch options

23



We should use the Kruskal-Wallis or median tests as there is no trend to observe between these designs

The median test is cruder than Kruskal-Wallis and should only be preferred when ranges of extreme values have been summarised together, which was not the case here (see http://tinyurl.com/median-KW)

Select Analyze – Nonparametric tests – Independent Samples…

Add PerformanceScore as the Test Field and Design as the Groups variable on the Fields tab

Nonparametric one-way ANOVA



Returns a significance value < 0.001 (ignore the note below the result as before)

Very strong evidence that there are differences between the groups (as before)

Select the Settings tab and Customize tests and the Kruskal-Wallis test on the Settings tab

Then select Run

http://tinyurl.com/median-KW



24



ANOVA with unequal variances

Our data set violated the normality of errors assumption but there were also differences in variances

The Brown-Forsythe and Welch tests should only be used with unequal variances if the data and errors are normally distributed (shown for illustration purposes here)

Under Options… select Brown-Forsythe and Welch tests



Both tests are again significant at 99.9%

Very strong evidence that the means are not equal

Generally the Welch test is slightly better unless there is one group with an extreme mean and a large variance (which was not the case here, so the Welch test should be preferred) – see (Field, 2013: 443)

25



Multiple comparisons

What if we conclude there are differences between the groups?

We don’t know which pairs are different

We can do post-hoc tests to compare each pair of groups

Similar to 2-sample tests but adjusted significance levels for the multiple testing issue

Note: You should only run post hoc tests if you obtain a positive result from the ANOVA (or equivalent) test



Which post hoc test?

For equal group sizes and similar variances, use Tukey (HSD) or REGWQ, or for guaranteed control over Type I errors (more conservative), use Bonferroni

For slightly different group sizes, use Gabriel

For very different group sizes, use Hochberg’s GT2

For unequal variances, use Games-Howell (also recommended as a backup in other circumstances)

Source: (Field, 2013: 459)

26



Our data set violated the normality of errors assumption

but there were also significant differences in variances

Try using the Games-Howell post hoc test (shown for

illustration purposes only)

Run the One-Way ANOVA as before

Select Post Hoc… and Games-Howell



Very strong evidence of differences between groups 1 and 3

Evidence of differences between groups 1 and 2

Weak evidence of differences between groups 2 and 3

27



This is the correct post hoc testing for our data set

Double click on this output box in the output window:

This opens the Model Viewer window

Nonparametric post hoc

testing



Change the View to Pairwise Comparisons

28



The output should then look like this. Concentrate on the Adjusted Sig. values:

Weak evidence of a difference between Design 1 and Design 2

Very strong evidence of a difference between Design 1 and Design 3

No evidence of a difference between Design 2 and Design 3



Note

SPSS version 22 does not use Mann-Whitney U test in its Kruskal-Wallis post hoc testing but a variant called the Dunn-Bonferroni test

The Sig. values given by the pairwise comparison in the Model Viewer are higher that those for the Mann-Whitney U test (e.g. for Designs 1 and 2 was found earlier to be 0.013; note: need one more decimal place to calculate the correction)

However, we can still use their relative size to decide which pairs to run an individual post hoc test using the Mann-Whitney U test

For our dataset, we do not need to run Designs 2 and 3 because we know it will be non-significant even with a correction for this bug but we should run Designs 1 and 3

To obtain the adjusted Sig. values, multiply the Sig. value by the number of pairs

29



Legacy Mann-Whitney U test

The Mann-Whitney U test

will not work in the new

dialog with three groups

Use the legacy dialog

instead: Analyze >

Nonparametric Tests >

Legacy Dialogs > 2

Independent Samples…

Choose groups 1 and 3

Mann-Whitney U is the

default test

Returns the value 0.000

Double click on the Exact

Sig. output to check it to

one more decimal place



Example 2: Summary of results

Pair

Post hoc test

Games-

Howell

Kruskall-

Wallis (Dunn-

Bonferroni)

Piarwise Mann-

Whitney U with

Bonferroni adjustment

1 and 2 0.035 0.066 0.040

1 and 3 < 0.001 < 0.001 < 0.001

2 and 3 0.086 0.287 Not tested

According to the preferred (second and third) tests there is:

Very strong evidence of a difference between Designs 1 and 3

(Weak) evidence of a difference between Designs 1 and 2

No evidence of a difference between Designs 2 and 3

30



Recap: We have considered:

Two groups:

Descriptives


Independent samples t-test

Mann Whitney U test

Several groups:

Descriptives


One-way ANOVA

Kruskal-Wallis test

Post hoc testing



statstutor resources

www.statstutor.ac.uk

Normality checking (draft – electronic copy provided)

Normality checking solutions (draft – electronic copy provided)

Independent samples t-test (paper copy provided)

Mann-Whitney U test (available from statstutor website)

One way ANOVA (paper copy provided)

One way ANOVA additional material (available from statstutor website)

Kruskal-Wallis test (draft – electronic copy provided)

31



References

IBM (2014) Post hoc comparisons for the Kruskal-Wallis test. http://www-01.ibm.com/support/docview.wss?uid=swg21477370.

IBM developerWorks (2015) Bonferroni with Mann-Whitney? https://www.ibm.com/developerworks/community/forums/html/topic?id=51942182-1ad0-4f26-9a49-56849775ac4f.

Field, A. (2013) Discovering Statistics using SPSS: (And sex and drugs and rock 'n' roll). 4th edn. London: SAGE.

Sawilowsky, S. S. and Blair, R. C. (1992) A more realistic look at the robustness and Type II error properties of the t test to departures from population normality. Psychological Bulletin, 111(2), pp. 352–360.

Statistica (n.d.) Statistica Help: Nonparametric Statistics Notes – Kruskall-Wallis ANOVA by Ranks and Median Test. http://tinyurl.com/median-KW.

http://www-01.ibm.com/support/docview.wss?uid=swg21477370




https://www.ibm.com/developerworks/community/forums/html/topic?id=51942182-1ad0-4f26-9a49-56849775ac4f















statistical analysis of independent groups in spss · statistical analysis of independent groups in...

Documents