exploring assumptions chapter 5. assumptions of parametric data parametric tests are based on the...

45
Exploring Assumptions Chapter 5

Upload: rhoda-jennings

Post on 18-Dec-2015

251 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Exploring Assumptions

Chapter 5

Page 2: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Assumptions of Parametric Data• Parametric tests are based on the normal distributions and the

following must be met:1. Normally distributed data.2. Homogeneity of variance:

– Variances should be equal throughout the data– The groups should have equal variance– In corr: the variance in one variable (x) should be equal

across all levels of the other variable (y).3. Dependent variable should be interval or ratio.4. Independence: for independent designs the data from different

subjects should be independent; for repeated designs the time variable will be dependent, but different subjects should be independent.

Page 3: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Assumption of Normality

• Many statistical tests (t, ANOVA) assume that the sampling distribution is normally distributed.

• This is a problem, we don’t have access to the sampling distribution.

• But, according to the central limit theorem if the sample data are approximately normal, then the sampling distribution will be normal.

• Also from the central limit theorem, in large samples (n > 30) the sampling distribution tends to be normal, regardless of the shape of the data in our sample.

Page 4: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Explore Command

In SPSS, use:Analyze – Descriptives – Explore to get to the Explore Command.

Page 5: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

The P-P Plots is a probability-probability plot.

From the Explore command it compares the variables cumulative probability to the cumulative probability of the normal distribution.

If the circles fall on the line the data are normal.

Day 2 & Day 3 are NOT normal. It is clear from the histograms that they are both positively skewed.

Page 6: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:
Page 7: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Day 1 is platykurtic (flat), Days 2 & 3 are leptokurtic (peaked)

Interpretation of Deviations of Skew and Kurtosis based on Z Scores

Z > 1.95 is significant at p < .05Z > 2.58 is significant at p < .01Z > 3.29 is significant at p < .001

Page 8: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Day 1 is platykurtic (flat), Days 2 & 3 are leptokurtic (peaked)

Interpretation of Deviations of Skew and Kurtosis based on Z Scores

Z > 1.95 is significant at p < .05Z > 2.58 is significant at p < .01Z > 3.29 is significant at p < .001

Page 9: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Day 1 is platykurtic (flat), Days 2 & 3 are leptokurtic (peaked)

Interpretation of Deviations of Skew and Kurtosis based on Z Scores

Z > 1.95 is significant at p < .05Z > 2.58 is significant at p < .01Z > 3.29 is significant at p < .001

Page 10: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Large Samples and Z Scores

• Significance tests of skew and kurtosis should not be used in large samples (because they are likely to be significant even when skew and kurtosis are not too different from normal.

Z > 1.95 is significant at p < .05Z > 2.58 is significant at p < .01Z > 3.29 is significant at p < .001

Page 11: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

With the combined groups (Duncetown & Sussex) Numeracy looks positively skewed.

Page 12: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

You can split a file by groups (as in the next 2 slides) or just use existing groups and add the grouping variable to the factor list.

Page 13: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:
Page 14: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:
Page 15: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Just as in the p-p Plot, the Q-Q Plot the data should fall on the line if the data set is normal. Both of these variables have a problem with Normality.

Page 16: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

The plot on the left shows unequal variance. This would be evident when looking at: sd’s, variance and the Levine’s Test of Homogeneity of Variance.

Page 17: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Numeracy from SPSSExam.sav VarianceDuncetown University 4.271Sussex University 9.432

N = 50

Page 18: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:
Page 19: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:
Page 20: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:
Page 21: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:
Page 22: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Outliers and Extreme Scores

KINE 5305 Applied Statistics in Kinesiology

Page 23: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

SPSS – Explore BoxPlot

• The top of the box is the upper fourth or 75th percentile.

• The bottom of the box is the lower fourth or 25th percentile.

• 50 % of the scores fall within the box or interquartile range.

• The horizontal line is the median. • The ends of the whiskers

represent the largest and smallest values that are not outliers.

• An outlier, O, is defined as a value that is smaller (or larger) than 1.5 box-lengths.

• An extreme value, E , is defined as a value that is smaller (or larger) than 3 box-lengths.

• Normally distributed scores typically have whiskers that are about the same length and the box is typically smaller than the whiskers.

Page 24: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Choosing a Z Score to Define Outliers

Z Score % Above % +/- Above

3.0 0.0013 0.0026

3.1 0.0010 0.0020

3.2 0.0007 0.0014

3.3 0.0005 0.0010

3.4 0.0003 0.0006

Page 25: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Decisions for Extremes and Outliers1. Check your data to verify all numbers are entered correctly.2. Verify your devices (data testing machines) are working

within manufacturer specifications.3. Use Non-parametric statistics, they don’t require a normal

distribution.4. Develop a criteria to use to label outliers and remove them

from the data set. You must report these in your methods section.

1. If you remove outliers consider including a statistical analysis of the results with and without the outlier(s). In other words, report both, see Stevens (1990) Detecting outliers.

5. Do a log transformation.1. If you data have negative numbers you must shift the numbers to

the positive scale (eg. add 20 to each).2. Try a natural log transformation first in SPSS use LN().3. Try a log base 10 transformation, in SPSS use LG10().

Page 26: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Transform - Compute

Page 27: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Add 10 to the data. Then log transform.

Add 10 to each data point.

Try Natural Log.

Last option, use Log10.

Page 28: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Add 10 to each data point, since you can not take a log of a negative number.

Page 29: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

First try a Natural Log Transformation

Page 30: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

If Natural Log doesn’t work try Log10 Transformation.

Page 31: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Outlier Criteria: 1.5 * Interquartile Range from the Median

Milner CE, Ferber R, Pollard CD, Hamill J, Davis IS. Biomechanical Factors Associated with Tibial Stress Fracture in Female Runners. Med Sci Sport Exer. 38(2):323-328, 2006.

Statistical analysis. Boxplots were used to identify outliers, defined as values >1.5 times the interquartile range away from the median. Identified outliers were removed from the data before statistical analysis of the differences between groups. A total of six data points fell outside this defined range and were removed as follows: two from the RTSF group for BALR, one from the CTRL group for ASTIF, one from each group for KSTIF, and one from the CTRL group for TIBAMI.

Page 32: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Outlier Criteria: 1.5 * Interquartile Range from the Median (using SPSS)

Page 33: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Outlier Criteria: 1.5 * Interquartile Range from the Median (using SPSS)

Descriptives

-.0785 .53292

-1.1684

1.0115

-.2540

-.3549

8.520

2.91893

-5.27

10.00

15.27

3.39

1.176 .427

3.783 .833

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

JumpStatistic Std. Error

Outliers = 1.5 * 3.39 above and below -.3549

Page 34: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Outlier Criteria: ± 3 standard deviations from the mean

Tremblay MS, Barnes JD, Copeland JL, Esliger DW. Conquering Childhood Inactivity: Is the Answer in the Past? Med Sci Sport Exer. 37(7):1187-1194, 2005.

Data analyses. The normality of the data was assessed by calculating skewness and kurtosis statistics. The data were considered within the limits of a normal distribution if the dividend of the skewness and kurtosis statistics and their respective standard errors did not exceed ± 2.0. If the data for a given variable were not normally distributed, one of two steps was taken: either a log transformation (base 10) was performed or the outliers were identified (± 3 standard deviations from the mean) and removed. Log transformations were performed for push-ups and minutes of vigorous physical activity per day. Outliers were removed from the data for the following variables: sitting height, body mass index (BMI), handgrip strength, and activity counts per minute.

Page 35: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Computing and Saving Z Scores

Check this box and SPSS creates and saves the z scores for all selected variables. The z scores in this case will be named zLight1…

Page 36: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Computing and Saving Z Scores

Now, you can identify and remove raw scores above and below 3 sds if you want to remove outliers.

Page 37: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Comparison of Outlier MethodsMedian ± 1.5 * Interquartile Range

27 ± 1.5 * 7 gives a range of 16.5 – 37.5 TABLE 1.1 Newcomb's measurements of the passage time of light

28 22 36 26 28 28

26 24 32 30 27 24

33 21 36 32 31 25

24 25 28 36 27 32

34 30 25 26 26 25

-44 23 21 30 33 29

27 29 28 22 26 27

16 31 29 36 32 28

40 19 37 23 32 29

-2 24 25 27 24 16

29 20 28 27 39 23

Page 38: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Comparison of Outlier Methods ± 2 SDs (Notice the 16s are not removed)

TABLE 1.1 Newcomb's measurements of the passage time of light

28 22 36 26 28 28

26 24 32 30 27 24

33 21 36 32 31 25

24 25 28 36 27 32

34 30 25 26 26 25

-44 23 21 30 33 29

27 29 28 22 26 27

16 31 29 36 32 28

40 19 37 23 32 29

-2 24 25 27 24 16

29 20 28 27 39 23

Page 39: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Comparison of Outlier Methods ± 2 SDs (Notice the 16s are not removed)

TABLE 1.1 Newcomb's measurements of the passage time of light

28 22 36 26 28 28

26 24 32 30 27 24

33 21 36 32 31 25

24 25 28 36 27 32

34 30 25 26 26 25

-44 23 21 30 33 29

27 29 28 22 26 27

16 31 29 36 32 28

40 19 37 23 32 29

-2 24 25 27 24 16

29 20 28 27 39 23

Page 40: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Comparison of Outlier Methods ± 3 SDs (Step 1, then run Z scores again)

TABLE 1.1 Newcomb's measurements of the passage time of light

28 22 36 26 28 28

26 24 32 30 27 24

33 21 36 32 31 25

24 25 28 36 27 32

34 30 25 26 26 25

-44 23 21 30 33 29

27 29 28 22 26 27

16 31 29 36 32 28

40 19 37 23 32 29

-2 24 25 27 24 16

29 20 28 27 39 23

Page 41: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Comparison of Outlier Methods ± 3 SDs (Step 2, after already removing the -44, the -2 then has a

SD of -4.68 so it is removed) TABLE 1.1 Newcomb's measurements of the passage time of light

28 22 36 26 28 28

26 24 32 30 27 24

33 21 36 32 31 25

24 25 28 36 27 32

34 30 25 26 26 25

-44 23 21 30 33 29

27 29 28 22 26 27

16 31 29 36 32 28

40 19 37 23 32 29

-2 24 25 27 24 16

29 20 28 27 39 23

Page 42: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Conclusions?

• The median ± 1.5 * interquartile range appears to be too liberal.

• ± 2 SDs may also be too liberal and statisticians may not approve.

• An iterative process where you remove points above and below 3 SDs and then re-check the distribution may be the most conservative and acceptable method.

• Choosing 3.1, 3.2, or 3.3 as a SD increases the protection against removing a score that is potentially valid and should be retained.

Page 43: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Choosing a Z Score to Define Outliers

Z Score % Above % +/- Above

3.0 0.0013 0.0026

3.1 0.0010 0.0020

3.2 0.0007 0.0014

3.3 0.0005 0.0010

3.4 0.0003 0.0006

Page 44: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Decisions for Extremes and Outliers1. Check your data to verify all numbers are entered correctly.2. Verify your devices (data testing machines) are working

within manufacturer specifications.3. Use Non-parametric statistics, they don’t require a normal

distribution.4. Develop a criteria to use to label outliers and remove them

from the data set. You must report these in your methods section.

1. If you remove outliers consider including a statistical analysis of the results with and without the outlier(s). In other words, report both, see Stevens (1990) Detecting outliers.

5. Do a log transformation.1. If you data have negative numbers you must shift the numbers to

the positive scale (eg. add 20 to each).2. Try a natural log transformation first in SPSS use LN().3. Try a log base 10 transformation, in SPSS use LG10().

Page 45: Exploring Assumptions Chapter 5. Assumptions of Parametric Data Parametric tests are based on the normal distributions and the following must be met:

Data Transformations and Their Uses

Data Transformation Can Correct For

Log Transformation (log(X)) Positive Skew, Unequal Variances

Square Root Transformation (sqrt(X)) Positive Skew, Unequal Variances

Reciprocal Transformation (1/X) Positive Skew, Unequal Variances

Reverse Score Transformation – all of the above can correct for negative skew, but you must first reverse the scores. Just subtract each score from the highest score in the data set + 1.

Negative Skew