outliers and their effect on distribution assessment - minitab · outliers and their effect on...

45
Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016

Upload: ngodieu

Post on 10-May-2019

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Outliers and Their Effect on Distribution Assessment

Larry Bartkus September 2016

Page 2: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Topics of Discussion

What is an Outlier? (Definition)

How can we use outliers

Analysis of Outlying Observation

The Standards

Tests for Outliers in Minitab

Importance of Data Distributions

How Outliers Affect Distribution Assessment

Examples

Page 3: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

What are Outlying Observations?

Do they have value?

Page 4: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

What are Outliers?

An outlier is an observation point that is distant

from other observations.

An outlier may be due to variability in the

measurement or it may indicate experimental

error: the latter are sometimes excluded from the

data set.

An outlier may be caused by a defective unit or a

problem in the process.

ISO 16269 defines it as “member of a small subset

of observations that appears to be inconsistent

with the remainder of a given sample."

Page 5: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Anomaly Detection

Anomaly detection (also known as outlier

detection)is the search for items or events which

do not conform to an expected pattern.

Anomalies are also referred to as outliers, change,

deviation, surprise, aberrant, intrusion, etc.

An outlier may be caused by a defective unit or a

problem in the process.

Page 6: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Basic Understanding of Outliers

There is no rigid mathematical definition of what

constitutes an outlier; determining whether or not

an observation is an outlier is ultimately a

subjective exercise.

Every effort must be made to determine what is

causing the outlier to exist. Testing, equipment,

operator error, materials, etc. all must be reviewed

and assessment made.

Page 7: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

What to do with outliers

• Outliers must be investigated

• Use the tools found in root cause analysis

• Interview operators

• Evaluate the measuring system

• Try to duplicate the reading

• Don’t make assumptions or jump to conclusions

• Do not remove outliers simply because they are

outliers

Page 8: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

The Importance of Outliers

Note that outliers are not necessarily "bad"

data-points; indeed they may well be the most

important, most information rich, part of the

dataset. Under no circumstances should they

be automatically removed from the dataset.

Outliers may deserve special consideration:

they may be the key to the phenomenon under

study or the result of human blunders.

Page 9: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

From ASTM E178-08

1.1.1 An outlying observation may be merely an

extreme manifestation of the random variability inherent

in the data. If this is true, the value should be retained

and processed in the same manner as the other

observations in the sample.

1.1.2 On the other hand, an outlying observation may

be the result of gross deviation from prescribed

experimental procedure or an error in calculating or

recording the numerical value. In such cases , it may be

desirable to institute an investigation to ascertain the

reason for the aberrant value.

Page 10: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

From The Complete Idiot’s

Guide to Statistics

Outliers Extreme values in a data set that should be

discarded before analysis.

WHAT???

Page 11: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Considerations before collecting data

Page 12: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

What distribution is expected

for continuous data? 3 distributions cover most circumstances

Random variables which are added together follow a

normal distribution

Human height, weight, many dimensions

Random variables which are multiplied together

follow a lognormal distribution

Also useful for data which are bounded at zero like tensile

strength, burst strength. These are skewed right because

there are no negative values. Cycle time and microbiologic

data often follow the lognormal distribution.

Weibull distribution is versatile for skewed data.

Time to failure, fatigue life.

Page 13: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Analysis of Outliers

The Standards and Minitab

Page 14: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Standards Relating to Outliers

ASTM E178-08 “Standard Practice for Dealing

with Outlying Observations”

ISO 16269-4:2010 “Statistical interpretation of

data – Part 4: Detection and treatment of outliers”

Page 15: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

ASTM E178-08

18 page document that focuses on Dixon Test.

Includes table for Tietjen-Moore Critical Values

(One-Sided Test) for Lk

Offers an example with aas laboratory analysis

References Wilk-Shapiro W Statistic as a possible

test for outliers

Page 16: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

ISO 16269-4:2010

54 Page document that offers a myriad of methods

for detecting and treating outliers.

Methods include Box plot, Cochran’s test,

Greenwood’s test.

Full explanation of GESD (Generalized Extreme

Studentized Deviate) method

Demonstrates various graphical representations of

data (Box plot, histogram, stem-and-leaf and

probability plot)

Much emphasis put on distribution understanding

including Box-Cox plot and analysis.

Page 17: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Testing for outliers

[Q1 –k(Q3-Q1), Q3 + k(Q3 – Q1)]

Other methods flag observations based on measures

such as the interquartile range. For example, if Q1

and Q3 are the lowest and upper quartiles

respectively, then one could define an outlier to be

any observation outside the range:

Page 18: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

The Box Plot

http://www.physics.csbsju.edu/stats/box2.html

Page 19: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Interpreting the Box Plot

Page 20: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Testing for outliers

Model-based methods which are commonly used

for identification assume that the data are from a

normal distribution, and identify observations

which are deemed “unlikely” based on mean and

standard deviation.

Page 21: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Outlier Testing in Minitab

Three good tests for outliers found in Minitab:

• Box plot

• Grubbs Test

• Dixon’s Test

• Dixon’s Q ratio

• Dixon’s 11 ratio

• Dixon’s 12 ratio

• Dixon’s 20 ratio

• Dixon’s 21 ratio

• Dixon’s 22 ratio

Page 22: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Stat> Basic Statistics> Outlier Test

Outlier Testing in Minitab

First determine the test you will

use. The Dixon’s numbered ratios

are based on sample size

Next select if you want to

evaluate the smallest, largest or

smallest or largest value to

evaluate

Page 23: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

27262524232221201 91 8

1 8.26 26.1 0 3.42 0.001

Min Max G P

Grubbs' Test

Outlier

Outlier Plot of Outlier

Grubbs test

20.28, 18.75, 21.07, 20.33,18.96, 19.35, 19.24, 20.45, 20.61,

18.64, 20.15, 18.26, 20.55, 22.76, 26.10, 19.66, 20.50, 19.93,

22.02, 20.51, 18.67, 20.06, 21.37, 21.05, 22.32

Page 24: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Dixon’s Q Ratio

27262524232221201 91 8

1 8.26 26.1 0 0.43 0.002

Min Max r1 0 P

Dixon's Q Test

Outlier

Outlier Plot of Outlier Dixon’s Q Ratio test for the same

data set.

Page 25: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

27

26

25

24

23

22

21

20

1 9

1 8

Ou

tlie

r

Boxplot of Outlier

Simple Boxplot

Page 26: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Other Tests for Outliers

Chauvenet’s criterion

Cochran’s C Test – Variance Outlier Test

Peircece’s criterion

The Jackknife Method

Greenwood’s Test

Page 27: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

The Importance of Distribution Fitting

Impact on Statistical Tests

Page 28: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Why test for normality?

Reliability predictions are based on model

(distribution)

If model is inappropriate, misleading

conclusions can be drawn.

Page 29: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Sensitive to normality

Non-Sensitive

• Confidence intervals on means

• T-tests

• ANOVA, including DOE and Regression

• X-bar Charts

Sensitive

• Confidence Intervals on standard deviations

• Tolerance Intervals, Reliability Intervals

• Variance tests

• Cpk, Cp, Ppk, Pp

• I-Charts

29

Page 30: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Reasons for failing normality

Measurement resolution

Data shift

Multiple sources of data

Truncated data

Outliers

Too much data

The underlying distribution is not normal

Page 31: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

The Effect of Outlierson Distribution Assessment

Some Examples

Page 32: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Difference of a Data Point

12011010090

Median

Mean

116114112110108

1st Q uartile 104.40

Median 112.90

3rd Q uartile 116.80

Maximum 122.40

107.48 113.94

107.73 116.08

6.27 11.03

A -Squared 0.70

P-V alue 0.060

Mean 110.71

StDev 7.99

V ariance 63.87

Skewness -1.17564

Kurtosis 1.99926

N 26

Minimum 86.20

A nderson-Darling Normality Test

95% C onfidence Interv al for Mean

95% C onfidence Interv al for Median

95% C onfidence Interv al for StDev95% Confidence Intervals

Summary for EW Data 2

Project: NORMALITY EXAMPLES 2011-06-22.MPJ

Page 33: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

With Outlier Removed

120115110105100

Median

Mean

116114112110

1st Q uartile 104.70

Median 113.00

3rd Q uartile 116.80

Maximum 122.40

109.06 114.32

109.24 116.44

4.97 8.85

A -Squared 0.61

P-V alue 0.099

Mean 111.69

StDev 6.36

V ariance 40.51

Skewness -0.353437

Kurtosis -0.992428

N 25

Minimum 100.40

A nderson-Darling Normality Test

95% C onfidence Interv al for Mean

95% C onfidence Interv al for Median

95% C onfidence Interv al for StDev95% Confidence Intervals

Summary for EW Data 2 No Outlier

Project: Normality Examples 2011-06-22.MPJ

Page 34: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Summary Statistics Output

1612840

Median

Mean

65432

1st Q uartile 1.3757

Median 2.5724

3rd Q uartile 4.8224

Maximum 16.4155

2.4541 5.5224

1.8680 3.7050

3.2721 5.5232

A -Squared 2.77

P-V alue < 0.005

Mean 3.9883

StDev 4.1086

V ariance 16.8803

Skewness 1.90297

Kurtosis 3.02385

N 30

Minimum 0.3286

A nderson-Darling Normality Test

95% C onfidence Interv al for Mean

95% C onfidence Interv al for Median

95% C onfidence Interv al for StDev95% Confidence Intervals

Summary for C2

Project: NORMALITY EXAMPLES 2011-06-22.MPJ

Page 35: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Investigate Non-Normal Distributions

20100-10

99

95

80

50

20

5

1

C2

Pe

rce

nt

1001010.1

99

95

80

50

20

5

1

C2

Pe

rce

nt

Goodness of F it Test

Normal

A D = 2.774

P-V alue < 0.005

Lognormal

A D = 0.208

P-V alue = 0.852

Probability Plot for C2

Project: NORMALITY EXAMPLES 2011-06-22.MPJ

Normal - 95% CI Lognormal - 95% CI

Page 36: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Outlier or Not?

Data set – 200, 440, 220, 1100, 80, 55, 60, 1700, 100, 200, 160, 110, 170, 210,

3600, 90, 85, 35, 430, 180, 5, 15, 1, 30, 10, 10, 20, 15, 20, 15.

4000

3000

2000

1 000

0

Resu

lts

Boxplot of Results

Page 37: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Dixon’s Q Ratio

4000300020001 0000

1 .00 3600.00 0.53 0.000

Min Max r1 0 P

Dixon's Q Test

Results

Outlier Plot of Results

Page 38: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Grubbs Test

4000300020001 0000

1 .00 3600.00 4.60 0.000

Min Max G P

Grubbs' Test

Results

Outlier Plot of Results

Page 39: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Outlier or Not?

Data set – 200, 440, 220, 1100, 80, 55, 60, 1700, 100, 200, 160, 110, 170, 210,

3600, 90, 85, 35, 430, 180, 5, 15, 1, 30, 10, 10, 20, 15, 20, 15.

3000200010000

Median

Mean

6005004003002001000

1st Q uartile 18.75

Median 87.50

3rd Q uartile 202.50

Maximum 3600.00

45.15 579.18

31.14 177.71

569.49 961.29

A -Squared 6.23

P-V alue < 0.005

Mean 312.17

StDev 715.08

V ariance 511335.66

Skewness 3.8720

Kurtosis 16.3024

N 30

Minimum 0.00

A nderson-Darling Normality Test

95% C onfidence Interv al for Mean

95% C onfidence Interv al for Median

95% C onfidence Interv al for StDev95% Confidence Intervals

Summary for Aerobes

Page 40: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Review the Probability Plot

Page 41: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Try a transformation

3210

Median

Mean

2.252.101.951.801.651.50

1st Q uartile 1.2698

Median 1.9418

3rd Q uartile 2.3063

Maximum 3.5563

1.5873 2.1633

1.4924 2.2496

0.6143 1.0369

A -Squared 0.25

P-V alue 0.710

Mean 1.8753

StDev 0.7713

V ariance 0.5950

Skewness -0.069983

Kurtosis 0.319587

N 30

Minimum 0.0000

A nderson-Darling Normality Test

95% C onfidence Interv al for Mean

95% C onfidence Interv al for Median

95% C onfidence Interv al for StDev95% Confidence Intervals

Summary for Log Base 10

Conclusion

The sample with a reading of 3600 is expected to occur approximately 1 out of every 67 readings on an average and

should not be considered extreme based on the distribution of the data. The sample with this count of Aerobes should

be identified through a gram stain to determine the nature of its source, but unless abnormal concerns are raised

based on the type of microorganism, no additional action should be taken.

Page 42: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Probability Plot with and without Outliers

The plot on the left shows a typical normal distribution. The one on the right

contains two data points as outlying observations.

Probability plots can give us great insight into the distribution from our process.

Page 43: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Conclusion

We make conclusions about processes based on

the distribution models we use. Know your

process distribution.

Outliers can be a useful source of data, so

investigate causes of outliers.

Slow down on your assumptions. Look at the data

several ways including a probability plot. Try to

understand why your data is the way it is.

When writing up your results, reference standards

or technical literature.

Page 44: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

References

– ISO 16269-4:2010 “Statistical interpretation of data – Part 4 Detection and treatment of outliers”

– ASTM E178-08 “Standard Practice for Dealing With Outlying Observations”

– Minitab 17 Training Material “Analysis of Nonnormal Data for Quality”

– Acheson J. Duncan “Quality Control and Industrial Statistics” Fourth Edition 1974 pp 206-207, pp 977

– W.J. Dixon “Processing Data for Outliers,” Biometrics, Vol. IX (1953), pp. 74-89.

Page 45: Outliers and Their Effect on Distribution Assessment - Minitab · Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016. Topics of Discussion ... Sk ew

Questions?