1-s2.0-s0009898111006644-main.pdf

8/9/2019 1-s2.0-S0009898111006644-main.pdf

1/5

8/9/2019 1-s2.0-S0009898111006644-main.pdf

2/5

a contaminated Normal distribution. Some authors [8,9], however,

claim that the distribution, even in the absence of outliers, may be

leptokurtic, i.e. exhibiting heavier tails than the Normal distribution.

Thus, comparing the two types of estimating approaches for the spe-

cic classes of symmetric and unimodal distributions, represented by

the Normal and Student's t -distributions may be of importance.

This study was designed to address three simple questions: (1) What

is the false positive rate of Z -scores estimation methods in non-

contaminated samples from the Normal and Student's t distributions?(2) What is the true positive rate of Z -score estimation methods in con-

taminated samplesfrom thesame distributions? (3)What is theaccuracy

and precision of the different variability estimators for the Normal

distribution?

2. Materials and methods

A total of 1000 random samples was generated from a Normal distri-

bution with mean and standard deviation arbitrarily set at μ =10 andσ =0.5. Data were generated for sample sizes ranging from n =3 to 20.

Subsequently, to obtain data from a leptokurtic distribution, similar sim-

ulations were performed using a Student's t -distribution with 5 degrees

of freedom. Only samples for which all values were within the interval

[ μ −3σ , μ +3σ ] were withheld. Next, the samples were contaminated

byaddingan outlierat μ +3σ , at μ +5σ andat μ +7σ separately, resulting

in a sample of size n +1. Samples of size of n =3 without added outliers

were not taken into account. Z -scores were calculated on each sample fol-

lowing ve different approaches. The rst approach used the Grubbs test

[10,11] to remove outliers in a rst step. This test calculates the distance

between the most extreme point and the centre of the distribution. If

this distance is too large with respect to the standard deviation, this

point was indicated as an outlier. The test uses a predened false alarm

rate which was kept small (α =0.05). If an outlier was found, the test

was repeated on the rest of the sample, using the predened level α

until no outlier was present any more. In a second step, Z -scores were cal-

culated based on the classical average and standard deviation of the data

that were not marked as outliers in the rst step. The second approach

used the Dixon test [12] to remove outliers in a rst step. The method is

based on the calculation of ranges between lowest and highest samplevalues, and subrangesbetween the most extreme samplevalueson either

side. Like theGrubbstest, it uses a hypothesistest. Also here, outliers were

removed till the null hypothesis of absence of outliers was accepted

(α =0.05) and subsequently, Z -scores were obtained using the classical

average and standard deviation based on the datathat were not removed.

The third approach, often called the Tukey approach [13,14], calculates a

robust estimator of scale by dividing the interquartile distance by the

interquartile distance of a standard Normal distribution (D=1.34898)

and uses the median as an estimator of the centre of the distribution.

The fourth approach uses Qn [15,16] as a robust estimator of scale and

the median as an estimator of the centre. The Qn estimator is approxi-

mately the median value of all pairwise differences between all values,

rescaled to reect the standard deviation of a Normal distribution with

axed value D. At last,the robust estimators of scale and centre accordingto ISO 13528 were calculated. Algorithm A of ISO 13528 [2] is based on

calculating the classical average and standard deviation of a Winsorized

sample. Winsorizing, i.e. replacing values beyond a certain limit by the

limit itself, was applied forvalues deviating by 1.5(δ) standard deviations

away from the centre.

The ability of the various approaches of agging outliers when they

exist and of not agging themwhentheydo not exist can beassessed in

a way similar to the evaluation of diagnostic tests. For this purpose, the

Negative Predictive Value (NPV) and Positive Predictive Value (PPV)

were calculated for each approach by letting a specic parameter vary

thatmay be changed to over- or under-estimate the standard deviation,

and, as a consequence, respectively decrease or increase the number of Z -scores above 3. The NPV wascalculated as theratio between the True

Negatives (i.e. the samples to which no outlier was added and that

showed not Z -scores beyond 3) and the number of samples for which

no Z -score beyond 3 was found (=True+False Negatives). Likewise,

the PPV was calculated as the ratio between the True Positives (i.e. the

samples to which an outlier was added and that showed a Z -score be-

yond 3) and the number of samples for which a Z -score beyond 3 was

found (=True+False positives). For the Grubbs and Dixon tests-

based approaches, the P -value for which outliers are excluded (α ) was

changed. For the Tukey and Qn approach, D was changed: lower values

of D result in lower standard deviations, higher Z -scores and hence ahigher Z -citation rate. For the ISO-13528 approach, δ was changed.

NPV and PPV for each of the different values of the varying parameter

were recorded and graphically displayed. At last, for each simulated

data series of samples generated from the Normal distribution, the var-

iability estimator obtained by every approach was recorded and its

mean and standard error calculated.

3. Results

3.1. False positives

A representative part of the False Positive (FP) rates obtained is

depicted in theupper part of Table 1 (no outlier).Among allapproaches,

the Tukey method showed the most distinctive behaviour; while all FP

rates werebelow 15% for theNormal distribution and below 30% forthe

Student's t -distribution, Tukey's approach had for almost all sample

sizes a rate above 20%. The Dixon and ISO approaches showed the low-

est FP rates. In addition, it is seen that forsamples of size 6 or larger, the

FP rate of each approach (except ISO) stabilised for the Normaldistribu-

tion. By contrast, all FP rates increased with increasing sample size for

the Student's t -distribution.

3.2. True positives

The True Positive (TP) rates when adding an outlier at a distance μ +3σ or μ +5σ are shown in Table 1. For all outlier distances, differ-

ences between the different approaches were similar for the Normal

and Student's t -distribution. For outliers at μ +3σ , none of the ap-

proaches was able to ag the outliers in more than half of the casesfor all sample sizes. Tukey's approach had the highest performance,

reaching a agging rate of nearly 50% as soon as the sample size

was 6 or larger. The other approaches had much weaker perfor-

mance; the ISO approach had agging rates below 10% for very

small samples. All other approaches exhibited outlier nding rates

of roughly 10–30%. In addition, these results point to a clear improve-

ment of agging rates for all approaches with increasing sample size

and outlier distance, with a probability of detection close to 100%

for outliers at μ +7σ . The ISO and Dixon approaches, however, still

had a weak performance for very small sample sizes.

3.3. Negative and positive predictive value

The results of the NPV and PPV for sample size n =6 are shown inFig. 1. Like in Receiver Operating Characteristic(ROC) analysis, the perfect

approach would ag no Z -scores larger than 3 in case they do not exist

(negative prediction) and would ag them all in case they would exist

(positive prediction), would correspond to a curve made up by a vertical

line equal to the Y-axis and a horizontallinewhich intersects the Y -axis at

the value 1. The further the curve departs from the perfect curve, the

worse the performance of the approach. For outliers at μ +3σ , curves

were located far from the ideal line so that NPV and PPV did not reach

high levels for any of the approaches: only a combination of positive

and negative predicted values of about 60% was feasible, and although

the Grubbs approach tended to perform better, there was not much dif-

ference between the approaches.For increasingoutlierdistance, however,

it is seen that the curves tend to close to the perfect curve. The Qn ap-

proach consistently performed the worst. The outlier searching

583W. Coucke et al. / Clinica Chimica Acta 413 (2012) 582–586

8/9/2019 1-s2.0-S0009898111006644-main.pdf

3/5

algorithms showed a slightly better performance, mainly for outliers at

moderate distance from the centre ( μ +5σ ). There was almost no differ-

ence between the results of the data generated from the Normal or from

the Student's t -distribution.

A similar trend was seen for a sample size n =8 (Fig. 2). All algo-

rithms exhibited a weak performance for outliers at μ +3σ . For outliers

at μ +5σ , however, the Grubbs approach performed better than the

other approaches. This difference became less clear for more distant

outliers, where all approaches showed almost perfect positive and neg-

ative predictive values. Focussing on the Grubbs approach, the search

for the optimal P -value forexcluding outliers (α ) wasmade for different

combinations of sample size and outlier distance. The optimal α de-creased when outliers becamemore distant and with increasing sample

size. In case of outliers at small distance from the distribution, the opti-

mal α was 0.2 for all sample sizes. This value decreased when the

outliers was further away from the distribution (0.02–0.1 for outlier at μ +5σ , 0.007–0.06 for outlier at μ +7σ ).

3.4. Variability and bias of standard deviation

Results concerning the variability and bias of the estimated stan-

dard deviations are depicted in Table 2. In absence of outliers, Tukey

and outlier search-based approaches showed a higher distance be-

tween the estimated and actual population mean of 0.5, consistently

underestimating the standard error. The reverse occurred when out-

liers were present: Tukey, Dixon and Grubbs approaches had thebest accuracy, with the latter performing better when outliers be-

came more distant. The Qn and ISO approach tended to overestimate

the standard deviation consistently in presence of outliers and for all

Table 1

False and true outlier rates, expressed as percentages, in the samples of the ve different approaches, for a representative selection of investigated sample sizes. False outlier rates

are shown next to ‘no outlier’, true outlier rates in the other lines.

Sample

size

Outlier

distance

Normal distribution Student's t distribution

Grubbs Dixon Tukey ISO QN Grubbs Dixon Tukey ISO QN

5 no outlier 10.9 5.7 29.6 4.2 15 12.8 6.7 33.5 5.3 18.7

6 no outlier 8.1 4.0 22.0 4.6 8.1 15.1 9.1 29.1 9.7 14.8

7 no outlier 10.2 4.0 20.7 5.0 10.0 15.0 9.0 28.0 9.2 15.9

8 no outlier 9.3 4.2 20.8 4.7 8.5 18.2 10.0 32.2 11.8 14.8

20 no outlier 9.8 4.5 21.2 8.4 9.1 31.6 24.3 46.6 32.2 31.5

5 μ +3σ 21.2 11.9 52.1 8.0 25.4 19.9 10.2 44.7 7.9 23.1

6 μ +3σ 25.7 13.9 44.7 13.1 20.3 22.7 13.6 42.5 12.1 17.6

7 μ +3σ 30.0 15.5 47.0 15.3 24.7 22.6 12.9 41.4 13.8 23.3

8 μ +3σ 30.9 13.7 46.5 16.5 20.9 22.4 12.0 40.6 13.9 18.4

20 μ +3σ 35.9 17.3 50.7 28.6 28.5 20.5 11.0 43.2 19.7 19.9

5 μ +5σ 60.0 37.1 85.8 19.5 52.3 51.5 33.4 78.9 17.4 46.5

6 μ +5σ 73.4 52.9 86.6 43.8 56.1 67.7 47.6 83.4 39.0 49.7

7 μ +5σ 83.2 62.2 91.3 59.2 64.2 73.2 53.2 83.8 50.9 57.7

8 μ +5σ 89.7 60.0 90.1 70.3 69.2 79.4 48.8 85.8 59.2 60.9

20 μ +5σ 100 95.2 99.0 99.3 98.5 99.2 81.1 96.7 93.8 92.9

5 μ +7σ 87.7 66.6 97.7 35.2 76.7 80.1 58.7 95.3 29.7 70.0

6 μ +7σ 96.7 85.1 99.1 76.9 83.5 93.8 78.5 98.2 71.6 82.0

7 μ +7σ 99.5 93.1 99.1 91.9 91.2 97.4 86.3 98.3 84.1 83.2

20 μ +7σ 100 100 100 100 100 100 100 100 100 100

Negative predictive value

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Student’s t, outlier at 3 σ 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

0 . 6

0 . 7

0 . 8

0 . 9

1

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Student’s t, outlier at 5 σ

0.0 0.2 0.4 0.6 0.8 1.0


Normal, outlier at 3 σ 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

0 . 6

0 . 7

0 . 8

0 . 9

1

Normal, outlier at 5 σ

Grubbs

Dixon

Tukey

ISO

Qn

P o s i t i v e

p r e d i c t i v e

v a l u e

P o s i t i v e

p r e d i c t i v e

v a l u e

Fig. 1. Negative and positive predictive values for the ve different approaches, based on samples of size n =6, for Normal and Student's t -distributions.

584 W. Coucke et al. / Clinica Chimica Acta 413 (2012) 582–586

8/9/2019 1-s2.0-S0009898111006644-main.pdf

4/5

sample sizes. Precision was similar for all approaches and increased

with increasing sample size.

4. Discussion

The ndings of the present study illustrate that, as far as symmetric

unimodal distributions are concerned, the behaviour of the different ap-

proaches for estimating Z -scores does not really depend on the kurtosis

(peakness) of the distribution: similar performances were found for the

datagenerated from theNormal and from the Student's t -distribution.Al-

though Normal and t -distributions cover a wide range of distributions

that describe data reported in EQA surveys, distributions may be multi-

modal or exhibit skewness in some cases and the Z -scores may become

unreliable. Unimodality is a prerequisite to obtain reliable Z -scores and

inthe light of thepresence of matrix effects [17], the performanceof a lab-

oratory should be assessed with respect to its peers by so-called peer

group comparisons, andthiscan only be assured by grouping dataaccord-

ingto equal or similar methodology. As a result,peer groupsmaybe small.

For example, half of the peer groups in Belgian EQA programmes for

chemistry and immunoassays contain 10 laboratories or less.

Apart from avoiding multimodality by the EQA set up, post hoc

controls for unimodality and symmetry may be applied as well. For-

mal tests have been described to test whether the data are unimodal

[18]. They are based on kernel density estimation and standard errors

and signicance of multimodality can be obtained by bootstrapping.

In addition, asymmetry of the data distribution can be assessed by

measuring skewness after removing spurious results.

Regarding PPV, we observed that, for small sample sizes (nb10) and

for outliers close to the centre, the Tukey, and to a lesser extent, the

Grubbs approaches performed better than the other ones. Remark

that the ISO approach has, in comparison with other approaches, low

outlier nding capacities for sample sizes below 10. There is however

not much difference between the various approaches when sample

size increases and/or when outliers are located further away from the

centre, so that the question of which approach to select for Z -scores

should only be addressed for small sample sizes.


0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Student’s t, outlier at 3 σ 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

0 . 6

0 . 7

0 . 8

0 . 9

1

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Student’s t, outlier at 5 σ

0.0 0.2 0.4 0.6 0.8 1.0


P o s i t i v e

p r e d i c t i v e

v a l u e

P o s i t i v e

p r e d i c t i v e

v a l u e

Normal, outlier at 3 σ 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

0 . 6

0 . 7

0 . 8

0 . 9

1

Normal, outlier at 5 σ

Grubbs

Dixon

Tukey

ISO

Qn

Fig. 2. Negative and positive predictive values for the ve different approaches, based on samples of size n =8, for Normal and Student's t -distributions.

Table 2

Standard error and average of the estimate of variability obtained by the different approaches, for a representative selection of investigated sample sizes. Better estimates of the

standard error have lower values in the left part of the table, and should tend as much as possible to the original value of σ , which was in our setting 0.5, in the right part of the

table. Results are obtained from a Normal distribution.

Standard error of estimated standard deviation Mean of estimated standard deviation

Sample size Outlier distance Grubbs Dixon Tukey ISO QN Grubbs Dixon Tukey ISO QN

5 0 0.424 0.423 0.454 0.475 0.579 0.478 0.490 0.430 0.553 0.571

6 0 0.394 0.394 0.412 0.442 0.486 0.483 0.492 0.438 0.541 0.544

7 0 0.393 0.390 0.415 0.437 0.508 0.484 0.496 0.455 0.538 0.553

8 0 0.380 0.377 0.397 0.418 0.445 0.484 0.492 0.451 0.530 0.531

20 0 0.290 0.287 0.346 0.315 0.319 0.488 0.493 0.484 0.511 0.510

5 3 0.561 0.529 0.559 0.585 0.798 0.732 0.763 0.549 0.857 0.846

6 3 0.526 0.494 0.513 0.562 0.639 0.686 0.721 0.559 0.771 0.771

7 3 0.504 0.471 0.477 0.528 0.627 0.658 0.696 0.546 0.726 0.737

8 3 0.477 0.441 0.482 0.501 0.542 0.634 0.673 0.548 0.682 0.700

20 3 0.339 0.316 0.350 0.331 0.339 0.547 0.566 0.508 0.553 0.566

5 5 0.845 0.847 0.559 0.859 0.958 0.822 0.982 0.549 1.168 0.921

6 5 0.741 0.783 0.513 0.724 0.761 0.685 0.825 0.559 0.871 0.837

7 5 0.645 0.720 0.477 0.599 0.678 0.605 0.740 0.546 0.760 0.760

8 5 0.565 0.681 0.482 0.538 0.596 0.547 0.726 0.548 0.697 0.727

20 5 0.281 0.346 0.350 0.331 0.344 0.486 0.505 0.508 0.553 0.568

5 7 0.940 1.118 0.559 1.128 0.969 0.703 1.001 0.549 1.367 0.923

6 7 0.663 0.896 0.513 0.756 0.770 0.535 0.718 0.559 0.881 0.839

7 7 0.461 0.717 0.477 0.600 0.678 0.490 0.594 0.546 0.760 0.760

8 7 0.393 0.726 0.482 0.538 0.596 0.481 0.618 0.548 0.697 0.727

20 7 0.281 0.278 0.35 0.331 0.344 0.486 0.489 0.508 0.553 0.568

585W. Coucke et al. / Clinica Chimica Acta 413 (2012) 582–586

8/9/2019 1-s2.0-S0009898111006644-main.pdf

5/5

For the NPV, Tukey's approach demonstrated the worst perfor-

mance, so that, in line with its underestimation of variability in absence

of outliers, this approach has a much higher agging rate thanother ap-

proaches, regardless of the contamination of the sample. Further, for

leptokurtic data, more values will be wrongly agged when the sample

size increases.The latter caneasily be explainedby the higher frequency

of data in the tails of the distribution. This nding is contraintuitive to

Thienpont's suggestion [19] to make the threshold for agging

Z -scores dependent on the sample size. The explanation lies in the factthatall tests assumea Normaldistribution, and increasing thethreshold

value with decreasing sample size would only work for normally dis-

tributed data [20]. Nevertheless, changing the threshold has an inverse

effect on the NPV and PPV and it is therefore important to consider NPV

and PPV together. A look at the analysis of NPV and PPV shows that the

difference between the algorithms disappears with increasing sample

size and for outliers further away from the centre. For outliers relatively

close to the centre and for smaller sample sizes, however, the outlier-

search based algorithms tend to perform better than the robust

algorithms.

When the estimated standard deviation is not only used for Z -scores

but also for a follow up of the performance of the different peer groups,

this standard deviation seems to be overestimated by every approach

when outliers are present at a small distance from the centre. The latter

can be explained by the low performance of all the algorithms with re-

spect to outliers relatively close to the centre. We see however that,

alsohere, the outlier-search based algorithms perform better thanthe ro-

bust algorithms when the outliers are more distant from thecentre of the

distribution. The low ef cacy of robustestimators forsample sizes up to 6

has been stipulated already by Rousseeuw [21]. For the particular objec-

tive of this study, it can be added that robust estimators underperform

also forsamples of larger size,and that thestability of the estimated stan-

dard deviation is quite similar across the different approaches.

When considering NPV, PPV, and bias of the variability estimators

together, we would recommend the outlier-search based algorithms

as compared to the robust approaches, certainly when sample sizes

are small. If however a robust approach is preferred, we would rec-

ommend the Tukey approach for its simplicity, its unbiased estimator

of variability and high agging rate when outliers are present, al-though its relatively low negative predictive value may make it use-

less for punitive EQA programmes.

To provide an answer to the question of the minimal sample size

of a peer group before its members can be evaluated. There are two

antagonistic arguments involved. Firstly, NPV and PPV, and accuracy

of the estimated standard deviation increase with increasing sample

size and hence, larger sample sizes are preferred. Secondly, when

only large peer groups are evaluated, many laboratories will escape

evaluation; hence, from this perspective, smaller sample sizes are

preferred. In our opinion, the Grubbs and Tukey approaches nd the

best compromise between the antagonistic arguments with a mini-

mal sample size of 6, which is in line with previously published

results [5]. The Grubbs approach should be applied with a high α

(0.2) when sample sizes are small and a lower α (0.02–0.01) when

sample sizes are larger (n≥10). If the EQA organiser favours the ISO

approach, we would denitively not recommend it for sample sizes

below 10.

In conclusion, this study focussed on small sample sizes with one

outlier added. When sample sizes increase and the probability of en-

countering multiple outliers becomes high, the outlier searching algo-

rithms applied here may suffer from the masking effects when thedata contain more outliers, i.e. the presence of an outlier may escape

notice if a larger outlier is present. In this case masking-free modica-

tions of the Grubbs and Dixon test may be applied [22,23].

References

[1] Plebani M. External quality assessment programs: past, present and future. Jugo-slav Med Biohem 2005;24:201–6.

[2] International Organization for Standardization. ISO 13528:2005. Statisticalmethods for use in prociency testing by interlaboratory comparisons; 2005.

[3] M. Thompson, S.L.R. Ellison, R. Wood. The International Harmonised Protocol forthe Prociency Testing of Analytical Chemistry Laboratories. Pure Appl. Chem2006;78:145–96.

[4] Shif er RE. Maximum Z scores and outliers. Am Stat 1988;42:79–80.[5] Hund E, Massart DL, Smeyers-Verbeke J. Inter-laboratory studies in analytical

chemistry. Anal Chim Acta 2000;423:145–65.

[6] Healy M. Outliers in clinical chemistry quality-control schemes. Clin Chem1979;25:675–7.[7] Rocke D. Robust statistical analysis of interlaboratory studies. Biometrika

1983;70:421–31.[8] Heydorn K. The distribution of interlaboratory comparison data. Accredit Qual

Assur 2008;13:723–4.[9] Duewer DL. The distribution of interlaboratory comparison data: response to the

contribution by K. Heydorn. Accredit Qual Assur 2008;13:725–6.[10] Grubbs FE. Procedures for detecting outlying observations in samples. Techno-

metrics 1969;11:1–21.[11] Rosario P, Martínez JL, Silván JM. Comparison of different statistical methods for

evaluation of prociency test data. Accredit Qual Assur 2008;13:493–9.[12] Dixon WJ. Analysis of extreme values. Ann Math Stat 1950:488–506.[13] Tukey JW. Exploratory data analysis; 1977. th ed. MA: Reading.[14] Sciacovelli L, Secchiero S, Zardo L, Plebani M. External Quality Assessment

Schemes: need for recognised requirements. Clin Chim Acta 2001;309:183–99.[15] Rousseeuw PJ, Croux C. Alternatives to the median absolute deviation. J Am Stat

Assoc 1993;88(424):1273–83.[16] Wilrich P-T. Robust estimates of the theoretical standard deviation to be used in

interlaboratory precision experiments. Accredit Qual Assur 2007;12:231–40.[17] Miller WG. Specimen materials, target values and commutability for external

quality assessment (prociency testing) schemes. Clin Chim Acta 2003;327:25–37.

[18] Lowthian PJ, Thompson M. Bump-hunting for the prociency tester-searching formultimodality. Analyst 2002;127:1359–64.

[19] Thienpont LMR, Steyaert HLC, De Leenheer AP. A modied statistical approach forthe detection of outlying values in external quality control: comparison withother techniques. Clin Chim Acta 1987;168:337–46.

[20] Zhou Q, Xu J, Xie W, Li S, Li X. Use of robust ZB and ZW to evaluate pro ciencytesting data. Clin Chim Acta 2011;412:936–9.

[21] Rousseeuw PJ, Verboven S. Robust estimation in very small samples. Comput StatData Anal 2002;40:741–58.

[22] Rosner B. On the detection of many outliers. Technometrics 1975;17:221–7.[23] Jain RB. A recursive version of Grubbs test for detecting multiple outliers in envi-

ronmental and chemical data. Clin Biochem 2010;43:1030–3.

586 W. Coucke et al. / Clinica Chimica Acta 413 (2012) 582–586

1-s2.0-s0009898111006644-main.pdf

Documents