1-s2.0-s0009898111006644-main.pdf
TRANSCRIPT
-
8/9/2019 1-s2.0-S0009898111006644-main.pdf
1/5
-
8/9/2019 1-s2.0-S0009898111006644-main.pdf
2/5
a contaminated Normal distribution. Some authors [8,9], however,
claim that the distribution, even in the absence of outliers, may be
leptokurtic, i.e. exhibiting heavier tails than the Normal distribution.
Thus, comparing the two types of estimating approaches for the spe-
cic classes of symmetric and unimodal distributions, represented by
the Normal and Student's t -distributions may be of importance.
This study was designed to address three simple questions: (1) What
is the false positive rate of Z -scores estimation methods in non-
contaminated samples from the Normal and Student's t distributions?(2) What is the true positive rate of Z -score estimation methods in con-
taminated samplesfrom thesame distributions? (3)What is theaccuracy
and precision of the different variability estimators for the Normal
distribution?
2. Materials and methods
A total of 1000 random samples was generated from a Normal distri-
bution with mean and standard deviation arbitrarily set at μ =10 andσ =0.5. Data were generated for sample sizes ranging from n =3 to 20.
Subsequently, to obtain data from a leptokurtic distribution, similar sim-
ulations were performed using a Student's t -distribution with 5 degrees
of freedom. Only samples for which all values were within the interval
[ μ −3σ , μ +3σ ] were withheld. Next, the samples were contaminated
byaddingan outlierat μ +3σ , at μ +5σ andat μ +7σ separately, resulting
in a sample of size n +1. Samples of size of n =3 without added outliers
were not taken into account. Z -scores were calculated on each sample fol-
lowing ve different approaches. The rst approach used the Grubbs test
[10,11] to remove outliers in a rst step. This test calculates the distance
between the most extreme point and the centre of the distribution. If
this distance is too large with respect to the standard deviation, this
point was indicated as an outlier. The test uses a predened false alarm
rate which was kept small (α =0.05). If an outlier was found, the test
was repeated on the rest of the sample, using the predened level α
until no outlier was present any more. In a second step, Z -scores were cal-
culated based on the classical average and standard deviation of the data
that were not marked as outliers in the rst step. The second approach
used the Dixon test [12] to remove outliers in a rst step. The method is
based on the calculation of ranges between lowest and highest samplevalues, and subrangesbetween the most extreme samplevalueson either
side. Like theGrubbstest, it uses a hypothesistest. Also here, outliers were
removed till the null hypothesis of absence of outliers was accepted
(α =0.05) and subsequently, Z -scores were obtained using the classical
average and standard deviation based on the datathat were not removed.
The third approach, often called the Tukey approach [13,14], calculates a
robust estimator of scale by dividing the interquartile distance by the
interquartile distance of a standard Normal distribution (D=1.34898)
and uses the median as an estimator of the centre of the distribution.
The fourth approach uses Qn [15,16] as a robust estimator of scale and
the median as an estimator of the centre. The Qn estimator is approxi-
mately the median value of all pairwise differences between all values,
rescaled to reect the standard deviation of a Normal distribution with
axed value D. At last,the robust estimators of scale and centre accordingto ISO 13528 were calculated. Algorithm A of ISO 13528 [2] is based on
calculating the classical average and standard deviation of a Winsorized
sample. Winsorizing, i.e. replacing values beyond a certain limit by the
limit itself, was applied forvalues deviating by 1.5(δ) standard deviations
away from the centre.
The ability of the various approaches of agging outliers when they
exist and of not agging themwhentheydo not exist can beassessed in
a way similar to the evaluation of diagnostic tests. For this purpose, the
Negative Predictive Value (NPV) and Positive Predictive Value (PPV)
were calculated for each approach by letting a specic parameter vary
thatmay be changed to over- or under-estimate the standard deviation,
and, as a consequence, respectively decrease or increase the number of Z -scores above 3. The NPV wascalculated as theratio between the True
Negatives (i.e. the samples to which no outlier was added and that
showed not Z -scores beyond 3) and the number of samples for which
no Z -score beyond 3 was found (=True+False Negatives). Likewise,
the PPV was calculated as the ratio between the True Positives (i.e. the
samples to which an outlier was added and that showed a Z -score be-
yond 3) and the number of samples for which a Z -score beyond 3 was
found (=True+False positives). For the Grubbs and Dixon tests-
based approaches, the P -value for which outliers are excluded (α ) was
changed. For the Tukey and Qn approach, D was changed: lower values
of D result in lower standard deviations, higher Z -scores and hence ahigher Z -citation rate. For the ISO-13528 approach, δ was changed.
NPV and PPV for each of the different values of the varying parameter
were recorded and graphically displayed. At last, for each simulated
data series of samples generated from the Normal distribution, the var-
iability estimator obtained by every approach was recorded and its
mean and standard error calculated.
3. Results
3.1. False positives
A representative part of the False Positive (FP) rates obtained is
depicted in theupper part of Table 1 (no outlier).Among allapproaches,
the Tukey method showed the most distinctive behaviour; while all FP
rates werebelow 15% for theNormal distribution and below 30% forthe
Student's t -distribution, Tukey's approach had for almost all sample
sizes a rate above 20%. The Dixon and ISO approaches showed the low-
est FP rates. In addition, it is seen that forsamples of size 6 or larger, the
FP rate of each approach (except ISO) stabilised for the Normaldistribu-
tion. By contrast, all FP rates increased with increasing sample size for
the Student's t -distribution.
3.2. True positives
The True Positive (TP) rates when adding an outlier at a distance μ +3σ or μ +5σ are shown in Table 1. For all outlier distances, differ-
ences between the different approaches were similar for the Normal
and Student's t -distribution. For outliers at μ +3σ , none of the ap-
proaches was able to ag the outliers in more than half of the casesfor all sample sizes. Tukey's approach had the highest performance,
reaching a agging rate of nearly 50% as soon as the sample size
was 6 or larger. The other approaches had much weaker perfor-
mance; the ISO approach had agging rates below 10% for very
small samples. All other approaches exhibited outlier nding rates
of roughly 10–30%. In addition, these results point to a clear improve-
ment of agging rates for all approaches with increasing sample size
and outlier distance, with a probability of detection close to 100%
for outliers at μ +7σ . The ISO and Dixon approaches, however, still
had a weak performance for very small sample sizes.
3.3. Negative and positive predictive value
The results of the NPV and PPV for sample size n =6 are shown inFig. 1. Like in Receiver Operating Characteristic(ROC) analysis, the perfect
approach would ag no Z -scores larger than 3 in case they do not exist
(negative prediction) and would ag them all in case they would exist
(positive prediction), would correspond to a curve made up by a vertical
line equal to the Y-axis and a horizontallinewhich intersects the Y -axis at
the value 1. The further the curve departs from the perfect curve, the
worse the performance of the approach. For outliers at μ +3σ , curves
were located far from the ideal line so that NPV and PPV did not reach
high levels for any of the approaches: only a combination of positive
and negative predicted values of about 60% was feasible, and although
the Grubbs approach tended to perform better, there was not much dif-
ference between the approaches.For increasingoutlierdistance, however,
it is seen that the curves tend to close to the perfect curve. The Qn ap-
proach consistently performed the worst. The outlier searching
583W. Coucke et al. / Clinica Chimica Acta 413 (2012) 582–586
-
8/9/2019 1-s2.0-S0009898111006644-main.pdf
3/5
algorithms showed a slightly better performance, mainly for outliers at
moderate distance from the centre ( μ +5σ ). There was almost no differ-
ence between the results of the data generated from the Normal or from
the Student's t -distribution.
A similar trend was seen for a sample size n =8 (Fig. 2). All algo-
rithms exhibited a weak performance for outliers at μ +3σ . For outliers
at μ +5σ , however, the Grubbs approach performed better than the
other approaches. This difference became less clear for more distant
outliers, where all approaches showed almost perfect positive and neg-
ative predictive values. Focussing on the Grubbs approach, the search
for the optimal P -value forexcluding outliers (α ) wasmade for different
combinations of sample size and outlier distance. The optimal α de-creased when outliers becamemore distant and with increasing sample
size. In case of outliers at small distance from the distribution, the opti-
mal α was 0.2 for all sample sizes. This value decreased when the
outliers was further away from the distribution (0.02–0.1 for outlier at μ +5σ , 0.007–0.06 for outlier at μ +7σ ).
3.4. Variability and bias of standard deviation
Results concerning the variability and bias of the estimated stan-
dard deviations are depicted in Table 2. In absence of outliers, Tukey
and outlier search-based approaches showed a higher distance be-
tween the estimated and actual population mean of 0.5, consistently
underestimating the standard error. The reverse occurred when out-
liers were present: Tukey, Dixon and Grubbs approaches had thebest accuracy, with the latter performing better when outliers be-
came more distant. The Qn and ISO approach tended to overestimate
the standard deviation consistently in presence of outliers and for all
Table 1
False and true outlier rates, expressed as percentages, in the samples of the ve different approaches, for a representative selection of investigated sample sizes. False outlier rates
are shown next to ‘no outlier’, true outlier rates in the other lines.
Sample
size
Outlier
distance
Normal distribution Student's t distribution
Grubbs Dixon Tukey ISO QN Grubbs Dixon Tukey ISO QN
5 no outlier 10.9 5.7 29.6 4.2 15 12.8 6.7 33.5 5.3 18.7
6 no outlier 8.1 4.0 22.0 4.6 8.1 15.1 9.1 29.1 9.7 14.8
7 no outlier 10.2 4.0 20.7 5.0 10.0 15.0 9.0 28.0 9.2 15.9
8 no outlier 9.3 4.2 20.8 4.7 8.5 18.2 10.0 32.2 11.8 14.8
20 no outlier 9.8 4.5 21.2 8.4 9.1 31.6 24.3 46.6 32.2 31.5
5 μ +3σ 21.2 11.9 52.1 8.0 25.4 19.9 10.2 44.7 7.9 23.1
6 μ +3σ 25.7 13.9 44.7 13.1 20.3 22.7 13.6 42.5 12.1 17.6
7 μ +3σ 30.0 15.5 47.0 15.3 24.7 22.6 12.9 41.4 13.8 23.3
8 μ +3σ 30.9 13.7 46.5 16.5 20.9 22.4 12.0 40.6 13.9 18.4
20 μ +3σ 35.9 17.3 50.7 28.6 28.5 20.5 11.0 43.2 19.7 19.9
5 μ +5σ 60.0 37.1 85.8 19.5 52.3 51.5 33.4 78.9 17.4 46.5
6 μ +5σ 73.4 52.9 86.6 43.8 56.1 67.7 47.6 83.4 39.0 49.7
7 μ +5σ 83.2 62.2 91.3 59.2 64.2 73.2 53.2 83.8 50.9 57.7
8 μ +5σ 89.7 60.0 90.1 70.3 69.2 79.4 48.8 85.8 59.2 60.9
20 μ +5σ 100 95.2 99.0 99.3 98.5 99.2 81.1 96.7 93.8 92.9
5 μ +7σ 87.7 66.6 97.7 35.2 76.7 80.1 58.7 95.3 29.7 70.0
6 μ +7σ 96.7 85.1 99.1 76.9 83.5 93.8 78.5 98.2 71.6 82.0
7 μ +7σ 99.5 93.1 99.1 91.9 91.2 97.4 86.3 98.3 84.1 83.2
20 μ +7σ 100 100 100 100 100 100 100 100 100 100
Negative predictive value
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Student’s t, outlier at 3 σ 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 . 6
0 . 7
0 . 8
0 . 9
1
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Student’s t, outlier at 5 σ
0.0 0.2 0.4 0.6 0.8 1.0
Negative predictive value
Normal, outlier at 3 σ 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 . 6
0 . 7
0 . 8
0 . 9
1
Normal, outlier at 5 σ
Grubbs
Dixon
Tukey
ISO
Qn
P o s i t i v e
p r e d i c t i v e
v a l u e
P o s i t i v e
p r e d i c t i v e
v a l u e
Fig. 1. Negative and positive predictive values for the ve different approaches, based on samples of size n =6, for Normal and Student's t -distributions.
584 W. Coucke et al. / Clinica Chimica Acta 413 (2012) 582–586
-
8/9/2019 1-s2.0-S0009898111006644-main.pdf
4/5
sample sizes. Precision was similar for all approaches and increased
with increasing sample size.
4. Discussion
The ndings of the present study illustrate that, as far as symmetric
unimodal distributions are concerned, the behaviour of the different ap-
proaches for estimating Z -scores does not really depend on the kurtosis
(peakness) of the distribution: similar performances were found for the
datagenerated from theNormal and from the Student's t -distribution.Al-
though Normal and t -distributions cover a wide range of distributions
that describe data reported in EQA surveys, distributions may be multi-
modal or exhibit skewness in some cases and the Z -scores may become
unreliable. Unimodality is a prerequisite to obtain reliable Z -scores and
inthe light of thepresence of matrix effects [17], the performanceof a lab-
oratory should be assessed with respect to its peers by so-called peer
group comparisons, andthiscan only be assured by grouping dataaccord-
ingto equal or similar methodology. As a result,peer groupsmaybe small.
For example, half of the peer groups in Belgian EQA programmes for
chemistry and immunoassays contain 10 laboratories or less.
Apart from avoiding multimodality by the EQA set up, post hoc
controls for unimodality and symmetry may be applied as well. For-
mal tests have been described to test whether the data are unimodal
[18]. They are based on kernel density estimation and standard errors
and signicance of multimodality can be obtained by bootstrapping.
In addition, asymmetry of the data distribution can be assessed by
measuring skewness after removing spurious results.
Regarding PPV, we observed that, for small sample sizes (nb10) and
for outliers close to the centre, the Tukey, and to a lesser extent, the
Grubbs approaches performed better than the other ones. Remark
that the ISO approach has, in comparison with other approaches, low
outlier nding capacities for sample sizes below 10. There is however
not much difference between the various approaches when sample
size increases and/or when outliers are located further away from the
centre, so that the question of which approach to select for Z -scores
should only be addressed for small sample sizes.
Negative predictive value
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Student’s t, outlier at 3 σ 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 . 6
0 . 7
0 . 8
0 . 9
1
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Student’s t, outlier at 5 σ
0.0 0.2 0.4 0.6 0.8 1.0
Negative predictive value
P o s i t i v e
p r e d i c t i v e
v a l u e
P o s i t i v e
p r e d i c t i v e
v a l u e
Normal, outlier at 3 σ 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 . 6
0 . 7
0 . 8
0 . 9
1
Normal, outlier at 5 σ
Grubbs
Dixon
Tukey
ISO
Qn
Fig. 2. Negative and positive predictive values for the ve different approaches, based on samples of size n =8, for Normal and Student's t -distributions.
Table 2
Standard error and average of the estimate of variability obtained by the different approaches, for a representative selection of investigated sample sizes. Better estimates of the
standard error have lower values in the left part of the table, and should tend as much as possible to the original value of σ , which was in our setting 0.5, in the right part of the
table. Results are obtained from a Normal distribution.
Standard error of estimated standard deviation Mean of estimated standard deviation
Sample size Outlier distance Grubbs Dixon Tukey ISO QN Grubbs Dixon Tukey ISO QN
5 0 0.424 0.423 0.454 0.475 0.579 0.478 0.490 0.430 0.553 0.571
6 0 0.394 0.394 0.412 0.442 0.486 0.483 0.492 0.438 0.541 0.544
7 0 0.393 0.390 0.415 0.437 0.508 0.484 0.496 0.455 0.538 0.553
8 0 0.380 0.377 0.397 0.418 0.445 0.484 0.492 0.451 0.530 0.531
20 0 0.290 0.287 0.346 0.315 0.319 0.488 0.493 0.484 0.511 0.510
5 3 0.561 0.529 0.559 0.585 0.798 0.732 0.763 0.549 0.857 0.846
6 3 0.526 0.494 0.513 0.562 0.639 0.686 0.721 0.559 0.771 0.771
7 3 0.504 0.471 0.477 0.528 0.627 0.658 0.696 0.546 0.726 0.737
8 3 0.477 0.441 0.482 0.501 0.542 0.634 0.673 0.548 0.682 0.700
20 3 0.339 0.316 0.350 0.331 0.339 0.547 0.566 0.508 0.553 0.566
5 5 0.845 0.847 0.559 0.859 0.958 0.822 0.982 0.549 1.168 0.921
6 5 0.741 0.783 0.513 0.724 0.761 0.685 0.825 0.559 0.871 0.837
7 5 0.645 0.720 0.477 0.599 0.678 0.605 0.740 0.546 0.760 0.760
8 5 0.565 0.681 0.482 0.538 0.596 0.547 0.726 0.548 0.697 0.727
20 5 0.281 0.346 0.350 0.331 0.344 0.486 0.505 0.508 0.553 0.568
5 7 0.940 1.118 0.559 1.128 0.969 0.703 1.001 0.549 1.367 0.923
6 7 0.663 0.896 0.513 0.756 0.770 0.535 0.718 0.559 0.881 0.839
7 7 0.461 0.717 0.477 0.600 0.678 0.490 0.594 0.546 0.760 0.760
8 7 0.393 0.726 0.482 0.538 0.596 0.481 0.618 0.548 0.697 0.727
20 7 0.281 0.278 0.35 0.331 0.344 0.486 0.489 0.508 0.553 0.568
585W. Coucke et al. / Clinica Chimica Acta 413 (2012) 582–586
-
8/9/2019 1-s2.0-S0009898111006644-main.pdf
5/5
For the NPV, Tukey's approach demonstrated the worst perfor-
mance, so that, in line with its underestimation of variability in absence
of outliers, this approach has a much higher agging rate thanother ap-
proaches, regardless of the contamination of the sample. Further, for
leptokurtic data, more values will be wrongly agged when the sample
size increases.The latter caneasily be explainedby the higher frequency
of data in the tails of the distribution. This nding is contraintuitive to
Thienpont's suggestion [19] to make the threshold for agging
Z -scores dependent on the sample size. The explanation lies in the factthatall tests assumea Normaldistribution, and increasing thethreshold
value with decreasing sample size would only work for normally dis-
tributed data [20]. Nevertheless, changing the threshold has an inverse
effect on the NPV and PPV and it is therefore important to consider NPV
and PPV together. A look at the analysis of NPV and PPV shows that the
difference between the algorithms disappears with increasing sample
size and for outliers further away from the centre. For outliers relatively
close to the centre and for smaller sample sizes, however, the outlier-
search based algorithms tend to perform better than the robust
algorithms.
When the estimated standard deviation is not only used for Z -scores
but also for a follow up of the performance of the different peer groups,
this standard deviation seems to be overestimated by every approach
when outliers are present at a small distance from the centre. The latter
can be explained by the low performance of all the algorithms with re-
spect to outliers relatively close to the centre. We see however that,
alsohere, the outlier-search based algorithms perform better thanthe ro-
bust algorithms when the outliers are more distant from thecentre of the
distribution. The low ef cacy of robustestimators forsample sizes up to 6
has been stipulated already by Rousseeuw [21]. For the particular objec-
tive of this study, it can be added that robust estimators underperform
also forsamples of larger size,and that thestability of the estimated stan-
dard deviation is quite similar across the different approaches.
When considering NPV, PPV, and bias of the variability estimators
together, we would recommend the outlier-search based algorithms
as compared to the robust approaches, certainly when sample sizes
are small. If however a robust approach is preferred, we would rec-
ommend the Tukey approach for its simplicity, its unbiased estimator
of variability and high agging rate when outliers are present, al-though its relatively low negative predictive value may make it use-
less for punitive EQA programmes.
To provide an answer to the question of the minimal sample size
of a peer group before its members can be evaluated. There are two
antagonistic arguments involved. Firstly, NPV and PPV, and accuracy
of the estimated standard deviation increase with increasing sample
size and hence, larger sample sizes are preferred. Secondly, when
only large peer groups are evaluated, many laboratories will escape
evaluation; hence, from this perspective, smaller sample sizes are
preferred. In our opinion, the Grubbs and Tukey approaches nd the
best compromise between the antagonistic arguments with a mini-
mal sample size of 6, which is in line with previously published
results [5]. The Grubbs approach should be applied with a high α
(0.2) when sample sizes are small and a lower α (0.02–0.01) when
sample sizes are larger (n≥10). If the EQA organiser favours the ISO
approach, we would denitively not recommend it for sample sizes
below 10.
In conclusion, this study focussed on small sample sizes with one
outlier added. When sample sizes increase and the probability of en-
countering multiple outliers becomes high, the outlier searching algo-
rithms applied here may suffer from the masking effects when thedata contain more outliers, i.e. the presence of an outlier may escape
notice if a larger outlier is present. In this case masking-free modica-
tions of the Grubbs and Dixon test may be applied [22,23].
References
[1] Plebani M. External quality assessment programs: past, present and future. Jugo-slav Med Biohem 2005;24:201–6.
[2] International Organization for Standardization. ISO 13528:2005. Statisticalmethods for use in prociency testing by interlaboratory comparisons; 2005.
[3] M. Thompson, S.L.R. Ellison, R. Wood. The International Harmonised Protocol forthe Prociency Testing of Analytical Chemistry Laboratories. Pure Appl. Chem2006;78:145–96.
[4] Shif er RE. Maximum Z scores and outliers. Am Stat 1988;42:79–80.[5] Hund E, Massart DL, Smeyers-Verbeke J. Inter-laboratory studies in analytical
chemistry. Anal Chim Acta 2000;423:145–65.
[6] Healy M. Outliers in clinical chemistry quality-control schemes. Clin Chem1979;25:675–7.[7] Rocke D. Robust statistical analysis of interlaboratory studies. Biometrika
1983;70:421–31.[8] Heydorn K. The distribution of interlaboratory comparison data. Accredit Qual
Assur 2008;13:723–4.[9] Duewer DL. The distribution of interlaboratory comparison data: response to the
contribution by K. Heydorn. Accredit Qual Assur 2008;13:725–6.[10] Grubbs FE. Procedures for detecting outlying observations in samples. Techno-
metrics 1969;11:1–21.[11] Rosario P, Martínez JL, Silván JM. Comparison of different statistical methods for
evaluation of prociency test data. Accredit Qual Assur 2008;13:493–9.[12] Dixon WJ. Analysis of extreme values. Ann Math Stat 1950:488–506.[13] Tukey JW. Exploratory data analysis; 1977. th ed. MA: Reading.[14] Sciacovelli L, Secchiero S, Zardo L, Plebani M. External Quality Assessment
Schemes: need for recognised requirements. Clin Chim Acta 2001;309:183–99.[15] Rousseeuw PJ, Croux C. Alternatives to the median absolute deviation. J Am Stat
Assoc 1993;88(424):1273–83.[16] Wilrich P-T. Robust estimates of the theoretical standard deviation to be used in
interlaboratory precision experiments. Accredit Qual Assur 2007;12:231–40.[17] Miller WG. Specimen materials, target values and commutability for external
quality assessment (prociency testing) schemes. Clin Chim Acta 2003;327:25–37.
[18] Lowthian PJ, Thompson M. Bump-hunting for the prociency tester-searching formultimodality. Analyst 2002;127:1359–64.
[19] Thienpont LMR, Steyaert HLC, De Leenheer AP. A modied statistical approach forthe detection of outlying values in external quality control: comparison withother techniques. Clin Chim Acta 1987;168:337–46.
[20] Zhou Q, Xu J, Xie W, Li S, Li X. Use of robust ZB and ZW to evaluate pro ciencytesting data. Clin Chim Acta 2011;412:936–9.
[21] Rousseeuw PJ, Verboven S. Robust estimation in very small samples. Comput StatData Anal 2002;40:741–58.
[22] Rosner B. On the detection of many outliers. Technometrics 1975;17:221–7.[23] Jain RB. A recursive version of Grubbs test for detecting multiple outliers in envi-
ronmental and chemical data. Clin Biochem 2010;43:1030–3.
586 W. Coucke et al. / Clinica Chimica Acta 413 (2012) 582–586