european commission · web viewoutlier treatment (robust estimation) 0.2 module type 5. outlier...
TRANSCRIPT
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
5. Outlier treatment (robust estimation)
0. General information0.1 Module name5. Outlier treatment (robust estimation)0.2 Module type5. Outlier treatment (robust estimation)0.3 Module codeMethod: Version history
Version Date Description of changes Author NSI Person-id1.0 29-02.2012 First version Grazyna Dehnel GUS GD2.0 31.05.2012 Second version Grazyna Dehnel GUS GD
Template version used 1.0 d.d. 25-3-2011Print date 21-5-2023 22:38
1
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
Contents
Outlier treatment (robust estimation)........................................................................................3
1 General description................................................................................................................ 3
2 Examples...............................................................................................................................19
3 Glossary................................................................................................................................ 37
4 Literature..............................................................................................................................37
Specific description ..................................................................................................................40
A.1 Purpose of the method......................................................................................................40
A.2 Recommended use of the method.....................................................................................40
A.3 Possible disadvantages of the method...............................................................................40
A.4 Variants of the method......................................................................................................40
A.5 Input data sets....................................................................................................................41
A.6 Logical preconditions..........................................................................................................41
A.7 Tuning parameters.............................................................................................................41
A.8 Recommended use of the individual variants of the method............................................41
A.9 Output data sets.................................................................................................................41
A.10 Properties of the output data sets...................................................................................41
A.11 Unit of processing.............................................................................................................41
A.12 User interaction - not tool specific...................................................................................42
A. 13 Logging indicators............................................................................................................42
A.14 Quality indicators of the output data...............................................................................42
A.15 Actual use of the method.................................................................................................42
A.16 Relationship with other modules.....................................................................................42
2
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
Outlier treatment (robust estimation)
1. General description
Databases created on the basis of statistical surveys very often contain values that differ from
other databases. The target variable may be highly skewed, there may be a large proportion of
non-responses or extreme values. This means that one of the major problems involved in
estimating information may be the distribution of variables.
The occurrence of outliers and their detection is an important element of statistical analysis.
The effect of outliers on estimation can be significant, since in such situations estimators don’t
retain their properties such as resistance to bias or efficiency.
For this reason, in addition to using the classical approach, work should be carried on to
develop more robust methods that aren’t affected by the presence of non-typical data or
outliers.
This is true especially when estimation is carried out at a low level of aggregation. Outliers,
non-typical data or null values, however, are an integral part of each population and cannot be
dismissed in the analysis.
The module presents some nonstandard, robust estimation techniques. The following
methods of estimation are analyzed: modification of generalized regression estimator (GREG)
proposed by R. Chambers, H. Falvey, D. Hedlin, P. Kokic (2001), Winsor estimator as discussed
by C. Mackin, J. Preston (2002) and local regression presented by J.Y. Kim, F.J. Breidt, and J.D.
Opsomer (2001).
GREG estimation is frequently used in practice (see the module 2.3). In business statistics, the
perceived main advantage of GREG is that auxiliary information usually improves precision
considerably. But its application is also connected with a danger arising from outliers, whose
presence results in a significant bias of estimates (see Hedlin D., 2004). Several, more or less
direct modifications of GREG estimators are applied and discussed. Among those closely related
are model-based estimation using inverse transformation and Winsor estimation introducing
‘border’ points to distinguish outlier observations. As concerns the ‘further’ family, the module
presents local regression which can be used to accommodate local departures from the linear
3
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
model.
The modification of GREG estimator
In business statistics, which is characterised by huge diversification, direct Horvitz-
Thompson estimation procedures do not provide satisfactory results. Applying GREG estimation
that uses auxiliary information is perceived as a solution, which usually increases precision
considerably. Despite this advantage, D. Hedlin (2004) draws attention to how important good
modelling is. He indicates that for a set of real data, different GREG estimators might produce
wildly different results. The difference between them lies entirely in the choice of the model.
One of the assumption underlying the linear regression model is that residuals have the
same variance, regardless of the value of the covariate. It is often the case, however, that this
condition doesn’t hold, which implies heteroscedasticity. It contributes to ineffective estimates
of regression parameters.
This is why, attempts are being made to develop new estimation methods that account
for heteroscedasticity. Violation of the assumptions of the linear regression model can
sometimes be avoided by a kind of inverse transformation. It can be exemplified by a
modification of the GREG estimator proposed by R. Chambers, H. Falvey, D. Hedlin, P. Kokic
[2001].
The modification of the GREG estimator involves including an additional covariate ziγ
in the
model. Under the classical model, the GREG estimator of the total for the study variable.
YGREG=∑i∈U
y i+∑i∈s
wi ei (1)
where y i=x 'i β the model parameterβ is estimated on the basis of a modified formula, which
account for an additional covariatez [Chambers, Falvey, Hedlin i Kokic, 2001]:
β=(∑i∈sw ix i x ' i/ z i
γ )−1
(∑i∈sw ixi y i /zi
γ )(2)
where:
ziγ
- covariate in the regression model y on x assuming heteroscedasticity.
4
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
γ - parameter which describes the degree of heteroscedasticity.
The dispersion of points determined on the basis of empirical values of the variables x
and y on the scatter plot affects the value of parameterγ . The increasing value of parameter γ
is correlated with a stronger tendency to scatter away from the regression line. The practice of
statistical surveys indicates that given a number of populations, one is faced with
heteroscedasticity, which means that an increase in the value of the covariate is accompanied
by a rise in the residual variance. This means that parameter γ is greater than zero. Survey
results indicate that usually it assumes values within the interval of⟨1;2⟩ [Brewer, 1963]. Its size
is determined on the basis of earlier surveys, a pilot survey or sample-based estimates [Särndal,
Swensson, Wretman, 1992, p.255].
A modification the GREG estimator expressed by formula (1) can be presented in the form
labeled (3), which is equal to the classical formula of the GREG estimator:
YGREG=∑i∈ s
wi gi yi (3)
where:
gi – weight of ith individual observation defined as:
gi=1+(X−XDIR )' (∑i∈swi x i x ' i/ zi
γ)−1 ( xi / ziγ )
(4)
X DIR – the Horvitz-Thompson estimator of the total of the auxiliary variable x
X – the total for the auxiliary variable x
x, z – auxiliary variables
YGREG – estimate of the total obtained by applying the GREG estimator
X DIR – Horvitz-Thompson estimate of the total value of the auxiliary variable x
X – total value of the auxiliary variable x,
γ – coefficient characterising the degree of heteroscedasticity, for γ=0 the original GREG
estimator is obtained.
5
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
The only difference is the definition of the weightgi , which depends on the value of the
auxiliary variable x for sampled observations.
As is known from other surveys, coefficient γ should be included in the interval 1≤γ≤2
(Särndal (1992) p.255), the following estimators are analyzed:
(a) Y DIR – direct Horvitz-Thompson estimator (HT),
(b) YGREG0
– γ=0⇒ zi0
regression estimator based on the linear regression model
assuming homoscedasticity (GREG estimator),
(c) YGREG1
– γ=1⇒ zi1
,
(d) YGREG1,5
– γ=1,5⇒ zi1,5
,
(e) YGREG2
– γ=2⇒ zi2
.
The estimators denoted by (c) (d) and (e) are regression estimators based on the linear
regression model assuming heteroscedasticity. The Horvitz-Thompson estimator (HT) was
considered as the reference value when comparing the effectiveness of the GREG estimator and
its modifications. The second one of the estimators, (b) - YGREG0
, takes the form of the GREG
estimator, but its value differs slightly in comparison with the original GREG formula. The
difference results from the fact that not all observations drawn to the sample are used for
estimation. Observations for which auxiliary variable ‘z’ is equal to zero1 are omitted. In the case
of the remaining three estimators, the linear regression model assumes heteroscedasticity of a
degree indicated by γ coefficient. The modification assumes that each of the auxiliary variables
might take both roles, of ‘x’ and ‘z’ with the warranty that ‘z’ does not equal zero.
As R.L. Chambers, H. Falvey, D. Hedlin, P. Kokic [2001] point out, regression estimators
accounting for heteroscedasticity can sometimes produces significantly biased estimates. The
correction term, also known as ‘bias adjustment’ can be much larger than the model-based
component. In extreme cases, this scan produce negative estimates for variables, which, by
definition, should have positive values. This situation can be due to model misspecification, on
the one hand, and a strong effect of outliers, on the other. They significantly affect model
1 The variable ‘z’ is assumed not to take zero value. Nevertheless, it happens in practice that zero values occur for variables for which they are not expected (for example the revenue might be equal to zero for an operating enterprise).
6
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
parameters and, consequently, absolute values of weights assigned to outliers are
disproportionately large. Various solutions are proposed in order to prevent this. They can only
be used, however, once outliers have been identified. To this end, one can use a measure of
distance [Konarski, 2004]. The most common measure is DFBETA i .
DFBETAi=(∑i∈ sx i x ' i)
−1x i
ei1−x ' i(∑i∈s
x ix 'i)−1
x i(5)
Its standardized form is given by the following formula [Belsley, Kuh, Welsch, 1980]:
DFBETAS i=β− β (−i )
√MSE(−i ) β=
DFBETA i
√MSE(−i )β (6)
where:
√MSE(−i ) - estimation error for the regression model after removing i-th observation
β - an estimate of parameterβ
β (−1 ) - an estimate of parameterβ after removing i-th observation
An observation is regarded as an outlier if it meets the condition |DFBETAS i|>2 [Fox, 1991],
and |DFBETAS i|>
2√n for large samples [Belsley, Kuh, Welsch, 1980].
In the modified version of GREG estimation proposed by R.L. Chambers, H. Falvey, D. Hedlina,
P. Kokic [2001] DFBETA is strongly correlated with weight gi . This means that the measure of
distance affects the degree of influence on variable y. DFBETA for i-th observation in the
modified GREG estimator can be determined using the following formula [Chambers, Falvey,
Hedlin i Kokic, 2001]:
DFBETA i=(∑i∈ sx i x ' i/ zi
γ )−1( x iziγ )(
ei1−hi ) (7)
7
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
where:
hi=(x i /ziγ )'
(∑i∈nx ix 'i /zi
γ)−1 (x i /ziγ )
(8)
Presented below are three measures that can be used to identify outliers, which account for
the number of degrees of freedom in the regression model. One of them was introduced by R.
D. Cook [1977]:
Di=ei
2
MSE (k+1 )⋅
hi(1−hi )
2(9)
where:
k – number of auxiliary variables
hi - distance of the ith observation from the mean value of variable x defined as:
hi=1n+
(X i−X )2
∑ (X i− X )2 (10)
Observations are regarded as outliers if they meet the condition Di>
4n−k−1 .
The second formula to identify outliers was developed by D. A. Belsley, E. Kuh and R. E. Welsch
[1980]:
DFFITS i=
ei√ hi1−hi
√MSE(−1 )√1−hi (11)
where hi denotes the distance of i-th observation from the mean value of variable x defined as
(10). An observation is regarded as an outlier if it meets the condition
|DFFITSi|>2√ (p+1 ) / (n−k−1 )
The third measure accounts for the influence of the ith observation on the results of
regression analysis through its effect on MSE values of estimated regression coefficients
[Belsley, Kuh, Welsch, 1980].
8
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
COVRATIOi=1
( n−k−2+t i2
n−k−1 )k+1
⋅ 1(1−hi )
(12)
where hi is given by formula (10) and
t i=e i
√MSE(−i )√1−hi (13)
If COVRATIOi<1 it means that deleting ith observation will reduce MSE values of estimated
regression coefficients. On the other hand, if COVRATIOi>1 then deleting ith observation will
result in higher MSE values of estimated regression coefficients. It is recommended that critical
values be used, which account for sample size, i.e. |COVRATIOi−1|>3 (k+1 ) /n .
By applying one of the above measures make it possible to identify outliers. As was
mentioned earlier, in the presence of outliers, the modified GREG estimator may grossly
underestimate or overestimate the population totals. Various solutions are used to cope with
this problem, such as limiting the weight variability, replacing outliers with the median or the
use of post-stratification [Chambers, Falvey, Hedlin i Kokic, 2001].
Post-stratification involves a subjective division of a sample into 3 strata, the first of which
contains outliers with respect to variable ‘x’ (high values of variable ’x’, low values of variable
’y’), the third one contains outliers with respect to ‘y’ (high values of variable ‘y’, low values of
variable ‘x’), whereas the second one contains the remaining observations. Optimally, the first
and third stratum should be as small as possible (considering problems of estimating variance,
the number of observations should not be less than 20) while the second stratum should be as
large as possible. Estimation is conducted independently for each stratum.
The Winsor estimation
Another approach suggested in the literature consists in modifying values in the sample
so that the estimator becomes robust and isn’t affected by large residuals [Kokic, Bell, 1994;
Chambers 1996]. Sampled units whose values outside certain preset cut-off values are
transformed in order to make them closer to the cut-off value. This approach is exemplified by
9
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
Winsor estimation, which was applied for the first time in a survey conducted by D.T. Searls
[1966]. Sampled units are divided into two groups. One of them contains typical observations,
which are used to fit the model, the other one contains observations regarded as outliers. The
division is made on the basis of one or two preset cut-off values. Then, values of the study
variable outside the cut-off values are transformed so that they are no longer regarded as
outliers. It should be stressed, however, that the modified value are artificial and can sometimes
be unacceptable. As a result of Winsor estimation, we obtain a „new” sample, in which
untypical observations have been replaced with typical ones. Further calculations are conducted
for the modified sample. Any kind of estimation can be used at this stage, such as Horvitz-
Thompson or GREG.
The use of Winsor estimation reduces estimator variance, while, at the same time, it may
contribute to bias. However, the decline in variance is big enough to offset the bias of MSE
[Hedlin, 2004].
The main difficult, then, lies in the choice of cut-off values for dividing observations in
the sample. The optimum selection has a strong bearing on estimation quality.
The Winsor estimator can be expressed as:
Y win=∑i∈s
~wi y i¿
(14)
where modified values of study variable y i¿
are calculated in the following manner [Gross, Bode
Taylor, Lloyd-Smith, 1986]:
yi¿=¿ {( 1
~wi ) yi+(1− 1~wi )KUi je { s¿ li yi>K Ui ¿ { yi je { s¿li K Li≤ yi≤KUi ¿ ¿¿¿
(15)
10
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
where:
U={1, . .. .. i ,. .. . .N } - general population of size N
~wi=wi gi
w i - sampling weights
gi - weights dependent on the value of an auxiliary variable for sampled units
KUi - upper cut-off value K Li - lower cut-off value
Based on formula (14) it can be assumed that a unit drawn into the sample is regarded as an
element (~wi−1 ) representing non-sampled units. Hence, according to formula (15), an
observation regarded as an outlier contributes its unweighted values, while the non-sampled
units, represented by the remainder of the weight, (~wi−1 ) , contribute preset upper or lower
cut-off values.
Two kinds of Winsor estimators can be distinguished: Type I and II. The difference between
them consists in the treatment of outliers. In the case of Type II estimator, as weight ~wi
decreases, the contribution of outliers increases – the modified value of the study variable
„approaches” the value of the outlier, i.e. the real value of the variable. Type II Winsor estimator
is based on an arbitrary assumption whereby the ratio
1~wi is equal to zero (cf. formula 15) and,
as a result, any outlier always takes on the value cut-off. [Kokic, Bell, 1994].
Cut-off values are calculated to minimize MSE [Preston, Mackin, 2002]:
KUi=μ i¿−
BU
(~wi−1 ) (16)
K Li=μi¿−
BL
(~wi−1 ) (17)
where:
BU=E [Y winU−Y DIR ] - estimator bias Y winU
11
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
BL=E [Y winL−Y DIR ] - estimator bias YwinL
YwinU - Winsor estimator of the population total when only upper winsorization is performed
YwinL - Winsor estimator of the population total when only lower winsorization is performed
Considering the fact that μi¿
is difficult to estimate, in practice it is assumed that μi¿
=μi= β xi
[Preston, Mackin, 2002]. Hence, cut-off values are estimated based on the following formulas:
KUi= μ i−BU
(~w i−1 )=μ i+
G(~w i−1 ) where G=−BU (18)
K Li= μi−BL
(~w i−1 )=μ i+
H(~w i−1 ) where H=−BL (19)
where:
μi= β x i - a robust estimate of regression parameterμi¿
.
To calculate them, one can use function ψU ( D(k ) ) [Kokic, Bell, 1994].
ψU ( D (k ) )=D(k )−∑i∈s
max {Di−D(k ) ,0}=(k+1 ) D(k )−∑j=1
k
D ( j ) (20)
where:
Di=(Y i− μi ) (~wi−1 ) - an estimate of weighted residual Di=(Y i−μi¿) (~wi−1 ) for uniti drawn
into the sample.
(k ) - a number assigned to the unit drawn into the sample after ordering all units in the sample
according to non-ascending estimated residuals Di
Solving the equation ψU ( D(k ) ) requires Di to be ordered in such a way that
D (1 )≥D(2 )≥. . .≥0≥.. . . Consecutive ordered values of estimated residuals are assigned numbers
( j ) where ( j )=(1 ) . . . (k ) .
By solving ψU (G )=0 one can obtain the value of G . In practice, since it is difficult to find the
12
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
right solution of the equation, two methods are proposed. According to the first one, G is
estimated using the formula:
G=1
(k ¿+1 )∑j=1
k¿
D( j ) (21)
where k¿
is the last value ofk for which the value of ψU ( D(k ) ) is non-negative.
The second approach, which was used in this study, involves using linear interpolation between
1(k¿+1 )∑j=1
k ¿
D( j ) and
1(k¿+2 ) ∑j=1
k¿+1
D( j ). Then, G can be expressed as [Preston, Mackin, 2002]:
G=
ψU ( D(k¿+1 ) )[ 1( k¿+1 )∑j=1
k¿
D( j)]−ψU (D (k ¿) )[ 1(k¿+2 ) ∑j=1
k ¿+1
D( j )](ψU ( D( k¿+1) )−ψU ( D(k¿ )) ) (22)
The value ofH can be computed analogically. Estimates of weighted residuals
Di=(Y i− μi ) (~wi−1 ) are ordered in ascendingly D [ 1 ]≤D [ 2 ]≤.. .≤0≤. .. . Function ψ L ( D [m ] ) assumes the form given by the formula:
ψ L ( D [m ] )=D[m ]−∑i∈ s
min {Di−D[m ] ,0}=(m+1 ) D[m ]−∑l=1
m
D( l )(23)
where:
[ l ] - a number assigned to the unit drawn into the sample after ordering all units in the sample
by estimated residuals Di , [ l ]=[1 ] . .. [m ]
H can thus be expressed as (cf. formula (22)) [Preston, Mackin, 2002]:
H=
ψ L( D[m**+1 ] )[ 1[m**+1 ]∑l=1
m**
D [l ]]−ψ L( D[m** ])[ 1[m**+2 ] ∑l=1
m**+ 1
D [l ])](ψ L ( D[m**+1] )−ψL ( D[m** ] )) (24)
where m**
is the last value ofm for which the value of ψU ( D [m] ) is non-positive.
13
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
In order to estimate cut-off values KUi and K Li , in addition to the above bias parameters
G=−BU and H=−BL it is necessary to compute μi= β x i which is an estimate of μi¿
. For this
purpose robust regression methods can be used. Those recommended in the literature
[Preston, Mackin, 2002] include: Trimmed least squares (TLS), Trimmed least absolute value
(LAV), Sample Splitting, Least median of squares (LMS).
The method of Trimmed least squares (TLS) involves fitting an Ordinary Least Squares (OLS)
regression model to minimise the function:
F=∑
i∈ s( yi−βT xi )
2
(25)
Based on the model, theoretical values are calculated, and then residuals. In the next step,
units with the largest positive and negative residuals are removed. As a rule, the sample is
reduced by about 5%. A new regression model is fitted to the reduced sample in order to
estimate the value of μi¿
. One advantage of the TLS is that it is quick to run and simple.
Another method used in robust regression is Trimmed least absolute value (LAV). It consists
in fitting a regression model to minimise the function: F=∑
i∈ s|y i−βT xi|
(26)
The model is used to calculate theoretical values and then residuals. As is the case in the TLS
method, units with the largest positive and negative residuals are removed. A new regression
model is fitted to the reduced sample. It is expected that the LAV method is a more robust
regression model than the TLS technique because large residuals, which are not squared, have
less influence on the regression parameters.
Another example of robust regression is Sample Splitting Technique based on Ordinary Least
Squares (OLS). It is applied to data that has been randomly split into two halves. A regression
model is fitted to each half of the data while the residuals are calculated using the model
applied to the half of the data that was not used to fit the model. Then, after merging the data,
units with the largest positive and negative residuals are removed. The process is repeated until
a certain percentage of data has been deleted. The SS technique is expected to be more robust
than TLS because the residuals used to remove the ‘outlier’ units are not calculated from a
14
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
regression model that has been generated using these ‘outlier’ units.
The list of robust regression techniques cannot be complete without the Least median of
squares (LMS) technique. It was described by Rousseeuw and Leroy [2003]. It resembles the
bootstrap method. It involves drawing subsamples of size n – 1 from a sample of size n using
simple random sampling with replacement. For each subsample trial regression model
parameters are calculated and then their squared residuals, which are used to calculate the
median. The model with the smallest median of squared residuals is selected. The LMS
technique should be more robust than TLS because an OLS regression model is fitted in the
absence of "outlier" units, without totally removing these ‘outlier’ units.
Local regression estimation
Apart from the modified GREG or Winsor estimation, among other methods that are less
sensitive to outliers, the local regression technique is often discussed.
Unlike classical regression, where one model is fitted to the whole sample, local regression
consists in fitting many local models for different ranges of observations in the sample based on
kernel regression. The standard local estimator can be obtained by replacing the first term of
the GREG estimator, which is an estimator of a study variable based on the regression model (
y i ), with an estimate of the total of a study variable obtained using kernel regression ( y loc ,i ). Thanks to this solution, the estimator is less sensitive to outliers and a non-linear relationship
between the study and the auxiliary variable. A standard local regression estimator can be
expressed as [Breidt, Opsomer, 2000]:
Y loc=∑i∈U
yloc , i+∑i∈ s
w i ( y i− yloc , i ) (27)
or as an estimator based on a model where estimates are made for each observation from the
general population [Chambers, Dorfman, Wehrly, 1993; Dorfman, 2000]:
y loc ,i=c j' (Di
'W iDi )−1 Di
'W i ys i=1, 2, …, N (28)
where: U denotes population and s is the sample
c j'
is the vector with 1 in the jth position and 0s otherwise,
15
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
Di , i = 1, 2, …, N, are n x 2 matrices, each with [ 1 ( x j−x i ) ] in the jth row, j = 1, 2, …, n,
Wi, for i = 1, 2, …, N, are n x n diagonal matrices withw ibi−1 K [(x j−xi )bi−1 ] in cell (j, j), where
K (⋅ ) is the kernel function and b i is the bandwidth for the ith observation.
The local regression estimator is based on the value of y loc ,i , which, in many cases, is close
to y i , calculated using the standard GREG estimator. The difference between them is that using
kernel regression with y loc ,i , which relies on many models fitted to subsamples, it is possible to
account for local changes in the study variable, which can’t be done in the case of a single
regression model of the GREG estimator.
This prediction (28) can be written as [Hedlin, 2004]:
yloc ,i= yloc , i+(xi− x loc )∑j∈s
q ji (x j− x loc ) y j
∑j∈s
q ji (x j− x loc )2 (29)
q ji=max [0 , 34 (1−( (x j−x i )
bi(s ) )
2
)] (30)
where:
n – sample size, i=1,2,…, n
qji - diagonal elements in the matrix Wi ,
y loc=∑j∈s
q ji y j(∑j∈sq ji)
−1
(31)
xloc=∑j∈s
q ji x j(∑j∈ sq ji)
−1
(32)
It is worth noting that y loc ,i is an estimate of the study variable for ith observation made on the
basis of local estimation without the x-variable.
The effectiveness of local regression strongly depends on the choice of the kernel function
and the bandwidth. Different approaches are suggested in the literature [Chambers, Dorfman,
16
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
Wehrly, 1993; Chambers, 1996; Kim, Breidt, Opsomer, 2001]. The following forms of the
function of variable u can be used in local regression:
- Uniform K (u )=1
- Triangular K (u )=(1−|u|)
- Quartic (biweight) K (u )=15
16(1−u2 )2
- Cosine K (u )= π
4cos ( π2 u)
- Gaussian K (u )= 1
√2πe−
12u2
- Epanechnikov K (u )=3
4(1−u2)
The most commonly used one, however, is the Epanechikov in the form [Hedlin, 2004]:
K (u ji )=max [0 , 34 (1−u ji
2 ) ]=max [0 , 34 (1−( (x j−xi )
bi )2)]
(33)
where: u ji=( x j−x i )/bi for i = 1, 2, …, n and j= 1, 2, …, n
b i - bandwidth for ith unit
The kernel function defines a window for each sampled unit. Observations this window aren’t
used to estimate mi . It is worth noting that:
K (u ji )=0 if u ji=( x j−x i )/bi≥1 (34)
Thus, the weight selection process is repeated for all observations in the sample. First, a
window is defined for each observation. Observations outside the window for ith unit are
assigned the weight equal to 0. Other observations, included in the window for ith unit are
given positive weights. The weight value depends on how much the auxiliary variable differs
from its corresponding value for ith unit. The largest weights are assigned to units with auxiliary
variable values close toxi .
17
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
The bandwidth is determined using different methods, which can be divided into 2 groups.
The first group of them includes methods where only one bandwidth is set, which is fixed for
the whole sample. The second group includes methods where several bandwidths are
recommended, depending on values of the auxiliary variable for sampled observations.
The most common ways of bandwidth selection for ith unit include:
-b i=
14 (xmax−xmin )
- b i=x i+10− xi−10
- b i=x i+20−xi−20
- b i=x i+40−x i−40
The first one is an example of the first group, where the bandwidth is constant. It
constitutes ¼ of the range of the auxiliary variable. The remaining three methods, referred to as
nearest-neighbor, treat bandwidth as variable. The parameter b i expresses the difference
between values of the auxiliary variable x for two observations selected from the frame sorted
by x i in ascending order and situated at a distance from x i within the bandwidth containing 10,
20, 40 observations. If the number of a sampled unit (with index i , where i assumes values
i=1 .. . .n ) is so small that no unit with the number i−10 , i−20 or i−40 can be specified, and
hence, the x value x i−10 , x i−20 or x i−40 doesn’t exist, the minimum value of x is used instead. A
similar procedure is applied with respect to x i+10 , x i+20 and x i+40 ; with the maximum value
being used [Hedlin, 2004].
The local regression estimator is called flexible, since for a long bandwidth it is similar to the
GREG and for a shorter bandwidth it captures local model departures.
18
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
2. Examples
The module presents an attempt to apply administrative data for a more effective use of
the SP-3 survey sampling of small businesses (see Dehnel, Gołata, 2010). The analysis refers to
data from the year 2001 for which databases were made available by the Polish Central
Statistical Office (CSO). Auxiliary variables come from two administrative registers: Database of
Statistical Units (BJS) created by CSO on the basis of the register of economic units called
REGON and the Ministry of Finance’s tax register. Because of a high fraction of nonresponse in
the SP-3 survey, the incompleteness of the tax register and outdated information in BJS, all
constituting major problems in terms of data availability, special matching procedures were
applied to improve the completeness of the data set for purposes of estimation. Estimation was
conducted for the following basic variables: the average number of employees under a contract
employed in one company, the amount of average monthly salary paid over one year in one
company, average revenue per company and average costs per company. The domain of study
was defined as an ‘intersection’ of the joint distribution of economic entities by region and
classification of economic activity (PKD section).
Estimation was conducted for 176 domains: 17 regions (R) and 11 PKD sections (S)
(Region&Section). Since the number of domains taken into account in the study is quite
considerable, we present only selected results for one of the estimated variables - Gross Wage
in the region of West Pomerania province and in selected sections: Manufacturing, Construction,
Trade, Hotels and Restaurants. However, to evaluate results obtained in the study, synthetic
characteristics across all domains were provided.
The results obtained from the study were compared with the GREG approach. In order
to determine the estimation precision an approximated method based on samples and
bootstrap was applied. The bootstrap method presented by Efron (1979) for simple random
sampling needs modifications when applied for complex samples. We applied the approach
described in detail by Shao and Tu (1995). The simulation study consisted of 500 iterations (see
Klimanek, Paradysz, 2006). In each one, a subsample of size n-1 was drawn and for each
iteration sample modified weights were distinguished to estimate unknown parameter Y ¿ b ,
where (*) stands for different robust estimators used in the study. The empirical variance was
19
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
obtained according to the standard approach
Var ( Y ¿ )=1
500−1 ∑b=1
500
( Y Rb−Y ¿ )2(35)
Estimation precision was evaluated on the basis of estimator’s coefficient of variation:
CV ( Y ¿ )=
√Var ( Y ¿ )( Y ¿ ) (36)
Coefficient RedCV ( Y ¿) was used to measure the degree of CV reduction obtained by robust
estimators applied:
RedCV ( Y ¿)=CV ( Y ¿ )−CV ( Y DIR)
CV ( Y DIR) (37)
deff (Y )= Var (Y ¿ )Var ( Y DIR ) (38)
Additionally, the design effect coefficient deff (proposed by Kish (1995)) was used to
measure the effectiveness of the target estimator in comparison with the direct one. All the
values smaller than one indicate that the method applied is more effective.
Modifications of GREG estimator
The estimation was conducted for Gross Wage (y). As auxiliary variables (x) and (z) we
used: (i) the number of employees from BJS and (ii) revenue from the Tax Register.
The synthetic characteristics summarising estimates of Gross Wage by domain are
presented in table 1. They show that the highest precision was obtained for YGREG0
, the
estimator ignoring observations for which “z” was equal to zero. With the increase of , usually
the maximum, median and the average value of CV also increases. The biggest proportion of
domains for which the relative dispersion of the estimator was smaller than in the case of the
direct one, was also observed for YGREG0
, and then other estimators YGREG1
, YGREG1.5
and YGREG2
were
classified. Only for the model with Revenue as an auxiliary variable was the order of the
estimators changed, which can be explained by a weaker relation of the estimated and auxiliary
variables.
20
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
We cannot unequivocally determine which of the estimators is characterized by the
highest precision. In each case an appropriate model and the definition of ‘x’ and ‘z’ is
necessary. Model misspecification may lead to negative estimates. Such an untypical situation
was observed for domain Trade in Lower Silesia province (see figure 1. B).
Table 1. Characteristics of the CV distribution, modified GREG estimates of Gross Wage, all
domains, small business, Poland, 2001
CharacteristicsEstimator
Y DIR YGREG0 YGREG
1 YGREG1,5 YGREG
2
x – number of employees z - number of employeesmin 0.0721 0.0665 0.0670 0.0671 0.0673 max 1.0964 1.0402 1.0522 1.1754 1.3406
average 0.3153 0.3019 0.3020 0.3030 0.3061median 0.2861 0.2739 0.2773 0.2799 0.2811
proportion of domains for whichCV < CVDIR (%) 73.30 72.16 71.59 68.18
x -revenue z - number of employeesmin 0.073 0.070 0.073 0.073 0.073max 0.779 0.990 0.863 0.795 0.979
average 0.304 0.297 0.306 0.310 0.312median 0.281 0.266 0.283 0.286 0.286
proportion of domains for whichCV < CVDIR (%) 68.75 56.25 46.02 45.45
x - revenue z - revenuemin 0.070 0.070 -1.731 -1.388 -1.332max 0.793 13.556 5.093 3.203 11.370
average 0.305 0.421 0.358 0.316 0.401median 0.280 0.274 0.276 0.280 0.302
proportion of domains for whichCV < CVDIR (%) 57.95 65.34 63.07 55.11
Source: Own estimation based on SP3 survey data, BJS, Tax Register
Defining auxiliary variables ‘x’ and ‘z’ as Revenue, the estimators YGREG1,5
and YGREG2
produced negative estimates. As can be seen from figure 1.A, outliers in both dimensions (of the
estimated as well as of the auxiliary variable) influence estimation results. The regression lines
corresponding to YGREG0
, YGREG1
, YGREG1,5
, YGREG2
, presented on the diagram show the following
regularity (which was observed also for other domains). The coefficient determines the degree
to which outliers in the dimension of ‘y’ are included in the model. The linear regression YGREG0
with =0 includes outliers in the dimension of ‘x’. With the increase of , outliers in the
dimension of ‘y’ are more accounted for by assigning to them greater gi weights and less
21
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
importance is attached to outliers in the dimension of ‘x’. The value of influences also the
range of weights. The smallest dispersion of weights refers to =0 (gi0∈ (-0,13 ; 2,4)), and the
biggest range is observed for =2 (gi2∈ (-35 ; 55) ).
Gi weights are sample dependent and should not diverge much from one. Negative
weights are particularly undesirable as they tend to increase the estimator variance. That is why
some methods for restricting gi weights were developed. There is also a close relationship
between gi weights of a sample observation and its influence on Beta. The most common
measure of this influence is DFBETA defined as a change in the estimate of Beta when a unit is
excluded from the sample:
DFBETAi=(∑i∈ Sxi xi
' / ziγ )−1( x iziγ )(
e i1−hi ) (39)
The main reason for negative estimates are outliers, i.e. those companies in which the
Gross Wage is very big and the Revenue very small. They influence model parameters and result
in incomparably large weights assigned to those observations. DFBETA was used to identify the
influential units. The outlying observations were modified before conducting further estimation.
Figure 1. Regression lines for YGREGγ
estimators in Lower Silesia province:
A) domain defined as Industry
0
250
500
0 500 1000 1500 2000 2500 3000 3500 4000
Gross Wageths. PLN
Revenue ths. PLNY = 0 =1 = 1,5 =2
B) domain defined as Trade
-100
0
100
200
300
0 3000 6000 9000 12000 15000
Gross Wageths. PLN
Revenue ths. PLN
Y = 0 =1 = 1,5 =2
C) domain defined as Trade after substituting the outliers with median
22
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
0
100
200
300
0 3000 6000 9000 12000 15000Y = 0 =1 = 1,5 =2
Gross Wageths. PLN
Revenue ths. PLN
Source: Own estimation based on SP3 survey data and Tax Register
To avoid negative estimates, one proposed solution is to substitute outliers with median
values or implement post-stratification (Chambers et al. (2001a)). Observations were divided
into three groups: outliers in the dimension of ‘x’ , in the dimension of ‘y’ and the remaining
ones (ideally this group should be the largest). The results obtained by substitution are
presented on figure 1.C. The modification considerably changed the model parameters for
YGREG1,5
and YGREG2
. Adapting both methods improved the estimation precision as well as
reduced the variation of the estimator.
The analysis conducted so far leads to the following conclusions:
– Implementation of Chambers et al. (2001a) modification produces results with much
differentiation depending on auxiliary variables and the model type
– Introducing the additional auxiliary variable ‘z’ makes it possible to move the estimation
problem from disjunctive values of the auxiliary variable to disjunctive values of the
estimated variable,
– Application of the additional variable ‘z’ also creates the possibility of estimating basic
economic information for small businesses at a lower aggregation level,
– Weights from the modified model may be characterized by much variation, which is
contradictory to the general assumption that the product of weights gi and w i should be
close to w i ,
– Due to outliers in the sample data, weights designated from the modified model may take
on values less than zero, which might result in negative estimates,
23
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
– Gain in estimation precision is greater for domains of a smaller size (number of
observations) and in the case of a stronger correlation between the estimated and the
auxiliary variable,
– A significant improvement in estimation precision may be obtained by selecting an
adequate model including variables ‘x’ and ‘z’, with properly designated auxiliary variables,
which in the case of a large number of small domains makes application of the modified
model rather difficult
– Weights assigned to the GREG estimator are strongly related to the distance measure
DFBETA, which enables identification of outliers
– Post-stratification is proposed for domains, in which case selecting a suitable model is
difficult.
Winsor estimation
To evaluate results of the research, similarly as in the case of the modified GREG, the
reference values were obtained by using direct Horvitz-Thompson (HT) estimates. Comparison
analysis of Winsor estimators with the modified GREG is presented in Table 2. This time the
estimated variable was defined as Revenue. To preserve comparability, we applied an identical
et of variables in both the modified GREG and Winsor estimator. The results show that the most
efficient GREG estimates are obtained for the weightγ =1.5. Both mean and median values of
the coefficient of variation are the lowest. But for bigger values of the parameter γ =2 efficiency
is decreasing. If the median is taken into account, a very high precision (close to the one for
=1.5) is observed also for γ =0 and γ =1. Regardless of γ parameter, GREG estimator provides
more precise estimates (in view of the median) than the direct one. The highest fraction of
domains for which GREG estimates were more efficient is for =1 but with the increase of
this percentage decreases.
Basic distribution characteristics of the variation of Winsor estimator across all domains
(Region&Section) show similar CV for those using techniques TLS, LAV and LMS and direct HT
estimator. The mean belongs to an interval (0.263 – 0.303) and the median (0.210 – 0.232). The
smallest variation is observed for the TSS technique (mean equals to 0.225 and median 0.172).
Also for TSS the highest fraction of domains for which it provided more efficient estimates was
24
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
observed (almost 74%). This record is quite close to GREG with =1, and much better than for
GREG with =1.5. It also produces absolutely better estimates in terms of other characteristics
of CV distribution like min, average and median.
Table 2. Characteristics of CV distribution: Winsor and modified GREG estimators, revenue, all
domains, 2001 r.
CharacteristicsGREG Estimator
DIRY 0GREGY 1
GREGY 5,1GREGY 2
GREGY
y - Revenue x – Cost from TAX register z - Revenue from TAX register
min 0,088 0,074 0,075 -7,340 -3,834max 0,902 1,407 1,417 8,148 4,474
average 0,276 0,273 0,278 0,230 0,298median 0,232 0,212 0,213 0,210 0, 226
proportion of domains for whichCV < CVDIR (%) 72,16 74,43 70,45 59,09
CharacteristicsWINSOR Estimator
DIRY TLSY LAVY TSSY LMSY
y - Revenue x – Cost from TAX register
min 0,088 0,071 0,070 0,033 0,072max 0,902 1,371 1,352 1,485 1,578
average 0,276 0,263 0,270 0,225 0,303median 0,232 0,211 0,210 0,172 0,228
proportion of domains for whichCV < CVDIR (%) 64,77 64,20 73,86 50,00
Source: Own estimation based on SP3 survey data and Tax Register
The appraisal of Winsor estimation showed that:
– Depending on the type of estimated characteristic, different methods might be applied to
determine the cut-off values used in the delimitation process: two (upper and lower) or one
(upper). Both cases were examined showing that the estimation precision is bigger for two
cut-off values
– Simulation research demonstrated the relation between efficiency and type of robust
regression technique used. The more robust regression technique was applied, the more
efficient estimates were produced. This was shown by comparing TSS estimates with those
obtained by TLS and LAV
– Cut-off values used in the delimitation process and therefore the type of robust regression
technique, influence the precision of Winsor estimator. The highest precision in our study
was observed for Sample Splitting Technique (TSS).
25
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
Local regression estimation
The comparison of robust estimation in business surveys was supplemented by the analysis
of how the bandwidth definition influences the estimation precision. To preserve comparability,
we applied an identical set of variables as in the case of modified GREG and Winsor estimation.
The estimation was conducted for Gross Wage (y). We used Revenue from the Tax Register as
auxiliary variable (x)
Table 3. Characteristics of the CV distribution, local regression estimator, all domains, 2001
CharacteristicsEstimator
Y DIR Y loc (10 ) Y loc (20 ) Y loc ( 40 ) Y loc ( max,min )
min 0.10 0.09 0.08 0.09 0.08 max 0.63 0.60 1.77 1.78 0.73
average 0.27 0.25 0.36 0.36 0.27median 0.25 0.23 0.21 0.21 0.24
proportion of domains for whichCV < CVDIR (%) 73.30 71.59 72.16 59.66
Source: Own estimation based on SP3 survey data and Tax Register
The characteristics of relative dispersion distribution for direct and local estimators are
comparable, except the maximum. The relatively highest precision is observed for Y loc (10 )
estimator, in terms of the characteristics of the distribution and as concerns the proportion of
domains for which CV is smaller than obtained for the direct estimator (see tab. 3).
The analysis of local regression estimation allowed us to make the following insights:
– As the bandwidth increases, the local regression estimates resemble GREG estimates more
closely.
– The highest precision, in terms of CV, was obtained for local regression with bandwidth
Y loc (10 ) and Y loc (20 )
– For a narrow bandwidth, estimation is based on many local models, which increases the
computing time
– As the bandwidth gets increasingly narrow, the local change in the estimated variable is
taken into account to a greater degree. As the bandwidth increases, the smoothing effect is
more significant
26
0,0
0,2
0,4
0,6
0,8
1,0
1,2
1,4
1,6
1,8
2,0
GREG GREG0 GREG1 GREG15 GREG2 TLS LAV TSS LSM LOC10 LOC20 LOC40 LOCmax
CV
MAX
Q3
Q2
MEAN
Q1
MIN
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
– The weights designated by the kernel function do not depend on the values of the
estimated variable but on the auxiliary variables. This means that they can be applied for
many estimated variables, in case the set of auxiliary variables is unchanged.
Comparison of the results and conclusions
GREG estimation is frequently used. But its application to business statistics is connected
with a danger arising from outliers, whose presence results in a significant bias of estimates (see
Hedlin (2004)). Several modifications belonging to the ‘closer’ and ‘further’ family of GREG
estimators were applied and discussed. Among those closely related we relied on model-based
estimation using inverse transformation and Winsor estimation introducing a ‘border’ point to
distinguish outlier observations. As concerns the ‘further’ family, the module presents local
regression which can be used to accommodate local departures from the linear model. The
estimation precision was evaluated on the basis of estimator’s coefficient of variation CV, RedCV
- coefficient measuring the degree of CV reduction and deff coefficient (see figures 2– 4).
Figure 2. Characteristics of CV distribution across all domains, different robust estimators, 2001
Source: Own estimation based on SP3 survey data, BJS and Tax Register
The highest estimation precision expressed by the average value and the quartiles of the
estimator’s CV across all domains, is observed for the Winsor Y TSS and local Y loc (10 ) estimators.
This remark does not take into account the extreme value observed for Winsor TSS, which was
27
-100%
-75%
-50%
-25%
0%
25%
50%
75%
100%
GREG GREG0 GREG1 GREG15 GREG2 TLS LAV TSS LSM LOC10 LOC20 LOC40 LOCmax
RedCV
MAX
Q3
Q2
MEAN
Q1
MIN
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
due to high dispersion in the case of one of the bootstrap sub samples. For the remaining
Winsor estimators, local regression and the modified GREG, the relative dispersion, as measured
by the quartiles and the average, is just a little bit smaller than for the original GREG estimator,
but very similar.
The narrowest range of the CV is observed for two local estimators: Y loc (10 ) and
Y loc ( max,min ) as well as the modified GREG by Chambers at al. For the remaining set of
estimators examined, the variation is much more spread, but does not exceed the maximum
range observed for the original GREG estimator (CV=12%).
Figure 3. Characteristics of RedCV distribution across all domains, different robust estimators,
2001
Source: Own estimation based on SP3 survey data, BJS and Tax Register
Coefficient RedCV (%) represents the degree to which the variation of the estimators is
reduced in comparison with the direct HT estimator (see figure 3). If the average value across all
domains of study, or the quartiles of the distribution are considered, the reduction was
obtained for the Winsor TSS estimator (on average by 33%, but for median by 61%) and the
local regression estimators: Y loc (10 ) (average - by 10%, median - by 9%) and Y loc ( max,min )
(average - by 4%, median – by 8%). For the remaining estimation techniques applied, the
average value of CV is bigger than in the case of the direct HT estimator. This means that the
28
0,0
1,0
2,0
3,0
4,0
5,0
GREG GREG0 GREG1 GREG15 GREG2 TLS LAV TSS LSM LOC10 LOC20 LOC40 LOCmax
deff
MAX
Q3
Q2
MEAN
Q1
MIN
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
techniques used did not result in improving estimation precision. Generally, it is due to extreme
values of CV observed for odd domains, which influence the mean. If the median, as a measure
independent of outliers, was considered, reduction should be noticed for all Winsor and local
regression estimators.
Another coefficient used to evaluate the gain in efficiency of an estimator is deff, which
compares the variation of the estimator considered with respect to the direct estimator (see
figure 4). If its value is smaller than one, it means that the estimation technique applied is more
efficient than the direct one. Evaluating the precision in terms of the deff coefficient, it can be
observed that the greatest gain was obtained by Winsor sample splitting estimator TSS (on
average by 50%). The average variation across all domains, for the whole set of remaining
estimators, is greater than that obtained using the direct one. An especially high value of the
deff coefficient is observed for all the local regression estimators. To explain this, we should
bear in mind that although local models are constructed, in the estimation process all the
observations are taken into account, including outliers. This has a direct influence on high
variation of local estimators.
Figure 4. Characteristics of deff coefficient distribution across all domains, different estimators,
2001
Source: Own estimation based on SP3 survey data, BJS and Tax Register
29
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
In assessing estimators, one has to specify their bias. Since we don’t know the real value of
the estimated parameter, this is difficult. The present study relies on a method applied by R.
Chambers, G. Brown, P. Heady, D. Heasman [2001] and E. Gołata [2004] in a study which
focused on the use of indirect estimation in the analysis of unemployment. The method is based
on one of the properties of the direct estimator. It is assumed that despite high variability, it is
an unbiased estimator. Hence, a regression model fitted to real data and based on direct
estimates will have the form y=x. Since real values are unknown, it was assumed that if
estimates obtained using a given estimator are close to real values, the estimator can be treated
as an unbiased predictor of the direct estimator. [Chambers, Brown, Heady, Heasman (2001)].
Estimates obtained using the unbiased direct estimator can be regarded as values of a random
variable, whose expectation is equal to the estimate calculated by applying the estimator of
interest. If the estimator under assessment is unbiased, the regression coefficient used in the
model based on its estimates and direct estimates will be 1. As the estimator bias increases, the
regression coefficient will increasingly differ from 1.
To test bias, parameters of the regression function y=ax were estimated for the direct
estimator2 (the OY axis) and estimates obtained using other estimators analysed in the study
(the OX axis), see Fig. 5
A significance test was then used to verify the hypothesis that the difference between
estimates of the regression coefficient and 1 is not statistically significant. Additionally, to
illustrate the distribution of the bias of unbiased estimates of the direct estimator in comparison
with estimators under assessment, scatter plots were used The random distribution of
estimates for the direct and other estimators is demonstrated by the concentration of
observations along the identity line. Since the size of study domains differs significantly,
estimates were transformed by applying a power function with exponent of 0.5. This was
necessary to meet the homoscedasticity assumption of OLS [Chambers, Brown, Heady,
Heasman (2001)], see Figs. 5 – 8.
2 In the study a direct regression estimator was used, which is based on auxiliary variables to offset the small sample size across domains.
30
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
Fig. 5. The relationship between estimates of gross wage obtained by the direct estimator and GREG estimator across all domains
0,97650,9766
1,00641,00840,99860,9342
0,84000,85150,8617
0
250
500
750
1000
1250
1500
0 250 500 750 1000 1250 1500
transormacja ocen estymatora
trans
form
acja
oce
n es
tym
ator
a
GREGY
9925,0
9746,0ˆ2 =
=
R
xYGREG
Source: Own estimation based on SP3 survey data and the Tax Register
Fig. 6. The relationship between estimates of gross wage obtained by the direct estimator
and the modified GREG estimator across all domains
0
250
500
750
1000
1250
1500
0 250 500 750 1000 1250 1500
transformacja ocen estymatorów Modelu Chambersa
trans
form
acja
oce
n es
tym
ator
ów
9911,0
9739,0ˆ2
0
=
=
R
xYGREG
9952,0
9766,0ˆ2
1
=
=
R
xYGREG
9732,0
9765,0ˆ2
5,1
=
=
R
xYGREG
9924,0
9766,0ˆ2
2
=
=
R
xYGREG
Source: Own estimation based on SP3 survey data and the Tax Register
Fig. 7. The relationship between estimates of gross wage obtained by the direct estimator
31
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
and the Winsor estimator across all domains
0
250
500
750
1000
1250
1500
0 250 500 750 1000 1250 1500
transformacja ocen estymatorów Winsora
trans
form
acja
oce
n es
tym
ator
ów 9842,0
0064,1ˆ2 =
=
R
xYTLS
9855,0
0084,1ˆ2 =
=
R
xYLAV
9906,0
9986,0ˆ2 =
=
R
xYTSS
9522,0
9342,0ˆ2 =
=
R
xYLMS
Source: Own estimation based on SP3 survey data and the Tax Register
Fig. 8. The relationship between estimates of gross wage obtained by the direct estimator
and local regression estimators across all domains
0
250
500
750
1000
1250
1500
0 250 500 750 1000 1250 1500
transformacja ocen estymatorów lokalnych
trans
form
acja
oce
n es
tym
ator
ów
9867,0
8400,010ˆ2 =
=
R
xYloc
9832,0
8515,020ˆ2 =
=
R
xYloc
9811,0
8617,040ˆ2 =
=
R
xYloc
9779,0
7943,0minmax,ˆ2 =
=
R
xYloc
Source: Own estimation based on SP3 survey data and the Tax Register
The Student’s t-value based on the model for the standard GREG estimator justifies the
32
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
rejection of the null hypothesis assuming that the regression coefficient is equal to 1, see. Tab.
4. A similar result was obtained for other, modified GREG estimators. Based on the p-value for
the t-test statistic, the difference between estimates obtained by the direct and GREG
estimators can be treated as due to chance, see Fig. 7 and Table 4. This result is surprising,
considering the fact that GREG estimators are regarded as unbiased. However, if the sample
includes some observations with extreme values of the auxiliary variable (outliers), then such
estimators can produce under- or overestimation of the study variable.
Different results of the significance test were obtained for the Winsor estimators, where
outliers are modified, see Fig. 8 and Table 4. The t-test confirmed the compatibility between
estimates obtained by the direct and three indirect estimators: Y TLS , Y LAV and Y TSS . The null
hypothesis about the regression coefficient equal to 1 was rejected for the estimator Y LMS
The last group to be analysed included local regression estimators. They are significantly
different from the direct estimator. The difference is evident from the slope of the regression
line and significance test results. (see Fig. 8 and Table 4). Local estimators are characterised by
the largest t-values. For all cases, the p-value is less than 0.001. This indicates that these
estimators are biased. Hence, the differences between estimates produced by the direct and
local estimators cannot be attributed to chance.
To sum up the verification of the random character of the difference between direct
unbiased estimates and model-based estimates, it can be said that the “best” results were
obtained using Winsor estimation. The main difference between Winsor estimators and other
estimators assessed in the study is that outliers are modified to be as close as possible to cut-off
values. In contrast, auxiliary variable values are not changed in modified GREG estimators and
local estimators. What changes is only their influence on the estimation of the study variable,
which depends on weight values. The verification study reveals a negative influence of outliers
on estimation precision. This seems to be confirmed by the degree of correlation between direct
and model-based estimates, see Table 5. In the case of Winsor estimators the correlation is
very strong, in contrast to GREG and local estimators, where it is much weaker. If unbiased
direct estimates are treated as a reference point, correlation analysis can demonstrate to what
extent model-based estimates obtained by other estimators are different. The weakest
33
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
correlation can be observed for the three local estimators (r∈ (0 ,88 ;0 ,89 ) ). The correlation is
stronger for the local estimator Y loc (10 ) (r=0 ,986 ) and Winsor estimator Y LMS (r=0 ,977 ), with
the strongest correlation observed for the three remaining Winsor estimators Y TLS , Y LAV , Y TSS
oand the GREG estimator (r∈ (0 ,991 ;0 ,998 ) ).
Table 4. Results of the study to verify the null hypothesis about the regression coefficient equal to 1 for the relationship between indirect and direct estimates (Y DIR ) of gross wage in 2001.
Estimator Regression coefficient t-value p-value coefficient of
determination
GREG estimation
YGREG 0.9746 -6.0787 < 0.001 0.9968The modified GREG estimation
YGREG0
0.9739 -5.7249 < 0.001 0.9962
YGREG1
0.9766 -6.9514 < 0.001 0.9979
YGREG1,5
0.9765 -5.8704 < 0.001 0.9971
YGREG2
0.9766 -5.5502 < 0.001 0.9967Winsor estimation
Y TLS 1.0064 1.0222 0.30812 0.9932
Y LAV 1.0084 1.3979 0.16391 0.9938
Y TSS 0.9986 -0.2986 0.76563 0.9960
Y LMS 0.9342 -6.4495 < 0.001 0.9795Local regression estimation
Y loc (10 ) 0.8400 -21.7436 < 0.001 0.9867Y loc (20 ) 0.8515 -17.6389 < 0.001 0.9832Y loc ( 40 ) 0.8617 -15.3053 < 0.001 0.9811
Y loc ( max,min ) 0.7943 -22.7796 < 0.001 0.9779Source: Own estimation based on SP3 survey data and the Tax Register
34
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
Table 5. A correlation matrix between estimates of gross wage obtained using estimators
under assessment by province and activity class in 2001.
Estimator Y DIR YGREG YGREG0 YGREG
1 YGREG1,5 YGREG
2 Y TLS Y LAV Y TSS Y LMSY loc (10 ) Y loc (20 ) Y loc ( 40 ) Y loc ( max,min )
GREG estimation
YGREG 0.9976 1
The modified GREG estimation
YGREG0
0.9979 0.9966 1
YGREG1
0.9920 0.9938 0.9903 1
YGREG1,5
0.9911 0.9928 0.9890 0.9997 1
YGREG2
0.9909 0.9926 0.9887 0.9995 1.0000 1
Winsor estimation
Y TLS 0.9923 0.9933 0.9911 0.9902 0.9893 0.9891 1
Y LAV 0.9930 0.9935 0.9915 0.9909 0.9900 0.9897 0.9994 1
Y TSS 0.9972 0.9984 0.9962 0.9925 0.9915 0.9913 0.9976 0.9977 1
Y LMS 0.9771 0.9792 0.9759 0.9724 0.9716 0.9714 0.9755 0.9758 0.9792 1
Local regression estimation
Y loc (10 ) 0.9856 0.9813 0.9844 0.9763 0.9757 0.9756 0.9891 0.9894 0.9870 0.9614 1
Y loc (20 ) 0.8904 0.8831 0.8936 0.8578 0.8568 0.8569 0.8995 0.8988 0.8946 0.8759 0.9289 1
Y loc ( 40 ) 0.8958 0.8888 0.8988 0.8639 0.8630 0.8632 0.9052 0.9044 0.9002 0.8817 0.9339 0.9985 1
Y loc ( max,min ) 0.8733 0.8637 0.8759 0.8363 0.8351 0.8352 0.8774 0.8762 0.8741 0.8562 0.9125 0.9866 0.9834 1
Source: Own estimation based on SP3 survey data and the Tax Register
Evaluation of estimators based on CV, RedCV deff and the significance test for the
regression coefficient indicate that the Winsor estimator with sample splitting (SST) is
characterised by the greatest precision. Rating the remaining estimators in terms of estimation
precision is rather difficult, since it changes depending on estimation parameters , while the
validity of criteria is relative. It is much easier to choose the ‘best’ estimator within each
method of estimation. In the case of modified GREG estimation, the best choice is the YGREG1,5
estimator, which accounts for an auxiliary variable ‘z’, which specifies heteroscedasticity in
regression of y on x in proportion to the value of zi1,5
. In the group of local regression
35
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
estimators, it is the Y loc (10 ) estimator that displays the best precision since it has the narrowest
bandwidth for constructing local kernel functions. The bandwidth is determined separately for
each sampled unit using the “nearest neighbour method”. In the case of the ith unit the
bandwidth comprises units with numbers i ±10. It should be stressed that the bandwidth,
which is used to construct the kernel function, affects the computation time. An increased
bandwidth slows down the estimation process. That is why, even though nowadays processing
large datasets is not a major consideration for analysts, local estimation methods are seldom
used in practice. The third group of estimators analysed in this study consists of Winsor
estimators. As was mentioned above, it is the estimator Y TSS , which is based on sample
splitting, that turned out to be the best. The technique involves constructing two regression
models for sampled units, randomly divided into two sets. Residuals for the first set are
computed using the model fitted for the other one. The final regression model used in Winsor
estimation is constructed after rejecting units whose residuals are the highest. This approach
increases the robustness of the sampling splitting technique in comparison with TLS or LAV,
since residuals used to delete data aren’t computed on the basis of the regression model they
were based on.
Glossary
Term Definitionoutliers Extreme values that differ from other databases, value nof variable,
wchch is essentially “impossible” under given model
Literature:Belsley D. A., Kuh E., Welsch R. E., 1980. Regression diagnostics, Identifying influential data and
sources of co-linearity, New York, Wiley.Breidt, F.J., Opsomer, J.D. (2000) Local Polynomial Regression Estimation in Survey Sampling.
The Annals of Statistics, 28, 1026 – 1053.Brewer K., 1963, Ratio estimation and finite population, some results deductible from the
assumption of an underlying stochastic process, Australian Journal of Statistics 5, s. 93-105.Chambers, R.L. (1996) Robust case-weighting for multipurpose establishment Surveys, Journal of
Official Statistics, Vol.12, No.1, 3 – 32.
36
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
Chambers, R., Dorfman, A.H., Wehrly, T.E. (1993) Bias Robust Estimation in Finite Populations Using Nonparametric Calibration. Journal of the American Statistical Association, 88, 268 – 277.
Chambers, R., Kokic, P., Smith, P. and Cruddas, M. (2000) Winsorization for Identifying and Treating Outliers in Business Surveys, Proceedings of the Second International Conference on Establishment Surveys (ICES II), 687 – 696.
Chambers R., Brown G., Heady P., Heasman D. (2001) Evaluation of Small Area Estimation Methods – an Application to Unemployment Estimates from the UK LFS, Proceedings of Statistics Canada Symposium 2001, Achieving Data Quality in a Statistical Agency: a Methodological Perspective.
Chambers, R.L, Falvey, H., Hedlin, D., Kokic P. (2001a) Does the Model Matter for GREG Estimation? A Business Survey Example, Journal of Official Statistics, Vol.17, No.4, 527 – 544.
Cook R. D., 1977, Deletion of influential observations in linear regression, Technometrics, 19, s.351-361.
Dehnel G., Gołata E. (2010) On some robust estimators for Polish Business Survey, Statistics in Transition- new series, Vol.11, number 2, Warszawa 2010. s. 287-312 (Central Statistical Office and Polish Statistical Association), 58 – 71, Summ. - Bibliogr. ISBN 978-83-7027-431-3 (in Polish).
Dorfman, A.H. (2000) Non-Parametric Regression for Estimating Totals in Finite Populations. Proceedings of the Survey Research Methods. American Statistical Association, 47 – 54.
Efron, B. (1979) Bootstrap methods: Another look at the jackknife, [in:] Annals of Statistics 7, 1979, 1 – 26.
Gołata E., 2004, Estymacja pośrednia bezrobocia na lokalnym rynku pracy, Wydawnictwo Akademii Ekonomicznej w Poznaniu, Poznań.
Gross W.F., Bode G., Taylor J.M., Lloyd–Smith C.W., 1986, Some finite population estimators which reduce the contribution of outliers, [w:] Proceedings of the Pacific Statistical Conference, 20–24 May 1985, Auckland, New Zealand.
Hedlin D. (2004) Business Survey Estimation, R&D, Sweden.Hidiroglou, M.H., Srinath, K.P. (1981) Some estimators of population total from simple random
samples containing large units, JASA, 76, 690 – 695.Kim, J.Y., Breidt, F.J. and Opsomer, J.D. (2001) Local polynomial regression estimation in two-
stage sampling. Proceedings of the Section on Survey Research Methods, American Statistical Association, 55 – 61.
Kish L. (1995) Methods for design effects, Journal Official Statistics, 11 55-77.Klimanek T., Paradysz J. (2006) Adaptation of EURAREA experience in business statistics,
”Statistics in Transition”, Vol.7, No. 4.Kokic, P.N., Bell, P.A. (1994) Optimal winsorizing cutoffs for a stratified finite population
estimator, Journal of Official Statistics, 10. 419 – 435.Konarski R,, 2004, Regresja wielokrotna, diagnostyka i selekcja modułu regresji, strona
internetowa, www.pbsdga.pl/x.php?x=160/Regresja-wielokrotna.html.Mackin, C., Preston J. (2002) Winsorization for Generalised Regression Estimation, Australian
Bureau of Statistics.Pawlowska Z. (2005) Role of small and medium enterprises in creating a demand on work, [in:]
37
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
“Wiadomosci Statystyczne”, No.2, 34 – 46 (in Polish).Preston J., Mackin C., 2002, Winsorization for Generalised Regression Estimation, Paper for the
Methodological Advisory Committee, November 2002, Australian Bureau of Statistics.Särndal C.E., Swensson B., Wretman J. (1992) Model Assisted Survey Sampling, Springer Verlag,
New York.
Searls D.T., 1966, An estimator which reduces large true observations, JASA, 61, s. 1200–1204.Shao J., Tu D., 1995, The jackknife and bootstrap, New York, Springer Verlag.
38
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
Specific description – Method: A synthetic estimator for small area estimation
A.1 Purpose of the method
Presentation of some alternative estimation techniques which are less sensitive to outliers
A.2 Recommended use of the method
The methods presented in this module are recommended for use in the case when: the study
variable(s) are highly skewed, there is a large proportion of zero responses, some negative
values and several auxiliary variables that can be used to improve estimation including outliers.
Such a situation is common in business statistics. The growing use of auxiliary information from
administrative registers and the need to substantially reduce sample sizes or to produce more
effective estimates has increased the importance of recognizing and dealing with the data
problem.
A.3 Possible disadvantages of the method
It is difficult to rank estimators, as their evaluation changes depending on the criterion. It
would be much easier to indicate the ‘best’ or ‘worst’ estimator in each group. The modified
GREG can often produce negative weights. As for local regression, it should be stressed that the
bandwidth significantly influences the computing time.
For Winsor estimators, the one implementing Sample Splitting Technique TSSY provided the
most efficient estimates. In this method two regression models were estimated for sample
observations randomly divided into two groups. Evaluation of such models in terms of residuals
made us reject the most outlying observations. This approach makes the TSS technique more
robust in comparison with others like TLS or LAV.
A.4 Variants of the method
Modification of GREG estimator proposed by R. Chambers, H. Falvey, D. Hedlin, P. Kokic
(2001), Winsor estimator discussed by C. Mackin, J. Preston (2002) and local regression
presented by J.Y. Kim, F.J. Breidt, and J.D. Opsomer (2001).
39
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
A.5 Input data sets
The input data set depends on estimators taken into account and the source of information. The
input data set can contain individual information for all units in the sample. The input data set
can contain information coming from auxiliary sources e.g. administrative register. Specific
software (e.g. SAS) may be based on different structures of the input data set in the procedure
of robust estimation.
A.6 Logical preconditions
The level of aggregation of study variables and auxiliary variables should be the same
A.7 Tuning parameters
The tuning parameters depend on estimators used: the maximum number of iterations, convergence criterion
A.8 Recommended use of the individual variants of the method
Hedlin D., 2004, Business Survey Estimation, R&D, Sweden.
Preston J., Mackin C., 2002, Winsorization for Generalised Regression Estimation, Paper for the
Methodological Advisory Committee, November 2002, Australian Bureau of Statistics.
A.9 Output data sets
An output dataset may contain a table with the following information: estimates for small area,
MSE, bias (depends on used software).
A.10 Properties of the output data sets
The user should check the quality of estimates based on their knowledge of the investigated phenomenon, MSE, bias of estimates
A.11 Unit of processing
Processing unit level data and domain level variables for computations of the sample size
dependent of estimator and its MSE.
40
Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation
A.12 User interaction - not tool specific
1. Select method of robust estimation.
2. Choose auxiliary variables to be included in robust estimation.
2. Establish the level of aggregation.
3. Establish tuning parameters (convergence criteria, starting point, stopping point).
4. After the use of robust estimation quality indicators should be checked and verified in order
to evaluate the final results (MSE).
A. 13 Logging indicators
1. Run time of the application.
2. Number of iteration to reach convergence in the estimation process.
3. Characteristics of the input data.
A.14 Quality indicators of the output data
1. MSE
2. Bias
A.15 Actual use of the method
the national accounts, business surveys,
A.16 Relationship with other modules
2. Weighting2.1. Basic weights with examples2.2 Calibration2.3 GREG
41