european commission · web viewoutlier treatment (robust estimation) 0.2 module type 5. outlier...

Outlier treatment (robust estimation) Version 1.0 29-02-2012Theme: Process Step: Estimation

5. Outlier treatment (robust estimation)

0. General information0.1 Module name5. Outlier treatment (robust estimation)0.2 Module type5. Outlier treatment (robust estimation)0.3 Module codeMethod: Version history

Version Date Description of changes Author NSI Person-id1.0 29-02.2012 First version Grazyna Dehnel GUS GD2.0 31.05.2012 Second version Grazyna Dehnel GUS GD

Template version used 1.0 d.d. 25-3-2011Print date 21-5-2023 22:38

1


Contents

Outlier treatment (robust estimation)........................................................................................3

1 General description................................................................................................................ 3

2 Examples...............................................................................................................................19

3 Glossary................................................................................................................................ 37

4 Literature..............................................................................................................................37

Specific description ..................................................................................................................40

A.1 Purpose of the method......................................................................................................40

A.2 Recommended use of the method.....................................................................................40

A.3 Possible disadvantages of the method...............................................................................40

A.4 Variants of the method......................................................................................................40

A.5 Input data sets....................................................................................................................41

A.6 Logical preconditions..........................................................................................................41

A.7 Tuning parameters.............................................................................................................41

A.8 Recommended use of the individual variants of the method............................................41

A.9 Output data sets.................................................................................................................41

A.10 Properties of the output data sets...................................................................................41

A.11 Unit of processing.............................................................................................................41

A.12 User interaction - not tool specific...................................................................................42

A. 13 Logging indicators............................................................................................................42

A.14 Quality indicators of the output data...............................................................................42

A.15 Actual use of the method.................................................................................................42

A.16 Relationship with other modules.....................................................................................42

2


Outlier treatment (robust estimation)

1. General description

Databases created on the basis of statistical surveys very often contain values that differ from

other databases. The target variable may be highly skewed, there may be a large proportion of

non-responses or extreme values. This means that one of the major problems involved in

estimating information may be the distribution of variables.

The occurrence of outliers and their detection is an important element of statistical analysis.

The effect of outliers on estimation can be significant, since in such situations estimators don’t

retain their properties such as resistance to bias or efficiency.

For this reason, in addition to using the classical approach, work should be carried on to

develop more robust methods that aren’t affected by the presence of non-typical data or

outliers.

This is true especially when estimation is carried out at a low level of aggregation. Outliers,

non-typical data or null values, however, are an integral part of each population and cannot be

dismissed in the analysis.

The module presents some nonstandard, robust estimation techniques. The following

methods of estimation are analyzed: modification of generalized regression estimator (GREG)

proposed by R. Chambers, H. Falvey, D. Hedlin, P. Kokic (2001), Winsor estimator as discussed

by C. Mackin, J. Preston (2002) and local regression presented by J.Y. Kim, F.J. Breidt, and J.D.

Opsomer (2001).

GREG estimation is frequently used in practice (see the module 2.3). In business statistics, the

perceived main advantage of GREG is that auxiliary information usually improves precision

considerably. But its application is also connected with a danger arising from outliers, whose

presence results in a significant bias of estimates (see Hedlin D., 2004). Several, more or less

direct modifications of GREG estimators are applied and discussed. Among those closely related

are model-based estimation using inverse transformation and Winsor estimation introducing

‘border’ points to distinguish outlier observations. As concerns the ‘further’ family, the module

presents local regression which can be used to accommodate local departures from the linear

3


model.

The modification of GREG estimator

In business statistics, which is characterised by huge diversification, direct Horvitz-

Thompson estimation procedures do not provide satisfactory results. Applying GREG estimation

that uses auxiliary information is perceived as a solution, which usually increases precision

considerably. Despite this advantage, D. Hedlin (2004) draws attention to how important good

modelling is. He indicates that for a set of real data, different GREG estimators might produce

wildly different results. The difference between them lies entirely in the choice of the model.

One of the assumption underlying the linear regression model is that residuals have the

same variance, regardless of the value of the covariate. It is often the case, however, that this

condition doesn’t hold, which implies heteroscedasticity. It contributes to ineffective estimates

of regression parameters.

This is why, attempts are being made to develop new estimation methods that account

for heteroscedasticity. Violation of the assumptions of the linear regression model can

sometimes be avoided by a kind of inverse transformation. It can be exemplified by a

modification of the GREG estimator proposed by R. Chambers, H. Falvey, D. Hedlin, P. Kokic

[2001].

The modification of the GREG estimator involves including an additional covariate ziγ

in the

model. Under the classical model, the GREG estimator of the total for the study variable.

YGREG=∑i∈U

y i+∑i∈s

wi ei (1)

where y i=x 'i β the model parameterβ is estimated on the basis of a modified formula, which

account for an additional covariatez [Chambers, Falvey, Hedlin i Kokic, 2001]:

β=(∑i∈sw ix i x ' i/ z i

γ )−1

(∑i∈sw ixi y i /zi

γ )(2)

where:

ziγ

- covariate in the regression model y on x assuming heteroscedasticity.

4


γ - parameter which describes the degree of heteroscedasticity.

The dispersion of points determined on the basis of empirical values of the variables x

and y on the scatter plot affects the value of parameterγ . The increasing value of parameter γ

is correlated with a stronger tendency to scatter away from the regression line. The practice of

statistical surveys indicates that given a number of populations, one is faced with

heteroscedasticity, which means that an increase in the value of the covariate is accompanied

by a rise in the residual variance. This means that parameter γ is greater than zero. Survey

results indicate that usually it assumes values within the interval of⟨1;2⟩ [Brewer, 1963]. Its size

is determined on the basis of earlier surveys, a pilot survey or sample-based estimates [Särndal,

Swensson, Wretman, 1992, p.255].

A modification the GREG estimator expressed by formula (1) can be presented in the form

labeled (3), which is equal to the classical formula of the GREG estimator:

YGREG=∑i∈ s

wi gi yi (3)

where:

gi – weight of ith individual observation defined as:

gi=1+(X−XDIR )' (∑i∈swi x i x ' i/ zi

γ)−1 ( xi / ziγ )

(4)

X DIR – the Horvitz-Thompson estimator of the total of the auxiliary variable x

X – the total for the auxiliary variable x

x, z – auxiliary variables

YGREG – estimate of the total obtained by applying the GREG estimator

X DIR – Horvitz-Thompson estimate of the total value of the auxiliary variable x

X – total value of the auxiliary variable x,

γ – coefficient characterising the degree of heteroscedasticity, for γ=0 the original GREG

estimator is obtained.

5


The only difference is the definition of the weightgi , which depends on the value of the

auxiliary variable x for sampled observations.

As is known from other surveys, coefficient γ should be included in the interval 1≤γ≤2

(Särndal (1992) p.255), the following estimators are analyzed:

(a) Y DIR – direct Horvitz-Thompson estimator (HT),

(b) YGREG0

– γ=0⇒ zi0

regression estimator based on the linear regression model

assuming homoscedasticity (GREG estimator),

(c) YGREG1

– γ=1⇒ zi1

,

(d) YGREG1,5

– γ=1,5⇒ zi1,5

,

(e) YGREG2

– γ=2⇒ zi2

.

The estimators denoted by (c) (d) and (e) are regression estimators based on the linear

regression model assuming heteroscedasticity. The Horvitz-Thompson estimator (HT) was

considered as the reference value when comparing the effectiveness of the GREG estimator and

its modifications. The second one of the estimators, (b) - YGREG0

, takes the form of the GREG

estimator, but its value differs slightly in comparison with the original GREG formula. The

difference results from the fact that not all observations drawn to the sample are used for

estimation. Observations for which auxiliary variable ‘z’ is equal to zero1 are omitted. In the case

of the remaining three estimators, the linear regression model assumes heteroscedasticity of a

degree indicated by γ coefficient. The modification assumes that each of the auxiliary variables

might take both roles, of ‘x’ and ‘z’ with the warranty that ‘z’ does not equal zero.

As R.L. Chambers, H. Falvey, D. Hedlin, P. Kokic [2001] point out, regression estimators

accounting for heteroscedasticity can sometimes produces significantly biased estimates. The

correction term, also known as ‘bias adjustment’ can be much larger than the model-based

component. In extreme cases, this scan produce negative estimates for variables, which, by

definition, should have positive values. This situation can be due to model misspecification, on

the one hand, and a strong effect of outliers, on the other. They significantly affect model

1 The variable ‘z’ is assumed not to take zero value. Nevertheless, it happens in practice that zero values occur for variables for which they are not expected (for example the revenue might be equal to zero for an operating enterprise).

6


parameters and, consequently, absolute values of weights assigned to outliers are

disproportionately large. Various solutions are proposed in order to prevent this. They can only

be used, however, once outliers have been identified. To this end, one can use a measure of

distance [Konarski, 2004]. The most common measure is DFBETA i .

DFBETAi=(∑i∈ sx i x ' i)

−1x i

ei1−x ' i(∑i∈s

x ix 'i)−1

x i(5)

Its standardized form is given by the following formula [Belsley, Kuh, Welsch, 1980]:

DFBETAS i=β− β (−i )

√MSE(−i ) β=

DFBETA i

√MSE(−i )β (6)

where:

√MSE(−i ) - estimation error for the regression model after removing i-th observation

β - an estimate of parameterβ

β (−1 ) - an estimate of parameterβ after removing i-th observation

An observation is regarded as an outlier if it meets the condition |DFBETAS i|>2 [Fox, 1991],

and |DFBETAS i|>

2√n for large samples [Belsley, Kuh, Welsch, 1980].

In the modified version of GREG estimation proposed by R.L. Chambers, H. Falvey, D. Hedlina,

P. Kokic [2001] DFBETA is strongly correlated with weight gi . This means that the measure of

distance affects the degree of influence on variable y. DFBETA for i-th observation in the

modified GREG estimator can be determined using the following formula [Chambers, Falvey,

Hedlin i Kokic, 2001]:

DFBETA i=(∑i∈ sx i x ' i/ zi

γ )−1( x iziγ )(

ei1−hi ) (7)

7


where:

hi=(x i /ziγ )'

(∑i∈nx ix 'i /zi

γ)−1 (x i /ziγ )

(8)

Presented below are three measures that can be used to identify outliers, which account for

the number of degrees of freedom in the regression model. One of them was introduced by R.

D. Cook [1977]:

Di=ei

2

MSE (k+1 )⋅

hi(1−hi )

2(9)

where:

k – number of auxiliary variables

hi - distance of the ith observation from the mean value of variable x defined as:

hi=1n+

(X i−X )2

∑ (X i− X )2 (10)

Observations are regarded as outliers if they meet the condition Di>

4n−k−1 .

The second formula to identify outliers was developed by D. A. Belsley, E. Kuh and R. E. Welsch

[1980]:

DFFITS i=

ei√ hi1−hi

√MSE(−1 )√1−hi (11)

where hi denotes the distance of i-th observation from the mean value of variable x defined as

(10). An observation is regarded as an outlier if it meets the condition

|DFFITSi|>2√ (p+1 ) / (n−k−1 )

The third measure accounts for the influence of the ith observation on the results of

regression analysis through its effect on MSE values of estimated regression coefficients

[Belsley, Kuh, Welsch, 1980].

8


COVRATIOi=1

( n−k−2+t i2

n−k−1 )k+1

⋅ 1(1−hi )

(12)

where hi is given by formula (10) and

t i=e i

√MSE(−i )√1−hi (13)

If COVRATIOi<1 it means that deleting ith observation will reduce MSE values of estimated

regression coefficients. On the other hand, if COVRATIOi>1 then deleting ith observation will

result in higher MSE values of estimated regression coefficients. It is recommended that critical

values be used, which account for sample size, i.e. |COVRATIOi−1|>3 (k+1 ) /n .

By applying one of the above measures make it possible to identify outliers. As was

mentioned earlier, in the presence of outliers, the modified GREG estimator may grossly

underestimate or overestimate the population totals. Various solutions are used to cope with

this problem, such as limiting the weight variability, replacing outliers with the median or the

use of post-stratification [Chambers, Falvey, Hedlin i Kokic, 2001].

Post-stratification involves a subjective division of a sample into 3 strata, the first of which

contains outliers with respect to variable ‘x’ (high values of variable ’x’, low values of variable

’y’), the third one contains outliers with respect to ‘y’ (high values of variable ‘y’, low values of

variable ‘x’), whereas the second one contains the remaining observations. Optimally, the first

and third stratum should be as small as possible (considering problems of estimating variance,

the number of observations should not be less than 20) while the second stratum should be as

large as possible. Estimation is conducted independently for each stratum.

The Winsor estimation

Another approach suggested in the literature consists in modifying values in the sample

so that the estimator becomes robust and isn’t affected by large residuals [Kokic, Bell, 1994;

Chambers 1996]. Sampled units whose values outside certain preset cut-off values are

transformed in order to make them closer to the cut-off value. This approach is exemplified by

9


Winsor estimation, which was applied for the first time in a survey conducted by D.T. Searls

[1966]. Sampled units are divided into two groups. One of them contains typical observations,

which are used to fit the model, the other one contains observations regarded as outliers. The

division is made on the basis of one or two preset cut-off values. Then, values of the study

variable outside the cut-off values are transformed so that they are no longer regarded as

outliers. It should be stressed, however, that the modified value are artificial and can sometimes

be unacceptable. As a result of Winsor estimation, we obtain a „new” sample, in which

untypical observations have been replaced with typical ones. Further calculations are conducted

for the modified sample. Any kind of estimation can be used at this stage, such as Horvitz-

Thompson or GREG.

The use of Winsor estimation reduces estimator variance, while, at the same time, it may

contribute to bias. However, the decline in variance is big enough to offset the bias of MSE

[Hedlin, 2004].

The main difficult, then, lies in the choice of cut-off values for dividing observations in

the sample. The optimum selection has a strong bearing on estimation quality.

The Winsor estimator can be expressed as:

Y win=∑i∈s

~wi y i¿

(14)

where modified values of study variable y i¿

are calculated in the following manner [Gross, Bode

Taylor, Lloyd-Smith, 1986]:

yi¿=¿ {( 1

~wi ) yi+(1− 1~wi )KUi je { s¿ li yi>K Ui ¿ { yi je { s¿li K Li≤ yi≤KUi ¿ ¿¿¿

(15)

10


where:

U={1, . .. .. i ,. .. . .N } - general population of size N

~wi=wi gi

w i - sampling weights

gi - weights dependent on the value of an auxiliary variable for sampled units

KUi - upper cut-off value K Li - lower cut-off value

Based on formula (14) it can be assumed that a unit drawn into the sample is regarded as an

element (~wi−1 ) representing non-sampled units. Hence, according to formula (15), an

observation regarded as an outlier contributes its unweighted values, while the non-sampled

units, represented by the remainder of the weight, (~wi−1 ) , contribute preset upper or lower

cut-off values.

Two kinds of Winsor estimators can be distinguished: Type I and II. The difference between

them consists in the treatment of outliers. In the case of Type II estimator, as weight ~wi

decreases, the contribution of outliers increases – the modified value of the study variable

„approaches” the value of the outlier, i.e. the real value of the variable. Type II Winsor estimator

is based on an arbitrary assumption whereby the ratio

1~wi is equal to zero (cf. formula 15) and,

as a result, any outlier always takes on the value cut-off. [Kokic, Bell, 1994].

Cut-off values are calculated to minimize MSE [Preston, Mackin, 2002]:

KUi=μ i¿−

BU

(~wi−1 ) (16)

K Li=μi¿−

BL

(~wi−1 ) (17)

where:

BU=E [Y winU−Y DIR ] - estimator bias Y winU

11


BL=E [Y winL−Y DIR ] - estimator bias YwinL

YwinU - Winsor estimator of the population total when only upper winsorization is performed

YwinL - Winsor estimator of the population total when only lower winsorization is performed

Considering the fact that μi¿

is difficult to estimate, in practice it is assumed that μi¿

=μi= β xi

[Preston, Mackin, 2002]. Hence, cut-off values are estimated based on the following formulas:

KUi= μ i−BU

(~w i−1 )=μ i+

G(~w i−1 ) where G=−BU (18)

K Li= μi−BL

(~w i−1 )=μ i+

H(~w i−1 ) where H=−BL (19)

where:

μi= β x i - a robust estimate of regression parameterμi¿

.

To calculate them, one can use function ψU ( D(k ) ) [Kokic, Bell, 1994].

ψU ( D (k ) )=D(k )−∑i∈s

max {Di−D(k ) ,0}=(k+1 ) D(k )−∑j=1

k

D ( j ) (20)

where:

Di=(Y i− μi ) (~wi−1 ) - an estimate of weighted residual Di=(Y i−μi¿) (~wi−1 ) for uniti drawn

into the sample.

(k ) - a number assigned to the unit drawn into the sample after ordering all units in the sample

according to non-ascending estimated residuals Di

Solving the equation ψU ( D(k ) ) requires Di to be ordered in such a way that

D (1 )≥D(2 )≥. . .≥0≥.. . . Consecutive ordered values of estimated residuals are assigned numbers

( j ) where ( j )=(1 ) . . . (k ) .

By solving ψU (G )=0 one can obtain the value of G . In practice, since it is difficult to find the

12


right solution of the equation, two methods are proposed. According to the first one, G is

estimated using the formula:

G=1

(k ¿+1 )∑j=1

k¿

D( j ) (21)

where k¿

is the last value ofk for which the value of ψU ( D(k ) ) is non-negative.

The second approach, which was used in this study, involves using linear interpolation between

1(k¿+1 )∑j=1

k ¿

D( j ) and

1(k¿+2 ) ∑j=1

k¿+1

D( j ). Then, G can be expressed as [Preston, Mackin, 2002]:

G=

ψU ( D(k¿+1 ) )[ 1( k¿+1 )∑j=1

k¿

D( j)]−ψU (D (k ¿) )[ 1(k¿+2 ) ∑j=1

k ¿+1

D( j )](ψU ( D( k¿+1) )−ψU ( D(k¿ )) ) (22)

The value ofH can be computed analogically. Estimates of weighted residuals

Di=(Y i− μi ) (~wi−1 ) are ordered in ascendingly D [ 1 ]≤D [ 2 ]≤.. .≤0≤. .. . Function ψ L ( D [m ] ) assumes the form given by the formula:

ψ L ( D [m ] )=D[m ]−∑i∈ s

min {Di−D[m ] ,0}=(m+1 ) D[m ]−∑l=1

m

D( l )(23)

where:

[ l ] - a number assigned to the unit drawn into the sample after ordering all units in the sample

by estimated residuals Di , [ l ]=[1 ] . .. [m ]

H can thus be expressed as (cf. formula (22)) [Preston, Mackin, 2002]:

H=

ψ L( D[m**+1 ] )[ 1[m**+1 ]∑l=1

m**

D [l ]]−ψ L( D[m** ])[ 1[m**+2 ] ∑l=1

m**+ 1

D [l ])](ψ L ( D[m**+1] )−ψL ( D[m** ] )) (24)

where m**

is the last value ofm for which the value of ψU ( D [m] ) is non-positive.

13


In order to estimate cut-off values KUi and K Li , in addition to the above bias parameters

G=−BU and H=−BL it is necessary to compute μi= β x i which is an estimate of μi¿

. For this

purpose robust regression methods can be used. Those recommended in the literature

[Preston, Mackin, 2002] include: Trimmed least squares (TLS), Trimmed least absolute value

(LAV), Sample Splitting, Least median of squares (LMS).

The method of Trimmed least squares (TLS) involves fitting an Ordinary Least Squares (OLS)

regression model to minimise the function:

F=∑

i∈ s( yi−βT xi )

2

(25)

Based on the model, theoretical values are calculated, and then residuals. In the next step,

units with the largest positive and negative residuals are removed. As a rule, the sample is

reduced by about 5%. A new regression model is fitted to the reduced sample in order to

estimate the value of μi¿

. One advantage of the TLS is that it is quick to run and simple.

Another method used in robust regression is Trimmed least absolute value (LAV). It consists

in fitting a regression model to minimise the function: F=∑

i∈ s|y i−βT xi|

(26)

The model is used to calculate theoretical values and then residuals. As is the case in the TLS

method, units with the largest positive and negative residuals are removed. A new regression

model is fitted to the reduced sample. It is expected that the LAV method is a more robust

regression model than the TLS technique because large residuals, which are not squared, have

less influence on the regression parameters.

Another example of robust regression is Sample Splitting Technique based on Ordinary Least

Squares (OLS). It is applied to data that has been randomly split into two halves. A regression

model is fitted to each half of the data while the residuals are calculated using the model

applied to the half of the data that was not used to fit the model. Then, after merging the data,

units with the largest positive and negative residuals are removed. The process is repeated until

a certain percentage of data has been deleted. The SS technique is expected to be more robust

than TLS because the residuals used to remove the ‘outlier’ units are not calculated from a

14


regression model that has been generated using these ‘outlier’ units.

The list of robust regression techniques cannot be complete without the Least median of

squares (LMS) technique. It was described by Rousseeuw and Leroy [2003]. It resembles the

bootstrap method. It involves drawing subsamples of size n – 1 from a sample of size n using

simple random sampling with replacement. For each subsample trial regression model

parameters are calculated and then their squared residuals, which are used to calculate the

median. The model with the smallest median of squared residuals is selected. The LMS

technique should be more robust than TLS because an OLS regression model is fitted in the

absence of "outlier" units, without totally removing these ‘outlier’ units.

Local regression estimation

Apart from the modified GREG or Winsor estimation, among other methods that are less

sensitive to outliers, the local regression technique is often discussed.

Unlike classical regression, where one model is fitted to the whole sample, local regression

consists in fitting many local models for different ranges of observations in the sample based on

kernel regression. The standard local estimator can be obtained by replacing the first term of

the GREG estimator, which is an estimator of a study variable based on the regression model (

y i ), with an estimate of the total of a study variable obtained using kernel regression ( y loc ,i ). Thanks to this solution, the estimator is less sensitive to outliers and a non-linear relationship

between the study and the auxiliary variable. A standard local regression estimator can be

expressed as [Breidt, Opsomer, 2000]:

Y loc=∑i∈U

yloc , i+∑i∈ s

w i ( y i− yloc , i ) (27)

or as an estimator based on a model where estimates are made for each observation from the

general population [Chambers, Dorfman, Wehrly, 1993; Dorfman, 2000]:

y loc ,i=c j' (Di

'W iDi )−1 Di

'W i ys i=1, 2, …, N (28)

where: U denotes population and s is the sample

c j'

is the vector with 1 in the jth position and 0s otherwise,

15


Di , i = 1, 2, …, N, are n x 2 matrices, each with [ 1 ( x j−x i ) ] in the jth row, j = 1, 2, …, n,

Wi, for i = 1, 2, …, N, are n x n diagonal matrices withw ibi−1 K [(x j−xi )bi−1 ] in cell (j, j), where

K (⋅ ) is the kernel function and b i is the bandwidth for the ith observation.

The local regression estimator is based on the value of y loc ,i , which, in many cases, is close

to y i , calculated using the standard GREG estimator. The difference between them is that using

kernel regression with y loc ,i , which relies on many models fitted to subsamples, it is possible to

account for local changes in the study variable, which can’t be done in the case of a single

regression model of the GREG estimator.

This prediction (28) can be written as [Hedlin, 2004]:

yloc ,i= yloc , i+(xi− x loc )∑j∈s

q ji (x j− x loc ) y j

∑j∈s

q ji (x j− x loc )2 (29)

q ji=max [0 , 34 (1−( (x j−x i )

bi(s ) )

2

)] (30)

where:

n – sample size, i=1,2,…, n

qji - diagonal elements in the matrix Wi ,

y loc=∑j∈s

q ji y j(∑j∈sq ji)

−1

(31)

xloc=∑j∈s

q ji x j(∑j∈ sq ji)

−1

(32)

It is worth noting that y loc ,i is an estimate of the study variable for ith observation made on the

basis of local estimation without the x-variable.

The effectiveness of local regression strongly depends on the choice of the kernel function

and the bandwidth. Different approaches are suggested in the literature [Chambers, Dorfman,

16


Wehrly, 1993; Chambers, 1996; Kim, Breidt, Opsomer, 2001]. The following forms of the

function of variable u can be used in local regression:

- Uniform K (u )=1

- Triangular K (u )=(1−|u|)

- Quartic (biweight) K (u )=15

16(1−u2 )2

- Cosine K (u )= π

4cos ( π2 u)

- Gaussian K (u )= 1

√2πe−

12u2

- Epanechnikov K (u )=3

4(1−u2)

The most commonly used one, however, is the Epanechikov in the form [Hedlin, 2004]:

K (u ji )=max [0 , 34 (1−u ji

2 ) ]=max [0 , 34 (1−( (x j−xi )

bi )2)]

(33)

where: u ji=( x j−x i )/bi for i = 1, 2, …, n and j= 1, 2, …, n

b i - bandwidth for ith unit

The kernel function defines a window for each sampled unit. Observations this window aren’t

used to estimate mi . It is worth noting that:

K (u ji )=0 if u ji=( x j−x i )/bi≥1 (34)

Thus, the weight selection process is repeated for all observations in the sample. First, a

window is defined for each observation. Observations outside the window for ith unit are

assigned the weight equal to 0. Other observations, included in the window for ith unit are

given positive weights. The weight value depends on how much the auxiliary variable differs

from its corresponding value for ith unit. The largest weights are assigned to units with auxiliary

variable values close toxi .

17


The bandwidth is determined using different methods, which can be divided into 2 groups.

The first group of them includes methods where only one bandwidth is set, which is fixed for

the whole sample. The second group includes methods where several bandwidths are

recommended, depending on values of the auxiliary variable for sampled observations.

The most common ways of bandwidth selection for ith unit include:

-b i=

14 (xmax−xmin )

- b i=x i+10− xi−10

- b i=x i+20−xi−20

- b i=x i+40−x i−40

The first one is an example of the first group, where the bandwidth is constant. It

constitutes ¼ of the range of the auxiliary variable. The remaining three methods, referred to as

nearest-neighbor, treat bandwidth as variable. The parameter b i expresses the difference

between values of the auxiliary variable x for two observations selected from the frame sorted

by x i in ascending order and situated at a distance from x i within the bandwidth containing 10,

20, 40 observations. If the number of a sampled unit (with index i , where i assumes values

i=1 .. . .n ) is so small that no unit with the number i−10 , i−20 or i−40 can be specified, and

hence, the x value x i−10 , x i−20 or x i−40 doesn’t exist, the minimum value of x is used instead. A

similar procedure is applied with respect to x i+10 , x i+20 and x i+40 ; with the maximum value

being used [Hedlin, 2004].

The local regression estimator is called flexible, since for a long bandwidth it is similar to the

GREG and for a shorter bandwidth it captures local model departures.

18


2. Examples

The module presents an attempt to apply administrative data for a more effective use of

the SP-3 survey sampling of small businesses (see Dehnel, Gołata, 2010). The analysis refers to

data from the year 2001 for which databases were made available by the Polish Central

Statistical Office (CSO). Auxiliary variables come from two administrative registers: Database of

Statistical Units (BJS) created by CSO on the basis of the register of economic units called

REGON and the Ministry of Finance’s tax register. Because of a high fraction of nonresponse in

the SP-3 survey, the incompleteness of the tax register and outdated information in BJS, all

constituting major problems in terms of data availability, special matching procedures were

applied to improve the completeness of the data set for purposes of estimation. Estimation was

conducted for the following basic variables: the average number of employees under a contract

employed in one company, the amount of average monthly salary paid over one year in one

company, average revenue per company and average costs per company. The domain of study

was defined as an ‘intersection’ of the joint distribution of economic entities by region and

classification of economic activity (PKD section).

Estimation was conducted for 176 domains: 17 regions (R) and 11 PKD sections (S)

(Region&Section). Since the number of domains taken into account in the study is quite

considerable, we present only selected results for one of the estimated variables - Gross Wage

in the region of West Pomerania province and in selected sections: Manufacturing, Construction,

Trade, Hotels and Restaurants. However, to evaluate results obtained in the study, synthetic

characteristics across all domains were provided.

The results obtained from the study were compared with the GREG approach. In order

to determine the estimation precision an approximated method based on samples and

bootstrap was applied. The bootstrap method presented by Efron (1979) for simple random

sampling needs modifications when applied for complex samples. We applied the approach

described in detail by Shao and Tu (1995). The simulation study consisted of 500 iterations (see

Klimanek, Paradysz, 2006). In each one, a subsample of size n-1 was drawn and for each

iteration sample modified weights were distinguished to estimate unknown parameter Y ¿ b ,

where (*) stands for different robust estimators used in the study. The empirical variance was

19


obtained according to the standard approach

Var ( Y ¿ )=1

500−1 ∑b=1

500

( Y Rb−Y ¿ )2(35)

Estimation precision was evaluated on the basis of estimator’s coefficient of variation:

CV ( Y ¿ )=

√Var ( Y ¿ )( Y ¿ ) (36)

Coefficient RedCV ( Y ¿) was used to measure the degree of CV reduction obtained by robust

estimators applied:

RedCV ( Y ¿)=CV ( Y ¿ )−CV ( Y DIR)

CV ( Y DIR) (37)

deff (Y )= Var (Y ¿ )Var ( Y DIR ) (38)

Additionally, the design effect coefficient deff (proposed by Kish (1995)) was used to

measure the effectiveness of the target estimator in comparison with the direct one. All the

values smaller than one indicate that the method applied is more effective.

Modifications of GREG estimator

The estimation was conducted for Gross Wage (y). As auxiliary variables (x) and (z) we

used: (i) the number of employees from BJS and (ii) revenue from the Tax Register.

The synthetic characteristics summarising estimates of Gross Wage by domain are

presented in table 1. They show that the highest precision was obtained for YGREG0

, the

estimator ignoring observations for which “z” was equal to zero. With the increase of , usually

the maximum, median and the average value of CV also increases. The biggest proportion of

domains for which the relative dispersion of the estimator was smaller than in the case of the

direct one, was also observed for YGREG0

, and then other estimators YGREG1

, YGREG1.5

and YGREG2

were

classified. Only for the model with Revenue as an auxiliary variable was the order of the

estimators changed, which can be explained by a weaker relation of the estimated and auxiliary

variables.

20


We cannot unequivocally determine which of the estimators is characterized by the

highest precision. In each case an appropriate model and the definition of ‘x’ and ‘z’ is

necessary. Model misspecification may lead to negative estimates. Such an untypical situation

was observed for domain Trade in Lower Silesia province (see figure 1. B).

Table 1. Characteristics of the CV distribution, modified GREG estimates of Gross Wage, all

domains, small business, Poland, 2001

CharacteristicsEstimator

Y DIR YGREG0 YGREG

1 YGREG1,5 YGREG

2

x – number of employees z - number of employeesmin 0.0721 0.0665 0.0670 0.0671 0.0673 max 1.0964 1.0402 1.0522 1.1754 1.3406

average 0.3153 0.3019 0.3020 0.3030 0.3061median 0.2861 0.2739 0.2773 0.2799 0.2811

proportion of domains for whichCV < CVDIR (%) 73.30 72.16 71.59 68.18

x -revenue z - number of employeesmin 0.073 0.070 0.073 0.073 0.073max 0.779 0.990 0.863 0.795 0.979

average 0.304 0.297 0.306 0.310 0.312median 0.281 0.266 0.283 0.286 0.286


x - revenue z - revenuemin 0.070 0.070 -1.731 -1.388 -1.332max 0.793 13.556 5.093 3.203 11.370

average 0.305 0.421 0.358 0.316 0.401median 0.280 0.274 0.276 0.280 0.302


Source: Own estimation based on SP3 survey data, BJS, Tax Register

Defining auxiliary variables ‘x’ and ‘z’ as Revenue, the estimators YGREG1,5

and YGREG2

produced negative estimates. As can be seen from figure 1.A, outliers in both dimensions (of the

estimated as well as of the auxiliary variable) influence estimation results. The regression lines

corresponding to YGREG0

, YGREG1

, YGREG1,5

, YGREG2

, presented on the diagram show the following

regularity (which was observed also for other domains). The coefficient determines the degree

to which outliers in the dimension of ‘y’ are included in the model. The linear regression YGREG0

with =0 includes outliers in the dimension of ‘x’. With the increase of , outliers in the

dimension of ‘y’ are more accounted for by assigning to them greater gi weights and less

21


importance is attached to outliers in the dimension of ‘x’. The value of influences also the

range of weights. The smallest dispersion of weights refers to =0 (gi0∈ (-0,13 ; 2,4)), and the

biggest range is observed for =2 (gi2∈ (-35 ; 55) ).

Gi weights are sample dependent and should not diverge much from one. Negative

weights are particularly undesirable as they tend to increase the estimator variance. That is why

some methods for restricting gi weights were developed. There is also a close relationship

between gi weights of a sample observation and its influence on Beta. The most common

measure of this influence is DFBETA defined as a change in the estimate of Beta when a unit is

excluded from the sample:

DFBETAi=(∑i∈ Sxi xi

' / ziγ )−1( x iziγ )(

e i1−hi ) (39)

The main reason for negative estimates are outliers, i.e. those companies in which the

Gross Wage is very big and the Revenue very small. They influence model parameters and result

in incomparably large weights assigned to those observations. DFBETA was used to identify the

influential units. The outlying observations were modified before conducting further estimation.

Figure 1. Regression lines for YGREGγ

estimators in Lower Silesia province:

A) domain defined as Industry

0

250

500

0 500 1000 1500 2000 2500 3000 3500 4000

Gross Wageths. PLN

Revenue ths. PLNY = 0 =1 = 1,5 =2

B) domain defined as Trade

-100

0

100

200

300

0 3000 6000 9000 12000 15000

Gross Wageths. PLN

Revenue ths. PLN

Y = 0 =1 = 1,5 =2

C) domain defined as Trade after substituting the outliers with median

22


0

100

200

300

0 3000 6000 9000 12000 15000Y = 0 =1 = 1,5 =2

Gross Wageths. PLN

Revenue ths. PLN

Source: Own estimation based on SP3 survey data and Tax Register

To avoid negative estimates, one proposed solution is to substitute outliers with median

values or implement post-stratification (Chambers et al. (2001a)). Observations were divided

into three groups: outliers in the dimension of ‘x’ , in the dimension of ‘y’ and the remaining

ones (ideally this group should be the largest). The results obtained by substitution are

presented on figure 1.C. The modification considerably changed the model parameters for

YGREG1,5

and YGREG2

. Adapting both methods improved the estimation precision as well as

reduced the variation of the estimator.

The analysis conducted so far leads to the following conclusions:

– Implementation of Chambers et al. (2001a) modification produces results with much

differentiation depending on auxiliary variables and the model type

– Introducing the additional auxiliary variable ‘z’ makes it possible to move the estimation

problem from disjunctive values of the auxiliary variable to disjunctive values of the

estimated variable,

– Application of the additional variable ‘z’ also creates the possibility of estimating basic

economic information for small businesses at a lower aggregation level,

– Weights from the modified model may be characterized by much variation, which is

contradictory to the general assumption that the product of weights gi and w i should be

close to w i ,

– Due to outliers in the sample data, weights designated from the modified model may take

on values less than zero, which might result in negative estimates,

23


– Gain in estimation precision is greater for domains of a smaller size (number of

observations) and in the case of a stronger correlation between the estimated and the

auxiliary variable,

– A significant improvement in estimation precision may be obtained by selecting an

adequate model including variables ‘x’ and ‘z’, with properly designated auxiliary variables,

which in the case of a large number of small domains makes application of the modified

model rather difficult

– Weights assigned to the GREG estimator are strongly related to the distance measure

DFBETA, which enables identification of outliers

– Post-stratification is proposed for domains, in which case selecting a suitable model is

difficult.

Winsor estimation

To evaluate results of the research, similarly as in the case of the modified GREG, the

reference values were obtained by using direct Horvitz-Thompson (HT) estimates. Comparison

analysis of Winsor estimators with the modified GREG is presented in Table 2. This time the

estimated variable was defined as Revenue. To preserve comparability, we applied an identical

et of variables in both the modified GREG and Winsor estimator. The results show that the most

efficient GREG estimates are obtained for the weightγ =1.5. Both mean and median values of

the coefficient of variation are the lowest. But for bigger values of the parameter γ =2 efficiency

is decreasing. If the median is taken into account, a very high precision (close to the one for

=1.5) is observed also for γ =0 and γ =1. Regardless of γ parameter, GREG estimator provides

more precise estimates (in view of the median) than the direct one. The highest fraction of

domains for which GREG estimates were more efficient is for =1 but with the increase of

this percentage decreases.

Basic distribution characteristics of the variation of Winsor estimator across all domains

(Region&Section) show similar CV for those using techniques TLS, LAV and LMS and direct HT

estimator. The mean belongs to an interval (0.263 – 0.303) and the median (0.210 – 0.232). The

smallest variation is observed for the TSS technique (mean equals to 0.225 and median 0.172).

Also for TSS the highest fraction of domains for which it provided more efficient estimates was

24


observed (almost 74%). This record is quite close to GREG with =1, and much better than for

GREG with =1.5. It also produces absolutely better estimates in terms of other characteristics

of CV distribution like min, average and median.

Table 2. Characteristics of CV distribution: Winsor and modified GREG estimators, revenue, all

domains, 2001 r.

CharacteristicsGREG Estimator

DIRY 0GREGY 1

GREGY 5,1GREGY 2

GREGY

y - Revenue x – Cost from TAX register z - Revenue from TAX register

min 0,088 0,074 0,075 -7,340 -3,834max 0,902 1,407 1,417 8,148 4,474

average 0,276 0,273 0,278 0,230 0,298median 0,232 0,212 0,213 0,210 0, 226

proportion of domains for whichCV < CVDIR (%) 72,16 74,43 70,45 59,09

CharacteristicsWINSOR Estimator

DIRY TLSY LAVY TSSY LMSY

y - Revenue x – Cost from TAX register

min 0,088 0,071 0,070 0,033 0,072max 0,902 1,371 1,352 1,485 1,578

average 0,276 0,263 0,270 0,225 0,303median 0,232 0,211 0,210 0,172 0,228

proportion of domains for whichCV < CVDIR (%) 64,77 64,20 73,86 50,00


The appraisal of Winsor estimation showed that:

– Depending on the type of estimated characteristic, different methods might be applied to

determine the cut-off values used in the delimitation process: two (upper and lower) or one

(upper). Both cases were examined showing that the estimation precision is bigger for two

cut-off values

– Simulation research demonstrated the relation between efficiency and type of robust

regression technique used. The more robust regression technique was applied, the more

efficient estimates were produced. This was shown by comparing TSS estimates with those

obtained by TLS and LAV

– Cut-off values used in the delimitation process and therefore the type of robust regression

technique, influence the precision of Winsor estimator. The highest precision in our study

was observed for Sample Splitting Technique (TSS).

25



The comparison of robust estimation in business surveys was supplemented by the analysis

of how the bandwidth definition influences the estimation precision. To preserve comparability,

we applied an identical set of variables as in the case of modified GREG and Winsor estimation.

The estimation was conducted for Gross Wage (y). We used Revenue from the Tax Register as

auxiliary variable (x)

Table 3. Characteristics of the CV distribution, local regression estimator, all domains, 2001

CharacteristicsEstimator

Y DIR Y loc (10 ) Y loc (20 ) Y loc ( 40 ) Y loc ( max,min )

min 0.10 0.09 0.08 0.09 0.08 max 0.63 0.60 1.77 1.78 0.73

average 0.27 0.25 0.36 0.36 0.27median 0.25 0.23 0.21 0.21 0.24



The characteristics of relative dispersion distribution for direct and local estimators are

comparable, except the maximum. The relatively highest precision is observed for Y loc (10 )

estimator, in terms of the characteristics of the distribution and as concerns the proportion of

domains for which CV is smaller than obtained for the direct estimator (see tab. 3).

The analysis of local regression estimation allowed us to make the following insights:

– As the bandwidth increases, the local regression estimates resemble GREG estimates more

closely.

– The highest precision, in terms of CV, was obtained for local regression with bandwidth

Y loc (10 ) and Y loc (20 )

– For a narrow bandwidth, estimation is based on many local models, which increases the

computing time

– As the bandwidth gets increasingly narrow, the local change in the estimated variable is

taken into account to a greater degree. As the bandwidth increases, the smoothing effect is

more significant

26

0,0

0,2

0,4

0,6

0,8

1,0

1,2

1,4

1,6

1,8

2,0

GREG GREG0 GREG1 GREG15 GREG2 TLS LAV TSS LSM LOC10 LOC20 LOC40 LOCmax

CV

MAX

Q3

Q2

MEAN

Q1

MIN


– The weights designated by the kernel function do not depend on the values of the

estimated variable but on the auxiliary variables. This means that they can be applied for

many estimated variables, in case the set of auxiliary variables is unchanged.

Comparison of the results and conclusions

GREG estimation is frequently used. But its application to business statistics is connected

with a danger arising from outliers, whose presence results in a significant bias of estimates (see

Hedlin (2004)). Several modifications belonging to the ‘closer’ and ‘further’ family of GREG

estimators were applied and discussed. Among those closely related we relied on model-based

estimation using inverse transformation and Winsor estimation introducing a ‘border’ point to

distinguish outlier observations. As concerns the ‘further’ family, the module presents local

regression which can be used to accommodate local departures from the linear model. The

estimation precision was evaluated on the basis of estimator’s coefficient of variation CV, RedCV

- coefficient measuring the degree of CV reduction and deff coefficient (see figures 2– 4).

Figure 2. Characteristics of CV distribution across all domains, different robust estimators, 2001

Source: Own estimation based on SP3 survey data, BJS and Tax Register

The highest estimation precision expressed by the average value and the quartiles of the

estimator’s CV across all domains, is observed for the Winsor Y TSS and local Y loc (10 ) estimators.

This remark does not take into account the extreme value observed for Winsor TSS, which was

27

-100%

-75%

-50%

-25%

0%

25%

50%

75%

100%


RedCV

MAX

Q3

Q2

MEAN

Q1

MIN


due to high dispersion in the case of one of the bootstrap sub samples. For the remaining

Winsor estimators, local regression and the modified GREG, the relative dispersion, as measured

by the quartiles and the average, is just a little bit smaller than for the original GREG estimator,

but very similar.

The narrowest range of the CV is observed for two local estimators: Y loc (10 ) and

Y loc ( max,min ) as well as the modified GREG by Chambers at al. For the remaining set of

estimators examined, the variation is much more spread, but does not exceed the maximum

range observed for the original GREG estimator (CV=12%).

Figure 3. Characteristics of RedCV distribution across all domains, different robust estimators,

2001


Coefficient RedCV (%) represents the degree to which the variation of the estimators is

reduced in comparison with the direct HT estimator (see figure 3). If the average value across all

domains of study, or the quartiles of the distribution are considered, the reduction was

obtained for the Winsor TSS estimator (on average by 33%, but for median by 61%) and the

local regression estimators: Y loc (10 ) (average - by 10%, median - by 9%) and Y loc ( max,min )

(average - by 4%, median – by 8%). For the remaining estimation techniques applied, the

average value of CV is bigger than in the case of the direct HT estimator. This means that the

28

0,0

1,0

2,0

3,0

4,0

5,0


deff

MAX

Q3

Q2

MEAN

Q1

MIN


techniques used did not result in improving estimation precision. Generally, it is due to extreme

values of CV observed for odd domains, which influence the mean. If the median, as a measure

independent of outliers, was considered, reduction should be noticed for all Winsor and local

regression estimators.

Another coefficient used to evaluate the gain in efficiency of an estimator is deff, which

compares the variation of the estimator considered with respect to the direct estimator (see

figure 4). If its value is smaller than one, it means that the estimation technique applied is more

efficient than the direct one. Evaluating the precision in terms of the deff coefficient, it can be

observed that the greatest gain was obtained by Winsor sample splitting estimator TSS (on

average by 50%). The average variation across all domains, for the whole set of remaining

estimators, is greater than that obtained using the direct one. An especially high value of the

deff coefficient is observed for all the local regression estimators. To explain this, we should

bear in mind that although local models are constructed, in the estimation process all the

observations are taken into account, including outliers. This has a direct influence on high

variation of local estimators.

Figure 4. Characteristics of deff coefficient distribution across all domains, different estimators,

2001


29


In assessing estimators, one has to specify their bias. Since we don’t know the real value of

the estimated parameter, this is difficult. The present study relies on a method applied by R.

Chambers, G. Brown, P. Heady, D. Heasman [2001] and E. Gołata [2004] in a study which

focused on the use of indirect estimation in the analysis of unemployment. The method is based

on one of the properties of the direct estimator. It is assumed that despite high variability, it is

an unbiased estimator. Hence, a regression model fitted to real data and based on direct

estimates will have the form y=x. Since real values are unknown, it was assumed that if

estimates obtained using a given estimator are close to real values, the estimator can be treated

as an unbiased predictor of the direct estimator. [Chambers, Brown, Heady, Heasman (2001)].

Estimates obtained using the unbiased direct estimator can be regarded as values of a random

variable, whose expectation is equal to the estimate calculated by applying the estimator of

interest. If the estimator under assessment is unbiased, the regression coefficient used in the

model based on its estimates and direct estimates will be 1. As the estimator bias increases, the

regression coefficient will increasingly differ from 1.

To test bias, parameters of the regression function y=ax were estimated for the direct

estimator2 (the OY axis) and estimates obtained using other estimators analysed in the study

(the OX axis), see Fig. 5

A significance test was then used to verify the hypothesis that the difference between

estimates of the regression coefficient and 1 is not statistically significant. Additionally, to

illustrate the distribution of the bias of unbiased estimates of the direct estimator in comparison

with estimators under assessment, scatter plots were used The random distribution of

estimates for the direct and other estimators is demonstrated by the concentration of

observations along the identity line. Since the size of study domains differs significantly,

estimates were transformed by applying a power function with exponent of 0.5. This was

necessary to meet the homoscedasticity assumption of OLS [Chambers, Brown, Heady,

Heasman (2001)], see Figs. 5 – 8.

2 In the study a direct regression estimator was used, which is based on auxiliary variables to offset the small sample size across domains.

30


Fig. 5. The relationship between estimates of gross wage obtained by the direct estimator and GREG estimator across all domains

0,97650,9766

1,00641,00840,99860,9342

0,84000,85150,8617

0

250

500

750

1000

1250

1500

0 250 500 750 1000 1250 1500

transormacja ocen estymatora

trans

form

acja

oce

n es

tym

ator

a

GREGY

9925,0

9746,0ˆ2 =

=

R

xYGREG

Source: Own estimation based on SP3 survey data and the Tax Register

Fig. 6. The relationship between estimates of gross wage obtained by the direct estimator

and the modified GREG estimator across all domains

0

250

500

750

1000

1250

1500

0 250 500 750 1000 1250 1500

transformacja ocen estymatorów Modelu Chambersa

trans

form

acja

oce

n es

tym

ator

ów

9911,0

9739,0ˆ2

0

=

=

R

xYGREG

9952,0

9766,0ˆ2

1

=

=

R

xYGREG

9732,0

9765,0ˆ2

5,1

=

=

R

xYGREG

9924,0

9766,0ˆ2

2

=

=

R

xYGREG



31


and the Winsor estimator across all domains

0

250

500

750

1000

1250

1500

0 250 500 750 1000 1250 1500

transformacja ocen estymatorów Winsora

trans

form

acja

oce

n es

tym

ator

ów 9842,0

0064,1ˆ2 =

=

R

xYTLS

9855,0

0084,1ˆ2 =

=

R

xYLAV

9906,0

9986,0ˆ2 =

=

R

xYTSS

9522,0

9342,0ˆ2 =

=

R

xYLMS



and local regression estimators across all domains

0

250

500

750

1000

1250

1500

0 250 500 750 1000 1250 1500

transformacja ocen estymatorów lokalnych

trans

form

acja

oce

n es

tym

ator

ów

9867,0

8400,010ˆ2 =

=

R

xYloc

9832,0

8515,020ˆ2 =

=

R

xYloc

9811,0

8617,040ˆ2 =

=

R

xYloc

9779,0

7943,0minmax,ˆ2 =

=

R

xYloc


The Student’s t-value based on the model for the standard GREG estimator justifies the

32


rejection of the null hypothesis assuming that the regression coefficient is equal to 1, see. Tab.

4. A similar result was obtained for other, modified GREG estimators. Based on the p-value for

the t-test statistic, the difference between estimates obtained by the direct and GREG

estimators can be treated as due to chance, see Fig. 7 and Table 4. This result is surprising,

considering the fact that GREG estimators are regarded as unbiased. However, if the sample

includes some observations with extreme values of the auxiliary variable (outliers), then such

estimators can produce under- or overestimation of the study variable.

Different results of the significance test were obtained for the Winsor estimators, where

outliers are modified, see Fig. 8 and Table 4. The t-test confirmed the compatibility between

estimates obtained by the direct and three indirect estimators: Y TLS , Y LAV and Y TSS . The null

hypothesis about the regression coefficient equal to 1 was rejected for the estimator Y LMS

The last group to be analysed included local regression estimators. They are significantly

different from the direct estimator. The difference is evident from the slope of the regression

line and significance test results. (see Fig. 8 and Table 4). Local estimators are characterised by

the largest t-values. For all cases, the p-value is less than 0.001. This indicates that these

estimators are biased. Hence, the differences between estimates produced by the direct and

local estimators cannot be attributed to chance.

To sum up the verification of the random character of the difference between direct

unbiased estimates and model-based estimates, it can be said that the “best” results were

obtained using Winsor estimation. The main difference between Winsor estimators and other

estimators assessed in the study is that outliers are modified to be as close as possible to cut-off

values. In contrast, auxiliary variable values are not changed in modified GREG estimators and

local estimators. What changes is only their influence on the estimation of the study variable,

which depends on weight values. The verification study reveals a negative influence of outliers

on estimation precision. This seems to be confirmed by the degree of correlation between direct

and model-based estimates, see Table 5. In the case of Winsor estimators the correlation is

very strong, in contrast to GREG and local estimators, where it is much weaker. If unbiased

direct estimates are treated as a reference point, correlation analysis can demonstrate to what

extent model-based estimates obtained by other estimators are different. The weakest

33


correlation can be observed for the three local estimators (r∈ (0 ,88 ;0 ,89 ) ). The correlation is

stronger for the local estimator Y loc (10 ) (r=0 ,986 ) and Winsor estimator Y LMS (r=0 ,977 ), with

the strongest correlation observed for the three remaining Winsor estimators Y TLS , Y LAV , Y TSS

oand the GREG estimator (r∈ (0 ,991 ;0 ,998 ) ).

Table 4. Results of the study to verify the null hypothesis about the regression coefficient equal to 1 for the relationship between indirect and direct estimates (Y DIR ) of gross wage in 2001.

Estimator Regression coefficient t-value p-value coefficient of

determination

GREG estimation

YGREG 0.9746 -6.0787 < 0.001 0.9968The modified GREG estimation

YGREG0

0.9739 -5.7249 < 0.001 0.9962

YGREG1

0.9766 -6.9514 < 0.001 0.9979

YGREG1,5

0.9765 -5.8704 < 0.001 0.9971

YGREG2

0.9766 -5.5502 < 0.001 0.9967Winsor estimation

Y TLS 1.0064 1.0222 0.30812 0.9932

Y LAV 1.0084 1.3979 0.16391 0.9938

Y TSS 0.9986 -0.2986 0.76563 0.9960

Y LMS 0.9342 -6.4495 < 0.001 0.9795Local regression estimation

Y loc (10 ) 0.8400 -21.7436 < 0.001 0.9867Y loc (20 ) 0.8515 -17.6389 < 0.001 0.9832Y loc ( 40 ) 0.8617 -15.3053 < 0.001 0.9811

Y loc ( max,min ) 0.7943 -22.7796 < 0.001 0.9779Source: Own estimation based on SP3 survey data and the Tax Register

34


Table 5. A correlation matrix between estimates of gross wage obtained using estimators

under assessment by province and activity class in 2001.

Estimator Y DIR YGREG YGREG0 YGREG

1 YGREG1,5 YGREG

2 Y TLS Y LAV Y TSS Y LMSY loc (10 ) Y loc (20 ) Y loc ( 40 ) Y loc ( max,min )

GREG estimation

YGREG 0.9976 1

The modified GREG estimation

YGREG0

0.9979 0.9966 1

YGREG1

0.9920 0.9938 0.9903 1

YGREG1,5

0.9911 0.9928 0.9890 0.9997 1

YGREG2

0.9909 0.9926 0.9887 0.9995 1.0000 1

Winsor estimation

Y TLS 0.9923 0.9933 0.9911 0.9902 0.9893 0.9891 1

Y LAV 0.9930 0.9935 0.9915 0.9909 0.9900 0.9897 0.9994 1

Y TSS 0.9972 0.9984 0.9962 0.9925 0.9915 0.9913 0.9976 0.9977 1

Y LMS 0.9771 0.9792 0.9759 0.9724 0.9716 0.9714 0.9755 0.9758 0.9792 1


Y loc (10 ) 0.9856 0.9813 0.9844 0.9763 0.9757 0.9756 0.9891 0.9894 0.9870 0.9614 1

Y loc (20 ) 0.8904 0.8831 0.8936 0.8578 0.8568 0.8569 0.8995 0.8988 0.8946 0.8759 0.9289 1

Y loc ( 40 ) 0.8958 0.8888 0.8988 0.8639 0.8630 0.8632 0.9052 0.9044 0.9002 0.8817 0.9339 0.9985 1

Y loc ( max,min ) 0.8733 0.8637 0.8759 0.8363 0.8351 0.8352 0.8774 0.8762 0.8741 0.8562 0.9125 0.9866 0.9834 1


Evaluation of estimators based on CV, RedCV deff and the significance test for the

regression coefficient indicate that the Winsor estimator with sample splitting (SST) is

characterised by the greatest precision. Rating the remaining estimators in terms of estimation

precision is rather difficult, since it changes depending on estimation parameters , while the

validity of criteria is relative. It is much easier to choose the ‘best’ estimator within each

method of estimation. In the case of modified GREG estimation, the best choice is the YGREG1,5

estimator, which accounts for an auxiliary variable ‘z’, which specifies heteroscedasticity in

regression of y on x in proportion to the value of zi1,5

. In the group of local regression

35


estimators, it is the Y loc (10 ) estimator that displays the best precision since it has the narrowest

bandwidth for constructing local kernel functions. The bandwidth is determined separately for

each sampled unit using the “nearest neighbour method”. In the case of the ith unit the

bandwidth comprises units with numbers i ±10. It should be stressed that the bandwidth,

which is used to construct the kernel function, affects the computation time. An increased

bandwidth slows down the estimation process. That is why, even though nowadays processing

large datasets is not a major consideration for analysts, local estimation methods are seldom

used in practice. The third group of estimators analysed in this study consists of Winsor

estimators. As was mentioned above, it is the estimator Y TSS , which is based on sample

splitting, that turned out to be the best. The technique involves constructing two regression

models for sampled units, randomly divided into two sets. Residuals for the first set are

computed using the model fitted for the other one. The final regression model used in Winsor

estimation is constructed after rejecting units whose residuals are the highest. This approach

increases the robustness of the sampling splitting technique in comparison with TLS or LAV,

since residuals used to delete data aren’t computed on the basis of the regression model they

were based on.

Glossary

Term Definitionoutliers Extreme values that differ from other databases, value nof variable,

wchch is essentially “impossible” under given model

Literature:Belsley D. A., Kuh E., Welsch R. E., 1980. Regression diagnostics, Identifying influential data and

sources of co-linearity, New York, Wiley.Breidt, F.J., Opsomer, J.D. (2000) Local Polynomial Regression Estimation in Survey Sampling.

The Annals of Statistics, 28, 1026 – 1053.Brewer K., 1963, Ratio estimation and finite population, some results deductible from the

assumption of an underlying stochastic process, Australian Journal of Statistics 5, s. 93-105.Chambers, R.L. (1996) Robust case-weighting for multipurpose establishment Surveys, Journal of

Official Statistics, Vol.12, No.1, 3 – 32.

36


Chambers, R., Dorfman, A.H., Wehrly, T.E. (1993) Bias Robust Estimation in Finite Populations Using Nonparametric Calibration. Journal of the American Statistical Association, 88, 268 – 277.

Chambers, R., Kokic, P., Smith, P. and Cruddas, M. (2000) Winsorization for Identifying and Treating Outliers in Business Surveys, Proceedings of the Second International Conference on Establishment Surveys (ICES II), 687 – 696.

Chambers R., Brown G., Heady P., Heasman D. (2001) Evaluation of Small Area Estimation Methods – an Application to Unemployment Estimates from the UK LFS, Proceedings of Statistics Canada Symposium 2001, Achieving Data Quality in a Statistical Agency: a Methodological Perspective.

Chambers, R.L, Falvey, H., Hedlin, D., Kokic P. (2001a) Does the Model Matter for GREG Estimation? A Business Survey Example, Journal of Official Statistics, Vol.17, No.4, 527 – 544.

Cook R. D., 1977, Deletion of influential observations in linear regression, Technometrics, 19, s.351-361.

Dehnel G., Gołata E. (2010) On some robust estimators for Polish Business Survey, Statistics in Transition- new series, Vol.11, number 2, Warszawa 2010. s. 287-312 (Central Statistical Office and Polish Statistical Association), 58 – 71, Summ. - Bibliogr. ISBN 978-83-7027-431-3 (in Polish).

Dorfman, A.H. (2000) Non-Parametric Regression for Estimating Totals in Finite Populations. Proceedings of the Survey Research Methods. American Statistical Association, 47 – 54.

Efron, B. (1979) Bootstrap methods: Another look at the jackknife, [in:] Annals of Statistics 7, 1979, 1 – 26.

Gołata E., 2004, Estymacja pośrednia bezrobocia na lokalnym rynku pracy, Wydawnictwo Akademii Ekonomicznej w Poznaniu, Poznań.

Gross W.F., Bode G., Taylor J.M., Lloyd–Smith C.W., 1986, Some finite population estimators which reduce the contribution of outliers, [w:] Proceedings of the Pacific Statistical Conference, 20–24 May 1985, Auckland, New Zealand.

Hedlin D. (2004) Business Survey Estimation, R&D, Sweden.Hidiroglou, M.H., Srinath, K.P. (1981) Some estimators of population total from simple random

samples containing large units, JASA, 76, 690 – 695.Kim, J.Y., Breidt, F.J. and Opsomer, J.D. (2001) Local polynomial regression estimation in two-

stage sampling. Proceedings of the Section on Survey Research Methods, American Statistical Association, 55 – 61.

Kish L. (1995) Methods for design effects, Journal Official Statistics, 11 55-77.Klimanek T., Paradysz J. (2006) Adaptation of EURAREA experience in business statistics,

”Statistics in Transition”, Vol.7, No. 4.Kokic, P.N., Bell, P.A. (1994) Optimal winsorizing cutoffs for a stratified finite population

estimator, Journal of Official Statistics, 10. 419 – 435.Konarski R,, 2004, Regresja wielokrotna, diagnostyka i selekcja modułu regresji, strona

internetowa, www.pbsdga.pl/x.php?x=160/Regresja-wielokrotna.html.Mackin, C., Preston J. (2002) Winsorization for Generalised Regression Estimation, Australian

Bureau of Statistics.Pawlowska Z. (2005) Role of small and medium enterprises in creating a demand on work, [in:]

37


“Wiadomosci Statystyczne”, No.2, 34 – 46 (in Polish).Preston J., Mackin C., 2002, Winsorization for Generalised Regression Estimation, Paper for the

Methodological Advisory Committee, November 2002, Australian Bureau of Statistics.Särndal C.E., Swensson B., Wretman J. (1992) Model Assisted Survey Sampling, Springer Verlag,

New York.

Searls D.T., 1966, An estimator which reduces large true observations, JASA, 61, s. 1200–1204.Shao J., Tu D., 1995, The jackknife and bootstrap, New York, Springer Verlag.

38


Specific description – Method: A synthetic estimator for small area estimation

A.1 Purpose of the method

Presentation of some alternative estimation techniques which are less sensitive to outliers

A.2 Recommended use of the method

The methods presented in this module are recommended for use in the case when: the study

variable(s) are highly skewed, there is a large proportion of zero responses, some negative

values and several auxiliary variables that can be used to improve estimation including outliers.

Such a situation is common in business statistics. The growing use of auxiliary information from

administrative registers and the need to substantially reduce sample sizes or to produce more

effective estimates has increased the importance of recognizing and dealing with the data

problem.

A.3 Possible disadvantages of the method

It is difficult to rank estimators, as their evaluation changes depending on the criterion. It

would be much easier to indicate the ‘best’ or ‘worst’ estimator in each group. The modified

GREG can often produce negative weights. As for local regression, it should be stressed that the

bandwidth significantly influences the computing time.

For Winsor estimators, the one implementing Sample Splitting Technique TSSY provided the

most efficient estimates. In this method two regression models were estimated for sample

observations randomly divided into two groups. Evaluation of such models in terms of residuals

made us reject the most outlying observations. This approach makes the TSS technique more

robust in comparison with others like TLS or LAV.

A.4 Variants of the method

Modification of GREG estimator proposed by R. Chambers, H. Falvey, D. Hedlin, P. Kokic

(2001), Winsor estimator discussed by C. Mackin, J. Preston (2002) and local regression

presented by J.Y. Kim, F.J. Breidt, and J.D. Opsomer (2001).

39


A.5 Input data sets

The input data set depends on estimators taken into account and the source of information. The

input data set can contain individual information for all units in the sample. The input data set

can contain information coming from auxiliary sources e.g. administrative register. Specific

software (e.g. SAS) may be based on different structures of the input data set in the procedure

of robust estimation.

A.6 Logical preconditions

The level of aggregation of study variables and auxiliary variables should be the same

A.7 Tuning parameters

The tuning parameters depend on estimators used: the maximum number of iterations, convergence criterion

A.8 Recommended use of the individual variants of the method

Hedlin D., 2004, Business Survey Estimation, R&D, Sweden.

Preston J., Mackin C., 2002, Winsorization for Generalised Regression Estimation, Paper for the

Methodological Advisory Committee, November 2002, Australian Bureau of Statistics.

A.9 Output data sets

An output dataset may contain a table with the following information: estimates for small area,

MSE, bias (depends on used software).

A.10 Properties of the output data sets

The user should check the quality of estimates based on their knowledge of the investigated phenomenon, MSE, bias of estimates

A.11 Unit of processing

Processing unit level data and domain level variables for computations of the sample size

dependent of estimator and its MSE.

40


A.12 User interaction - not tool specific

1. Select method of robust estimation.

2. Choose auxiliary variables to be included in robust estimation.

2. Establish the level of aggregation.

3. Establish tuning parameters (convergence criteria, starting point, stopping point).

4. After the use of robust estimation quality indicators should be checked and verified in order

to evaluate the final results (MSE).

A. 13 Logging indicators

1. Run time of the application.

2. Number of iteration to reach convergence in the estimation process.

3. Characteristics of the input data.

A.14 Quality indicators of the output data

1. MSE

2. Bias

A.15 Actual use of the method

the national accounts, business surveys,

A.16 Relationship with other modules

2. Weighting2.1. Basic weights with examples2.2 Calibration2.3 GREG

41

european commission · web viewoutlier treatment (robust estimation) 0.2 module type 5. outlier...

Documents