calibrated imputation of numerical data under linear edit restrictions jeroen pannekoek natalie...
TRANSCRIPT
Calibrated imputation of numerical data under
linear edit restrictions
Jeroen Pannekoek
Natalie Shlomo
Ton de Waal
Missing data
Data may be missing from collected data sets
Unit non-response Data from entire units are missing Often dealt with by means of weighting
Item non-response Some items from units are missing Usually dealt with by means of imputation
Linear edit restrictions
Data often have to satisfy edit restrictions For numerical data most edits are linear Balance equations:
a1x1 + a2x2 + … + anxn + b = 0 Inequalities:
a1x1 + a2x2 + ... + anxn + b ≥ 0
Totals
Sometimes also totals are known
x11 x12 x13
x21 x22 x23
… … …
xr1 xr2 xr3
X1 X2 X3
Eliminating balance equations
We can “eliminate balance equations” Example: set of edits
net + tax – gross = 0 net ≥ tax net ≥ 0
Eliminating the balance equations net = gross – tax gross – tax ≥ tax gross – tax ≥ 0
Eliminating balance equations
We can “eliminate balance equations” Example: set of edits
net + tax – gross = 0 net ≥ tax net ≥ 0
Eliminating the balance equations net = gross – tax gross – tax ≥ tax gross – tax ≥ 0
Eliminating balance equations
By eliminating all balance equations we only have to deal with inequality edits
If we sequentially impute variables, we only have to ensure that imputed values lie in an interval Li ≤ xi ≤ Ui
We can now focus on satisfying totals
Imputation methods
Adjusted predicted mean imputation Adjusted predicted mean imputation with
random residuals MCMC approach
Adjusted predicted mean imputation
We use sequential imputation All missing values for a variable (the target
variable) are imputed simultaneously We impute target column xt
We use the model xt = β0 + βxp + e
We impute xt = β0 + βxp
Imputed values do not satisfy edits nor totals
Satisfying totals
The totals of missing data for target variable (Xt,mis) as well as predictor (Xp,mis) are known
We construct the following model for observed data xt,obs = β0 + βxp,obs + e
Xt,mis = β1m + βXp,mis
m is the number of missing values
We apply OLS to estimate model parameters We impute xt,mis = β1 + βxp,mis
Sum of imputed values then equals known value of this total
Satisfying totals and intervals (edits)
We impute xt,mis = β1 + βxp,mis + at
at,i are chosen in such a way that Imputed values lie in their feasible intervals Σi at,i = 0
Appropriate values for at,i can be found by means of operations research technique
For simple alternative technique, see paper
Satisfying totals and intervals (edits)
Alternatively, draw m residuals by Acceptance/Rejection sampling from a Normal Distribution (zero mean and residual variance of the regression model) that satisfy interval constraints
Adjust random residuals to meet the sum constraints as carried out for at,i
MCMC approach
Start with pre-imputed consistent dataset Randomly select two records We select a variable in these records. Note
that we know the sum of these two values of this variable for the two records
MCMC approach
We then apply following two steps 1. We determine intervals for the two values.
2. We then draw value for one missing value. Other value then immediately follows.
Now, repeat Steps 1 and 2 until “convergence”. In Step 2 we draw a value from a posterior
predictive distribution implied by a linear regression model under uninformative prior, conditional on the fact that it has to lie inside corresponding interval
Evaluation study: methods
Evaluated imputation methods: UPMA: unbenchmarked simple predictive mean
imputation with adjustments to imputations that satisfy interval constraints
BPMA: benchmarked predictive mean imputation with adjustments to imputations that satisfy interval constraints and totals
MCMC: BPMA with adjustments was used as pre-imputed data set for MCMC approach
Evaluation study: data set
11,907 individuals aged 15 and over that responded to all questions in 2005 Israel Income Survey and earned more than 1000 Israel Shekels for their monthly gross income
Item non-response was introduced randomly to income variables 20% of records were selected randomly and their net
income variable deleted 20% of records were selected randomly and their tax
variable deleted while 10% of those records were in common with the missing net income variable
Totals of each of the income variables are known
Evaluation study: data set
We focus on three variables from the Income Survey: gross: gross income from earnings net: net income from earnings tax: tax paid
Edits: net + tax = gross net ≥ tax gross ≥ 3 x tax gross ≥ 0, net ≥ 0, tax ≥ 0
Log transform was carried out on variables to ensure normality of data
Evaluation criteria dL1
average distance between imputed and true values Z
number of imputed records on boundary of feasible region defined by edits
K-S (Kolmogorov-Smirnov) compares empirical distribution of original values to empirical
distribution of imputed values Sign
sign test carried out on difference between original value and imputed value
Kappa Kappa statistic for 2-dimensional contingency table; compares
agreement against that which might be expected by chance
Results
Net
UPMA BPMA MCMC
dL12266.1 2132.6 4304.8
Z 204 11 1
K-S 3.535 5.129 9.100
Sign 0.0147 < 0.0001 0.0001
Kappa 0.161 0.178 0.117
Results
Tax
UPMA BPMA MCMC
dL1786.8 821.7 1393.7
Z 123 12 0
K-S 3.521 9.129 11.158
Sign < 0.0001 < 0.0001 < 0.0001
Kappa 0.418 0.421 0.226
Conclusions MCMC approach is doing worse than other methods on all
criteria except number of records that lie on boundary However, MCMC allows multiple imputation in order to take
imputation uncertainty into account in variance estimation BPMA appear to be slightly better compared to UPMA except for
K-S statistic Number of records that lie on boundary for UPMA is cause for
concern MCMC approach is doing slightly better than BPMA approach in this
respect
Future research
Improving MCMC approach Carrying out multiple imputation using MCMC approach to obtain
proper variance estimation In our study a log transformation was carried out on variables to
ensure normality of data Correction factor was introduced into constant term of regression
model to correct for this log transformation Better approach to this problem will be investigated
Extending problem to situations where one has non-equal sampling weights