a general approach to variance estimation under imputation ... · variance estimation i treating...
TRANSCRIPT
A General Approach to Variance Estimationunder Imputation for Missing Survey Data
J.N.K. RaoCarleton UniversityOttawa, Canada 1 2
1Joint work with J.K. Kim at Iowa State University.2Workshop on Survey Sampling in honor of Jean-Claude Deville, Neuchatel,
Switzerland, June 24-26, 2009
Outline
I Item Nonresponse
I Deterministic imputation: Population model approach
I Imputed estimator
I Linearization variance estimator
I Examples: Domain estimation, Composite imputation
I Stochastic imputation: variance estimation
I Examples: Multiple imputation, binary response
I Simulation results
I Doubly robust approach
I Extensions
Survey Data
I Design features: clustering, stratification, unequal probabilityof selection
I Source of error:
1. Sampling errors2. Non-sampling errors: Nonresponse (missing data)
NoncoverageMeasurement errors
1
Types of nonresponse
I Unit (or total) nonresponse: refusal, not-at-homeRemedy: weight adjustment within classes
I Item nonresponse: sensitive item, answer not known,inconsistent answerRemedy: imputation (fill in missing data)
2
Advantages of imputation
I Complete data file: standard complete data methods
I Different analyses consistent with each other
I Reduce nonresponse bias
I Auxiliary x observed can be used to get “good” imputedvalues
I Same survey weight for all items
3
Commonly used imputation methods
I Marginal imputation methods:
1. Business surveys: Ratio, Regression, Nearest neighbor (NN)2. Socio-economic surveys: Random donor (within classes),
Stochastic ratio or regression, Fractional imputation (FI),Multiple imputation (MI)
4
Complete response set-up
I Population total: θN =∑N
i=1 yi
I NHT estimator:θn =
∑
i∈s
diyi
where
di = π−1i : design weight
πi= inclusion probability = Pr (i ∈ s)
I Variance estimator:
Vn =∑
i∈s
∑
j∈s
Ωijyiyj
Ωij depends on joint inclusion probabilities πij > 0
5
Deterministic imputation
I Population model approach (Deville and Sarndal, 1994):
Eζ (yi | xi ) = m (xi , β0)
I ai = 1 if yi observed when i ∈ s= 0 otherwise
for i ∈ U = 1, 2, · · · , NI MAR: Distribution of ai depends only on xi
I Imputed value: yi = m(xi , β)
I β: unique solution of EE
U (β) =∑
i∈s
diai yi −m (xi , β) h (xi , β) = 0
6
Model specification
I Further model specification:
Varζ (yi | xi ) = σ2q (xi , β0)
I h (xi , β) = m (xi , β) /q (xi , β) ≡ hi
I Examples: commonly used imputations
1. Ratio imputation: hi = 1
Eζ (yi | xi ) = β0xi , Varζ (yi | xi ) = σ2xi
2. Linear regression imputation: hi = xi
Eζ (yi | xi ) = x ′i β0, Varζ (yi | xi ) = σ2
3. Logistic regression imputation (yi = 0 or 1): hi = xi
log mi/ (1−mi ) = x ′i β0, Varζ (yi | xi ) = mi (1−mi )
where mi = Eζ (yi | xi )
7
Imputed estimator
I Imputed estimator of total θN :
θId =∑
i∈s
di
aiyi + (1− ai ) m(xi , β)
≡
∑
i∈s
di yi
I Examples1. Ratio imputation: m(xi , β) = xi β where
β =(∑
i∈s diaixi
)−1 ∑i∈s diaiyi
2. Linear regression imputation: m(xi , β) = x ′i β where
β =(∑
i∈s diaixix′i
)−1 ∑i∈s diaixiyi
3. Logistic regression imputation: β is the solution to∑i∈s di yi −m (xi , β) xi = 0
I Imputed estimator of domain total θz =∑N
i=1 ziyi :
θI ,z =∑
i∈s
dizi yi
where zi = 1 if i ∈ D; zi = 0 otherwise.
8
Variance estimation
I Treating imputed values as if observed: Underestimation if yi
used in Vn for yi
I Methods that account for imputation:I Adjusted jackknife: Rao and Shao (1992)I Linearization (Pop. model): Deville and Sarndal (1994)I Fractional imputation method: Fuller and Kim (2005)I Bootstrap: Shao and Sitter (1996)I Reverse approach: Shao and Steel (1999)
9
Variance estimation (Cont’d)
I Linearization method:Theorem 1 (Kim and Rao, 2009): Under regularity conditions,
n1/2N−1(θId − θId
)= op (1)
whereθId =
∑
i∈s
wiηi
ηi = m (xi ; β0) + ai
1 + c ′hi
yi −m (xi ;β0) ,
c =
N∑
i=1
aim (xi ; β0) h′i
−1 N∑
i=1
(1− ai ) m (xi ; β0) .
I Reference distribution: Joint distribution of population modeland sampling mechanism, conditional on realized (xi , ai ) inthe population.
10
Variance estimation (Cont’d)I Reverse approach:
1.V1d =
∑
i∈s
∑
j∈s
Ωij ηi ηj
where ηi = ηi (β).2.
V2d =∑
i∈s
diai
(1 + c ′hi
)2 yi −m
(xi ; β
)
V2d valid even if Vζ (yi | xi ) is misspecified.
3. Variance estimator of θId (θId):
Vd = V1d + V2d
I Vd approximately design-model unbiased.
I If the overall sampling rate negligible:
Vd∼= V1d
11
Variance estimation (Cont’d)
I Domain estimation:
1. θI ,z : design-model unbiased for θz
2. UseV1d =
∑
i∈s
∑
j∈s
Ωij ηiz ηjz
where
ηiz = zim(xi ; β) + ai zi + c ′zhi
yi −m(xi ; β)
,
cz =
∑
i∈s
diaim(xi ; β)h′i
−1 ∑
i∈s
dizi (1− ai ) m(xi ; β)
12
Composite imputation
I x , y , z : z always observed
s = sRR ∪ sRM ∪ sMR ∪ sMM
θN =∑N
i=1 yi
sRM : x observed and y missingsMM : x and y missing
I Imputation model:
Eζ (yi | xi , zi ) = βy |xxi
Eζ (xi | zi ) = βx |zxi
I Imputed estimator:
θId =∑
i∈s+R
diyi +∑
i∈sRM
di
(βy |xxi
)+
∑
i∈sMM
di
(βy |x βx |zzi
)
13
Composite imputation (Cont’d)
I βy |x and βx |z solutions of estimation equations:
U1
(βy |x
)=
∑
i∈SRR
di
(yi − βy |xxi
)= 0
U2
(βx |z
)=
∑
i∈SR+
di
(xi − βx |zzi
)= 0
I Taylor linearization of the imputed estimator:
θId(β) ∼= θId (β)−(
∂θId
∂β
)′(∂U
∂β
)−1
U (β)
where U =(U1, U2
)′and β =
(βy |x , βx |z
)′.
14
Stochastic imputation
I y∗i = imputed value of yi such that
EI (y∗i ) = m(xi , β) ≡ mi
I Imputed estimator of θN :
θI =∑
i∈s
di aiyi + (1− ai ) y∗i
EI
(θI
)= θId
I Variance estimator of θI :
VI = Vd + V ∗
whereV ∗ =
∑
i∈s
d2i (1− ai ) (y∗i − mi )
2
15
Multiple imputation: RubinI y
∗(1)i , . . . , y
∗(M)i = imputed values of yi (M ≥ 2)
θ(k)I =
∑
i∈s
di
aiyi + (1− ai ) y
∗(k)i
I Imputed estimator
θMI = M−1M∑
k=1
θ(k)I
I Rubin’s variance estimator:
VR = WM +M + 1
MBM
where WM is the average of M naive variance estimators and
BM = (M − 1)−1M∑
k=1
(θ(k)I − θMI
)2
16
Multiple imputation (Cont’d)
I VR theoretically justified when
V(θId
)= V
(θn
)+ V
(θId − θn
)(A)
(Congenialty assumption)
I VR seriously biased if assumption (A) violated.
I (A) not satisfied for domain estimation when domains notspecified at the imputation stage.
I Our proposal:VMI = Vd + M−1BM
I VMI valid for θId as well as θI ,z without (A).
17
Binary response
I Model: yi | xi ∼ Bernoulli mi = m (xi , β0)
logit (mi ) = x ′i β; q (xi , β0) = mi (1−mi ) ≡ qi
I mi = m(xi , β
)where β is the solution to
∑
i∈s
diai yi −m (xi , β) xi = 0
I Stochastic hot deck imputation
y∗i =
1 with prob mi
0 with prob 1− mi
I ηi = mi + ai (1 + c ′xi ) (yi − mi )
c =∑
i∈s diai qixix′i
−1 ∑i∈s di (1− ai ) qixi .
18
Binary response (Cont’d)
I Fractional imputation (FI): Eliminate imputation variance V ∗
by FI
I M = 2 fractions: impute
y∗i =
1 with fractional weight mi
0 with fractional weight 1− mi
I Data file reports real values 1 and 0 with associated fractionsmi and 1− mi .
I θFI = θId : V ∗ eliminated
I Estimation of domain total and mean: θFI ,z ,(∑i∈s dizi
)−1θFI ,z
19
Binary response (Cont’d)
I Multiple imputation (MI):
1. Generate β∗ ∼ N
β,(∑
i∈s ai qixix′i
)−1
2. Generate y∗i ∼ Bernoulli (m∗i ) with m∗
i = m (xi , β∗)
3. Repeat steps 1 and 2 independently M times.
20
Simulation Study : Binary response
I Finite population of size N = 10, 000 fromI xi ∼ N (3, 1)I yi | xi ∼ Bernoulli (mi ), where logit (mi ) = 0.5xi − 2I zi ∼ Bernoulli (0.4) (zi : Domain indicator)
I SRS of size n = 100
I xi and zi : always observed. yi subject to missing.
I Missing response mechanism
ai ∼ Bernoulli (πi ) ; logit (πi ) = φ0 + φ1 (xi − 3) + φ2 |xi − 3|
(a) φ1 = 0, φ2 = 0; (b) φ1 = 1, φ2 = 0; (c) φ1 = 0, φ2 = 1φ0 is determined to achieve 70% response rate.
I Two variance estimates of multiple imputation are computed.
21
Simulation Study (Cont’d)
Table: Relative bias (RB) of the Rubin’s variance estimator (R)and proposed variance estimator (KR) for multiple imputation
Parameter Response RB (%)Mechanism R KR
Case 1 1.07 2.90Population Case 2 -0.29 1.42
Mean Case 3 -3.96 -2.09
Case 1 34.25 2.37Domain Case 2 31.08 2.28Mean Case 3 27.55 -3.41
Conclusion:
1. KR has small |RB| in all cases
2. R leads to large |RB| in the case of domain mean: 28% to34%
22
Doubly robust method
Case 1: pi known (pi= probability of response)
I Let β be the solution to
U (β) =∑
i∈s
diai
(1
pi− 1
)yi −m (xi , β) h (xi , β) = 0
I Imputed estimator:
θId =∑
i∈s
di
aiyi + (1− ai ) m(xi , β)
I If 1 is an element of hi , then
θId =∑
i∈s
di
ai
piyi +
(1− ai
pi
)m(xi , β)
23
Doubly robust method (Cont’d)
I Properties of θId :
1. Under the assumed response model, ER(θId) ∼= θn regardless ofthe choice of m(xi , β).
2. Under the imputation model, Eζ
(θId − θn
) ∼= 0.
I (1) and (2) imply that θId is doubly robust.
24
Doubly robust method (Cont’d)
Case 2: pi unknown (pi = pi (α))I Linearization variance estimator:
I Haziza and Rao (2006): linear regression imputationI Deville (1999), Demnati and Rao (2004) approach: general
case
25
Extensions
I Calibration estimatorsDavison and Sardy (2007): deterministic linear regressionimputation, stratified SRS
I Pseudo-empirical likelihood intervals
I Other parameters
26