a general approach to variance estimation under imputation ... · variance estimation i treating...

A General Approach to Variance Estimationunder Imputation for Missing Survey Data

J.N.K. RaoCarleton UniversityOttawa, Canada 1 2

1Joint work with J.K. Kim at Iowa State University.2Workshop on Survey Sampling in honor of Jean-Claude Deville, Neuchatel,

Switzerland, June 24-26, 2009

Outline

I Item Nonresponse

I Deterministic imputation: Population model approach

I Imputed estimator

I Linearization variance estimator

I Examples: Domain estimation, Composite imputation

I Stochastic imputation: variance estimation

I Examples: Multiple imputation, binary response

I Simulation results

I Doubly robust approach

I Extensions

Survey Data

I Design features: clustering, stratification, unequal probabilityof selection

I Source of error:

1. Sampling errors2. Non-sampling errors: Nonresponse (missing data)

NoncoverageMeasurement errors

1

Types of nonresponse

I Unit (or total) nonresponse: refusal, not-at-homeRemedy: weight adjustment within classes

I Item nonresponse: sensitive item, answer not known,inconsistent answerRemedy: imputation (fill in missing data)

2

Advantages of imputation

I Complete data file: standard complete data methods

I Different analyses consistent with each other

I Reduce nonresponse bias

I Auxiliary x observed can be used to get “good” imputedvalues

I Same survey weight for all items

3

Commonly used imputation methods

I Marginal imputation methods:

1. Business surveys: Ratio, Regression, Nearest neighbor (NN)2. Socio-economic surveys: Random donor (within classes),

Stochastic ratio or regression, Fractional imputation (FI),Multiple imputation (MI)

4

Complete response set-up

I Population total: θN =∑N

i=1 yi

I NHT estimator:θn =

∑

i∈s

diyi

where

di = π−1i : design weight

πi= inclusion probability = Pr (i ∈ s)

I Variance estimator:

Vn =∑

i∈s

∑

j∈s

Ωijyiyj

Ωij depends on joint inclusion probabilities πij > 0

5

Deterministic imputation

I Population model approach (Deville and Sarndal, 1994):

Eζ (yi | xi ) = m (xi , β0)

I ai = 1 if yi observed when i ∈ s= 0 otherwise

for i ∈ U = 1, 2, · · · , NI MAR: Distribution of ai depends only on xi

I Imputed value: yi = m(xi , β)

I β: unique solution of EE

U (β) =∑

i∈s

diai yi −m (xi , β) h (xi , β) = 0

6

Model specification

I Further model specification:

Varζ (yi | xi ) = σ2q (xi , β0)

I h (xi , β) = m (xi , β) /q (xi , β) ≡ hi

I Examples: commonly used imputations

1. Ratio imputation: hi = 1

Eζ (yi | xi ) = β0xi , Varζ (yi | xi ) = σ2xi

2. Linear regression imputation: hi = xi

Eζ (yi | xi ) = x ′i β0, Varζ (yi | xi ) = σ2

3. Logistic regression imputation (yi = 0 or 1): hi = xi

log mi/ (1−mi ) = x ′i β0, Varζ (yi | xi ) = mi (1−mi )

where mi = Eζ (yi | xi )

7

Imputed estimator

I Imputed estimator of total θN :

θId =∑

i∈s

di

aiyi + (1− ai ) m(xi , β)

≡

∑

i∈s

di yi

I Examples1. Ratio imputation: m(xi , β) = xi β where

β =(∑

i∈s diaixi

)−1 ∑i∈s diaiyi

2. Linear regression imputation: m(xi , β) = x ′i β where

β =(∑

i∈s diaixix′i

)−1 ∑i∈s diaixiyi

3. Logistic regression imputation: β is the solution to∑i∈s di yi −m (xi , β) xi = 0

I Imputed estimator of domain total θz =∑N

i=1 ziyi :

θI ,z =∑

i∈s

dizi yi

where zi = 1 if i ∈ D; zi = 0 otherwise.

8

Variance estimation

I Treating imputed values as if observed: Underestimation if yi

used in Vn for yi

I Methods that account for imputation:I Adjusted jackknife: Rao and Shao (1992)I Linearization (Pop. model): Deville and Sarndal (1994)I Fractional imputation method: Fuller and Kim (2005)I Bootstrap: Shao and Sitter (1996)I Reverse approach: Shao and Steel (1999)

9

Variance estimation (Cont’d)

I Linearization method:Theorem 1 (Kim and Rao, 2009): Under regularity conditions,

n1/2N−1(θId − θId

)= op (1)

whereθId =

∑

i∈s

wiηi

ηi = m (xi ; β0) + ai

1 + c ′hi

yi −m (xi ;β0) ,

c =

N∑

i=1

aim (xi ; β0) h′i

−1 N∑

i=1

(1− ai ) m (xi ; β0) .

I Reference distribution: Joint distribution of population modeland sampling mechanism, conditional on realized (xi , ai ) inthe population.

10

Variance estimation (Cont’d)I Reverse approach:

1.V1d =

∑

i∈s

∑

j∈s

Ωij ηi ηj

where ηi = ηi (β).2.

V2d =∑

i∈s

diai

(1 + c ′hi

)2 yi −m

(xi ; β

)

V2d valid even if Vζ (yi | xi ) is misspecified.

3. Variance estimator of θId (θId):

Vd = V1d + V2d

I Vd approximately design-model unbiased.

I If the overall sampling rate negligible:

Vd∼= V1d

11

Variance estimation (Cont’d)

I Domain estimation:

1. θI ,z : design-model unbiased for θz

2. UseV1d =

∑

i∈s

∑

j∈s

Ωij ηiz ηjz

where

ηiz = zim(xi ; β) + ai zi + c ′zhi

yi −m(xi ; β)

,

cz =

∑

i∈s

diaim(xi ; β)h′i

−1 ∑

i∈s

dizi (1− ai ) m(xi ; β)

12

Stochastic imputation

I y∗i = imputed value of yi such that

EI (y∗i ) = m(xi , β) ≡ mi

I Imputed estimator of θN :

θI =∑

i∈s

di aiyi + (1− ai ) y∗i

EI

(θI

)= θId

I Variance estimator of θI :

VI = Vd + V ∗

whereV ∗ =

∑

i∈s

d2i (1− ai ) (y∗i − mi )

2

15

Multiple imputation: RubinI y

∗(1)i , . . . , y

∗(M)i = imputed values of yi (M ≥ 2)

θ(k)I =

∑

i∈s

di

aiyi + (1− ai ) y

∗(k)i

I Imputed estimator

θMI = M−1M∑

k=1

θ(k)I

I Rubin’s variance estimator:

VR = WM +M + 1

MBM

where WM is the average of M naive variance estimators and

BM = (M − 1)−1M∑

k=1

(θ(k)I − θMI

)2

16

Multiple imputation (Cont’d)

I VR theoretically justified when

V(θId

)= V

(θn

)+ V

(θId − θn

)(A)

(Congenialty assumption)

I VR seriously biased if assumption (A) violated.

I (A) not satisfied for domain estimation when domains notspecified at the imputation stage.

I Our proposal:VMI = Vd + M−1BM

I VMI valid for θId as well as θI ,z without (A).

17

Binary response

I Model: yi | xi ∼ Bernoulli mi = m (xi , β0)

logit (mi ) = x ′i β; q (xi , β0) = mi (1−mi ) ≡ qi

I mi = m(xi , β

)where β is the solution to

∑

i∈s

diai yi −m (xi , β) xi = 0

I Stochastic hot deck imputation

y∗i =

1 with prob mi

0 with prob 1− mi

I ηi = mi + ai (1 + c ′xi ) (yi − mi )

c =∑

i∈s diai qixix′i

−1 ∑i∈s di (1− ai ) qixi .

18

Binary response (Cont’d)

I Fractional imputation (FI): Eliminate imputation variance V ∗

by FI

I M = 2 fractions: impute

y∗i =

1 with fractional weight mi

0 with fractional weight 1− mi

I Data file reports real values 1 and 0 with associated fractionsmi and 1− mi .

I θFI = θId : V ∗ eliminated

I Estimation of domain total and mean: θFI ,z ,(∑i∈s dizi

)−1θFI ,z

19

Binary response (Cont’d)

I Multiple imputation (MI):

1. Generate β∗ ∼ N

β,(∑

i∈s ai qixix′i

)−1

2. Generate y∗i ∼ Bernoulli (m∗i ) with m∗

i = m (xi , β∗)

3. Repeat steps 1 and 2 independently M times.

20

Simulation Study : Binary response

I Finite population of size N = 10, 000 fromI xi ∼ N (3, 1)I yi | xi ∼ Bernoulli (mi ), where logit (mi ) = 0.5xi − 2I zi ∼ Bernoulli (0.4) (zi : Domain indicator)

I SRS of size n = 100

I xi and zi : always observed. yi subject to missing.

I Missing response mechanism

ai ∼ Bernoulli (πi ) ; logit (πi ) = φ0 + φ1 (xi − 3) + φ2 |xi − 3|

(a) φ1 = 0, φ2 = 0; (b) φ1 = 1, φ2 = 0; (c) φ1 = 0, φ2 = 1φ0 is determined to achieve 70% response rate.

I Two variance estimates of multiple imputation are computed.

21

Simulation Study (Cont’d)

Table: Relative bias (RB) of the Rubin’s variance estimator (R)and proposed variance estimator (KR) for multiple imputation

Parameter Response RB (%)Mechanism R KR

Case 1 1.07 2.90Population Case 2 -0.29 1.42

Mean Case 3 -3.96 -2.09

Case 1 34.25 2.37Domain Case 2 31.08 2.28Mean Case 3 27.55 -3.41

Conclusion:

1. KR has small |RB| in all cases

2. R leads to large |RB| in the case of domain mean: 28% to34%

22

Doubly robust method

Case 1: pi known (pi= probability of response)

I Let β be the solution to

U (β) =∑

i∈s

diai

(1

pi− 1

)yi −m (xi , β) h (xi , β) = 0

I Imputed estimator:

θId =∑

i∈s

di

aiyi + (1− ai ) m(xi , β)

I If 1 is an element of hi , then

θId =∑

i∈s

di

ai

piyi +

(1− ai

pi

)m(xi , β)

23

Doubly robust method (Cont’d)

I Properties of θId :

1. Under the assumed response model, ER(θId) ∼= θn regardless ofthe choice of m(xi , β).

2. Under the imputation model, Eζ

(θId − θn

) ∼= 0.

I (1) and (2) imply that θId is doubly robust.

24

Doubly robust method (Cont’d)

Case 2: pi unknown (pi = pi (α))I Linearization variance estimator:

I Haziza and Rao (2006): linear regression imputationI Deville (1999), Demnati and Rao (2004) approach: general

case

25

Extensions

I Calibration estimatorsDavison and Sardy (2007): deterministic linear regressionimputation, stratified SRS

I Pseudo-empirical likelihood intervals

I Other parameters

26

a general approach to variance estimation under imputation ... · variance estimation i treating...

Documents