a general approach to variance estimation under imputation ... · variance estimation i treating...

28
A General Approach to Variance Estimation under Imputation for Missing Survey Data J.N.K. Rao Carleton University Ottawa, Canada 12 1 Joint work with J.K. Kim at Iowa State University. 2 Workshop on Survey Sampling in honor of Jean-Claude Deville, Neuchˆ atel, Switzerland, June 24-26, 2009

Upload: others

Post on 11-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

A General Approach to Variance Estimationunder Imputation for Missing Survey Data

J.N.K. RaoCarleton UniversityOttawa, Canada 1 2

1Joint work with J.K. Kim at Iowa State University.2Workshop on Survey Sampling in honor of Jean-Claude Deville, Neuchatel,

Switzerland, June 24-26, 2009

Page 2: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Outline

I Item Nonresponse

I Deterministic imputation: Population model approach

I Imputed estimator

I Linearization variance estimator

I Examples: Domain estimation, Composite imputation

I Stochastic imputation: variance estimation

I Examples: Multiple imputation, binary response

I Simulation results

I Doubly robust approach

I Extensions

Page 3: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Survey Data

I Design features: clustering, stratification, unequal probabilityof selection

I Source of error:

1. Sampling errors2. Non-sampling errors: Nonresponse (missing data)

NoncoverageMeasurement errors

1

Page 4: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Types of nonresponse

I Unit (or total) nonresponse: refusal, not-at-homeRemedy: weight adjustment within classes

I Item nonresponse: sensitive item, answer not known,inconsistent answerRemedy: imputation (fill in missing data)

2

Page 5: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Advantages of imputation

I Complete data file: standard complete data methods

I Different analyses consistent with each other

I Reduce nonresponse bias

I Auxiliary x observed can be used to get “good” imputedvalues

I Same survey weight for all items

3

Page 6: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Commonly used imputation methods

I Marginal imputation methods:

1. Business surveys: Ratio, Regression, Nearest neighbor (NN)2. Socio-economic surveys: Random donor (within classes),

Stochastic ratio or regression, Fractional imputation (FI),Multiple imputation (MI)

4

Page 7: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Complete response set-up

I Population total: θN =∑N

i=1 yi

I NHT estimator:θn =

i∈s

diyi

where

di = π−1i : design weight

πi= inclusion probability = Pr (i ∈ s)

I Variance estimator:

Vn =∑

i∈s

j∈s

Ωijyiyj

Ωij depends on joint inclusion probabilities πij > 0

5

Page 8: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Deterministic imputation

I Population model approach (Deville and Sarndal, 1994):

Eζ (yi | xi ) = m (xi , β0)

I ai = 1 if yi observed when i ∈ s= 0 otherwise

for i ∈ U = 1, 2, · · · , NI MAR: Distribution of ai depends only on xi

I Imputed value: yi = m(xi , β)

I β: unique solution of EE

U (β) =∑

i∈s

diai yi −m (xi , β) h (xi , β) = 0

6

Page 9: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Model specification

I Further model specification:

Varζ (yi | xi ) = σ2q (xi , β0)

I h (xi , β) = m (xi , β) /q (xi , β) ≡ hi

I Examples: commonly used imputations

1. Ratio imputation: hi = 1

Eζ (yi | xi ) = β0xi , Varζ (yi | xi ) = σ2xi

2. Linear regression imputation: hi = xi

Eζ (yi | xi ) = x ′i β0, Varζ (yi | xi ) = σ2

3. Logistic regression imputation (yi = 0 or 1): hi = xi

log mi/ (1−mi ) = x ′i β0, Varζ (yi | xi ) = mi (1−mi )

where mi = Eζ (yi | xi )

7

Page 10: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Imputed estimator

I Imputed estimator of total θN :

θId =∑

i∈s

di

aiyi + (1− ai ) m(xi , β)

i∈s

di yi

I Examples1. Ratio imputation: m(xi , β) = xi β where

β =(∑

i∈s diaixi

)−1 ∑i∈s diaiyi

2. Linear regression imputation: m(xi , β) = x ′i β where

β =(∑

i∈s diaixix′i

)−1 ∑i∈s diaixiyi

3. Logistic regression imputation: β is the solution to∑i∈s di yi −m (xi , β) xi = 0

I Imputed estimator of domain total θz =∑N

i=1 ziyi :

θI ,z =∑

i∈s

dizi yi

where zi = 1 if i ∈ D; zi = 0 otherwise.

8

Page 11: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Variance estimation

I Treating imputed values as if observed: Underestimation if yi

used in Vn for yi

I Methods that account for imputation:I Adjusted jackknife: Rao and Shao (1992)I Linearization (Pop. model): Deville and Sarndal (1994)I Fractional imputation method: Fuller and Kim (2005)I Bootstrap: Shao and Sitter (1996)I Reverse approach: Shao and Steel (1999)

9

Page 12: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Variance estimation (Cont’d)

I Linearization method:Theorem 1 (Kim and Rao, 2009): Under regularity conditions,

n1/2N−1(θId − θId

)= op (1)

whereθId =

i∈s

wiηi

ηi = m (xi ; β0) + ai

1 + c ′hi

yi −m (xi ;β0) ,

c =

N∑

i=1

aim (xi ; β0) h′i

−1 N∑

i=1

(1− ai ) m (xi ; β0) .

I Reference distribution: Joint distribution of population modeland sampling mechanism, conditional on realized (xi , ai ) inthe population.

10

Page 13: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Variance estimation (Cont’d)I Reverse approach:

1.V1d =

i∈s

j∈s

Ωij ηi ηj

where ηi = ηi (β).2.

V2d =∑

i∈s

diai

(1 + c ′hi

)2 yi −m

(xi ; β

)

V2d valid even if Vζ (yi | xi ) is misspecified.

3. Variance estimator of θId (θId):

Vd = V1d + V2d

I Vd approximately design-model unbiased.

I If the overall sampling rate negligible:

Vd∼= V1d

11

Page 14: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Variance estimation (Cont’d)

I Domain estimation:

1. θI ,z : design-model unbiased for θz

2. UseV1d =

i∈s

j∈s

Ωij ηiz ηjz

where

ηiz = zim(xi ; β) + ai zi + c ′zhi

yi −m(xi ; β)

,

cz =

i∈s

diaim(xi ; β)h′i

−1 ∑

i∈s

dizi (1− ai ) m(xi ; β)

12

Page 15: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Composite imputation

I x , y , z : z always observed

s = sRR ∪ sRM ∪ sMR ∪ sMM

θN =∑N

i=1 yi

sRM : x observed and y missingsMM : x and y missing

I Imputation model:

Eζ (yi | xi , zi ) = βy |xxi

Eζ (xi | zi ) = βx |zxi

I Imputed estimator:

θId =∑

i∈s+R

diyi +∑

i∈sRM

di

(βy |xxi

)+

i∈sMM

di

(βy |x βx |zzi

)

13

Page 16: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Composite imputation (Cont’d)

I βy |x and βx |z solutions of estimation equations:

U1

(βy |x

)=

i∈SRR

di

(yi − βy |xxi

)= 0

U2

(βx |z

)=

i∈SR+

di

(xi − βx |zzi

)= 0

I Taylor linearization of the imputed estimator:

θId(β) ∼= θId (β)−(

∂θId

∂β

)′(∂U

∂β

)−1

U (β)

where U =(U1, U2

)′and β =

(βy |x , βx |z

)′.

14

Page 17: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Stochastic imputation

I y∗i = imputed value of yi such that

EI (y∗i ) = m(xi , β) ≡ mi

I Imputed estimator of θN :

θI =∑

i∈s

di aiyi + (1− ai ) y∗i

EI

(θI

)= θId

I Variance estimator of θI :

VI = Vd + V ∗

whereV ∗ =

i∈s

d2i (1− ai ) (y∗i − mi )

2

15

Page 18: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Multiple imputation: RubinI y

∗(1)i , . . . , y

∗(M)i = imputed values of yi (M ≥ 2)

θ(k)I =

i∈s

di

aiyi + (1− ai ) y

∗(k)i

I Imputed estimator

θMI = M−1M∑

k=1

θ(k)I

I Rubin’s variance estimator:

VR = WM +M + 1

MBM

where WM is the average of M naive variance estimators and

BM = (M − 1)−1M∑

k=1

(θ(k)I − θMI

)2

16

Page 19: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Multiple imputation (Cont’d)

I VR theoretically justified when

V(θId

)= V

(θn

)+ V

(θId − θn

)(A)

(Congenialty assumption)

I VR seriously biased if assumption (A) violated.

I (A) not satisfied for domain estimation when domains notspecified at the imputation stage.

I Our proposal:VMI = Vd + M−1BM

I VMI valid for θId as well as θI ,z without (A).

17

Page 20: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Binary response

I Model: yi | xi ∼ Bernoulli mi = m (xi , β0)

logit (mi ) = x ′i β; q (xi , β0) = mi (1−mi ) ≡ qi

I mi = m(xi , β

)where β is the solution to

i∈s

diai yi −m (xi , β) xi = 0

I Stochastic hot deck imputation

y∗i =

1 with prob mi

0 with prob 1− mi

I ηi = mi + ai (1 + c ′xi ) (yi − mi )

c =∑

i∈s diai qixix′i

−1 ∑i∈s di (1− ai ) qixi .

18

Page 21: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Binary response (Cont’d)

I Fractional imputation (FI): Eliminate imputation variance V ∗

by FI

I M = 2 fractions: impute

y∗i =

1 with fractional weight mi

0 with fractional weight 1− mi

I Data file reports real values 1 and 0 with associated fractionsmi and 1− mi .

I θFI = θId : V ∗ eliminated

I Estimation of domain total and mean: θFI ,z ,(∑i∈s dizi

)−1θFI ,z

19

Page 22: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Binary response (Cont’d)

I Multiple imputation (MI):

1. Generate β∗ ∼ N

β,(∑

i∈s ai qixix′i

)−1

2. Generate y∗i ∼ Bernoulli (m∗i ) with m∗

i = m (xi , β∗)

3. Repeat steps 1 and 2 independently M times.

20

Page 23: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Simulation Study : Binary response

I Finite population of size N = 10, 000 fromI xi ∼ N (3, 1)I yi | xi ∼ Bernoulli (mi ), where logit (mi ) = 0.5xi − 2I zi ∼ Bernoulli (0.4) (zi : Domain indicator)

I SRS of size n = 100

I xi and zi : always observed. yi subject to missing.

I Missing response mechanism

ai ∼ Bernoulli (πi ) ; logit (πi ) = φ0 + φ1 (xi − 3) + φ2 |xi − 3|

(a) φ1 = 0, φ2 = 0; (b) φ1 = 1, φ2 = 0; (c) φ1 = 0, φ2 = 1φ0 is determined to achieve 70% response rate.

I Two variance estimates of multiple imputation are computed.

21

Page 24: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Simulation Study (Cont’d)

Table: Relative bias (RB) of the Rubin’s variance estimator (R)and proposed variance estimator (KR) for multiple imputation

Parameter Response RB (%)Mechanism R KR

Case 1 1.07 2.90Population Case 2 -0.29 1.42

Mean Case 3 -3.96 -2.09

Case 1 34.25 2.37Domain Case 2 31.08 2.28Mean Case 3 27.55 -3.41

Conclusion:

1. KR has small |RB| in all cases

2. R leads to large |RB| in the case of domain mean: 28% to34%

22

Page 25: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Doubly robust method

Case 1: pi known (pi= probability of response)

I Let β be the solution to

U (β) =∑

i∈s

diai

(1

pi− 1

)yi −m (xi , β) h (xi , β) = 0

I Imputed estimator:

θId =∑

i∈s

di

aiyi + (1− ai ) m(xi , β)

I If 1 is an element of hi , then

θId =∑

i∈s

di

ai

piyi +

(1− ai

pi

)m(xi , β)

23

Page 26: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Doubly robust method (Cont’d)

I Properties of θId :

1. Under the assumed response model, ER(θId) ∼= θn regardless ofthe choice of m(xi , β).

2. Under the imputation model, Eζ

(θId − θn

) ∼= 0.

I (1) and (2) imply that θId is doubly robust.

24

Page 27: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Doubly robust method (Cont’d)

Case 2: pi unknown (pi = pi (α))I Linearization variance estimator:

I Haziza and Rao (2006): linear regression imputationI Deville (1999), Demnati and Rao (2004) approach: general

case

25

Page 28: A General Approach to Variance Estimation under Imputation ... · Variance estimation I Treating imputed values as if observed: Underestimation if ~yi used in V^ n for yi I Methods

Extensions

I Calibration estimatorsDavison and Sardy (2007): deterministic linear regressionimputation, stratified SRS

I Pseudo-empirical likelihood intervals

I Other parameters

26