nuoo-ting (jassy) molitor 1 chris jackson 2 with nicky best, sylvia richardson 1

26
Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health Imperial College, London 2 MRC Biostatistics Unit, Cambridge [email protected] [email protected] http://www.bias- project.org.uk Bayesian graphical models for combining mismatched administrative and survey data: application to low birth weight and water disinfection by-products

Upload: makana

Post on 14-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Bayesian graphical models for combining mismatched administrative and survey data: application to low birth weight and water disinfection by-products. Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Nuoo-Ting (Jassy) Molitor1

Chris Jackson2

With Nicky Best, Sylvia Richardson1

1Department of Epidemiology and Public Health

Imperial College, London2MRC Biostatistics Unit, [email protected]@mrc-bsu.cam.ac.uk

http://www.bias-project.org.uk

Bayesian graphical models for combining mismatched administrative and survey data:

application to low birth weight and water disinfection by-products

Page 2: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Motivation of combining different data sources

Case study: Chlorination Study

Data Sources

Statistical modeling

Simulation and Real Data Analysis

Outlines

Page 3: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Observational studies

Fill with lots of uncertainties other than random errors

Missing values

Unobservedconfounder

Measurement errors

Selectionbias

Random errors

Uncertainties are hard to identify within a single data set

Page 4: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Combining multiple data sources

Research questions are complicated in nature and a single data set may not able to provide sufficient answer.

Example: Puzzle

Page 5: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Case study

Combining birth register, survey and census data

to study effects of water disinfection by-products

on risk of low birth weight

Page 6: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Low Birthweight(LBW)

(birth weight < 2.5kg)

Environmental ExposureChlorine Byproducts

(THMs)

OutcomeLow Birth-weight

(LBW)

LBW and pre-term(LBWP)

LBW and Full-term(LBWF)

LBW: baby’s birth weight is less than 2.5 kg LBWP: LBW babies were born less than 37 weeks LBWF: LBW babies were born at least 37 weeks

Covariates: mothers’ race/ethnicityBabies’ sex mothers’ smoking statusMothers’ maternal age during the pregnancy

Example of combining different data sources – Chlorination Study

ChlorineNatural organic matter

and / or Chemical compound

bromide

organic & inorganic byproductsorganic & inorganic byproducts• bromatebromate• chlorite chlorite • haloacetic acids (HAA5)haloacetic acids (HAA5)

• total trihalomethanes (THMstotal trihalomethanes (THMs) )

reacts

Gestationage

Page 7: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Available data sources related to the Chlorination StudyWhy do we need them?

Administrative data (NBR)

Deal with • Small % of LBW in pop• Inconclusive link between LBW and THMs

• Imputing missing covariates

Aggregate data

Surveydata (MCS)

• Adjust for importantsubject level covariate• Allows to examinedifferent types of LBW

Page 8: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Administrative data (large) -power, no selection bias

Observed postcode

Missing smoking and race/ethnicityMissing baby’s gestation age

NBR (national birth registry)

Observed postcodeCensus 2001 - region-level of race/ethnicity composition Consumer survey: CACI - region-level of tobacco expenditure

Aggregate Data (UK)

Survey data (Subset of NBR) - low power, selection bias

Observed postcode

Observed smoking and race/ethnicityObserved baby’s gestation age

MCS (millennium cohort study)

Summary of data sources

Page 9: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Disease sub-model for MCS m: subject index for MCS r: region index

y r m

normal LBWP

LBWF

THM r m

C r m

DiseaseModel

Parameters

Unknown Known

y : Birth weight indicator

(1: normal, 2: LBWP, 3: LBWF)

THM : THM (chlorine byproduct) exposure

C : missing covariates such as

race/ethnicity and smoking.

Only observed in the MCS.

Multinomial logistic regression for MCS

y r m ~ Multinomial (pr m,1:3, 1)

log(pr m,2 / pr m,1)= b10 + b11 THMr m + b12 Cr m

log(pr m,3 / pr m,1)= b20 + b21THMr m + b22 Cr m

Building the sub-model

Page 10: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Disease sub-model for NBR n: subject index for NBR r: region index

y r n

normal

THM r n

DiseaseModel

Parameters

UnknownKnown

Cr n

LBWP

LBWF

Missing LBWP & LBWF were due to

missing gestation age

C : missing covariates such as race/ethnicity

and smoking (Missing in the NBR, but

Observed in the MCS)

Building the sub-model

Multinomial logistic regression for NBR

y r n ~ Multinomial (pr n,1:3, 1)

log(pr n,2 / pr n,1)= b10 + b11 THMr n + b12 Cr n

log(pr n,3 / pr n,1)= b20 + b21THMr n + b22 Cr n

Page 11: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

G-age: Gestation age

y r n

THM r n

DiseaseModel

Parameters

normal LBW

THM r m

DiseaseModel

Parameters

y r m

normal LBWC r n C r m

Birth Weight (BW) Birth Weight (BW)

LBWP

LBWF

LBWP

LBWFmissing G-age

known

unknown

NBR MCS

Missing outcome model - impute LBWP and LBWF for NBR

Page 12: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

C r n C r m

NBR MCS

Aggregate

Ar

Unknown Known

missingcovar. modelparameters

Missing Covariate Model Impute CImpute Cr nr n in terms of aggregate data and MCS data in terms of aggregate data and MCS data

Building the sub-model

Since our missing covariate such as race and smoke are binary variables, we usea multivariate-probit model to account for their correlation

Page 13: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

1: nonwhite(Asian, Black, Others)

0: white

1: yes0: no

Race Smoke

1,r

2,r

uSmoke*~ MVN ,

uRace*

Define underlying continuous variables (smoke*, race*)Smoke= I(smok* >0) & Race= I (Race* >0)

Multivariate Probit Model (Chip & Greenberg,1998)

Correlation

T11,r 01,s r

T22,r 02,s r

u =δ δ A

u =δ δ A s=1,2,3

1 b= , -1<=b<=1

b 1

S: Sampling StratumAdjust for selection bias

Page 14: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

NBR disease sub-model

THM r n

DiseaseModel

Parameters

THM r m

DiseaseModel

Parameters

C r n C r m

y r m

normalLBWP

LBWFy r n

normal

LBWF

LBWP

MCS disease sub-model

C r n C r m

Missing covar. modelparameters

Missing covar. sub-model

Missing Outcome Model

Unified model

known

unknown

Aggre.Ar

Page 15: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

ri

obs mri,1:3

Tri,uri0,u 1,u ri 3,u ri 4,u ri2,u

ri,1

y ~ Multinomial (P , 1) , i N

plog = β +β THM +β X β Smoke β Race

p

u=2,3

* *ri ri ri ri

*1,rri

*2,rri

smoke =I smoke >0 , race =I race >0

usmoke~ Multivariate Normal ,

urace

ri ri,1:3

miss * m

*ri,1

ri,u*ri,u

ri,2 ri,3

y ~ Multinomial(p ,1) , i N

p = 0

pp = , u=2,3

p +p

1. Disease Model (y={1,2,3} )

3. Missing Covariates Model (Multivariate Probit)

2. Missing Outcome Model

T11,r 01,s r

T22,r 02,s r

u =δ δ A

u =δ δ A s=1,2,3

1 b= , -1<=b<=1

b 1

i: subject indexNm : group of subjects who had missing outcome (ymiss )r: regionu: index for the category of outcomeyobs: observed outcomeX: observed covariates

Page 16: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Y(1, 2, 3)

C (0/1)A

(aggre.) Missing Covariate Model

Missing Outcome Model

Investigating the performance of the unified model

Good Performance of model depended on1. How well the aggre. data can inform C (covariate)2. How strong C and Y are linked

We can examine the following 4 data scenarios1. Strong (A C) Strong (CY)2. Strong (A C) Weak (CY)3. Weak (A C) Strong (CY)4. Weak (A C) Weak (CY)

Page 17: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Step 1: Create data (N=1333) under the scenarios:

Step 3: Compare the prediction based on an analysis using fully observed data (no imputation)with an analysis using partially observed data (imputation).

Note: partially observed data were analyzed under various models1. Covariate sub-model (examining A C)2. Outcome sub-model (examining C Y)3. Unified Model (examining AC and CY)4. Unified Model with cut

Step 2: Missing assignment: - randomly chose 80% of subjects and treat their C as missing - only 10% of individuals with outcomes in categories 2 or 3 were assigned to be missing

Repeat step 2 : generate 20 replicate samples

Simulation Study

Page 18: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Examining the Imputation of missing covariateone level (AC)

Strong AC

Weak AC

Assign higher probability of covariate pattern to subjects whose true covariates corresponding to that pattern than to those whose true pattern is different

Ability to discriminateture covariate pattern

decrease

Page 19: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Examining the Imputation of missing covariatetwo level (AC & C Y)

Feedback form outcome model is beneficial to covariate imputation.

The predicted probabilities of covariate patter (C=0,0) are betterable to discriminate between subjects whose true covariates are C=0,0 or not.

In particular, weak C scenarios.

Page 20: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Examining the impact of the imputation modelon the Y-C association

    outcome model only Unified model Unified model

w/ cut

SYSC EST Est (MSE) Est (MSE) Est (MSE)

beta.smoke[3] 0.9 0.91 (0.01) 1.07 (0.27) 0.25 (0.43)

beta.race[3] 1.79 1.83 (0.01) 2.22 (0.25) 1.12 (0.47)

SYWC

beta.smoke[3] 0.99 0.97 (0.00) 0.97 (0.51) 0.15 (0.71)

beta.race[3] 2.56 2.57 (0.01) 2.71 (0.49) 0.67 (3.63)

WYSC

beta.smoke[3] -0.02 0.05 (0.01) 0.57 (1.34) 0.06 (0.07)

beta.race[3] 0.32 0.41 (0.03) 0.61 (0.41) 0.18 (0.09)

WYWC

beta.smoke[3] 0.35 0.34 (0.03) 0.91 (0.89) 0.09 (0.11)

beta.race[3] 1 1.06 (0.04) 1.23 (1.32) 0.18 (0.84)

• Outcome VS unified modelUnified model has higher MSE than outcome model(more missing values need to impute)

• Unified VS. Unified with cutStrong Y-C association help reduce MSEbut not weak Y-C association

Page 21: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Real data analysis – a water company in Northern England

© Imperial College London

0 40 80 120 16020Kilometers

´

United Utilities

South West Water

Southern Water

Severn TrentWater

Essex and SuffolkWater

Anglian Water

Yorkshire Water

Northumbrian Water

Welsh Water

Thames Water

BristolWater

Three ValleysWater

Data:

Restrict on: Singleton birth

Period: Sep 2000 – Aug 2001

Subjects:

MCS1333

NBR7945+ =

Total9278

Missing % in Race and Smoke: ~ 85%Missing % in Outcome: ~ 7%

Complete Observed

information

Missing RaceMissing SmokeMissing outcome at levels of2 (LBWP) and 3 (LBWF)

Page 22: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Real data analysis – a water company in northern England

Exposure variable : THMs

• It was dichotomized into 2 groups

• low-medium exposure group (<= 60 g/l) : 57.35 %

• high exposure group (>60 g/l) : 42.65 %

• Estimated in separate model for MCS and NBR (Whitaker et al, 2005)

In addition to race and smoke, we also adjust for :

baby’s sexmother maternal age

Observed in both MCS and NBR

Page 23: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

No imputation VS. Imputation

a. Multinomial logistic regression model for MCS data (Bayesian)

- no imputation

b. Bayesian multiple bias model for combined NBR, MCS and aggregate data

- impute missing outcome and covariates

Models for real data analysis

Page 24: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Results for the real data analysis (Low birth-weight full-term VS Normal)

    OR ( 95% CI)*

Data Model Outcome THMs Smoke Non-white

MCS(1333)

Multinomial Logistic (Bayesian)

LBWF  1.64(0.8-3.1)

2.65(1.2-5.2) 

5.92(2.2-12.9) 

MCS+NBR (9278)

Bayesian Multiple Bias

LBWF 2.4(1.1- 4.5)

2.5 (1.1-4.7)

 5.6(2.6-10.8)

* 95% Bayesian Credible Interval

All parameter estimates adjusted for baby’s sex, mother maternal age

Page 25: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Conclusion

There is an evidence for association of THM exposure with low birth-weight full-term.

Combining the datasets can increase statistical power of the survey data alleviate bias due to confounding in the administrative data

Must allow for selection mechanism of survey when combining data

Page 26: Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

THANKS

Mireille Toledano Mark Nieuwenhuijsen James Bennett Peter Hambly Daniela Fecht John Molitor