can methods that deal with missing data reduce bias or increase precision in longitudinal studies?

Can methods that deal with missing data reduce bias or increase precision

in longitudinal studies?

Jonathan Sterne, Margaret May,Jon Heron, Ross Harris

ALSPAC / MRC Health Services Research Collaboration, Department of Social Medicine,

University of Bristol

Outline

• Missing data in the ALSPAC study

• Commonly used methods for dealing with missing data

• Valid methods to deal with missing data

• Example applications

• Issues and concluding remarks

ART-LINC

ALSPAC

• Avon Longitudinal Study of Parents and Children• Birth cohort study of ~13,000 children and their

parents, based in south-west England, established by Prof Jean Golding and colleagues ~1990

• Designed to determine ways in which the individual’s genotype combines with environmental pressures to influence health and development

• Children now aged 14-15, 5 year core support recently agreed by MRC/Wellcome

ALSPAC data

• Self completion questionnaires

• Hands on assessments

• Data from external sources

• Biological samples

• DNA

Maintaining Response

• Handling non-response:– Two reminder letters– Telephone call– Visits

• Maintaining study profile:– Newsletters– Media coverage

• Discovery club for children

Response rates

• Child-based:

0

2000

4000

6000

8000

10000

12000

14000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Total number of questionnaires returned

Number of children (%)

0 857 (6.2)1 561 (4.1)2 418 (3.0)3 329 (2.4)4 386 (2.8)5 372 (2.7)6 331 (2.4)7 348 (2.5)8 343 (2.5)9 350 (2.5)10 401 (2.9)11 390 (2.8)12 451 (3.3)13 491 (3.6)14 619 (4.5)15 854 (6.2)16 1,318 (9.6)17 4,980 (36.1)Total 13,799

Number of q’s missed

KB (6 mn)

KD(18 mn)

KF(30 mn)

KJ(42 mn)

KL(54 mn)

KN(69 mn)

KQ(81 mn)

0 10634 (75.6%)

10700 (76.1%)

9741 (69.3%)

9874 (70.2%)

9186 (65.3%)

8479 (60.3%)

8315 (59.1%)

1 619 (4.4%)

295 (2.1%)

444 (3.2%)

90 (0.6%)

218 (1.6%)

153 (1.1%)

40 (0.3%)

2 111 (0.8%)

56 (0.4%)

52 (0.4%)

35 (0.2%)

66 (0.5%)

27 (0.2%)

25 (0.2%)

3 31 (0.2%)

12 (0.1%)

10 (0.1%)

2 (<0.1%)

2 (<0.1%)

10 (0.1%)

4 (<0.1%)

4 13 (0.1%)

18 (0.1%)

10 (0.1%)

3 (<0.1%)

18 (0.1%)

6 (<0.1%)

10 (0.1%)

5 74 (0.5%)

43 (0.3%)

62 (0.4%)

59 (0.4%)

34 (0.2%)

18 (0.1%)

116 (0.8%)

6 6 (<0.1%)

6 (<0.1%)

30 (0.2%)

- 6 (<0.1%)

4 (<0.1%)

1 (<0.1%)

Missed whole questionnaire

2574 (18.3%)

2932 (20.9%)

3713 (26.4%)

3999 (28.4%)

4532 (32.2%)

5365 (38.2%)

5551 (39.5%)

Missing data in ALSPAC• An inevitable problem in analyses that use data from

multiple time points– i.e. the analyses for which the cohort was designed

• Analyses based on children with complete data (“available case analyses”) can typically use 50% or fewer of the children in the cohort

• Social background is strongly associated with the probability that that data are missing

Analysts dilemma

• Exclude subjects with missing data?

• Omit covariates with missing data?

• Deal with missing data?

Consequences of missing data

• Bias - those with complete data may differ from those with incomplete data– Estimation based on subset with complete data “available cases” may give biased estimate of population parameter of interest

• Loss of precision/power–Missing data reduces sample size

Classification of missing data

• Model for distribution of missingness (DoM)– Introduced by Rubin (1976)

Sets of variables: Z with missing data, X with complete data

• MCAR missing completely at random– probability of Z missing not related to either X or true value of Z

• MAR missing at random– Probability of Z missing is not related to unobserved values of Z, but is related to observed values of X

• MNAR missing not at random– Probability of Z missing still depends on unobserved values of Z even after allowing for dependence on X

– statistical analyses cannot deal with this

Simple “ad hoc” missing data methods

• Available case analysis– unbiased if data MCAR, but inefficient

• Mean imputation – association attenuated

• Last value carried forward (for repeated measures)– Distorts trends over time

• Missing category indicator– always biased (see Vach and Blattner AJE 1991)

• Single imputation from model for missing data– distorts standard errors

Valid methods to deal with data that are missing at random (MAR)

• Likelihood-based (EM algorithm)

• Multiple imputation– derive predictive distributions for the missing values

– use these to produce multiple complete datasets

– use standard methods for analysis

– combine results to get valid parameter estimates and standard errors

– This is not “making up data”!!

• Efficient, robust methods that use weighted estimation

Multiple imputation in practice

• Very rapid software development in recent years

• Two flavours of MI:– methods based on the multivariate normal distribution (good

theoretical foundation, problems with categorical variables)

– “chained equations” (little theoretical foundation, good for categorical variables, becoming widely used)

• Few guidelines for analysts

• Highly complex models in typical situations

• Very difficult to report methods in adequate detail, in applied papers

Example 1: predicting mortality in HIV-1 infected people treated with antiretroviral

therapy in low income countries

• Data from the ART-LINC collaboration– 2,725 patients with active follow up in 14 treatment programmes

in Africa, Asia and South America

• Prognostic model for patients starting antiretroviral therapy

• Estimate mortality hazard ratio according to whether patients had AIDS at baseline, using methods for missing data– This information was missing for 649 patients (24%)

ART-LINC

[…cut…]

Example 2: prognostic value of anaemia in HIV-1 infected people treated with

antiretroviral therapy in developed countries

• Prognostic model already developed• Want to include anaemia, but this is missing about 30% of

the time• Haemoglobin is strongly associated with other prognostic

variables, in particular CD4 cell count• Can we (a) reduce bias (b) increase precision by using

missing data methods?

[…cut…]

• Estimate:– the prevalence of wheeze at different times– associations between wheeze and maternal asthma

• Possible analyses:– restrict to cases with complete data at all time points– restrict to cases with complete data at a single time point– Impute using information measured at a particular time– Impute using longitudinal information

Example 3: what can we gain from missing data methods in ALSPAC?

[…cut…]

• Most dramatic changes were between available case analyses

• Prevalence estimates based on missing data methods were plausibly less biased

• Surprisingly small changes in standard errors for associations when missing data methods were used

• We know very little about when estimates of associations are likely to be biased because of missing data

Example 3: what can we gain through using missing data methods?

Concluding Remarks

• Analyses restricted to patients with no missing data are widely used, but are biased when data are not missing completely at random (MCAR) and result in a loss of statistical power

• Use of missing data methods may reduce bias and increase precision. However, we pay a price in model complexity

• There is increasing usage of these methods, but they require great care and can produce misleading results

• Guidelines for conduct and reporting are needed

• Never, ever present analyses using missing data methods but not the available case analyses

Final message

It’s better to make the measurement than to try to use statistical methods to compensate

for the fact that it is missing

can methods that deal with missing data reduce bias or increase precision in longitudinal studies?

Documents

precisionpowermissing

missing datavalid methods

incomplete data estimation

complete data available

unobserved values of

x statistical analyses

complete datamcar

childrenbirth cohort