hierarchical models for combining multiple data sources measured at individual and small area levels...

26
Hierarchical models for combining multiple data sources measured at individual and small area levels Chris Jackson Chris Jackson With Nicky Best and With Nicky Best and Sylvia Richardson Sylvia Richardson Department of Epidemiology and Department of Epidemiology and Public Health Public Health Imperial College, London Imperial College, London [email protected] BIAS project BIAS project http://www.bias- http://www.bias- project.org.uk project.org.uk

Upload: alberta-ryan

Post on 25-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Hierarchical models for combining multiple data sources measured at individual and small area levels

Chris JacksonChris JacksonWith Nicky Best and Sylvia With Nicky Best and Sylvia

RichardsonRichardsonDepartment of Epidemiology and Public Department of Epidemiology and Public

HealthHealth

Imperial College, LondonImperial College, [email protected]

BIAS projectBIAS projecthttp://www.bias-http://www.bias-project.org.ukproject.org.uk

Outline

Infer some individual-level relationship, e.g. influence of Infer some individual-level relationship, e.g. influence of individual socio-economic circumstances on risk of ill individual socio-economic circumstances on risk of ill health health

Use combination of datasets, individual and aggregate, to Use combination of datasets, individual and aggregate, to answer the question.answer the question.

Multi-level models on multi-level data. Multi-level models on multi-level data.

Examples: Examples: Hospital admission for cardiovascular disease and socio-Hospital admission for cardiovascular disease and socio-

demographic factorsdemographic factors Low birth weight and air pollutionLow birth weight and air pollution

Advantages Disadvantages

Aggregate

Individual

Combining different forms of observational data

Census National registers Environmental monitors

Abundant, routinely collected Covers whole population Can study small-area variations

Surveys Cohort studies Case-control Census SAR

Ecological bias Distinguishing individual from area-level effects Not many variables

Direct information on exposure-outcome relationship More variables available

Low power Little geographical

information confidentiality

COMBINEDCOMBINED

Conflicts between information from each

Reduce confounding and bias Maximise power Separate individual and area-level effects

Example 1: Cardiovascular hospitalisation

QuestionQuestion Socio-demographic predictors of hospitalisation for heart and Socio-demographic predictors of hospitalisation for heart and

circulatory disease for individualscirculatory disease for individuals Is there any evidence of Is there any evidence of contextualcontextual effects (area-level as well as effects (area-level as well as

individual predictors) individual predictors) DesignDesignData synthesis usingData synthesis using Area-levelArea-level administrative data: hospital episode statistics and census administrative data: hospital episode statistics and census

small-area statisticssmall-area statistics Individual-levelIndividual-level survey data: Health Survey for England. survey data: Health Survey for England.IssueIssue Reduce ecological bias and improve power, compared to using Reduce ecological bias and improve power, compared to using

datasets singly. datasets singly.

Example 2: Low birth weight and pollution

QuestionQuestion Influence of traffic-related air pollution (PMInfluence of traffic-related air pollution (PM1010, NO, NO22, CO) on risk of , CO) on risk of

intrauterine growth retardation (intrauterine growth retardation ( low birth weight low birth weight))

DesignDesign

Data synthesis using two individual-level datasetsData synthesis using two individual-level datasets National births register, 2000National births register, 2000. (~600,000 births) . (~600,000 births) Millennium Cohort StudyMillennium Cohort Study. (~20,000 births). (~20,000 births)

IssueIssue Geographical identifiers (Geographical identifiers ( pollution exposure), and outcome, available pollution exposure), and outcome, available

for both datasetsfor both datasets Important confounders (maternal age, smoking, ethnicity…) only Important confounders (maternal age, smoking, ethnicity…) only

available in the small dataset. Combine to increase power.available in the small dataset. Combine to increase power.

Multilevel models for individual and area data

Most commonly used to modelMost commonly used to model individual-level outcomes individual-level outcomes yyijij (individual (individual j,j, area area ii))

in terms of in terms of individual-level predictors individual-level predictors xxijij group-level (e.g. area-level) predictors group-level (e.g. area-level) predictors xxii Allow baseline risk Allow baseline risk (possibly also covariate effects)(possibly also covariate effects) to vary by area: to vary by area:

yyijij ~ ~ ii + + x xijij + + x xii

HoweverHowever

We want to model We want to model area-level outcomes yarea-level outcomes y ii as well as individual as well as individual outcomes outcomes yyijij

Modelling the area-level outcome

Individual exposureIndividual exposure

Aggregate exposureAggregate exposure

Individual Individual outcomeoutcomeyyijijxxijij

xxii

Aggregate Aggregate outcomeoutcome

Individual exposureIndividual exposure

Aggregate exposureAggregate exposure

Individual Individual outcomeoutcomeyyijijxxijij

xxii yyii

Ecological inference

Determining individual-level exposure-outcome Determining individual-level exposure-outcome relationships using aggregate data. relationships using aggregate data.

A simple ecological model:A simple ecological model:

YYii ~ Binomial(p ~ Binomial(pii, N, Nii), ), logit(plogit(pii) = ) = + + X Xii

YYii is the number of disease cases in area is the number of disease cases in area ii

NNii is the population in area is the population in area ii

XXii is the proportion of individuals is the proportion of individuals in area in area i i with e.g. low social class.with e.g. low social class.

ppii is the area-specific disease rate is the area-specific disease rate

exp(exp()) = odds ratio associated with exposure = odds ratio associated with exposure XXii

This is the This is the group levelgroup level association. Not necessarily equal association. Not necessarily equal to individual-level association to individual-level association →→ ecological bias ecological bias

Ecological bias

Bias in ecological studies can be caused by:Bias in ecological studies can be caused by: Confounding. As in all observational studiesConfounding. As in all observational studies

confounders can be area-level (between-area) or individual-level confounders can be area-level (between-area) or individual-level (within-area).(within-area).

Solution: try to account for confounders. Solution: try to account for confounders. non-linear exposure-response relationship, combined with non-linear exposure-response relationship, combined with

within-area variability of exposurewithin-area variability of exposure No bias if exposure is constant in area (contextual effect)No bias if exposure is constant in area (contextual effect) Bias increases as within-area variability increasesBias increases as within-area variability increases ……unless models are refined to account for this hidden unless models are refined to account for this hidden

variabilityvariability

Improving ecological inference

Alleviate bias associated with within-area exposure Alleviate bias associated with within-area exposure variability.variability.

Get some information on within-area distribution Get some information on within-area distribution ffii(x)(x) of of exposures, e.g. from exposures, e.g. from individual-level exposure dataindividual-level exposure data..

Use this to form well-specified model for ecological data by Use this to form well-specified model for ecological data by integrating the underlying individual-level modelintegrating the underlying individual-level model. .

YYii ~ Binomial(p ~ Binomial(pii , N , Nii),), ppii = = ppikik(x) f(x) fii(x) dx(x) dx

ppi i is average group-level riskis average group-level risk

ppikik(x)(x) is individual-level model (e.g. logistic regression) is individual-level model (e.g. logistic regression)

ffii(x)(x) is distribution of exposure is distribution of exposure xx within area within area ii (or joint distribution of multiple exposures)

When ecological inference can work

Using well-specified modelUsing well-specified model Information on Information on within-area distributionwithin-area distribution of exposure of exposure

Information, e.g. from a sample of individual exposures, to Information, e.g. from a sample of individual exposures, to estimate the unbiased model that accounts for this distribution. estimate the unbiased model that accounts for this distribution.

High High between-area contrastsbetween-area contrasts in exposure in exposure Information on the variation in outcome between areas with low Information on the variation in outcome between areas with low

exposure rates and high exposure ratesexposure rates and high exposure rates E.g. to determine ethnic differences in health, better to study areas E.g. to determine ethnic differences in health, better to study areas

in London (more diverse) than areas in a rural region. in London (more diverse) than areas in a rural region.

When there is insufficient information in ecological data:When there is insufficient information in ecological data: May be able to incorporate May be able to incorporate individual-level exposure-individual-level exposure-

outcome data…outcome data…

Hierarchical related regression

Individual-level modelIndividual-level model Logistic regression for individual-level outcomeLogistic regression for individual-level outcome Includes individual or area-level predictorsIncludes individual or area-level predictors Use this to Use this to

model the individual-level data model the individual-level data construct correct model for aggregate dataconstruct correct model for aggregate data

Model for aggregate dataModel for aggregate data Based on averaging the individual model over the within-area joint Based on averaging the individual model over the within-area joint distribution of covariates.distribution of covariates. Alleviates ecological bias.Alleviates ecological bias.

Combined modelCombined model Individual and aggregate data assumed to be generated by Individual and aggregate data assumed to be generated by the the samesame baseline and relative risk parameters. baseline and relative risk parameters. Estimate these parameters using both datasets simultaneouslyEstimate these parameters using both datasets simultaneously

Infer individual-level relationships using both individual and aggregate dataInfer individual-level relationships using both individual and aggregate data

Combining ecological and case-control data

If If outcome is rareoutcome is rare, individual-level data from surveys or , individual-level data from surveys or cohorts will usually contain little information.cohorts will usually contain little information.

Supplement ecological data with Supplement ecological data with case-control data case-control data instead. instead. Haneuse and Wakefield (2005) describe a Haneuse and Wakefield (2005) describe a hybrid likelihood hybrid likelihood

for combination of ecological and case-control datafor combination of ecological and case-control data Even including individual data from the Even including individual data from the cases onlycases only can can

reduce ecological bias to acceptable levels. reduce ecological bias to acceptable levels.

Issues with combining data

Some variables Some variables missing in one datasetmissing in one dataset e.g. smoking, blood pressure available in survey but not e.g. smoking, blood pressure available in survey but not

administrative data administrative data Different but relatedDifferent but related information in each information in each

e.g. self-reported disease versus hospital admission e.g. self-reported disease versus hospital admission records.records.

ConflictsConflicts between datasets in information on what is between datasets in information on what is nominally the same variablenominally the same variable e.g. self-completed and interviewed responses to surveys e.g. self-completed and interviewed responses to surveys

Ideally the individual and aggregate data are from the same Ideally the individual and aggregate data are from the same source (e.g. census small-area and SAR)source (e.g. census small-area and SAR)

AGGREGATEAGGREGATE

Hospital Episode StatisticsHospital Episode Statistics

• number of CVD admissions in area in 1998, by age group/sex

Census small area Census small area statisticsstatistics

• marginal proportions non-white, social class IV/V,…

Census Samples of Census Samples of Anonymised Records (2%)Anonymised Records (2%)

• full within-area cross-classification of individuals, age/sex/ethnicity/social class/car ownership - required for correct aggregate model

INDIVIDUALINDIVIDUAL

Health Survey for EnglandHealth Survey for England

• Self-reported admission to hospital for CVD (1998 only)

• Self-reported long-term CVD (1997, 1999, 1998, 2000, 2001)

Multiple imputation for missing hospital admission in not-1998.

• individual age and sex

• individual ethnicity

• individual social class

• individual car access

Baseline and relative risk of CVD admission for individual

Example: Cardiovascular disease (CVD)

Health Survey for England aggregated over districts

Census covariates or Hospital Episode Statistics data

Are aggregate and individual data consistent?

Area baseline risk

i

Relative risk for individuals

UNKNOWNS

Basic illustration of combining individual and aggregate data

Aggregate census data

DATA

xij

yij

yi

xi

exposure

disease

Areas i

Areas i, individuals j

disease

exposuree.g. proportion low social class

Individual social class

CVD admission

Individual survey data

Area admissions count

ikIndividual survey data

Aggregate census data

Area/stratum baseline risk

Relative risk for exposures

DATA

xij

yij

yik

xir

exposures

disease

Areas i

Areas i, individuals j

social class r, employment status s, age/sex strata k.

xis xik

xirs

k

xil

Census Samples of Anonymised Records

Areas i, individuals l

Cross-classification of individuals

Exposures

More complex models for disease, more confounders, need another data source.

CVD admission

ikSurvey data (1998)

Aggregate census data

Area/stratum baseline risk

Relative risk for exposures

DATA

yij*

yij

yik

xir

Areas i

Areas i, individuals j

social class r, employment status s, age/sex strata k.

xis xik

xirs

k

xil

Census Samples of Anonymised Records

Areas i, individuals l

Cross-classification of individuals

CVD admissions

Survey data (1997-2001)

xij

yij

Areas i, individuals j

CVD admissions

including imputed values

Imputing missing outcomes in individual data

Self reported CVD

Log odds ratio

-1.0 -0.5 0.0 0.5 1.0 1.5 2.0-1.0 -0.5 0.0 0.5 1.0 1.5 2.0-1.0 -0.5 0.0 0.5 1.0 1.5 2.0-1.0 -0.5 0.0 0.5 1.0 1.5 2.0-1.0 -0.5 0.0 0.5 1.0 1.5 2.0

IndividualDistrictDistrict + individualWardWard + individual

Carstairs

No car

Social class IV/V

Non white

Estimated coefficients (with 95% CI) for multiple regression model of the risk of hospitalisation

Individual data only

Aggregate data only

Models combiningindividual and aggregateddata

Individual and area-level predictors Area level covariates in underlying model for hospitalisation risk Area level covariates in underlying model for hospitalisation risk

(Carstairs deprivation index)(Carstairs deprivation index) No significant influence of Carstairs, after accounting for No significant influence of Carstairs, after accounting for

individual-level factorsindividual-level factors Random effects models Random effects models Random area-level baseline risk, quantifies remaining variability Random area-level baseline risk, quantifies remaining variability

between areas.between areas. After adjusting for covariates, variance partitioned into After adjusting for covariates, variance partitioned into

individual / area-level componentsindividual / area-level components 4% of residual variance between wards attributable to 4% of residual variance between wards attributable to

unobserved area-level factorsunobserved area-level factors(2% for districts)(2% for districts)

Little evidence of contextual effects Little evidence of contextual effects

Example: Low birth weight and pollution

Geographically complete individual dataset from national Geographically complete individual dataset from national register, with exposure, outcome but not confoundersregister, with exposure, outcome but not confounders

Geographically sparse survey dataset with all variables. Geographically sparse survey dataset with all variables.

→→ mmissing data issing data problemproblem Impute missing covariates that are likely to be confounded Impute missing covariates that are likely to be confounded

with the pollution exposure. with the pollution exposure. Information for this imputation Information for this imputation

from aggregate data (e.g. ethnicity, from census). from aggregate data (e.g. ethnicity, from census). from sparse survey datasetfrom sparse survey dataset

CONFOUNDERSCONFOUNDERS

Sex, ageSex, age

SocioeconomicSocioeconomic

??

??

??

??

National register data National register data (LARGE) (LARGE)

Survey data (Small)Survey data (Small)

Low birth weightLow birth weight

PollutionPollution

Low birth weightLow birth weight

e

c

regression modelregression model

Confounders

Sex, age

Socioeconomic

Smoking

Ethnicity

Maternal age

etc..

POLLUTIONPOLLUTION

Aggregate Aggregate census datacensus data

EthnicityEthnicity

Parallel regression models

Desire unbiased inference on the effect of the Desire unbiased inference on the effect of the primary exposure.primary exposure.

Available from small dataset with all Available from small dataset with all confounders, but with low power. confounders, but with low power.

Information for imputation comes from small Information for imputation comes from small dataset or ecological data dataset or ecological data is resulting is resulting uncertainty worth the precision gained? uncertainty worth the precision gained?

Work in progress, currently awaiting some Work in progress, currently awaiting some data. data.

Summary

Combining datasets can increase power and Combining datasets can increase power and reduce bias, making use of strengths of eachreduce bias, making use of strengths of each

Problems may arise when data are incompatible Problems may arise when data are incompatible or inconsistent.or inconsistent.

Bayesian hierarchical models useful in cases of Bayesian hierarchical models useful in cases of conflicts.conflicts. All our methods can be implemented in WinBUGS All our methods can be implemented in WinBUGS

More applied studies needed to demonstrate the More applied studies needed to demonstrate the utility of the approach.utility of the approach.

PublicationsOur papers available from Our papers available from http://www.bias-project.org.ukhttp://www.bias-project.org.uk C. Jackson, N. Best, S. Richardson. C. Jackson, N. Best, S. Richardson. Hierarchical related regression for Hierarchical related regression for

combining aggregate and survey data in studies of socio-economic combining aggregate and survey data in studies of socio-economic disease risk factors.disease risk factors. under revision, Journal of the Royal Statistical under revision, Journal of the Royal Statistical Society, Series A.Society, Series A.

C. Jackson, N. Best, S. Richardson. C. Jackson, N. Best, S. Richardson. Improving ecological inference using Improving ecological inference using individual-level data. individual-level data. Statistics in Medicine (2006) 25(12):2136-2159.Statistics in Medicine (2006) 25(12):2136-2159.

C. Jackson, S. Richardson, N. Best. C. Jackson, S. Richardson, N. Best. Studying place effects on health by Studying place effects on health by synthesising area-level and individual data. Submitted.synthesising area-level and individual data. Submitted.

S. Haneuse and J. Wakefield. S. Haneuse and J. Wakefield. The combination of ecological and case-The combination of ecological and case-control data.control data. Submitted. Submitted.