using small-area spatial data for statistical and ... › media › media_447105_smxx.pdf · in...

39
Using small-area spatial data for statistical and epidemiological research Duncan Lee [email protected] Scottish Government 16th December 2015

Upload: others

Post on 25-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Using small-area spatial data forstatistical and epidemiological research

Duncan Lee [email protected]

Scottish Government 16th December 2015

Page 2: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Acknowledgements

This is joint work with colleagues from two main projects:

Andrew Lawson, Gary Napier, Kevin Pollock and ChrisRobertson on an MRC funded project focusing onmodelling spatio-temporal patterns in disease risk.

Guowen Huang and Marian Scott on an EPSRC fundedproject focusing on the effects of air pollution on humanhealth.

Introduction Measles susceptibility Air pollution 2/39

Page 3: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

IntroductionThe availability of small-area spatial data has dramaticallyincreased in the last decade or so, including:

Scottish Neighbourhood Statistics (SNS).Health and Social Care Information Centre (HSCIC).

This increase has been accompanied by the widespreaddevelopment of statistical methodology and software formapping and modelling these data.

The latter include Geographical Information Systemssoftware and statistical modelling software, such asArcGIS / QGIS, and packages in the statisticalsoftware R such as CARBayes and RINLA.

The analysis of small-area data is performed in manyfields, such as econometrics, epidemiology, and socialscience.

Introduction Measles susceptibility Air pollution 3/39

Page 4: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Example - Intermediate Geographies

Introduction Measles susceptibility Air pollution 4/39

Page 5: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Modelling spatial data

The key feature when modelling spatial data is that of spatialautocorrelation, which is summarised by Tobler’s first law ofgeography which says

Everything is related to everything else, but near thingsare more related than distant things.

Which means that standard regression models that assumeindependence in the residuals are likely to be inappropriate andpotentially result in misleading conclusions.

Introduction Measles susceptibility Air pollution 5/39

Page 6: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

How to model spatial aucorrelation

There are a number of commonly used models for capturingspatial autocorrelation in data / residuals from a regressionmodel, including:

Conditional AutoRegressive (CAR) models.Simultaneous AutoRegressive (SAR) models.Point level (Geostatistical) models based on each areascentroid.

CAR and SAR models are most commonly used for small-areadata.

Introduction Measles susceptibility Air pollution 6/39

Page 7: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Capturing spatial similarity

All models require the spatial closeness between each pair ofareal units to be defined, and CAR and SAR models use anK × K neighbourhood or adjacency matrix W, where K is thenumber of areal units in the data set. Then, the kjth element ofW is typically defined as:

wkj =

{1 Areas (k,j) share a border0 Otherwise

so that if wkj = 1 data in areal units (k, j) are modelled asspatially autocorrelated, while if wkj = 0 they are assumed to beconditionally independent.

Introduction Measles susceptibility Air pollution 7/39

Page 8: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

My research

Most of my research is interested in developing new statisticalmethodology for epidemiological applications:

Identifying trends in disease risk over time and how thosetrends vary in space.Identifying the locations of clusters of small-areas thatexhibit substantially higher disease risk than theirneighbours.Estimating the effects of air pollution on human health.

However, I am also involved in a social science led projectAQMeN (http://aqmen.ac.uk/) looking at changes in urbansegregation over time.

Introduction Measles susceptibility Air pollution 8/39

Page 9: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Motivations

My choice of research topics is motivated by two main goals:

1 Finding an important public health problem for whichexisting statistical models are inadequate, and henceproviding an innovative modelling solution.

2 Developing free to use software in R for others to be ableto apply both standard and novel statistical models forsmall-area data to their own problems.

Examples of the latter include the CARBayes andCARBayesST packages in R for spatial and spatio-temporalmodelling of areal unit data.

Introduction Measles susceptibility Air pollution 9/39

Page 10: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

This talk

In this talk I give 2 examples of statistical research you can dofocusing on addressing key questions in epidemiology, andusing small-area data such as that from the ScottishNeighbourhood Statistics database. The examples are:

1 Identifying temporal trends in measles susceptibility overtime.

2 Estimating the long-term effects of air pollution on healthin Scotland.

Introduction Measles susceptibility Air pollution 10/39

Page 11: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

3. Susceptibility to measles

An article published in 1998 by Andrew WakefieldWakefield et al. (1998) linked the measles, mumps andrubella (MMR) vaccine with an increased risk of autism.

This scare led to a reduction in vaccination rates, which by2003 reached a low of around 80% in some parts of theUK.

Repercussions from these decreased vaccination rates werefelt in 2013, with large outbreaks of measles occurring inEngland and Wales (Pollock et al. 2014).

The Wakefield article was partially retracted in 2004before being fully discredited in 2010 after multipleepidemiological studies failed to find any association.

Introduction Measles susceptibility Air pollution 11/39

Page 12: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

DataSince 1988 individual vaccination records of all children inScotland are kept in the Scottish Immunisation & RecallSystem (SIRS) .

The data analysed here are based on the estimated numbersof children eligible to attend pre-school (aged 2-4) fromnon-overlapping two-year birth cohorts.

We have the estimated number susceptible to measles (Ykt),and the total number of pre-school children (Nkt) in the1235 (indexed by k) intermediate geographies in Scotlandbetween 1998 and 2014 (indexed by t).

Thus θ̂kt = Ykt/Nkt is the estimated proportion of childrenwho are susceptible to measles, where measlessusceptibility is based upon the receipt of one or twovaccinations that each have a failure rate of 10%.

Introduction Measles susceptibility Air pollution 12/39

Page 13: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Covariate factors

We use three covariate factors here to capture the potentialimpacts of socio-economic deprivation and ethnicity:

Median House Price (MHP) in each IG and year.

Percentage of working age people in receipt of Job SeekersAllowance (JSA) in each IG and year.

Percentage of school children from ethnic minorities (EM).

Introduction Measles susceptibility Air pollution 13/39

Page 14: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Data

1998 2000 2002 2004 2006 2008 2010 2012 2014

0.1

0.2

0.3

0.4

0.5

Year of pre−school attendance

Pro

port

ion s

usceptible

Median house price by 100,000 (mean centred)−1 0 1 2 3

0 5 10

0.1

0.2

0.3

0.4

0.5

% of working age population claiming JSA (mean centred)

Pro

port

ion s

usceptible

% of pupils in ethnic minority groups (mean centred)0 20 40 60 80

Introduction Measles susceptibility Air pollution 14/39

Page 15: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Exploratory analysis

Initially a simple binomial logistic regression model was fittedto these data:

Ykt ∼ Binomial(Nkt, θkt),

ln(

θkt

1− θkt

)= β0 + S1(MHPkt) + S2(JSAkt) + S3(EMkt),

where Si(·) =∑3

j=1 Bi(·)βij, i = 1, 2, 3 is a natural cubic splineof each covariate with 3 degrees of freedom to allow for thenon-linear relationships observed above.

Introduction Measles susceptibility Air pollution 15/39

Page 16: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Residual autocorrelation

Spatial autocorrelation in the residuals was assessed bycomputing Moran’s I statistics and performing permutation testson the residuals from the above model separately for each year.

The values of the Moran’s I statistics obtained ranged between0.12 and 0.38, with the statistics generated from 10 000 randompermutations of the data yielding p-values of less than 0.0001 inall cases. Thus spatial autocorrelation has to be modelled.

Introduction Measles susceptibility Air pollution 16/39

Page 17: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Goals of the analysis

The goals of the analysis are as follows:

1 Estimate the magnitude of the increased measlessusceptibility associated with the MMR vaccination scarelinked to the Wakefield article and assess whethersusceptibility has decreased in recent years.

2 Assess whether the magnitude of the inequality, asmeasured by the spatial variability, in measlessusceptibility in Scotland increased at the same time andwhether spatial variation has now decreased.

3 Determine whether any area based covariates, such asdeprivation, have any impact on measles susceptibility inScotland.

Introduction Measles susceptibility Air pollution 17/39

Page 18: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Modelling approach

We propose a Bayesian hierarchical model to answer thesequestions, which is given by:

Ykt|Nkt, θkt ∼ Binomial (Nkt, θkt) ,

ln(

θkt

1− θkt

)= β0 + S1(MHPkt) + S2(JSAkt) + S3(EMkt) + φkt + δt.

where δt is the Scotland-wide temporal trend at time t while φkt

is a spatio-temporal effect for area k and time t.

Introduction Measles susceptibility Air pollution 18/39

Page 19: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Spatio-temporal trend models

The overall temporal trend is modelled by a first orderrandom walk process:

δt ∼ N(δt−1, σ

2) .The spatial trend in year t, φt = (φ1t, . . . , φKt) is modelledby a conditional autoregressive (CAR) model:

φkt|φ−kt,W, ρ, τ 2t ∼ N

(ρ∑K

j=1 wkjφjt

ρ∑K

j=1 wkj + 1− ρ,

τ 2t

ρ∑K

j=1 wkj + 1− ρ

),

where each year has its own variance τ 2t to allow for

temporally varying spatial variability.

Introduction Measles susceptibility Air pollution 19/39

Page 20: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Results - covariate effects

−1 0 1 2 3

0.9

1.0

1.1

1.2

1.3

1.4

1.5

Median house price by 100,000 (mean centred)

Od

ds r

atio

0 5 10

0.9

1.0

1.1

1.2

1.3

1.4

% of working age population claiming Jobseeker’s Allowance (mean centred)

Od

ds r

atio

0 20 40 60 80

0.9

1.0

1.1

1.2

1.3

1.4

1.5

% of pupils from ethnic minority groups (mean centred)

Od

ds r

atio

Introduction Measles susceptibility Air pollution 20/39

Page 21: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Results - Time trends and spatial variability

0.075

0.100

0.125

0.150

0.175

1998 2000 2002 2004 2006 2008 2010 2012 2014

Year

θt

0.075

0.100

0.125

0.150

0.175

1998 2000 2002 2004 2006 2008 2010 2012 2014

Year

θt

0.00

0.02

0.04

0.06

1998 2000 2002 2004 2006 2008 2010 2012 2014

Year

τt2

0.00

0.02

0.04

0.06

1998 2000 2002 2004 2006 2008 2010 2012 2014

Year

τt2

The top row is with covariates and the bottom panel is without.

Introduction Measles susceptibility Air pollution 21/39

Page 22: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Results - Spatial trends

Introduction Measles susceptibility Air pollution 22/39

Page 23: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Conclusions

Measles susceptibility increased to a peak in 2004coinciding with the media coverage surrounding theWakefield article, before dropping dramatically until thepresent day.

Spatial variation in measles susceptibility decreased until2006, whereafter it has stayed relatively constant over time.

Socio-economic deprivation appears to have a U-shapedrelationship with measles susceptibility, with increasingsusceptibility for very poor and very affluent communities.

Spatially, the rural northwest part of Scotland appears tohave the highest rates of susceptibility, as it has stayedconsistently higher than other parts of the country for alltime periods.

Introduction Measles susceptibility Air pollution 23/39

Page 24: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

2. Air pollution and health

Air pollution has long been known to adversely affectpublic health, in both the developed and developing world.

Recent reports by the UK government and the WorldHealth Organisation estimate that:

particulate matter reduces life expectancy by 6 months,with a health cost of £19 billion per year.there were estimated to be over 23,000 premature deathsfrom air pollution in 2010.

Air pollution will remain a key health problem for sometime, as nitrogen dioxide emissions are predicted to exceedEuropean Union limits until after 2020 in key parts of theUK, including in Glasgow.

Introduction Measles susceptibility Air pollution 24/39

Page 25: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Ecological study design

In ecological studies the data relate to populations living ina set of k = 1, . . . ,K non-overlapping areal units fort = 1, . . . ,T time periods, rather than to individuals.

In this study we have K = 1207 Intermediate Geographiesthat make up mainland Scotland, and data are collected forT = 5 years between 2007 and 2011.

For IG k and year t the observed number of hospitaladmissions due to respiratory disease is denoted by Ykt,while the expected number of admissions based onpopulation demographics is denoted by Ekt.

The standardised morbidity ratio is given bySMRkt = Ykt/Ekt, an exploratory measure of disease risk.

Introduction Measles susceptibility Air pollution 25/39

Page 26: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Spatial pattern in SMR

0 50 km0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Introduction Measles susceptibility Air pollution 26/39

Page 27: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Air pollution

Air pollution concentration data are available from two sources:

Measured data from a small number of monitoring sitessuch as the Automatic Urban and Rural Network (AURN).

Modelled concentrations from a dispersion model on a1kilometre regular grid, such as the data provided byDEFRA.

Neither data source is ideal, as the measured data are spatiallysparse and the modelled data are only estimates and not realmeasurements.

Introduction Measles susceptibility Air pollution 27/39

Page 28: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Modelled nitrogen dioxide (NO2) data

0 50 km 0

5

10

15

20

25

30

35

40

Introduction Measles susceptibility Air pollution 28/39

Page 29: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Locations of measured NO2 data

0 50 km

Triangles are monitors and + are diffusion tubes.

Introduction Measles susceptibility Air pollution 29/39

Page 30: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Goal 1 - Estimating NO2 concentrations

The first goal in this study is to estimate representativeconcentrations of nitrogen dioxide for each IG from these twosets of data. However, there are two natural questions one has toovercome.

How should the different spatial scales of the three datasets be resolved (point, grid square, IG)?

How can we use both the modelled and monitored data toget the best estimate of NO2 concentrations.

Introduction Measles susceptibility Air pollution 30/39

Page 31: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Fusion modelling

The accepted approach in the literature is a statisticalfusion model, which essentially regresses the measuredpollution data against the modelled pollution data.

Let Xt = (Xt(s1), ...,Xt(snt)) denote the vector of nt

measured NO2 concentrations (on the natural log scale) atsites (s1, . . . , snt) in year t

These measured concentrations are related to an nt × pdesign matrix of covariates Zt, such as the modelledconcentrations, site type, etc.

Introduction Measles susceptibility Air pollution 31/39

Page 32: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Pollution model

Then the regression component of the model is given by:

Xt ∼ N(Ztβt, σ2t It) t = 1, ...,T,

The regression parameters βt and the variance parameter σ2t

vary over time via autoregressive processes:

βt ∼ N(β + κ(βt−1 − β), τ 2V

)t = 2, ...,T,

ln (σ2t ) ∼ N(ln (σ2

t−1), σ2) t = 2, ...,T,

Note, no spatial autocorrelation is allowed for here because themodelled concentrations are spatially autocorrelated andaccount for it.

Introduction Measles susceptibility Air pollution 32/39

Page 33: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Prediction

The model is fitted in a Bayesian setting, using MarkovChain Monte Carlo (MCMC) simulation methods.

It predicts NO2 concentrations at each 1km grid square inmainland Scotland.

We aggregate to the IG level by computing the mean andmaximum of the set of 1km grid squares in each IG.

The mean is commonly used in these studies, while themaximum may better represented densely populated partsof each IG.

Introduction Measles susceptibility Air pollution 33/39

Page 34: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Goal 2 - Estimating the health effect of NO2

Recall that (Ykt,Ekt) denote the observed and expected numbersof hospital admissions in the kth IG in year t. Then a Poissongeneralised linear mixed model is typically used:

Ykt | Ekt,Rkt ∼ Poisson(EktRkt),

ln(Rkt) = bTktα+ X̃ktλ+ φkt,

where X̃kt is NO2 (mean or max) and bTkt are other covariates

such as socio-economic deprivation. Finally, φkt is a randomeffect that accounts for any unmeasured spatio-temporalautocorrelation.

Introduction Measles susceptibility Air pollution 34/39

Page 35: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Modelling spatial autocorrelation

Our basic model for spatial autocorrelation is a combination ofan autoregressive time series process of order 1 and aconditional autoregressive spatial process, and is given by

φt | φt−1 ∼ N(γφt−1, ν

2Q(ρ,W)−1) ,where W is a spatial neighbourhood matrix and (γ, ρ) aretemporal and spatial autocorrelation parameters respectively.Again, the model is fitted in a Bayesian framework usingMCMC simulation, and the R package CARBayesST can beused to fit this model.

Introduction Measles susceptibility Air pollution 35/39

Page 36: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

ResultsThe results are presented as relative risks for a standarddeviation increase in each covariates value, which is NO2 6.84µgm−3, Logprice 0.38, JSA 2.35.

Parameter Spatial mean NO2 Spatial max NO2

NO2 0.993 1.021(0.980,1.008) (1.004,1.037)

Logprice 0.920 0.921(0.909,0.929) (0.911,0.930)

JSA 1.200 1.196(1.185,1.215) (1.180,1.214)

ρ 0.926 0.911(0.891,0.956) (0.866,0.946)

γ 0.832 0.830(0.797,0.867) (0.792,0.865)

Introduction Measles susceptibility Air pollution 36/39

Page 37: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Conclusions

In this study the choice of spatially representative measureof NO2 had a large impact on the results, with no effectbeing seen for the spatial mean but a substantial effectbeing observed for the spatial max.

Estimating the effect of air pollution on health is a difficulttask because of factors such as:

Spatial misalignment between the pollution and diseasedata.The difficult task of controlling for residualspatio-temporal autocorrelation.The lack of real exposure data.

Future work in this field should look at personal exposuresand not outdoor measured data at fixed locations.

Introduction Measles susceptibility Air pollution 37/39

Page 38: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

Talk summary

In summary, databases of small-area statistics such as SNSprovide a valuable resource for addressing key questions ofpublic and policy interest.

Although this talk has focused on epidemiology, the SNShas data on topics as diverse as property prices,educational attainment, benefit claimants, access toservices and demography.

A key statistical issue is that of spatial autocorrelation, inthat simple models ignoring this are likely to producedbiased results, particularly in terms of confidence intervals.

A second issue is that of the modifiable areal unit problem(MAUP), which essentially means do your inferencesremain valid if you change spatial scale, e.g. from IG toDZ?

Introduction Measles susceptibility Air pollution 38/39

Page 39: Using small-area spatial data for statistical and ... › media › Media_447105_smxx.pdf · In this study we have K = 1207 Intermediate Geographies that make up mainland Scotland,

References

An integrated Bayesian model for estimating the long-termhealth effects of air pollution by fusing modelled andmeasured pollution data: A case study of nitrogen dioxideconcentrations in Scotland, Guowen Huang, et al., Spatialand Spatio-temporal Epidemiology, 2015, 14-15, 63-74.

A model to estimate the impact of changes in MMR vaccineuptake on inequalities in measles susceptibility inScotland, Gary, Napier, et al, under revision.

CARBayes and CARBayesST R packages are availablefrom https://cran.r-project.org.

Further details about my research can be found athttp://www.gla.ac.uk/schools/mathematicsstatistics/staff/duncanlee/.

Introduction Measles susceptibility Air pollution 39/39