1 philip clarke and denise silva development of small area estimation at ons
TRANSCRIPT
1
Philip Clarke and Denise Silva
Development of Small Area Estimation at ONS
2
Outline
1. Small Area Estimation Problem
2. History and current provision
3. Development in progress
4. Wider research
5. Consultancy service
3
1. Small Area Estimation Problem
• “Official statistics provide an indispensable element in the information system of a democratic society” (Fundamental Principles of Official Statistics, UNSD )
• Sample surveys are used to provide estimates for target parameters on population (or National) level and also for subpopulations or domains of study
• However implementation in a Small Area Context is challenging
4
Small Area Estimation Problem
• In small areas/domains sample sizes are usually not large enough to provide reliable estimates using classical design based methods.
• Small area estimation problem refers to SMALL SAMPLE SIZES (or none at all) in the domain or area of interest.
5
2. History
• Small Area Estimation in UK begun as research project in late 1990s.
• In response to calls for locally focussed information in many different areas :Environmental
Business
Social, e.g. health, housing, deprivation, unemployment.
• Also calls for more general domain estimation;– e.g. cross classifications by age/sex, occupation.
• Initial experimental studies on mental health estimation for DoH.
6
Developing alternative methodology
• Purpose :– To enable production of reliable estimates of characteristics of
interest for small areas or domains based on very small or no
sample.
– To asses the quality (precision) of estimates.
• Several years of research and development (since 1995)– Partnership work with universities and Statistics Finland
– The EURAREA project:
Research programme funded by Eurostat to ‘enhance
techniques to meet European needs’ (from 2001-2004)
7
Basis of Approach: Relax the Survey Restriction
• ‘Borrow strength’ by removing the isolation of
depending solely on the survey and solely on
respondents in a given area.– Widen the class of respondents for a given area by pooling together similar areas.
– Widen the class of respondents by taking past period respondents into account.
– Take advantage of other related data sources which are not sample survey based.
• Known as auxiliary data.
e.g. Administrative data or census data which are available for all areas/domains.
8
Model based estimation
• All approaches detailed are based on an implicit or explicit model.
• The auxiliary data and use of survey data from all areas is the approach currently adopted in UK.– Borrows strength nationally.
– Uses an explicit statistical model to represent the relationship
between the survey variable of interest and auxiliary data. Dependent variable is survey variable of interest.
Independent variables are certain auxiliary data variables known
as covariates.
Model fitted using sample data and assumed to apply generally.
Model then used in the obtaining of area/domain estimates.
9
Outline of a model structure
• Suppose variable of interest, Y, in an area j is linearly related to a single covariate X
• A possible model structure is given by :
where is the mean of Y in area j
• This is a deterministic structure, so we need to add some random variability
j jY X
jY
10
• Obtain
• uj represent random area differences from the deterministic value.
• represents variability between areas.
j j jY X u ),0(~ 2uj Nu
2u
11
Model fitting
• Fit the model using direct survey estimates for each area.
• This introduces additional sampling variability.
• Unit level sampling variability
giving rise to additional area level sampling variability
j j j jy X u e 2~ (0, )ij ee N
2~ (0, )j e je N n
12
Estimating from the model
• Once the model is fitted, estimate for area j by using parameter estimates :
jjj uXy ˆ ˆˆˆ
13
Estimating from the model
• Once the model is fitted, estimate for area j by using parameter estimates :
• Estimate of mean squared error given by
jjj uXy ˆ ˆˆˆ
2ˆ)ˆ(ˆujyESM )ˆ,αv(oC2)ˆ(raV)αr(aV 2 jj XX
14
Estimating from the model
• Once the model is fitted, estimate for area j by using parameter estimates :
• Estimate of mean squared error given by
• Modelling success measured by obtaining estimates with high precision based on low mean squared errors.
jjj uXy ˆ ˆˆˆ
2ˆ)ˆ(ˆujyESM )ˆ,αv(oC2)ˆ(raV)αr(aV 2 jj XX
15
Current provision
• SAEP – a generic methodology for application to variables from household based surveys. – Mean household income based on Family
Resources Survey published as Experimental Statistics for wards in 1998/99, 2001/02 and for middle layer super output areas 2004/05
• Specialised methodology for labour market estimation of unemployment from Labour Force Survey.– Unemployment levels and rates routinely
published quarterly as National Statistics for Local Authority Districts in Great Britain.
16
SAEP methodology and income estimation
SAEP methodology is -:• derived from outlined model-based approach,
BUT is• based on a unit (household)/area multilevel model;• borrows strength across areas using multivariate
area level auxiliary data (covariates);• can model transformation of variable of interest if
required;• adapted for estimating at ward/middle layer super
output area (MSOA) from customary ONS clustered design household sample surveys;
17
Application to income estimation- Response Variable
• Income value for each household sampled in Family Resources Survey (FRS).~ 3,300 MSOAs in England and Wales with sample
in 2004/05,
~ 21,500 total responding households.
• But not a simple random sample.– Clustered design with primary sampling units as
postcode sectors,
~ 1,500 sampled postcode sectors.
18
Coping with design clustering
• Samples are random samples of postcode sectors; – So random terms are around postcode sectors,
indexed by j
• Estimation is required for geographically distinct wards or middle layer super output areas;– So covariates are for these areas, indexed by d– For estimation, covariates must be known for all
areas not just sampled areas.
19
SAEP model and estimator structure for income estimation
• Multilevel structure gives rise to unit level random term replacing area sampling variability
• Logarithmic transformation of income taken because of positive skewness of income distribution
• Model : id d j ijlog y X u e
ije
ie
20
SAEP model fitting procedure
• Create a dataset containing :– Variable of interest from individual household
responses to survey.– values of a large number of administrative and
census variables for the particular household area of residence which we believe could impact on variable of interest, eg census variables, DWP social benefit claimant rates, council tax band proportions
21
SAEP model fitting procedure (cont.)
• Starting with a null model, fit covariates in a stepwise manner in order of significance by using specialised multilevel software – eg. MLwiN or SAS PROC MIXED.
• In this way select a set of significant covariates and fit an accepted model.
• Use diagnostic techniques to investigate model against assumptions eg. Randomness of residuals, unbiasedness of predictions.
22
Estimator and mean squared error
• Estimator on log income scale :A synthetic estimator is used omitting the random
area terms :
ˆˆ d dlog y X
23
Estimator and mean squared error
• Estimator on log income scale :A synthetic estimator is used omitting the random
area terms :
• Mean squared error
ˆˆ d dlog y X
2ˆ ˆTd d uX Var X
24
Converting to raw income scale
• Need to make allowance for
mean(log) log(mean)
• Area estimate2 2ˆ ˆˆˆ exp
2u e
d dy X
25
Converting to raw income scale
• Need to make allowance for
mean(log) log(mean)
• Area estimate
• Confidence interval
2 2ˆ ˆˆˆ exp2
u ed dy X
12 2
2 2ˆ ˆˆexp ˆ ˆ1.96
2Tu e
d d d uX X Var X
26
Actual model for ward estimation of income in 2004/05
ˆlog
.................
6.01 0.76
0.18 0.13
0.58 0.
...............
72
.
d d
d d
d d
phrpman
lnphrpecac lnphhtyp
inco
e1
engegh pcgeo
me
x
x x
x x
phrpman = proportion of household reference persons aged 16-74 who are in professional or managerial occupations.lnphrpecac = logit of proportion of household reference persons aged 16-74 who are economically active.lnphhtype1 = logit of proportion of one person households.engegh = proportion of council tax band G&H dwellings for England.pcgeo = proportion of people aged 60 and over claiming pension
credit (guarantee element only) .
27
28
Income estimation outputs
• Estimates obtained of sufficient precision for publication and acceptable to user community.
• Accredited as Experimental Statistics• Placed on Neighbourhood Statistics website
together with user guides and technical documentation.
29
Estimation of unemployment at local authority level
BACKGROUND• Unemployment is a key indicator and is used for
policy making and resource allocation
• Official UK measure of unemployment follows the International Labour Organisation Definition (ILO)
• ILO unemployment is estimated via the Labour Force Survey (national level)
• Small (local) sample sizes in the LFS for some areas
30
Features of Labour Force Survey
• A rotating panel survey– Roughly 60,000 households surveyed each quarter– Each household remains in sample for 5 quarters
(waves 1 to 5) then drops out
• Waves 1 and 5 respondents for last four quarters used to obtain an annual ‘local labour force survey’ dataset of about 90,000 independent households.
• Unclustered survey design – giving a sample in each LAD.
31
Features of unemployment modelling
• Unclustered LFS design means
– direct estimates available for each LAD
– availability of estimated random area terms in LAD estimation
• However– low precision of direct survey estimates due to small sample
sizes– need for better precision model-based estimates
• Availability of a highly correlated covariate – number of claimants of unemployment benefit/job seekers allowance
– Eliminates need for model fitting to a range of possible covariates on each occasion.
32
The small area estimation model
A LOGISTIC multilevel model by local authority (d) and six age/sex classes (i). It relates the probability pdi of an individual to be unemployed.
Response variable: proportion of unemployed individuals in LFS in age/sex class of local authority (logit transformed).
Covariate data• Benefit data: the logit of the claimant proportion of job seekers
allowance in each age/sex class within each local authority and also for overall age/sex classes;
• The age/sex class: male/female for age groups (16 to 24; 25 to 49; 50 and over)
• Geographical region: the 12 government office regions (GOR)
• ONS area classification : 7 categories under the National Statistics Area Classification for Local Authorities
33
• The model used to link pid with the auxiliary data is a Binomial linear mixed model with a logistic link function
Area random effect
logit ln1
Tidid id d
id
pp X u
p
β2~ (0, )d uu N
34
Estimator from model
• The model-based estimator of proportion unemployed in each age/sex group of each LAD is then given after fitting model by :
• Note the use of the term in the estimator as it is now available for each LAD.
ˆ ˆexpˆˆ ˆantilogit
ˆ ˆ1 exp
Tid dT
id id dTid d
up u
u
x βx β
x β
du
35
• Model has estimated a proportion at each age/sex group
• This is converted into an estimate of unemployment level at each LAD by :– multiplying each proportion estimate by the LFS estimate of
population unsampled
– adding those sampled and found unemployed
– summing the age/sex group estimates
Final Estimator for unemployment level for area d is:
Model-based estimate for Unemployment
6 6
1 1ˆ ˆ ˆ ˆd id sid id id idi iY Y y N n p
6 age-sex groups
36
LAD Estimation of unemployment rate
• The estimate of unemployment rate is obtained using model-based estimate of unemployment level and the direct estimate of employment :
Direct survey estimate of
Employment
dd
dd
EY
Yr
Model-based estimate of
Unemployment
37
Precision of Estimates
• The mean squared error (MSE) for the unemployment level estimates in LAD d is given by several components
• G1 and G2 come from the uncertainty in estimating the coefficients and
u in the model
• G3 arises because we have estimated the variance of u
• G4 is necessary because the model estimates actual values rather than
means
• G5 is the additional variance component due the estimation of population
size in each LAD
54321d GGGGG)Y(MSE
β
)ˆ( dN
2u
38
Unemployment estimates publication
• The standard errors of the model based estimates found to be smaller than the corresponding direct standard errors in each LAD.
• Model-based estimates have been accredited as National Statistics and now published quarterly in Labour Market statistics releases.
(http://www.statistics.gov.uk/StatBase/Product.asp?vlnk=14160)
39
3. Developments in progress
Labour Market area
– Consistent estimation of all three labour market states: - employed, not economically active, unemployed
– Currently… Local Authority labour market estimates are:
• Model-based estimates for unemployment
• Direct survey estimates for economically inactivity and employment figures
• Now developing a multivariate model to estimate concurrently number of unemployed, employed and economic inactive people by local authority
40
Compositional data
• The proportions of individuals classified in each category are: Proportions bounded between 0 and 1 and
subject to a unity-sum constraint.
Multinomial Logistic model to relate labour market probabilities with auxiliary data for all categories is therefore defined with only 2 equations.
41
Multinomial Logistic Model
11 1 1
3
mlogit( ) ln Tidid id d
id
pp u
p
x β
22 2 2
3
mlogit( ) ln Tidid id d
id
pp u
p
x β
42
Multinomial Logistic Model
11 1 1
3
mlogit( ) ln Tidid id d
id
pp u
p
x β
22 2 2
3
mlogit( ) ln Tidid id d
id
pp u
p
x β
1 1
1 2
1
exp( )
1 exp
Tid d
idTid j dj
j
up
u
x β
x β 2 2
2 2
1
exp( )
1 exp
Tid d
idTid j dj
j
up
u
x β
x β
Then:
3 2
1
1
1 expid
Tid j dj
j
pu
x β
43
The Model
• Relates the probabilities of labour market states to following predictors:
• age/sex group ; Geographical region and ONS area classification:
• Benefit data: claimant proportions (JSA) and incapacity benefit
• Other variables will be tested (e.g. income support)
46
Developments in progress (cont.)
Labour Market area– Unemployment estimation at Parliamentary
constituency level
• Non-nested geography but with certain matching areas
• Issue here is to ensure consistency with local authority
estimates at comparable areas
• Model developed and estimates likely to become
available in the coming year
47
Developments in progress (cont.)
Income estimation– Estimation at local authority level
• Clustered survey design entails a modification of SAEP framework to cater
• Currently in development
– Estimation of poverty: proportion households below threshold
• Currently being developed for MSOA/local authority level
48
4. Wider research activities
In conjunction with academic partners– Estimation of change over time
Current work is confined to single point-in-time estimation but users would like indication of progress over time – particular in relation to funding
– Estimation of poverty using M-quantile modelling
Research using FRS data by Nikos Tzavidis
– Models incorporating spatial relationshipsPreliminary investigation of spatial relationship in
unemployment model in conjunction with Ayoub Saei at Southampton University
Link with work at Imperial College by Nicky Best and Virgilio Gomez-Rubio
49
5. Methodology Consultancy Service
ONS is currently establishing a methodology consultancy service
– To undertake and support statistical work by other government departments and public sector organisations.
– Resource for assessment/quality improvement
– Currently working with Health and Safety Executive on small area estimation of incidence of work related illness at local authority level.
50
References• Small Area Estimation Project Report. Model-Based Small Area
Estimation Series No.2, ONS, January 2003• Developments in small area estimation in UK with focus in current
research. Clarke, P., Mcgrath K., Chandra, H., Tzavidis, N. (2007). IASS Satellite Meeting on Small Area Estimation, Pisa.
• Model Based Estimates of Income for Middle Layer Super Output Areas 2004/05 Technical Report, ONS, September 2007
http://neighbourgood.statistics.gov.uk/HTMLDocs/images/Technical Report 2004_05 v2 - Final_tcm97-53513.pdf http://neighbourhood.statistics.gov.uk/dissemination/MetadataDownloadPDF.do?downloadId=21704
• Development of improved estimation methods for local area unemployment levels and rates. Labour Market Trends, vol. 111, no 1www.statistics.gov.uk/cci/article.asp?id=372
• Summary publication accompanying the publication of the 2003 unemployment estimates November 2004http://www.statistics.gov.uk/downloads/theme_labour/ALALFS/AnnexA.pdf