identifying the cash-rich and the cash-poor: lessons from the census rehearsal dr paul williamson...
Post on 20-Dec-2015
216 views
TRANSCRIPT
Identifying the cash-rich and the cash-poor:
Lessons from the Census Rehearsal
Dr Paul Williamson
Department of Geography
ESRC Census Development Programme
• Most requested addition to 2001 Census
INCOME…
The 2001 Census Geography of income:
Other sources of data on income
• Benefits data
• Government surveys(e.g. GHS, LFS, FES, FRS, NES)
• Commercially-held data[Postcode sector and postcode unit estimates]
• The Census Rehearsal (1999)
Objectives
Evaluation of:• Extant methods for small-area income
estimation
• New approaches
• Utility of non-census information(e.g. council tax; house price; benefits data)
[ • Methods of imputing income band means ]
Definition of ‘income’
• Income Wealth
• Gross or net income?
• Pre or post housing costs?
• Adult or Household?
• Household?– Total– Equivalised
[Per capita / OECD / McClements]
Surrogates
• Univariate– % unemployed– % 2+ car households– % residents in Social Classes I + II– % owner-occupation
• Multivariate (deprivation indices)– Carstairs – Townsend– Breadline– DLTR Index of Multiple Deprivation 2000– Green (Wealth)[owning 2+ cars; NS-SEC I or II; High qualifications]
• Geodemographic– SuperProfiles– MOSAIC– GB Profiles
• Model
Individual income– Dale (SOC2000; Economic activity; age; sex;
Region]– Lee (SOC2000; Economic activity]– Regression (individual and/or ecological)
Household income– Regression (household and/or ecological)– Bramley & Smart (H/h comp.; earners; tenure;
area level deprivation)
The 1999 Census RehearsalKey features• full census questionnaire
+ INCOME• Large achieved sample
• Spatially contiguous
– c. 65,000 households– c. 140,000 individuals
Clustered sampling
strategy:– 7 part districts
[Excluding NI] – 38 wards– 650 EDs
• non-response rate– overall (~ 50%)– income (~15%)– other variables (5-20%)– full responses for ~ 55 % of achieved sample [individuals and households]
• non-response bias
Potential problems
Income No dataBand All missing£0 20.8 16.9<£60 13.2 11.0£60-£119 20.5 18.7£120-£199 15.5 16.1£200-£299 13.3 15.9£300-£479 11.3 14.6£480+ 5.5 6.8Total 100.0 100.0N 125138 67283
Social Class No data(1991) All missingNone 28.2 25.4I 4.0 4.9II 18.9 21.4III(N) 17.5 19.3III(M) 11.5 10.8IV 14.7 13.7V 5.0 4.3Army 0.2 0.2Total 100.0 100.0N 117010 67283
Correlation coefficientIndicators (calculated for 1991 Enumeration districts) Original IdealTownsend index 0.82 0.79% households with
No car 0.89 0.86 2+ cars 0.87 0.83% households
Owner-occupied 0.90 0.87 Social rented 0.94 0.92 Detached 0.98 0.97 Flats 0.92 0.92% of economically active
Unemployed 0.57 0.55 Social Class I+II 0.58 0.56
Rehearsal sub-set
• Banding of income question
What is your total current gross income from all sources?
Per week or Per year (approximately)
Nil _ NilLess than £60 _ Less than £3,000£60 to £119 _ £3,000 to £5,999£120 to £199 _ £6,000 to £9,999£200 to £299 _ £10,000 to £14,999£300 to £479 _ £15,000 to £24,999£480 or more _ £25,000 or more
– Only 10% of adults in top band
– but problem compounded when individual incomes aggregated to estimate household income
– band mid-point band mean– value of band means area sensitive?
Income Nationalband (£) Average Band A Band H
Income band mean0 0 0 01-60 34 35 2761-120 91 93 86121-200 156 155 156201-300 245 241 242301-480 375 364 391481+ 765 652 1353
Council Tax
Source: FRS 1998/9 (Crown Copyright)
Digression: modelling income band means
Alternative modelling strategies include:
• National mean
• Sub-group mean (e.g. by council tax band)
• Statistical distributions (log-normal; pareto)
• New variant of log-normal approach with addition of modelled median etc.
Results
• For all bands sub-group mean best– if possible
• For closed-bands, national mean is next best
• For open (top) band, new proposed log-normal approach is best, particularly where there is evidence of strong spatial clustering
– At what scale does income vary most?
• MAUP– 1991 vs 1998/9 boundaries– zones with <10 households or 25 residents
excluded from analysis
• SOC 2000 / NS-SEC– Lack of alternative SOC2000 coded data– Therefore have to use Census Rehearsal data– Use partitioned data to avoid unduly
advantaging SOC2000 based approaches
• Spatial scale
Results
Census Rehearsal Income Distribution
0
5
10
15
20
25
30
Nil <3 3-5 6-9 10-14 15-24 25+
Annual Gross Income (£ 000s)
• At ward level the % household reps. in top income-band averaged 9.1%– but ranged from 2.8% to 21.6%
• 89% of EDs contained one or more household reps. in top income-band– i.e. in top income-decile of the population
Heterogeneity rules OK!
Income distribution of household representative (Person 1 on Census Rehearsal form)
All EDs
0
4000
8000
12000
16000
EDs in lowestincome quintile
0
1000
2000
3000
4000
EDs in secondincome quintile
0
1000
2000
3000
4000
EDs in middleincome quintile
0
1000
2000
3000
4000
EDs in fourthincome quintile
0
1000
2000
3000
Nil <3 3-5 6-9 10-14 15-24 25+
Income bracket (£000 p.a.)
EDs in topincome quintile
0
1000
2000
3000
4000
5000
Nil <3 3-5 6-9 10-14 15-24 25+
Income bracket (£000 p.a.)
Missing data
• Missing data have minimal impact on results– From ‘Raw’ to ‘Ideal’ data, most correlations
change by <0.02– Very few values change by >0.05– Exception is NS-SEC 8 [by definition!]– Correlations lower for ‘Ideal’ than ‘Raw’
• Surrogates calculated direct from Rehearsal– circumvents data response bias?
Scale
• Higher correlations at higher geographies
• District effect small but significant– BUT none of districts in SE England
Overfitting• No significant impact
MAUP
• Correlations vary by up to 0.1 between alternative boundaries at same spatial scale
BUT
• No detectable effect on rankings of surrogate income measures
Adult income (r2) Surrogate
Ward
ED
Post-code
Univariate NS-SEC 1+II 0.81 0.81 0.64
Multivariate Townsend 0.36 0.46 0.38 Green (wealth) 0.57 0.55 0.50
Geodemographic PCA_96 Na 0.82 0.69 Voas 0.83 0.59 0.48
Model Dale 0.91 0.89 0.90 Lee 0.90 0.87 0.88 Voas (individual) 0.91 0.80 0.83 [See final slide for definition of ‘surrogates’]
Caveats• ‘Best’ performing surrogates in danger of
over-fitting?– For Dale, Lee and Voas mean occupational
income calculated directly from Census Rehearsal dataset (no other SOC2000 sources available at time of analysis)
BUT– No significant difference if SOC minor or unit
codes used– No significant difference if data partitioned
Household income (r2) Surrogate
Ward
ED
Post-code
Univariate NS-SEC 1+II 0.82 0.81 0.64
Multivariate Townsend 0.48 0.46 0.44 Green (wealth) 0.61 0.50 0.56
Geodemographic PCA_96 na 0.81 0.67 Voas 0.81 0.60 0.48
Model Dale 0.90 0.85 0.86 Lee 0.87 0.83 0.83 Voas (household) 0.76 0.74 0.74 [See final slide for definition of ‘surrogates’]
Accuracy
• For many purposes relative, rather than absolute, accuracy is most important
ranking
a) NS-SEC based income surrogate [NSSEC12]
0%
25%
50%
75%
100%
0 100 200 300 400 500 600
Observed mean individual income (£ week)
% o
f ec
onom
ical
ly a
ctiv
e in
NS
SE
C 1
+2
b) Regression based estimate [VOASIND]
0
100
200
300
400
500
600
0 100 200 300 400 500 600
Observed mean individual income (£ week)
Pre
dict
ed m
ean
indi
vidu
al in
com
e (£
wee
k)
c) Sub-group mean based estimate [LEINCM]
0
100
200
300
400
500
600
0 100 200 300 400 500 600
Observed mean individual income (£ week)
Pre
dic
ted
me
an
ind
ivid
ua
l in
com
e (
£ w
ee
k)
Surrogate/Estimate % NSSEC
1+2 Individual Regression
Sub-group mean
Ecological Regression
[NSSEC12] [VOASIND] [LEEINCM] % ranked in same decile as income
Overall 36 42 50 46
Within ± 1 decile 82 84 89 92
• < 1% of unexplained spatial variation in income attributable to area level effects
• House price has no significant impact– could be due to data problems
• Council tax band has small but significant effect [for areas of enumeration district size and below]
• Lack of utility counter-intuitive?– current value purchase price– purchase income current income
Other data sources
Conclusions (I)
• Best approaches capture 80-90% of spatial variation in income, even for smallest spatial units
• But considerable within-area heterogeneity
• Best approaches are regression or sub-group mean based
• Conventional deprivation indices a poor second to % social class / NS-SEC I+II
Conclusions (II)
• Geodemographic classifications at best perform as well as % NS-SEC I+II, and perform best for areas of ward size and above
• Qualified support for use of statistical distributions in modelling top income band means
ImplicationsMoral for marketers:
• Target people, not places
Moral for policy makers:
• Deprivation indices not the best proxy for income
• ONS ward income estimates (based on ecological regression) likely to perform well
Longer term • Consider external correlates
(e.g. IMD 2000; benefits data)
• Lobby for Census Office to create small-area income estimate– by imputing income on Census microdata– include non-census information (?)
Acknowledgements
• House price data were taken from the Experían Limited Postal Sector Data, ESRC/JISC Agreement.
• Grateful thanks are due to the Census Custodians of England, Wales and Scotland for granting permission to access the Census Rehearsal dataset.
• A debt of gratitude is also owed to a number at the Office for National Statistics, in particular Keith Whitfield and Philip Clarke.
• Finally, thanks are due to David Voas for undertaking some of the preparatory work for this project.
• All analyses and conclusions remain my sole responsibility.
Definitions (I)
• NS-SEC I+II: % persons aged 16-74 in NS-SEC I or II• Townsend: Multiple deprivation indicator based on % economically
active unemployed; % overcrowded households; % households with no car and % of households not owner occupied
• Green (Wealth): Affluence indicator based on % households with 2+ cars; % persons aged 16-74 in NS-SEC I and % adults with high educational qualifications
• PCA_96: Geodemographic classification based on principal components analysis of 20 normalised census variables, individuals in each of 96 area types assumed to have mean income of all persons in area type
• Voas: Alternative geodemographic classification, in which five census variables are divided into above or below median, one variable into thirds; with all cross-tabulated to give a total of 96 discrete area types
Definitions (II)
• Dale: Income imputed given mean income for population sub-group defined by sex, SOC 2000 minor group, economic activity (missing; employed full-time; employed part-time; self-employed; other), age (missing; 0-15; 16-19; 20-29; 30-49; 50+) [Maximum of 4860 valid sub-groups]
• Lee: Income imputed given mean income for population sub-group defined by SOC 2000 minor group, economic activity (child; not applicable; employed full-time; employed part-time; self-employed; unemployed; retired; other inactive) [maximum of 649 valid sub-groups]
Definitions (III)
• Voas (individual): Regression model for adult income (children assumed to have 0 income); INCOME0.5 predicted given: mean income by SOC2000 unit; mean income by Industry category, age, age2, residents, residents2, rooms and cars plus dummy variables for sex, white, full-time student, married, Single/Widowed/Divorced, Long-term ill, No qualifications, GCSE or equivalent, A levels or equivalent, Undergraduate degree or equivalent, employed full-time, employed part-time, self-employed, unemployed, retired, permanently sick, other economically inactive excluding pensioners and students, Semi-detached, terrace, flat, caravan, privately rented, social rented, employed manager or supervisor and district of residence
• Voas (household): Regression model for total household income; HHINC0.5 predicted given same set of predictors as for Voas (individual), but based only upon head of household’s characteristics