an introduction to the large-scale government surveys & samples of anonymised records jo wathan...

Post on 28-Mar-2015

223 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

An introduction to the large-scale Government Surveys &

Samples of Anonymised Records

Jo WathanESDS(Government) & SARs support

teamCCSR, University of Manchester

Today• What data is available?• What is it like?• Considerations when using the data• How are they used in research?• How do you access them?• Resources & Support

Why should you want to know?

• Because the data are...• Very cost effective: data free of charge to

academic researchers• Saves time: no need to conduct survey • Access to high quality, well documented

data • Can provide nationally representative data

‑ allows generalisation to population• Allows historical and geographical

comparisons to be made• ESRC funded data support services

What data am I talking about?

• UK is particularly rich in microdata which is available for secondary analysis

• Today focus on cross-sectional microdata from government surveys and The Census– Samples of Anonymised Records– ESDS Government Surveys (e.g. LFS, GHS)

• Other major sources:– Longitudinal data (e.g. LS, BHPS)– International microdata (e.g. ESS)– ESDS core function/UK Data Archive– Aggregate data

The Samples of Anonymised Records

(SARs)• Microdata samples from Census 1991 & 2001

• Available for the first time after research into the confidentiality risk

• More flexible than conventional aggregate tables

SAR Files Individual Household Small Area Microdata

1991(GB/NI)

2% with SAR area

1% with Region

-

2001 licensed data

3% with GOR (UK)

1% England & Wales only (special license)

5% with LA/UA/PC

2001 Controlled Access Microdata

3% with LA/UA/PC

1% with LA/UA/PC

-

What’s in the SARs?

• UK Census Microdata• Census has high response rate because compulsory

– 1991 only enumerated cases in data– 2001 missing people are ‘imputed’

• Census topics only – brief self-completion form– Accomodation, transport, socio-economic characteristics,

ethnicity, religion, health

• Anonymised and data limited to ensure confidentiality – Most restrictive in the end user license files for 2001, e.g.

less geography in the individual and household files, age banded

– Unusual cases perturbed

• Extremely large sample sizes!

ESDS Government Surveys• General Household Survey• Labour Force Survey• Family Resources Survey • Expenditure and Food Survey (previously the

National Food Survey and Family Expenditure Survey)

• ONS Omnibus Survey • National Travel Survey • Time Use Survey • British Crime Survey/Scottish Crime Survey• British Social Attitudes/Scottish Social

Attitudes/Northern Ireland Life & Times/Young People’s Social Attitudes

• Health Survey for England/Wales/Scotland• Survey of English Housing (England only)

What are ESDS Government data like?

• ‘Nationally’ representative survey microdata

• Large sample sizes (but smaller than the SARs)

• Identifying information is removed

• Most are conducted on an annual basis

• Continuous surveys – always up-to-date

• Cross-sectional (although the LFS has a 5-quarter panel element)

• Specialist topic surveys – more depth than the Census

All of these microdata are:• Individual information akin to the sort of data

you would collect if you were conducting your own survey

• Need to be analysed in an appropriate software package (like SPSS or Stata)

• Cross-sectional snapshots (exception: the LFS is actually 5 snapshots per address!)

• Good quality collected by a professional data collection organisation– Office for National Statistics– National Centre for Social Research

• Collected for policy purposes• Has good quality documentation & support

services

Thinking about using the data?

1. What is your research question?2. What evidence do you need to answer

your research question?3. Is the evidence you need already

available • check the literature and published reports.

4. Is cross-sectional secondary microdata appropriate for your research question?

• Is your question quantitative?• Do you need to follow individuals over time?

5. Is data available?

Locating and assessing data

• Locating data:– What data is available for my topic?– Are the variables I need available?

• Assessing data for analysis:– What population is the sample drawn

from?– What sampling scheme was used?– Do I need to weight?

What datasets cover my topic?

• Question Bank http://qb.soc.surrey.ac.uk – has topic guides and a search engine

across questionnaires • Census topics:

– Limited due to legislation, scale & self-completion;

– View the codebooks to see what data is in which files on SARs web pages

• Finding topics in surveys:– Much wider range of topics from large

number of different sources– ESDS Government topic guides on

employment, health, social capital, Scotland

– ESDS/UK Data Archive search engine

What variables are available for my topic?

• To understand the variables you have available– View the documentation/user guide– A list of variables & codings should

be available– Information on how derived variables

were created should be available– Double check in the dataset!

What do the variables mean?

Unless...• you can track your variable back

to the question(s) asked on the questionnaire

• Know who the questions were asked of

• And what was done with the raw data to turn it into the final data...

You don’t understand the data

Routeing in the documentation: GHS

Variable Name : ECSTILOVariable Label : Economic status

(harmonised)Topic : EmploymentPopulation : AdultsHhld/indiv.level : IndividualRange : 1 to 10Missing values : -6, -8

1 'Working (incl Unpaid FW'2 'Gov sch with emp'3 'Gov sch at coll'4 'Unemployed (ILO)'5 'Other Unemployed'7 'Retired'6 'Perm unable to work'8 'Keeping house'9 'Student'10 'Other inactive'-8 'NA, ECSTA not known'-6 'Child/No int'.

Derived variablesDO IF SCHEDTYP = 3 OR AGE LT 16.+ COMPUTE ECSTILO = -6.ELSE.+ DO IF DVILO3A = 1.+ DO IF SCHEMEET = 1.+ DO IF TRN = 1.+ COMPUTE ECSTILO = 2.+ ELSE IF TRN = 2.+ COMPUTE ECSTILO = 3.+ END IF.+ ELSE.+ COMPUTE ECSTILO = 1.+ END IF.+ ELSE IF DVILO3A = 2.+ COMPUTE ECSTILO = 4.+ ELSE IF DVILO3A = 3.+ DO IF YINACT = 1.+ COMPUTE ECSTILO = 9.+ ELSE IF YINACT = 2.+ COMPUTE ECSTILO = 8.+ ELSE IF YINACT = 3.+ COMPUTE ECSTILO = 10.

The population base: nation

• Most large scale surveys seek to be nationally representative but what is a nation?– Labour Force Survey = UK– General Household Survey = GB

(but strange things can happen North of the Caledonian Canal)

– Health Survey for England = England

– Not always apparent from the name

– Increase of country-specific surveys following devolution

• Over 80% of the population live in England (9% Scotland, 5% Wales, 3% NI) so surveys designed for UK wide analyses will not generally have large enough samples to analyse separate countries

Population base: type of survey

• Most large scale surveys are household surveys they interview 1+ person in private households– This will exclude people in institutions– Has knock effects for particular topics;

health, age etc.

• Surveys tend to gather limited information about children – May only relate to their existence age and

relationships to other household members– There may also be other age restrictions on

all or part of the survey

Population base - setting

• You may need to subset to obtain a reasonable database– SARs 1991 could double count

visitors (at place of residence AND location on Census night)

– SARs 2001 can double count students (at place of termtime residence AND parental address)

– Need to subset to prevent double counting

The sampling strategy will affect your results

• Few data sources approximate simple random sampling – the SARs does

• Stratification increases the precision of estimates – the Labour Force Survey is stratified

• Clustering reduces the precision of estimates – e.g. the General Household Survey

• Many major surveys use stratification and clustering

• Guidance should be available in the documentation

• PEAS website

Disproportionate sampling

• The British Social Attitudes survey takes only 1 person per household– If left like this the chance of selection

in the sample would be inversely proportional to the size of one’s household

• Over-sampling in order to obtain satisfactory sample sizes for minority groups (often referred to as ‘boosts’)– Health Survey for England has done

this with ethnic minorities

Weighting can be used to prevent bias from

disproportionate sampling weighted unweighted

Frequency % of all Frequency % of all

Number in household including R? Q37

1 759.2 17.1 1326 29.9

2 1608.4 36.3 1522 34.3

3 838.3 18.9 671 15.1

4 774.6 17.5 596 13.4

5 311.3 7 232 5.2

6 91.4 2.1 57 1.3

7 31.4 0.7 16 0.4

8 13.8 0.3 9 0.2

9 1.1 0 1 0

10 1.7 0 1 0

12 1.1 0 1 0

Total 4432.1 100 4432 100Dataset: British Social Attitudes Survey, 2003

Non-response trends – another reason for weighting

Source: Barton in ESDS weighting guidehttp://www.esds.ac.uk/government/docs/weighting.pdf

Imputation: 2001SARs

Not ONC imputed

ONC imputed

White 94.8 5.2

Mixed 91.5 8.5

Asian 84.6 15.4

Black 76.5 13.5

Chinese/Other

85.6 14.4

All 93.8 6.2

ExerciseSuggest datasets which would fulfil the

following criteria, for a range of employment projects:

1. A large up-to-date UK dataset with extensive questions on employment and training

2. The maximum possible sample size for a single time point to allow minority groups to be distinguished in analysis.

3. Any 1960s employment microdata4. A dataset with extensive questions on

income from sources other than just earnings

5. A dataset which could be used to look at attitudes to work

What would you use the data for?

• Straightforward secondary analysis– To assess theoretical accounts– To quantify characteristics or behaviours– To challenge official views– To apply alternative definitions

• Context to your own primary research – Your research could be quantitative or qualitative– To assess the national context of an area study– To assess whether your sample is typical– To assess the scale of behaviours

Practical research uses of the data

• Looking at change over time

• Look at sub-populations

• Using the flexibility of the data to look at alternative definitions

• Looking within households

Secondary analysis:change for subpopulations

SMOKING AND SOCIAL CLASS - MEN

05

101520

253035

4045

1994 1995 1996 1997 1998 1999 2000 2001

year

%

all sc I&II sc IV&VSource:HSE

Marmot, M (2003)

Using successive cross-sectional data over time

Pros…• Reasonable

amount of comparability

• Can pool years/quarters

• Data is representative at each time point

• Good at looking at impacts on groups

Cons…• Limits to

continuity in the data (e.g. ethnic)

• Cannot establish individual change

Looking at small populations

• Many surveys with 10+k respondents– Permits minority groups to be

represented– Rare subpopulations sample size may be

too small… can consider combining years if appropriate

• Largest sample sizes available from the Samples of Anonymised Records– The Small Area Microdata file contains

nearly 3 million records!

Survey data is subject to sampling error!

Example: Pregnancy and Employment

•Using 1998-99 General Household Survey data alone there are only 168 pregnant women aged 16-49

•95% Confidence interval for % pregnant women economically inactive 34.2 – 49.1%

•Combined 3 years’ data to obtain sample of 465 pregnant women

•Confidence interval using 3 years’ data: 34.9 – 43.9%

Combining datasets to increase sample size

Using the flexibility of the data to look at alternative

definitionsWhat are ‘hours worked’?• Is it just paid work? Or unpaid as well?• Hours usually worked, or actually worked

last week?• In main job, or in any job? • What about students?• Overtime – paid?• Overtime – unpaid?• Lunch hours?• Do non-workers work zero hours or

should they be excluded?

Hierarchical data: conceptually

Household 1North West

Social rented

Household 2Wales

Owner occupier

Person 1HoH

Female28

GCSEP/T WorkNo LTILL

Person 2Son of HoH

Male12N/AN/A

No LTILL

Person 1 HoHMale33

DegreeF/T Employee

No LTILL

Person 2Spouse of HOH

Female31

DegreeP/T Employee

No LTILL

Person 3Parent of HoH

Female 72

No qualsEcon Inactive

LTILL

Workless households (source FES, various years 1968-1996)

0

5

10

15

20

25

68 70 72 74 76 78 80 82 84 86 88 90 92 94 96

Year

Pe

rce

nta

ge

(o

f p

res

en

t w

ork

ing

ag

e h

oh

)

workless households

children in worklesshouseholds

Source: Richard Dickens, Paul Gregg and Jonathan Wadsworth(2000) ‘New Labour and the Labour Market, CMPO Working Paper Series00/19 Table 5

Finding out about what’s been/being done with the

data• User meetings

– General Household Survey– Labour Force Survey– Health Surveys– Samples of Anonymised Records

• ESDS Government– Publications database– Usage pages

Accessing & Support Services

• The data teams: – ESDS Government– SARs team at CCSR

• Registering to use the data• Special license and CAM data• Getting support

SARs Data team

• CENSUS MICRODATA SUPPORT• http://www.ccsr.ac.uk/sars• Register for the data• Access SARs documentation for all

SARs dataset• Explore data online or download

datasets in SPSS, Stata, or tab delimited form for:– 1991 data, 2001 Individual licensed file,

2001 Small Area Microdata• Information about 2001 Special Licence

Household SAR – link to UK Data Archive for download

ESDS Government• MAJOR CROSS-SECTIONAL UK

SURVEYS• http://www.esds.ac.uk/government• Survey pages • Introductory guides and resources

including topic guides, weighting guide, software guides

• Links to relevant external resources• Links to the UK Data Archive for

– Register for the data– Download the data in Stata, SPSS etc.– Explore the data online in Nesstar– Access documentation

The licence• All users need to be licensed• Academics complete license as part of

the Census Registration System Process

• Non-academic users contact UK Data Archive (Surveys) or CCSR (SARs) to arrange registration – charges may apply

• Cannot pass the data to an unlicensed user

• Cannot attempt to identify an individual

The licence – good practice

• Keep your data password protected• Destroy your data when you have

finished using it• Remove files before passing on

your PC to someone else• Tell the data team about your

publications• Tell the data team if you leave your

institution

Special licence files• Special licence is new way of

making more detailed data available to social researchers– Annual Population Survey data– Household SAR 2001

• Full & legally binding paper registration process – requires institutional signature & ONS approval

• Must agree to extensive data stewardship conditions

Controlled Access Microdata

• SARs Controlled Access Microdata designed for professional researchers who have no other data options open to them

• Access in safe setting only at ONS site• Specification on SARs website• Individual file and Household file• Files contains much more detail; e.g.

– Individual year of age (topcoded at 95)– Full coding on country of birth– SOC Unit Goup– Local authority geography– Index of Deprivation for SOAs– Index of Deprivation for migrants last address

• Further information and appropriate forms at http://www.statistics.gov.uk/census2001/sar_cams.asp

• Contact sars@ons.gsi.gov.uk for more details

User supportSARs:helpdesk email: sars-helpdesk@manchester.ac.uktel: (0161) 275 4262SARS jiscmail listhttp://www.ccsr.ac.uk/sars

ESDS Government:helpdesk email: govsurveys@esds.ac.uktel: (0161) 275 1980ESDS-Govsurveys jiscmail listhttp://www.esds.ac.uk/government

top related