anonymising quantative data

Anonymising quantitative

data

Dr Sharon Bolton

UK Data Service

UK Data Archive, University of Essex

Anonymising Research Data workshop

Dublin, 22 June 2016

The UK Data Service

• Single point of access to wide range of social science data:

ukdataservice.ac.uk

• Funded by the ESRC to serve the academic community: training

and guidance; UK Data Archive established 1967

• Used by academic researchers and students; government analysts;

charities; business; research centres; think tanks

• Survey microdata; cohort studies; international macrodata; census

data; qualitative/mixed methods data

• Support and guide data creators, including disclosure review

(anonymisation) and preparation for archiving

Protecting confidentiality: the ‘5 Safes’

Five guiding principles:

• Safe people - educate researchers to use data safely

• Safe projects - research projects for ‘public good’

• Safe settings - SecureLab system for sensitive data

• Safe outputs - SecureLab projects outputs screened

• Safe data - treat the data to protect respondent

confidentiality

• For this session, we will concentrate (mostly) on Safe

data

Data collection: planning

• Explain to respondents what archiving entails and gain agreement for data sharing – informed consent

• Think about disclosure risks before starting – what kind of information do you need to collect?

• Direct identifiers include: names; addresses; telephone numbers; email addresses; photos; (perhaps) IP addresses; do you really need them?

• Unless explicit consent obtained for sharing, direct identifiers should always be removed from data

Anonymising data: indirect identifiers

Indirect identifiers include:

• Sensitive information: health information/medical

conditions; crime victimisation/offending; drug/alcohol

use etc.

• ‘Less sensitive’ information: age/birth date; educational

characteristics; employment details; religious affiliation;

household size; geographic area

• Look at demographics in combination (e.g.

demographics + geographies)

• Text/string variables – too detailed?

Anonymising indirect identifiers

• Aggregate categories to reduce precision

• Band ages, incomes, expenditure, etc. to disguise outliers

• Use standard coding frames – e.g. SOC2010

• Generalise meaning of detailed text

• Document the changes you make

• Talk to other researchers, archives, data services

Published guides:

• UCD Research Data Management Guide http://libguides.ucd.ie/data/ethics

• ONS Disclosure control guidance for microdata produced from social surveys http://www.ons.gov.uk/methodology/methodologytopicsandstatisticalconcepts/disclosurecontrol/policyforsocialsurveymicrodata

http://libguides.ucd.ie/data/ethics

http://www.ons.gov.uk/methodology/methodologytopicsandstatisticalconcepts/disclosurecontrol/policyforsocialsurveymicrodata

Anonymising data: new developments and tools

Statistical Disclosure Control (SDC) software is available:

• mu-Argus

• standalone software package recommended by Eurostat for

government statisticians

• software and manual: http://neon.vb.cbs.nl/casc/mu.htm

• R tool - SDCMicro (GUI)

• Software, manual:

http://www.inside-r.org/packages/cran/sdcMicro/docs/sdcMicro

• new documentation being developed by UK Data Service, working with

R developers

http://neon.vb.cbs.nl/casc/mu.htm

http://www.inside-r.org/packages/cran/sdcMicro/docs/sdcMicro

Quiz 1: disclosive text in job titleJob title Frequency Valid Percent

nurse 73 73.0

carer for elderly man 1 1.0

hospital ward cleaner 1 1.0

social science researcher 1 1.0

head of dental practice 2 2.0

cleaner in electronics factory 1 1.0

Financial Director, Sunnyview Care Home,

Colchester

1 1.0

general manager 1 1.0

GP 1 1.0

Manager, Cotterill Village Stores 1 1.0

works in electronics factory 1 1.0

on benefits, not working 1 1.0

police officer 2 2.0

consultant, geriatric psychiatry 1 1.0

Reetired 1 1.0

retired 1 1.0

Retired 1 1.0

retirement 1 1.0

geography teacher 2 2.0

Teacher, music 2 2.0

Seondary school teeacher 1 1.0

unemployed 1 1.0

web designer 2 2.0

Total 100 100.0

Quiz 1: jobs coded with SOC2010

Job title: SOC2010 Frequency Valid Percent

1131: Director, financial 1 1.0

1171: Manager, general 1 1.0

1190: Manager, retail 1 1.0

2231: Nurse 73 73.0

2426: Researcher 1 1.0

2215: Dentist 2 2.0

2211: Doctor, medical 2 2.0

3312: Officer, police 2 2.0

2314 Teacher, secondary 3 3.0

2137: Designer, web 2 2.0

6145: Carer 1 1.0

9139: Worker, factory 1 1.0

9233: Cleaner 2 2.0

Retired 4 4.0

Unemployed 2 2.0

Total 100 100.0

Quiz 2: detailed religion categories

Religious affiliation

Frequency Valid Percent

1 Protestant 41 41.4

2 Anglican 4 4.0

3 Catholic 26 26.3

4 Muslim 8 8.1

5 Sikh 5 5.1

6 Jehovah's Witness 6 6.1

7 Methodist 1 1.0

8 Mormon 1 1.0

9 Baptist 1 1.0

10 Buddhist 3 3.0

11 None 1 1.0

12 No religion 1 1.0

13 Moravian 1 1.0

Total 99 100.0

Quiz 2: religion categories aggregated

Religious affiliation


1 Protestant 49 49.0

3 Catholic 26 26.0

4 Muslim 8 8.0

5 Sikh 5 5.0

6 Other religion 10 10.0

7 No religion 2 2.0

Total 100 100.0

Quiz 3: age

in years

Age in years


16 3 3.0

17 3 3.0

18 9 9.0

19 9 9.0

20 16 16.0

21 4 4.0

22 2 2.0

23 2 2.0

24 2 2.0

25 2 2.0

26 2 2.0

27 2 2.0

28 2 2.0

29 2 2.0

30 2 2.0

31 1 1.0

32 1 1.0

40 11 11.0

41 1 1.0

42 1 1.0

43 3 3.0

49 1 1.0

50 13 13.0

51 1 1.0

60 1 1.0

61 1 1.0

62 1 1.0

63 1 1.0

64 1 1.0

Total 100 100.0

Quiz 3: banded age

Age (banded)


1 16-20 40 40.0

2 21-30 22 22.0

4 41-50 13 13.0

5 51-60 19 19.0

6 60-64 6 6.0

Total 100 100.0

Access control

• Don’t over anonymise - find balance between protecting

respondents’ confidentiality and maintaining research

usability of data

• Can’t fully anonymise data without removing all the

useful detail? Go back to the 5 Safes – think about

access control: Safe people, Safe settings, Safe outputs

Access control

• At UK Data Service, data available under 3 access levels:

• OPEN – open public access

• SAFEGUARDED – downloadable, but use is traceable

• Registered users only (agree not to try to identify any

individual respondents)

• Special agreements/licence: permission-only access; approved projects – usage agreed in advance

• CONTROLLED – accredited users take a further training course

• Access via on-site safe setting or virtual secure environment

(SecureLab)

• Outputs disclosure-checked before publication

Anonymising quantitative data: summary

• Informed consent

• Think about level of detail needed before data collection

• Remove direct identifiers

• Check and treat indirect identifiers to reduce disclosure

risk

• Document your changes

• Balance anonymisation with access control to preserve

data usability

Questions?

Guidance on anonymisation:

• UCD: http://libguides.ucd.ie/data/ethics

• UKDS: www.data-archive.ac.uk/create-manage/consent-

ethics/anonymisation

• Managing and Sharing Research Data book

https://uk.sagepub.com/en-gb/eur/managing-and-sharing-research-

data/book240297

http://libguides.ucd.ie/data/ethics

http://www.data-archive.ac.uk/create-manage/consent-ethics/anonymisation

https://uk.sagepub.com/en-gb/eur/managing-and-sharing-research-data/book240297

anonymising quantative data

Education