statistical disclosure control for the 2011 uk census jane longhurst, caroline young and caroline...

35
Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Upload: rudolph-morrison

Post on 12-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Statistical Disclosure Control for the 2011 UK Census

Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Page 2: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Outline

• Context• Workplan• Progress• Short-listing the SDC Methods• Quantitative Evaluation• Description of the Methods (Advantages and

Disadvantages)• Example Evaluation (Risk-Utility Framework)• Summary

Page 3: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Context

• The UK takes a census every 10 years.

• Next census due in 2011.

• This will comprise separate, simultaneous Censuses for England & Wales (ONS), Scotland (GROS) and Northern Ireland (NISRA).

Page 4: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Context

• SDC for 2011 Census outputs is a major concern

for users

• Different SDC methodologies were adopted for

standard tabular 2001 Census outputs across UK

• Late addition of small cell adjustment by

ONS/NISRA resulted in high level of user

confusion and dissatisfaction

• Publicised commitment to aim for a common UK

SDC methodology for all 2011 Census outputs

Page 5: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Workplan

• Phase 1 (March ’06 – Jan ’07)– UK agreement of key SDC policy issues

• Phase 2 (Jan ’07 – Sept ’08) – Evaluation of all methods complying with agreed SDC

policy position in terms of risk/utility framework and feasibility of implementation

• Phase 3 (Sept ’08 – Spring/Summer ’09)– Recommendations and UK agreement of SDC

methodologies for 2011 Census tabular outputs• Phase 4 (Feb ’09 onwards)

– Evaluate and develop SDC methods for microdata, future work on output specification, system specification, development and testing

Page 6: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Progress

• The UK SDC Policy Position (Nov ‘06)

highlighted:

– Key risk is attribute disclosure

– Consideration of pre-tabular and post-tabular methods

– Small cell counts can be included in tables provided

uncertainty about the true value is created

– Different access agreements for tabular outputs that are

seriously compromised by SDC

• Tolerable threshold not yet determined, but

steer towards less conservative approach

Page 7: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Progress

• Development of SDC Strategy– UK SDC working group established to take forward

methodological work– UKCDMAC subgroup set up to QA work

• Initial stage of methodological research:– Review of SDC in census context (May ’07)– Qualitative evaluation of SDC methods for 2011 Census

outputs

• Focus on tabular outputs whilst considering impact on other outputs

Page 8: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Progress

• UK SDC working group met in August

– Produced short-list of SDC methods

– SDC methods assessed against criteria in line with

Registrars General policy statement

• Formal QA and sign-off of criteria and short-listed SDC methods

• Short-listed methods will undergo thorough quantitative evaluation and should maximise data utility whilst minimising disclosure risk

Page 9: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Short-listing: Criteria

– Method should: prevent new information being derived prevent disclosure by differencing and enable flexible

table generation

– Could use special access arrangements if disclosure control seriously comprises some tabular outputs

– Table design methods applied alongside chosen method

Page 10: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Short-listing: Criteria

• Trade off between risk and utility needs to be evaluated quantitatively

• Many potential SDC methods which could be used but not possible to conduct quantitative evaluation of each method

• Need to consider qualitative aspect using high-level review of advantages and disadvantages of SDC methods

• Qualitative and subsequent quantitative evaluations used in combination to establish recommended SDC method(s) for 2011 Census

Page 11: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Short-listing: Criteria

• Each method assessed against a set of 7 qualitative criteria (primary and secondary):

• Primary criteria– Additivity and consistency– Overall user acceptability– Protection against differencing– Feasibility of implementation

• Secondary criteria– Impact on microdata releases– Simple to understand– Easy to account for in analyses

Page 12: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Short-listing: Scoring

• Following methods considered for short-listing:– Record Swapping– Over-Imputation– Data Switching– Post Randomisation Method (PRAM)– Sampling– Conventional Rounding– Random Rounding– Small Cell adjustment– Controlled Rounding– Semi-Controlled Rounding– Suppression– Barnardisation– ABS Cell Perturbation Method

Page 13: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Short-listing: Scoring

• For each criteria, method assigned score:– 0 = method not meet criteria– 1 = method partly meets criteria– 2 = method does meet criteria

• Primary criteria given double weighting

• Overall score and ranking assigned to each method

• Methods failing on primary criteria were discounted

Page 14: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Short-listing: Scoring

• Majority of SDC methods failed primary criteria and were discounted from short-list.

• For example:– PRAM - difficult to implement and not proven for Census

data– Sampling – low user acceptance of weighted tables– Rounding – low user acceptance of rounding methods – Suppression – extremely difficult to implement to

protect against differencing

Page 15: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Short-listed SDC Methods

• Record swapping

• Over-imputation

• ABS Cell Perturbation method

• Small cell adjustment with record swapping (to provide comparison with 2001)

Page 16: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Quantitative Evaluation

• Examine how methods protect and manage risk and how they impact on data utility

• Plan to use range of 2001 Census tables, varying parameters, different geographies

• Information Loss software will be used to evaluate each short-listed method

• Consideration will be given to other issues, e.g. comparisons over time, communal establishments, imputation rates

Page 17: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

What do the methods do?

The short-list• Record Swapping• ABS Cell Perturbation• Over-imputation

Page 18: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Record Swapping - Summary

• 2001 Random Record Swapping method:• % households swapped across OAs• Swap within LA to preserve marginal distributions at

this level• Matches found using control variables

– Age– Gender– Hard to Count Index (census enumeration)– Household Size

• All non-geographic fields swapped• Random /Targeted

Page 19: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Record Swapping - Summary

Advantages Disadvantages

• Consistent and additive

• Some protection against differencing

• Risk of inconsistent / illogical records low

• Flexibility of swapping rates

• Effects of perturbation hidden and hard to measured or account for

• Tables not visibly perturbed

• Geographic fields such as workplace not swapped (Origin-Destination tables not protected)

Page 20: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

ABS Cell Perturbation - Summary

• Developed by the Australian Bureau of Statistics

• In use for their 2006 Census data

• Based on random numbers assigned to each record

• Then each table is adjusted independently in two stages:

– (1) Adding perturbations to each cell– (2) Restoring additivity of whole table

Page 21: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

ABS Cell Perturbation - Summary

• Assign each microdata record a random number between 1 and m called an rkey

• For each cell in a particular table:– Calculate the cell key according to a function of the rkeys

• Using a look-up table, read off the perturbation to add where ckeys are the columns and original values are the rows of the lookup table

• Perturbation added to original cell value • ABS additivity module not yet evaluated

mod ,ckey rkey m

Page 22: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Example Look-up Table

Original Cell value Perturbation drawn from following distribution (using the cell key)

0 No Perturbation

1 Normal (0, 2) truncated at -1 and +5

2 Normal (0, 2) truncated at -2 and +5

3 Normal (0, 2) truncated at -3 and +5

4 Normal (0, 2) truncated at -4 and +5

5+ Normal (0, 2) truncated at -5 and +5

Page 23: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

ABS Cell Perturbation - Summary

Advantages Disadvantages

•Tables consistent

•Protects against differencing

•Efficient – allegedly quick run-time

•Flexible – lookup table can be designed to suit needs

•After additivity stage, consistency is lost to some extent

•Needs to be applied to each table separately

Page 24: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Over-imputation - Summary

• Involves randomly selecting a percentage of microdata records which then have certain variables erased.

• Select donors matching on control variables and the erased variables are then imputed

• Various approaches to over-imputation will be considered

Page 25: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Over-imputation - Summary

Advantages Disadvantages

•Imputation software already in place

•Can target risky records

•Can protect workplace tables if includes geographical fields

•Provides some protection against differencing

•Errors (bias and variance of estimates) may be introduced

•Difficult to account for impacts e.g. standard errors at high levels of geography.

•Can alter association between characteristics of members within same household.

Page 26: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Quantitative Evaluation

• An example of how the quantitative evaluation will be carried out….

• Preliminary study comparing swapping and ABS cell perturbation using ideas developed by Natalie Shlomo (framework of balancing risk and utility)

Page 27: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Preliminary Evaluation: Tables used• 2001 UK Census Tables • EA: Southampton, Eastleigh, Test Valley (SJ)

Table Variables Persons in table

Cells in

table

Avg cell size

% zero cell

% small cells

A Religion(9) *

Age-sex(6) * OA(1487)

437,744 80,298 5.5 59.1 12.6

B Sex(2) *

LLTI(2) *

Econ-Activity(9) * Ward(70)

317,064 25,250 125.8 16.9 9.0

Page 28: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Measuring Disclosure Risk

• Main risk

– small cells in tables

– small cells in differenced tables

• Disclosure Risk = proportion of records in the small cells that have not been perturbed

Page 29: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Disclosure Risk: OA and Ward

0

10

20

30

40

50

60

70

Cell Perturbation 10% Random Sw ap 10% Targeted Sw ap

Dis

clo

sure

ris

k (

%)

OA Ward

Page 30: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Measuring Information Loss

Utility (information loss) measures compare statistical quality of original and protected tables

• Measure distortion to internal cell distributions

• Compare variance of cell counts

• Measure impact on rank correlations

Page 31: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Distance Metrics at Output Area level

0

2

4

6

8

10

12

10% Random Swap 10% Targeted Swap Cell Perturbation

Mea

sure

of p

ertu

rbat

ion

Hellingers Distance Relative Absolute Distance Average Absolute Distance

Page 32: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Variance of Cell Counts: OA and Ward

0.4

0.6

0.8

1

1.2

1.4

10% Random Swap 10% Targeted Swap Cell Perturbation

ratio

OA Ward

Page 33: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Impact on Rank Correlations: OA and Ward

0

5

10

15

20

25

30

35

40

10% Random Swap 10% Targeted Swap Cell Perturbation

%

OA Ward

Page 34: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Summary

• Ongoing progress made for 2011 Census• Thorough quantitative evaluation of short-list over

next year, using 2001 method as benchmark• Important to strike balance between minimising

disclosure risk and maximising data utility• Qualitative and quantitative evaluations used in

combination to establish recommended approach to SDC for 2011 Census

• User communication and consultation will take place throughout the work programme

Page 35: Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

Contact Details

[email protected]

[email protected]

[email protected]