1 assessing the impact of sdc methods on census frequency tables natalie shlomo southampton...

25
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

Upload: annabella-smith

Post on 02-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

1

Assessing the Impact of SDC Methods on Census Frequency Tables

Natalie Shlomo

Southampton Statistical Sciences Research Institute

University of Southampton

Page 2: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

2

Topics:

• Introduction• Disclosure risk • SDC methods for protecting Census frequency tables• Disclosure risk and data utility measures• Description of table• Risk-Utility analysis • Summary of Analysis • Discussion and future work

Page 3: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

3

• Disclosure risk in Census tables:

• Need to protect many tables from one dataset containing population counts which can be linked and differenced

• Need to consider output strategies for standard tables and web based table generating applications

• Need to interact with users and develop SDC framework with a focus on both disclosure risk and data utility

Introduction

Identification Individual Attribute Disclosure

Page 4: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

4

Disclosure Risk

For Census tables: • 1’s and 2’s in cells are disclosive since these

cells lead to identification,

• 0’s may be disclosive if there are only a few non-zero cells in a row or column (attribute disclosure)

Consideration of disclosure risk:• Threshold rules (minimum average cell size, ratio of small cells to

zeros, etc.)

• Proportion of high-risk cells (1 or 2)

• Entropy (minimum of 0 if distribution has one non-zero cell and all others zero, maximum of (log K) if all cells are equal).

Page 5: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

5

SDC Methods for Protecting Frequency Tables

1. Pre-tabular methods (special case of PRAM)

Random Record Swapping

Targeted Record Swapping

In a Census context, geographical variables typically swapped to avoid edit failures and minimize biasImplementation:

Randomly select p% of the households

Draw a household matching on set of key variables (i.e. household size and broad sex-age distribution) and swap all geographical variables Can target records for swapping that are in high-risk cells of size 1 or 2

Page 6: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

6

SDC Methods for Protecting Frequency Tables2. Rounding Unbiased random rounding

Entries are rounded up or down to a multiple of the rounding base depending on pre-defined probabilities and a stochastic draw

Example: For unbiased random rounding to base 3: 1 0 w.p of 2/3 1 3 w.p 1/3 2 0 w.p of 1/3 2 3 w.p 2/3

Expectation of rounding is 0 Margins and internal cells

rounded separately Small cell rounding: internal cells aggregated to obtain margins

Confidence Interval for Totals

-100-80-60-40-20

020406080

100

0 100 200 300 400 500 600 700 800 900 1000

Number of Perturbed Cells

Inte

rval

of

Err

or

Page 7: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

7

SDC Methods for Protecting Frequency Tables

2. Rounding (cont.)

Semi-controlled unbiased random rounding

Control the selection strategy for entries to round, i.e. use a “without replacement” strategy

Implementation:

- Calculate the expected number of entries to round up

- Draw an srswor sample from among the entries and round up, the rest round down.

Can be carried out per row/column to ensure consistent totals on one dimension (key statistics)

Eliminates extra variance as a result of the rounding

Page 8: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

8

SDC Methods for Protecting Frequency Tables

2. Rounding (cont.)

Controlled rounding

Feature in Tau-Argus (Salazar-González, Bycroft and Staggemeier, 2005)

- Uses linear programming techniques to round entries up or down, results similar to deterministic rounding

- All rounded entries add up to rounded margins

- Method not unbiased and entries can jump a base

Page 9: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

9

SDC Methods for Protecting Frequency Tables3. Cell Suppression

Hypercube method (Giessing, 2004)

Feature in Tau-Argus and suited for large tables

Uses heuristic based on suppressing corners of a hypercube formed by the primary suppressed cell with optimality conditions

Imputing suppressed cells for utility evaluation:

Replace suppressed cell by the average information loss in each row/column.

Example: Two suppressed cells in a row and known margin is 500. The total of non-suppressed cells is 400. Each cell is replaced with a value of 50

Page 10: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

10

Disclosure Risk MeasuresNeed to determine output strategies and SDC together

• Hard-copy tables, non-flexible categories and geographies: can

control SDC methods to suit the tables• Web-based tables and flexible categories and geographies:

need to add noise or round for every query

Disclosure risk measures:• Proportion of high-risk cells (C1 and C2) not protected

• Percent true zeros out of total zeros

21

21

)(

1CC

CCii

n

imputedorperturbednotRI

DR

pertorig

orig

CC

CDR

00

02

Page 11: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

11

• Distance metric - distortion to distributions (Gomatam and

Karr, 2003):

Internal cells:

Let be a table for row k, the number of rows, and the cell frequency for cell c,

Margins:

Let M be the margin, the number of categories, the number of persons in the category:

rn

k kc

korig

kpert

rorigpert cDcD

nDDHD

1

2))()((2

11),(

kD ( )kD crn

Utility Measures

MN

l

lorig

lpertorigpert NNNNHDM

1

2

2

1),(

Mn lNthl

Page 12: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

12

Utility Measures

• Impact on Tests for Independence:

Cramer’s V measure of association: where is the Pearson chi-square statistic

Same utility measure for entropy and the Pearson chi- square statistics

Impact on log linear analysis for multi-dimensional tables, i.e. deviance

)1(),1min(

2

CRnCV

2

( ) ( )

( , ) 100( )

pert origpert orig

orig

CV D CV DRCV D D

CV D

Page 13: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

13

Utility Measures

• “Between” Variance:

Let be a target proportion for a cell c in row k,

and let be the overall

proportion across all rows of the table

The “between” variance is defined as:

and the utility measure is:

( )korigP c

( )( )

( )

korigk

orig korig

c k

D cP c

D c

r

r

n

k kc

korig

n

k

korig

orig

cD

cDcP

1

1

)(

)()(

rn

korig

korig

rorig cPcP

ncPBV

1

2))()((1

1))((

))((

)))(())(((100))(),((

cPBV

cPBVcPBVcPcPBVR

orig

origpertorigpert

Page 14: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

14

Utility Measures

• Variance of Cell Counts:

The variance of the cell count for row k:

)(1

)(1

rn

k

korig

rorig DV

nDV

where is the number of columns

The average variance across all rows:

kn

The utility measure is:

)(

))()((100),(

orig

origpertpertorig DV

DVDVDDRDV

2))((1

1)( k

origkc

korig

k

korig DcD

nDV

Page 15: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

15

Description of Table

• 2001 UK Census Table:

Rows: Output Areas (1,487)

Columns: Economic Activity (9) * Sex (2)* Long- Term Illness (2)

Table includes 317,064 persons between 16-74 in 53,532 internal cells

Average cell size: 5.92 although table is skewed

Number of zeros: 17,915 (33.5%)

Number of small cells: 14,726 (27.5%)

Page 16: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

16

Percent Unperturbed Small Cells

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Original 10%Random

10%Target

20%Random

20%Target

Percent True Zeros

00.1

0.20.3

0.40.50.6

0.70.8

0.91

Page 17: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

17

Hellinger's Distance Margins OAs

0123456789

10

Hellinger's Distance Internal Cells

0123456789

10

Page 18: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

18

Difference in Cramer's V (Original=0.121)

-10

-5

0

5

10

15

20

25

30

Page 19: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

19

Difference in Variance of Cell Counts (Original=188.3)

-3

-2

-1

0

1

2

3

Difference in Between Variance (Original=0.00023)

-15-10-505

10152025303540

Page 20: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

20

Risk-Utility Map

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

00.511.522.533.5

HD

Pro

p.

tru

e ze

ros

RR 3

T 20

R 20

R 10

T 10

RR 5

Sup

Page 21: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

21

Summary of Analysis

• Rounding eliminates small cells but need to protect against disclosure by differencing and linking when random rounding

• Rounding adds more ambiguity into the zero counts

• Random rounding to base 5 has greatest impact on distortions to distribution

• Semi-controlled rounding has almost no effect on distortions to internal cells but has less distortion on marginal cells

• Full controlled rounding has less distortion to internal cells since it is similar to deterministic rounding

• Cell suppression with simple imputation method has highest utility (no perturbation on large cells) but difficult to implement in a Census

Page 22: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

22

Summary of Analysis

• High percent of true small cells in record swapping and less ambiguity of zero cells

• Record swapping has less distortion to internal cells than rounding which increases with higher swapping rates

• Targeted swapping has more distortion on internal cells than random swapping but has less impact on marginal cells

• Column margins of the table have no distortion because of controls in swapping

• Combining record swapping with rounding results in more distortion but provides added protection

Page 23: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

23

Summary of Analysis• Record swapping across geographies attenuates: - loss of association (moving towards independence)

- counts “flattening” out - proportions moving to the overall proportion

• Attenuation increases with higher swapping rates Targeted record swapping has less attenuation than random swapping

• Rounding introduces more zeros: - levels of association are higher - cell counts “sharper” Effects less severe for controlled rounding

• Combing record swapping and rounding cancel out opposing effects depending on the direction and magnitude of each procedure separately

Page 24: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

24

Discussion• Choice of SDC method depends on tolerable risk thresholds and

demands for “fit for purpose” data

• Modifying and combining SDC methods (non-perturbative and perturbative methods) can produce higher utility, i.e. ABS developed microdata keys for consistency in rounding

• Dissemination of quality measures and guidance for carrying out statistical analysis on protected tables

• Future output strategies based on flexible table generating software. More need for research into disclosure risk by differencing and linking (collaboration with CS community)

• Safe setting, remote access and license agreements for highly disclosive Census outputs (sample microdata and origin-destination tables)

Page 25: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

25

Natalie Shlomo

[email protected]