assessing the impact of sdc methods on census frequency tables

25
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

Upload: aleda

Post on 11-Feb-2016

43 views

Category:

Documents


5 download

DESCRIPTION

Assessing the Impact of SDC Methods on Census Frequency Tables. Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton . Topics:. Introduction Disclosure risk SDC methods for protecting Census frequency tables - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Assessing the Impact  of SDC Methods on Census Frequency Tables

1

Assessing the Impact of SDC Methods on Census Frequency Tables

Natalie Shlomo

Southampton Statistical Sciences Research Institute

University of Southampton

Page 2: Assessing the Impact  of SDC Methods on Census Frequency Tables

2

Topics:• Introduction• Disclosure risk • SDC methods for protecting Census frequency tables• Disclosure risk and data utility measures• Description of table• Risk-Utility analysis • Summary of Analysis • Discussion and future work

Page 3: Assessing the Impact  of SDC Methods on Census Frequency Tables

3

• Disclosure risk in Census tables:

• Need to protect many tables from one dataset containing population counts which can be linked and differenced

• Need to consider output strategies for standard tables and web based table generating applications

• Need to interact with users and develop SDC framework with a focus on both disclosure risk and data utility

Introduction

Identification Individual Attribute Disclosure

Page 4: Assessing the Impact  of SDC Methods on Census Frequency Tables

4

Disclosure RiskFor Census tables: • 1’s and 2’s in cells are disclosive since these

cells lead to identification, • 0’s may be disclosive if there are only a few non-zero

cells in a row or column (attribute disclosure)

Consideration of disclosure risk:• Threshold rules (minimum average cell size, ratio of small cells to

zeros, etc.)• Proportion of high-risk cells (1 or 2)• Entropy (minimum of 0 if distribution has one non-zero

cell and all others zero, maximum of (log K) if all cells are equal).

Page 5: Assessing the Impact  of SDC Methods on Census Frequency Tables

5

SDC Methods for Protecting Frequency Tables1. Pre-tabular methods (special case of PRAM)

Random Record SwappingTargeted Record Swapping

In a Census context, geographical variables typically swapped to avoid edit failures and minimize biasImplementation:

Randomly select p% of the households

Draw a household matching on set of key variables (i.e. household size and broad sex-age distribution) and swap all geographical variables Can target records for swapping that are in high-risk cells of size 1 or 2

Page 6: Assessing the Impact  of SDC Methods on Census Frequency Tables

6

SDC Methods for Protecting Frequency Tables2. Rounding Unbiased random rounding

Entries are rounded up or down to a multiple of the rounding base depending on pre-defined probabilities and a stochastic draw

Example: For unbiased random rounding to base 3: 1 0 w.p of 2/3 1 3 w.p 1/3 2 0 w.p of 1/3 2 3 w.p 2/3

Expectation of rounding is 0 Margins and internal cells

rounded separately Small cell rounding: internal cells aggregated to obtain margins

Confidence Interval for Totals

-100-80-60-40-20

020406080

100

0 100 200 300 400 500 600 700 800 900 1000

Number of Perturbed Cells

Inte

rval

of E

rror

Page 7: Assessing the Impact  of SDC Methods on Census Frequency Tables

7

SDC Methods for Protecting Frequency Tables

2. Rounding (cont.)Semi-controlled unbiased random rounding

Control the selection strategy for entries to round, i.e. use a “without replacement” strategy

Implementation: - Calculate the expected number of entries to round up - Draw an srswor sample from among the entries and

round up, the rest round down.

Can be carried out per row/column to ensure consistent totals on one dimension (key statistics)

Eliminates extra variance as a result of the rounding

Page 8: Assessing the Impact  of SDC Methods on Census Frequency Tables

8

SDC Methods for Protecting Frequency Tables

2. Rounding (cont.)Controlled rounding

Feature in Tau-Argus (Salazar-González, Bycroft and Staggemeier, 2005)

- Uses linear programming techniques to round entries up or down, results similar to deterministic rounding

- All rounded entries add up to rounded margins - Method not unbiased and entries can jump a base

Page 9: Assessing the Impact  of SDC Methods on Census Frequency Tables

9

SDC Methods for Protecting Frequency Tables3. Cell Suppression Hypercube method (Giessing, 2004) Feature in Tau-Argus and suited for large tables Uses heuristic based on suppressing corners of a

hypercube formed by the primary suppressed cell with optimality conditions

Imputing suppressed cells for utility evaluation: Replace suppressed cell by the average information

loss in each row/column.

Example: Two suppressed cells in a row and known margin is 500. The total of non-suppressed cells is 400. Each cell is replaced with a value of 50

Page 10: Assessing the Impact  of SDC Methods on Census Frequency Tables

10

Disclosure Risk MeasuresNeed to determine output strategies and SDC together • Hard-copy tables, non-flexible categories and geographies: can

control SDC methods to suit the tables• Web-based tables and flexible categories and geographies:

need to add noise or round for every query

Disclosure risk measures:• Proportion of high-risk cells (C1 and C2) not protected

• Percent true zeros out of total zeros21

21

)(1

CC

CCii

n

imputedorperturbednotRIDR

pertorig

orig

CCCDR

00

02

Page 11: Assessing the Impact  of SDC Methods on Census Frequency Tables

11

• Distance metric - distortion to distributions (Gomatam and

Karr, 2003):

Internal cells:

Let be a table for row k, the number of rows, and the cell frequency for cell c,

Margins:

Let M be the margin, the number of categories, the number of persons in the category:

rn

k kc

korig

kpert

rorigpert cDcD

nDDHD

1

2))()((211),(

kD ( )kD crn

Utility Measures

MN

l

lorig

lpertorigpert NNNNHDM

1

2

21),(

Mn lNthl

Page 12: Assessing the Impact  of SDC Methods on Census Frequency Tables

12

Utility Measures

• Impact on Tests for Independence:

Cramer’s V measure of association: where is the Pearson chi-square statistic

Same utility measure for entropy and the Pearson chi- square statistics

Impact on log linear analysis for multi-dimensional tables, i.e. deviance

)1(),1min(

2

CRnCV

2

( ) ( )

( , ) 100( )

pert origpert orig

orig

CV D CV DRCV D D

CV D

Page 13: Assessing the Impact  of SDC Methods on Census Frequency Tables

13

Utility Measures

• “Between” Variance:

Let be a target proportion for a cell c in row k,

and let be the overall

proportion across all rows of the table

The “between” variance is defined as:

and the utility measure is:

( )korigP c

( )( )

( )

korigk

orig korig

c k

D cP c

D c

r

r

n

k kc

korig

n

k

korig

orig

cD

cDcP

1

1

)(

)()(

rn

korig

korig

rorig cPcP

ncPBV

1

2))()((1

1))((

))(()))(())(((100

))(),((cPBV

cPBVcPBVcPcPBVR

orig

origpertorigpert

Page 14: Assessing the Impact  of SDC Methods on Census Frequency Tables

14

Utility Measures

• Variance of Cell Counts:

The variance of the cell count for row k:

)(1)(1

rn

k

korig

rorig DV

nDV

where is the number of columns

The average variance across all rows:kn

The utility measure is:

)())()((100

),(orig

origpertpertorig DV

DVDVDDRDV

2))((1

1)( korig

kc

korig

k

korig DcD

nDV

Page 15: Assessing the Impact  of SDC Methods on Census Frequency Tables

15

Description of Table• 2001 UK Census Table: Rows: Output Areas (1,487)

Columns: Economic Activity (9) * Sex (2)* Long- Term Illness (2)

Table includes 317,064 persons between 16-74 in

53,532 internal cells Average cell size: 5.92 although table is skewed Number of zeros: 17,915 (33.5%) Number of small cells: 14,726 (27.5%)

Page 16: Assessing the Impact  of SDC Methods on Census Frequency Tables

16

Percent Unperturbed Small Cells

00.1

0.20.30.40.5

0.60.70.8

0.91

Original 10%Random

10%Target

20%Random

20%Target

Percent True Zeros

00.10.20.30.40.50.60.70.80.9

1

Page 17: Assessing the Impact  of SDC Methods on Census Frequency Tables

17

Hellinger's Distance Margins OAs

0123456789

10

Hellinger's Distance Internal Cells

0123456789

10

Page 18: Assessing the Impact  of SDC Methods on Census Frequency Tables

18

Difference in Cramer's V (Original=0.121)

-10-505

1015202530

Page 19: Assessing the Impact  of SDC Methods on Census Frequency Tables

19

Difference in Variance of Cell Counts (Original=188.3)

-3

-2

-1

0

1

2

3

Difference in Between Variance (Original=0.00023)

-15-10-505

10152025303540

Page 20: Assessing the Impact  of SDC Methods on Census Frequency Tables

20

Risk-Utility Map

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

00.511.522.533.5

HD

Prop

. tru

e ze

ros

RR 3

T 20

R 20

R 10

T 10

RR 5

Sup

Page 21: Assessing the Impact  of SDC Methods on Census Frequency Tables

21

Summary of Analysis

• Rounding eliminates small cells but need to protect against disclosure by differencing and linking when random rounding

• Rounding adds more ambiguity into the zero counts

• Random rounding to base 5 has greatest impact on distortions to distribution

• Semi-controlled rounding has almost no effect on distortions to internal cells but has less distortion on marginal cells

• Full controlled rounding has less distortion to internal cells since it is similar to deterministic rounding

• Cell suppression with simple imputation method has highest utility (no perturbation on large cells) but difficult to implement in a Census

Page 22: Assessing the Impact  of SDC Methods on Census Frequency Tables

22

Summary of Analysis

• High percent of true small cells in record swapping and less ambiguity of zero cells

• Record swapping has less distortion to internal cells than rounding which increases with higher swapping rates

• Targeted swapping has more distortion on internal cells than random swapping but has less impact on marginal cells

• Column margins of the table have no distortion because of controls in swapping

• Combining record swapping with rounding results in more distortion but provides added protection

Page 23: Assessing the Impact  of SDC Methods on Census Frequency Tables

23

Summary of Analysis• Record swapping across geographies attenuates: - loss of association (moving towards independence)

- counts “flattening” out - proportions moving to the overall proportion

• Attenuation increases with higher swapping rates Targeted record swapping has less attenuation than random swapping

• Rounding introduces more zeros: - levels of association are higher - cell counts “sharper” Effects less severe for controlled rounding

• Combing record swapping and rounding cancel out opposing effects depending on the direction and magnitude of each procedure separately

Page 24: Assessing the Impact  of SDC Methods on Census Frequency Tables

24

Discussion• Choice of SDC method depends on tolerable risk thresholds and

demands for “fit for purpose” data

• Modifying and combining SDC methods (non-perturbative and perturbative methods) can produce higher utility, i.e. ABS developed microdata keys for consistency in rounding

• Dissemination of quality measures and guidance for carrying out statistical analysis on protected tables

• Future output strategies based on flexible table generating software. More need for research into disclosure risk by differencing and linking (collaboration with CS community)

• Safe setting, remote access and license agreements for highly disclosive Census outputs (sample microdata and origin-destination tables)

Page 25: Assessing the Impact  of SDC Methods on Census Frequency Tables

25

Natalie Shlomo

[email protected]