assessing the impact of sdc methods on census frequency tables
DESCRIPTION
Assessing the Impact of SDC Methods on Census Frequency Tables. Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton . Topics:. Introduction Disclosure risk SDC methods for protecting Census frequency tables - PowerPoint PPT PresentationTRANSCRIPT
1
Assessing the Impact of SDC Methods on Census Frequency Tables
Natalie Shlomo
Southampton Statistical Sciences Research Institute
University of Southampton
2
Topics:• Introduction• Disclosure risk • SDC methods for protecting Census frequency tables• Disclosure risk and data utility measures• Description of table• Risk-Utility analysis • Summary of Analysis • Discussion and future work
3
• Disclosure risk in Census tables:
• Need to protect many tables from one dataset containing population counts which can be linked and differenced
• Need to consider output strategies for standard tables and web based table generating applications
• Need to interact with users and develop SDC framework with a focus on both disclosure risk and data utility
Introduction
Identification Individual Attribute Disclosure
4
Disclosure RiskFor Census tables: • 1’s and 2’s in cells are disclosive since these
cells lead to identification, • 0’s may be disclosive if there are only a few non-zero
cells in a row or column (attribute disclosure)
Consideration of disclosure risk:• Threshold rules (minimum average cell size, ratio of small cells to
zeros, etc.)• Proportion of high-risk cells (1 or 2)• Entropy (minimum of 0 if distribution has one non-zero
cell and all others zero, maximum of (log K) if all cells are equal).
5
SDC Methods for Protecting Frequency Tables1. Pre-tabular methods (special case of PRAM)
Random Record SwappingTargeted Record Swapping
In a Census context, geographical variables typically swapped to avoid edit failures and minimize biasImplementation:
Randomly select p% of the households
Draw a household matching on set of key variables (i.e. household size and broad sex-age distribution) and swap all geographical variables Can target records for swapping that are in high-risk cells of size 1 or 2
6
SDC Methods for Protecting Frequency Tables2. Rounding Unbiased random rounding
Entries are rounded up or down to a multiple of the rounding base depending on pre-defined probabilities and a stochastic draw
Example: For unbiased random rounding to base 3: 1 0 w.p of 2/3 1 3 w.p 1/3 2 0 w.p of 1/3 2 3 w.p 2/3
Expectation of rounding is 0 Margins and internal cells
rounded separately Small cell rounding: internal cells aggregated to obtain margins
Confidence Interval for Totals
-100-80-60-40-20
020406080
100
0 100 200 300 400 500 600 700 800 900 1000
Number of Perturbed Cells
Inte
rval
of E
rror
7
SDC Methods for Protecting Frequency Tables
2. Rounding (cont.)Semi-controlled unbiased random rounding
Control the selection strategy for entries to round, i.e. use a “without replacement” strategy
Implementation: - Calculate the expected number of entries to round up - Draw an srswor sample from among the entries and
round up, the rest round down.
Can be carried out per row/column to ensure consistent totals on one dimension (key statistics)
Eliminates extra variance as a result of the rounding
8
SDC Methods for Protecting Frequency Tables
2. Rounding (cont.)Controlled rounding
Feature in Tau-Argus (Salazar-González, Bycroft and Staggemeier, 2005)
- Uses linear programming techniques to round entries up or down, results similar to deterministic rounding
- All rounded entries add up to rounded margins - Method not unbiased and entries can jump a base
9
SDC Methods for Protecting Frequency Tables3. Cell Suppression Hypercube method (Giessing, 2004) Feature in Tau-Argus and suited for large tables Uses heuristic based on suppressing corners of a
hypercube formed by the primary suppressed cell with optimality conditions
Imputing suppressed cells for utility evaluation: Replace suppressed cell by the average information
loss in each row/column.
Example: Two suppressed cells in a row and known margin is 500. The total of non-suppressed cells is 400. Each cell is replaced with a value of 50
10
Disclosure Risk MeasuresNeed to determine output strategies and SDC together • Hard-copy tables, non-flexible categories and geographies: can
control SDC methods to suit the tables• Web-based tables and flexible categories and geographies:
need to add noise or round for every query
Disclosure risk measures:• Proportion of high-risk cells (C1 and C2) not protected
• Percent true zeros out of total zeros21
21
)(1
CC
CCii
n
imputedorperturbednotRIDR
pertorig
orig
CCCDR
00
02
11
• Distance metric - distortion to distributions (Gomatam and
Karr, 2003):
Internal cells:
Let be a table for row k, the number of rows, and the cell frequency for cell c,
Margins:
Let M be the margin, the number of categories, the number of persons in the category:
rn
k kc
korig
kpert
rorigpert cDcD
nDDHD
1
2))()((211),(
kD ( )kD crn
Utility Measures
MN
l
lorig
lpertorigpert NNNNHDM
1
2
21),(
Mn lNthl
12
Utility Measures
• Impact on Tests for Independence:
Cramer’s V measure of association: where is the Pearson chi-square statistic
Same utility measure for entropy and the Pearson chi- square statistics
Impact on log linear analysis for multi-dimensional tables, i.e. deviance
)1(),1min(
2
CRnCV
2
( ) ( )
( , ) 100( )
pert origpert orig
orig
CV D CV DRCV D D
CV D
13
Utility Measures
• “Between” Variance:
Let be a target proportion for a cell c in row k,
and let be the overall
proportion across all rows of the table
The “between” variance is defined as:
and the utility measure is:
( )korigP c
( )( )
( )
korigk
orig korig
c k
D cP c
D c
r
r
n
k kc
korig
n
k
korig
orig
cD
cDcP
1
1
)(
)()(
rn
korig
korig
rorig cPcP
ncPBV
1
2))()((1
1))((
))(()))(())(((100
))(),((cPBV
cPBVcPBVcPcPBVR
orig
origpertorigpert
14
Utility Measures
• Variance of Cell Counts:
The variance of the cell count for row k:
)(1)(1
rn
k
korig
rorig DV
nDV
where is the number of columns
The average variance across all rows:kn
The utility measure is:
)())()((100
),(orig
origpertpertorig DV
DVDVDDRDV
2))((1
1)( korig
kc
korig
k
korig DcD
nDV
15
Description of Table• 2001 UK Census Table: Rows: Output Areas (1,487)
Columns: Economic Activity (9) * Sex (2)* Long- Term Illness (2)
Table includes 317,064 persons between 16-74 in
53,532 internal cells Average cell size: 5.92 although table is skewed Number of zeros: 17,915 (33.5%) Number of small cells: 14,726 (27.5%)
16
Percent Unperturbed Small Cells
00.1
0.20.30.40.5
0.60.70.8
0.91
Original 10%Random
10%Target
20%Random
20%Target
Percent True Zeros
00.10.20.30.40.50.60.70.80.9
1
17
Hellinger's Distance Margins OAs
0123456789
10
Hellinger's Distance Internal Cells
0123456789
10
18
Difference in Cramer's V (Original=0.121)
-10-505
1015202530
19
Difference in Variance of Cell Counts (Original=188.3)
-3
-2
-1
0
1
2
3
Difference in Between Variance (Original=0.00023)
-15-10-505
10152025303540
20
Risk-Utility Map
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
00.511.522.533.5
HD
Prop
. tru
e ze
ros
RR 3
T 20
R 20
R 10
T 10
RR 5
Sup
21
Summary of Analysis
• Rounding eliminates small cells but need to protect against disclosure by differencing and linking when random rounding
• Rounding adds more ambiguity into the zero counts
• Random rounding to base 5 has greatest impact on distortions to distribution
• Semi-controlled rounding has almost no effect on distortions to internal cells but has less distortion on marginal cells
• Full controlled rounding has less distortion to internal cells since it is similar to deterministic rounding
• Cell suppression with simple imputation method has highest utility (no perturbation on large cells) but difficult to implement in a Census
22
Summary of Analysis
• High percent of true small cells in record swapping and less ambiguity of zero cells
• Record swapping has less distortion to internal cells than rounding which increases with higher swapping rates
• Targeted swapping has more distortion on internal cells than random swapping but has less impact on marginal cells
• Column margins of the table have no distortion because of controls in swapping
• Combining record swapping with rounding results in more distortion but provides added protection
23
Summary of Analysis• Record swapping across geographies attenuates: - loss of association (moving towards independence)
- counts “flattening” out - proportions moving to the overall proportion
• Attenuation increases with higher swapping rates Targeted record swapping has less attenuation than random swapping
• Rounding introduces more zeros: - levels of association are higher - cell counts “sharper” Effects less severe for controlled rounding
• Combing record swapping and rounding cancel out opposing effects depending on the direction and magnitude of each procedure separately
24
Discussion• Choice of SDC method depends on tolerable risk thresholds and
demands for “fit for purpose” data
• Modifying and combining SDC methods (non-perturbative and perturbative methods) can produce higher utility, i.e. ABS developed microdata keys for consistency in rounding
• Dissemination of quality measures and guidance for carrying out statistical analysis on protected tables
• Future output strategies based on flexible table generating software. More need for research into disclosure risk by differencing and linking (collaboration with CS community)
• Safe setting, remote access and license agreements for highly disclosive Census outputs (sample microdata and origin-destination tables)