![Page 1: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/1.jpg)
RTI International is a trade name of Research Triangle Institute
3040 Cornwallis Road ¦ P.O. Box 12194 ¦ Research Triangle Park, North Carolina, USA 27709 Phone: 919-541-6990 e-mail: [email protected]
A Comparative Assessment of Methods for Protecting Confidentiality of Microdata
David Wilson
Joint Statistical MeetingsMinneapolis, MN August 7-11, 2005
![Page 2: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/2.jpg)
Outline
§ Framing the Comparison Scenario
§ 10 Statistical Disclosure Limitation Methods
§ Comparing 10 methods
§ Summary
![Page 3: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/3.jpg)
Apples and Oranges, Oh my!
§ Driving forces behind Statistical Disclosure Limitation (SDL) are: Risk and Information
§ In order to compare and choose acceptable SDL techniques, one must define acceptable risk and acceptable information loss
§ Requires subjective determinations of “how much risk is acceptable” and “how much information loss is acceptable”
![Page 4: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/4.jpg)
Disclosure: A Balancing Act
DATA CONFIDENTIALITYDATA QUALITY
![Page 5: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/5.jpg)
Means of Comparison
§ Do methods use a common measure of risk?....No
§ Do methods use a common measure of information loss?....No
§ So how do we compare competing methods?
§ Ease of implementation, by type of data they can handle, impact on one of several measures of information loss
![Page 6: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/6.jpg)
10 SDL Techniques
§ 10 SDL techniques applicable to microdata will be discussed
§ Global recoding, Local suppression, Rounding, Microaggregation, Noise addition, Sampling, Swapping, PRAM, Imputation, and MASSC
§ Not an exhaustive list
![Page 7: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/7.jpg)
10 SDL Techniques (cont.)
In (rough) order of complexity:
§ Global Recoding (Top Coding, Bottom Coding)Global recoding of a variable is the process of combining two or
more categories of a variable into one category.
Continuous or Categorical data. Coarsens data.
443
1742
2041
NAge
444
1743
42
2001
NAge
![Page 8: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/8.jpg)
10 SDL Techniques (cont.)
§ Local SuppressionLocal suppression is a record level process where a value for a variable is replaced by a value that indicates “missingness.” Applied to extreme values, for example. Changes distributions.
$45,0004
$1,500,0003
6
5
2
1
Obs
$32,000
$32,000
$32,000
$45,000
Income
$45,0004
.3
6
5
2
1
Obs
$32,000
$32,000
$32,000
$45,000
Income
![Page 9: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/9.jpg)
10 SDL Techniques (cont.)
§ Rounding Values for a variable are replaced with some integer multiple ofa rounding base. Applicable to quantitative, continuous variables. Changes distributions.
12.44
35.33
6
5
2
1
Obs
27.9
89.5
43.7
13.2
Age (in years)
124
353
6
5
2
1
Obs
28
90
44
13
Age (in years)
![Page 10: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/10.jpg)
10 SDL Techniques (cont.)
§ MicroaggregationMicroaggregation is the process of replacing values of variables, for a given grouping of records, with an aggregate value derived from that group. “Flattens” distributions. Hides extreme values.
Female
Male
Female
Male
Female
Male
Gender
12.44
35.33
6
5
2
1
Obs
27.9
89.5
43.7
13.2
Age (in years)
Female
Male
Female
Male
Female
Male
Gender
284
463
6
5
2
1
Obs
28
46
28
46
Age (in years)
![Page 11: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/11.jpg)
10 SDL Techniques (cont.)
§ Noise Addition
Additive (Multiplicative is another) noise addition is the record level process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible data values if done incorrectly. Changes distributions and multivariate relationships.
12.44
35.33
6
5
2
1
Obs
27.9
89.5
43.7
13.2
Age (in years)
14.84
33.13
6
5
2
1
Obs
29.1
93.2
41.3
13.9
Age (in years)
![Page 12: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/12.jpg)
10 SDL Techniques (cont.)
§ Sampling Release a subset of all records contained in the file by sampling from the set of all records. Increases variances and may impactrare responses.
12.44
35.33
6
5
2
1
Obs
27.9
89.5
43.7
13.2
Age (in years)
35.33
6
2
1
Obs
27.9
43.7
13.2
Age (in years)
![Page 13: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/13.jpg)
10 SDL Techniques (cont.)
§ Swapping Data swapping is the process of choosing two records at random from a microdata set and swapping the values of a set of variables
Female
Male
Female
Male
Female
Male
Gender
12.44
35.33
6
5
2
1
Obs
27.9
89.5
43.7
13.2
Age (in years)
Female
Male
Male
Female
Male
Female
Gender
12.44
35.33
6
5
2
1
Obs
27.9
89.5
43.7
13.2
Age (in years)
![Page 14: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/14.jpg)
10 SDL Techniques (cont.)
§ PRAM (Post Randomization)Values of variables for each record in a microdata set are changed according to a known probabilistic methodology
Values are changed according to a probability mechanism so values after application of PRAM may or not differ from the original values
§ Estimates after PRAM can be adjusted because of the known probability mechanism
![Page 15: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/15.jpg)
10 SDL Techniques (cont.)
§ Multiple/Single Imputation (synthetic data)
§ For single imputation, replace a value in a data set with a value either
§ 1) derived from a model of the population from which the data were derived or
§ 2) using some mathematical method to choose an imputed value that is “close” to the original value (e.g. Nearest neighbor)
![Page 16: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/16.jpg)
10 SDL Techniques (cont.)
§ Multiple/Single Imputation (synthetic data)
§ For “full” multiple imputation: § Model the population distribution of the variables contained in
the microdata and generate realizations of the microdata, under the developed model, and release the set of generated realizations.
§ For “partial” multiple imputation: § Model the population distribution of the variables contained in
the microdata and generate realizations of “parts” of the microdata, under the developed model, and release the set of generated realizations along with the un-imputed data.
![Page 17: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/17.jpg)
10 SDL Techniques (cont.)
§ MASSC
§ Uses recoding, substitution, and sampling to change the values of key variables
§ Sampling weights are created or adjusted to allow for accurate estimation of “totals”
![Page 18: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/18.jpg)
After all that…how do we compare methods?
§ Some perturb data: All but Global recoding, Swapping, and Local suppression
§ Some require construction of models: Single and Multiple imputation
§ Some are easier to implement than others: Global recoding, Local Suppression, Rounding
![Page 19: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/19.jpg)
After all that…how do we compare methods? (cont.)
Existence of software
§ Not much SDL software available
§ µ-Argus – Publicly Available, Free
§ MASSC – Software exists, provided as service
§ Privacert Appliance – Commercial, appears to use suppression
§ IVEware – General imputation software for SAS
![Page 20: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/20.jpg)
After all that…how do we compare methods? (cont.)
Impact on information loss
§ No generally accepted measure of information loss
§ Some methods provide estimates of information loss: PRAM and MASSC
![Page 21: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/21.jpg)
After all that…how do we compare methods? (cont.)
Analyzing Data After Treatment
§ GR – no adjustments necessary
§ Swapping/Local suppression/rounding – no general adjustments exist
§ Imputation – adjusted variance formulas exist in certain implementations
§ Sampling/MASSC – sampling weights can be used to adjust estimates
§ PRAM – Estimates adjusted because probability mechanism is known
![Page 22: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/22.jpg)
After all that…how do we compare methods? (cont.)
§ Ability to assess Risk
§ PRAM, Imputation, and MASSC all provide some measure of “risk”
§ No one measure of risk exists though record linkage techniques have been used to compare different methods( probabilistic, distance-based)
![Page 23: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/23.jpg)
Summary
§ There are technical and other motivations (legal, political) that affect which SDL methods are used
§ Find a balance between disclosure risk and information loss
§ No method is right for every situation
§ Consult your local statistician!
![Page 24: A Comparative Assessment of Methods for Protecting ...process of adding a realization of a random vector, say µ, to the vector of responses for a variable y. Could yield impossible](https://reader036.vdocument.in/reader036/viewer/2022071100/5fd89d0f161e3b1e8e500065/html5/thumbnails/24.jpg)
There can be only one…Reference
§ There are many, many journal articles dealing with statistical disclosure
§ Too many to list here
§ One good reference:
Willenborg, Leon and de Waal, Ton (2001), Elements of Statistical Disclosure Control, Lecture Notes in Statistics, Springer-Verlag.