protecting statistical databases against snoopers
DESCRIPTION
Protecting Statistical Databases Against Snoopers. Comparison of two methods. Disclosure vs. Anonymity. Information disclosure necessary for planning and numerical measurements Anonymity necessary for protection of the individual and the public’s trust in systems. Medical Data. - PowerPoint PPT PresentationTRANSCRIPT
Protecting Statistical Databases Against Snoopers
Comparison of two methods
Disclosure vs. Anonymity
Information disclosure necessary for planning and numerical measurements
Anonymity necessary for protection of the individual and the public’s trust in systems
Medical Data
Necessary for: Measuring effectiveness of
current treatments Finding sources of common
medical mistakes Tracking contagious disease Government spending planning Health Insurance Companies
Anonymity: Not as Easy as it Looks
Race
Profession
SexZip code
Birth date
Complete Identification Without Uniquely Identifying Information
Outside Factors Affecting Privacy
Snooper’s supplementary knowledge Public data sources Rarity
Comparing Two Methods of Protection
What are the privacy guarantees?
Can useful information be gained?
Sensitivity-based Noise-adding Algorithm
Proposed by Dwork, McSherry, Nissim and Smith
Adds noise to each answer based on the sensitivity of the series of queries
Amount of privacy based on ε, a coefficient in the noise-generating formula
SensitivityHow much could changing one row
change an answer?
MEAN COUNT HISTOGRAMS
The sensitivity of a series of queries is the sum of the sensitivities of the queries
Coin-flip Algorithm
Proposed by Mishra and Sandler
A way for individuals to publish their own personal data
Amount of privacy based on ε, the bias in the coin-flip
Implementing the Coin-flip Algorithm
Each of the k possible answers to a query are ordered and numbered
If an individual’s answer to the query is the ith answer, the profile would be a string of k bits where the ith is a one and the others are zero
To sanitize, each bit is flipped with probability ½ + ε/2
All sanitized profiles resemble a random string of ones and zeros
Example: HIV status Ordered possible responses:
“POSITIVE, NEGATIVE, UNKNOWN” The original profile of an HIV+
individual: “1, 0, 0” Results of coin-flips: “STAY, FLIP, STAY”
Resulting sanitized profile: “1, 1, 0” What do we know about the
individual from the sanitized profile?
My Research
Compare the total amount of error generated by histogram / frequency queries
Hypothesis: The noise-adding algorithm will generate less error for few queries and the coin-flip algorithm will generate less error for many queries
Research question: Where is the “sweet spot” where the error lines cross on a graph?
Sum of Error
0.00%
500.00%
1000.00%
1500.00%
2000.00%
2500.00%
3000.00%
3500.00%
4000.00%
1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381
Number of Frequency Queries
sum
of
erro
r as
per
cen
t o
f n
Coinflip
Noise Addition
The “sweet spot” first occurs at 101 queries.
With the smallest histograms first, the first “sweet spot” occurs at 32 queries.
Sum of Error
0.00%
500.00%
1000.00%
1500.00%
2000.00%
2500.00%
3000.00%
3500.00%
4000.00%
1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381
Number of Frequency Queries
sum
of
erro
r a
s p
erce
nt
of
n
Coinflip
Noise Addition
With the largest histograms first, the first “sweet spot” occurs at 189 queries.
Sum of Error
0.00%
500.00%
1000.00%
1500.00%
2000.00%
2500.00%
3000.00%
3500.00%
4000.00%
1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381
Number of Frequency Queries
sum
of
erro
r as
per
cen
t o
f n
Coinflip
Noise Addition
A Second Look Range of sensitivity: 2 to 136
Unordered histograms:At first “sweet spot”, sensitivity= 30.
Smallest histograms first:At first “sweet spot”, sensitivity= 32.
Largest histograms first:At first “sweet spot”, sensitivity= 34.
Sum of Error
0.00%
500.00%
1000.00%
1500.00%
2000.00%
2500.00%
3000.00%
3500.00%
4000.00%
1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381
Number of Frequency Queries
sum
of
erro
r as
per
cen
t o
f n
Coinflip
Noise Addition
Sum of Error
0.00%
500.00%
1000.00%
1500.00%
2000.00%
2500.00%
3000.00%
3500.00%
4000.00%
1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381
Number of Frequency Queries
sum
of
erro
r as
per
cen
t o
f n
Coinflip
Noise Addition
Sum of Error
0.00%
500.00%
1000.00%
1500.00%
2000.00%
2500.00%
3000.00%
3500.00%
4000.00%
1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381
Number of Frequency Queries
sum
of
erro
r as
per
cen
t o
f n
Coinflip
Noise Addition
Difference in Error
-200.00%
0.00%
200.00%
400.00%
600.00%
800.00%
1000.00%
1200.00%
1400.00%
1600.00%
2 12 22 32 42 52 62 72 82 92
Sensitivity
Dif
fere
nce
in p
erc
ent
err
or
Conclusions
For histogram / frequency queries, “sweet spots” occur between sensitivity=30 and sensitivity=40, so for least error: If sensitivity < 30, use NOISE-ADDING algorithm If sensitivity > 40, use COIN-FLIP algorithm
Quick Bibliography
Survey: N R Adam and J C Wortmann. Security-control methods
for statistical databases: a comparative study. ACM Computing Surveys, 25(4), December 1989.
Noise-adding algorithm: C Dwork, F McSherry, K Nissim, A Smith. Calibrating
noise to sensitivity in private data analysis. 3rd Theory of Cryptography Conference, 2006.
Coin-flip algorithm: N Mishra, M Sandler. Symposium on Principles of
Database Systems, 2006.
Professor Alf Weaver, PhD
Professor Nina Mishra, PhD
REU program at UVa, sponsored by the National Science Foundation