privacy cognizant information systems rakesh agrawal ibm almaden research center joint work with...
TRANSCRIPT
Privacy Cognizant Information Privacy Cognizant Information SystemsSystems
Rakesh AgrawalIBM Almaden Research Center
joint work with Srikant, Kiernan & Xu
GOALGOAL
ƒ Build information systems thatƒ protect the privacy and ownership of
informationƒ do not impede the flow of information
DriversDrivers Policies and Legislations
– U.S. and international regulations– Legal proceedings against businesses
Consumer Concerns– Consumer privacy apprehensions continue to plague the
Web … these fears will hold back roughly $15 billion in e-Commerce revenue.” Forrester Research, 2001
– Most consumers are “privacy pragmatists.” Westin Surveys Moral Imperative
– The right to privacy: the most cherished of human freedom -- Warren & Brandeis, 1890
OutlineOutline
Privacy Preserving Data Mining– How to have you cake and mine it too!– Assuming no trusted third party.
Hippocratic Databases– What I may see or hear in the course of treatment … I
will keep to myself. - Hippocratic Oath, 8 (circa 400 BC)
Other related topics– Privacy Preserving Information Integration– Watermarking & fingerprinting
Data Mining and PrivacyData Mining and Privacy
The primary task in data mining:– development of models about aggregated
data.Can we still develop accurate models,
while protecting the privacy of individual records?
Recommendation Service
Alice
Bob
3595,000J.S. Bachpaintingnasa
3595,000J.S. Bachpaintingnasa
4560,000B. Spearsbaseballcnn
4560,000B. Spearsbaseballcnn 42
85,000B. Marleycampingmicrosoft
4285,000B. Marleycampingmicrosoft
4560,000B. Spearsbaseballcnn
3595,000J.S. Bachpaintingnasa
Chris4285,000B. Marley,camping,microsoft
Randomization Randomization OverviewOverview
Recommendation Service
Alice
Bob
3595,000J.S. Bachpaintingnasa
3595,000J.S. Bachpaintingnasa
4560,000B. Spearsbaseballcnn
4560,000B. Spearsbaseballcnn 42
85,000B. Marleycampingmicrosoft
4285,000B. Marleycampingmicrosoft
4560,000B. Spearsbaseballcnn
3595,000J.S. Bachpaintingnasa
Chris4285,000B. Marley,camping,microsoft
Randomization Randomization OverviewOverview
Data Mining Model
Mining Algorithm
Recommendation Service
Alice
Bob
5065,000Metallicapaintingnasa
5065,000Metallicapaintingnasa
3890,000B. Spearssoccerfox
3890,000B. Spearssoccerfox 32
55,000B. Marleycampinglinuxware
3255,000B. Marleycampinglinuxware
4560,000B. Spearsbaseballcnn
3595,000J.S. Bachpaintingnasa
Chris4285,000B. Marley,camping,microsoft
Randomization Randomization OverviewOverview
Recommendation Service
Alice
Bob
5065,000Metallicapaintingnasa
5065,000Metallicapaintingnasa
3890,000B. Spearssoccerfox
3890,000B. Spearssoccerfox 32
55,000B. Marleycampinglinuxware
3255,000B. Marleycampinglinuxware
4560,000B. Spearsbaseballcnn
3595,000J.S. Bachpaintingnasa
Chris4285,000B. Marley,camping,microsoft
Randomization Randomization OverviewOverview
Data Mining Model
Mining Algorithm
Recovery
Inducing Classifiers over Privacy Inducing Classifiers over Privacy Preserved Numeric DataPreserved Numeric Data
30 | 25K | … 50 | 40K | …
Randomizer
65 | 50K | …
Randomizer
35 | 60K | …
ReconstructAge Distribution
ReconstructSalary Distribution
Decision TreeAlgorithm
Model
30 becomes
65 (30+35)
Alice’s age
Alice’s salary
John’s age
Reconstruction ProblemReconstruction Problem
Original values x1, x2, ..., xn
– from probability distribution X (unknown) To hide these values, we use y1, y2, ..., yn
– from probability distribution Y Given
– x1+y1, x2+y2, ..., xn+yn
– the probability distribution of Y Estimate the probability distribution of X.
Reconstruction AlgorithmReconstruction Algorithm
fX0 := Uniform distribution
j := 0 repeat
fXj+1(a) := Bayes’ Rule
j := j+1 until (stopping criterion met)
(R. Agrawal & R. Srikant, SIGMOD 2000)
Converges to maximum likelihood estimate.– D. Agrawal & C.C. Aggarwal, PODS 2001.
n
ij
XiiY
jXiiY
afayxf
afayxf
n 1 )())((
)())((1
Works WellWorks Well
20
60
Age
0
200
400
600
800
1000
1200
Nu
mb
er
of
Peop
le
Original
Randomized
Reconstructed
Decision Tree ExampleDecision Tree Example
Age Salary Repeat Visitor?
23 50K Repeat17 30K Repeat43 40K Repeat68 50K Single32 70K Single20 20K Repeat
Age < 25
Salary < 50K
Repeat
Repeat
Single
Yes
Yes
No
No
Accuracy vs. RandomizationAccuracy vs. Randomization
10 20 40 60 80 100 150 200
Randomization Level
40
50
60
70
80
90
100
Acc
ura
cy
Original
Randomized
Reconstructed
Fn 3
Hippocratic DatabasesHippocratic Databases
Hippocratic Oath, 8 (circa 400 BC)– What I may see or hear in the course of
treatment … I will keep to myself.
What if the database systems were to embrace the Hippocratic Oath?
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu
Hippocratic Databases. VLDB 2002..
The Ten PrinciplesThe Ten Principles
Driven by current privacy legislation.– US (FIPA, 1974), Europe (OECD , 1980), Canada (1995),
Australia (2000), Japan (2003)
Principles:– Collection Group: Purpose Specification, Consent,
Limited Collection– Use Group: Limited Use, Limited Disclosure,
Limited Retention, Accuracy– Security & Openness Group: Safety, Openness,
Compliance
Collection GroupCollection Group
1. Purpose Specification– For personal information stored in the database, the
purposes for which the information has been collected shall be associated with that information.
2. Consent– The purposes associated with personal information shall
have consent of the donor (person whose information is being stored).
3. Limited Collection– The information collected shall be limited to the minimum
necessary for accomplishing the specified purposes.
Use GroupUse Group
4. Limited Use– The database shall run only those queries that
are consistent with the purposes for which the information has been collected.
5. Limited Disclosure– Personal information shall not be
communicated outside the database for purposes other than those for which there is consent from the donor of the information.
Use Group (2)Use Group (2)
6. Limited Retention– Personal information shall be retained only as
long as necessary for the fulfillment of the purposes for which it has been collected.
7. Accuracy– Personal information stored in the database
shall be accurate and up-to-date.
Security & Openness GroupSecurity & Openness Group
8. Safety– Personal information shall be protected by security
safeguards against theft and other misappropriations.
9. Openness– A donor shall be able to access all information about
the donor stored in the database.
10. Compliance– A donor shall be able to verify compliance with the
above principles. Similarly, the database shall be able to address a challenge concerning compliance.
ArchitectureArchitecturePrivacyPolicy
DataCollection
Queries
PrivacyMetadataCreator
Store
PrivacyConstraintValidator
DataAccuracyAnalyzer
AuditInfo
AuditInfo
AuditTrail
QueryIntrusionDetector
AttributeAccessControl
PrivacyMetadata
Other
DataRetentionManager
RecordAccessControl
EncryptionSupport
DataCollectionAnalyzer
Decision-Making Across Sovereign Data Decision-Making Across Sovereign Data RepositoriesRepositories
Separate databases due to statutory, competitive, or security reasons. Selective, minimal sharing on
need-to-know basis. Example: Among those who took
a particular drug, how many had adverse reaction and their DNA contains a specific sequence? Researchers must not learn
anything beyond counts.
• Algorithms for computing joins and join counts while revealing minimal additional information.
Minimal Necessary Sharing
R S R must not
know that S has b & y
S must not know that R has a & x
u
v
RSa
u
v
x
b
u
v
y
R
S
Count (R S) R & S do not learn
anything except that the result is 2.
Watermarking Relational DatabasesWatermarking Relational Databases Goal: Deter data theft and assert ownership of
pirated copies.– Examples: Life Sciences, Electronic Parts.
Watermark – Intentionally introduced pattern in the data.
– Very unlikely to occur by chance.– Hard to find => hard to destroy (robust against
malicious attacks). Existing watermarking techniques developed for
multimedia: images, sound, text, … are not applicable to database tables.
– Rows in a table are unordered.– Rows can be inserted, updated, deleted.– Attributes can be added, dropped.
New algorithm for watermarking database tables.
– Watermark can be detected using only a subset of the rows and attributes of a table.
– Robust against updates,incrementally updatable.
Watermark
Insertion
Watermark
Detection
DatabaseSuspiciousDatabase
3. Pseudo randomly select a subset of the rows for marking
Function of secret key and attribute values
3. Identify marked rows/attributes, compare marks with expected mark values
Requires neither original unmarked data nor the watermark
1. Choose secret key
2. Specify table/attributes to be marked
1. Specify secret key
2. Specify table/attributes which should contain marks
4. Confirm presence or absence of the watermark
Closing ThoughtsClosing Thoughts
Solutions to complex problems such as privacy require a mix of legislations, societal norms, market forces & technology
By advancing technology, we can change the mix and improve the overall quality of the solution
ReferencesReferences M. Bawa, R. Bayardo, R. Agrawal. Privacy-preserving indexing of Documents on the Network. 29th
Int'l Conf. on Very Large Databases (VLDB), Berlin, Sept. 2003. R. Agrawal, A. Evfimievski, R. Srikant. Information Sharing Across Private Databases. ACM Int’l
Conf. On Management of Data (SIGMOD), San Diego, California, June 2003. A. Evfimievski, J. Gehrke, R. Srikant. Liming Privacy Breaches in Privacy Preserving Data Mining.
PODS, San Diego, California, June 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. An Xpath Based Preference Language for P3P. 12th Int'l
World Wide Web Conf. (WWW), Budapest, Hungary, May 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Implementing P3P Using Database Technology. 19th
Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Server Centric P3P. W3C Workshop on the Future of
P3P, Dulles, Virginia, Nov. 2002. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic Databases. 28th Int'l Conf. on Very Large
Databases (VLDB), Hong Kong, August 2002. R. Agrawal, J. Kiernan. Watermarking Relational Databases. 28th Int'l Conf. on Very Large
Databases (VLDB), Hong Kong, August 2002. Expanded version in VLDB Journal 2003. A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over Privacy
Preserving Data. 8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining (KDD), Edmonton, Canada, July 2002.
R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On Management of Data (SIGMOD), Dallas, Texas, May 2000.