privacy cognizant information systems rakesh agrawal ibm almaden research center joint work with...

26
Privacy Cognizant Privacy Cognizant Information Systems Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Upload: rosaline-grant

Post on 26-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Privacy Cognizant Information Privacy Cognizant Information SystemsSystems

Rakesh AgrawalIBM Almaden Research Center

joint work with Srikant, Kiernan & Xu

Page 2: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

GOALGOAL

ƒ Build information systems thatƒ protect the privacy and ownership of

informationƒ do not impede the flow of information

Page 3: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

DriversDrivers Policies and Legislations

– U.S. and international regulations– Legal proceedings against businesses

Consumer Concerns– Consumer privacy apprehensions continue to plague the

Web … these fears will hold back roughly $15 billion in e-Commerce revenue.” Forrester Research, 2001

– Most consumers are “privacy pragmatists.” Westin Surveys Moral Imperative

– The right to privacy: the most cherished of human freedom -- Warren & Brandeis, 1890

Page 4: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

OutlineOutline

Privacy Preserving Data Mining– How to have you cake and mine it too!– Assuming no trusted third party.

Hippocratic Databases– What I may see or hear in the course of treatment … I

will keep to myself. - Hippocratic Oath, 8 (circa 400 BC)

Other related topics– Privacy Preserving Information Integration– Watermarking & fingerprinting

Page 5: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Data Mining and PrivacyData Mining and Privacy

The primary task in data mining:– development of models about aggregated

data.Can we still develop accurate models,

while protecting the privacy of individual records?

Page 6: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Recommendation Service

Alice

Bob

3595,000J.S. Bachpaintingnasa

3595,000J.S. Bachpaintingnasa

4560,000B. Spearsbaseballcnn

4560,000B. Spearsbaseballcnn 42

85,000B. Marleycampingmicrosoft

4285,000B. Marleycampingmicrosoft

4560,000B. Spearsbaseballcnn

3595,000J.S. Bachpaintingnasa

Chris4285,000B. Marley,camping,microsoft

Randomization Randomization OverviewOverview

Page 7: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Recommendation Service

Alice

Bob

3595,000J.S. Bachpaintingnasa

3595,000J.S. Bachpaintingnasa

4560,000B. Spearsbaseballcnn

4560,000B. Spearsbaseballcnn 42

85,000B. Marleycampingmicrosoft

4285,000B. Marleycampingmicrosoft

4560,000B. Spearsbaseballcnn

3595,000J.S. Bachpaintingnasa

Chris4285,000B. Marley,camping,microsoft

Randomization Randomization OverviewOverview

Data Mining Model

Mining Algorithm

Page 8: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Recommendation Service

Alice

Bob

5065,000Metallicapaintingnasa

5065,000Metallicapaintingnasa

3890,000B. Spearssoccerfox

3890,000B. Spearssoccerfox 32

55,000B. Marleycampinglinuxware

3255,000B. Marleycampinglinuxware

4560,000B. Spearsbaseballcnn

3595,000J.S. Bachpaintingnasa

Chris4285,000B. Marley,camping,microsoft

Randomization Randomization OverviewOverview

Page 9: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Recommendation Service

Alice

Bob

5065,000Metallicapaintingnasa

5065,000Metallicapaintingnasa

3890,000B. Spearssoccerfox

3890,000B. Spearssoccerfox 32

55,000B. Marleycampinglinuxware

3255,000B. Marleycampinglinuxware

4560,000B. Spearsbaseballcnn

3595,000J.S. Bachpaintingnasa

Chris4285,000B. Marley,camping,microsoft

Randomization Randomization OverviewOverview

Data Mining Model

Mining Algorithm

Recovery

Page 10: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Inducing Classifiers over Privacy Inducing Classifiers over Privacy Preserved Numeric DataPreserved Numeric Data

30 | 25K | … 50 | 40K | …

Randomizer

65 | 50K | …

Randomizer

35 | 60K | …

ReconstructAge Distribution

ReconstructSalary Distribution

Decision TreeAlgorithm

Model

30 becomes

65 (30+35)

Alice’s age

Alice’s salary

John’s age

Page 11: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Reconstruction ProblemReconstruction Problem

Original values x1, x2, ..., xn

– from probability distribution X (unknown) To hide these values, we use y1, y2, ..., yn

– from probability distribution Y Given

– x1+y1, x2+y2, ..., xn+yn

– the probability distribution of Y Estimate the probability distribution of X.

Page 12: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Reconstruction AlgorithmReconstruction Algorithm

fX0 := Uniform distribution

j := 0 repeat

fXj+1(a) := Bayes’ Rule

j := j+1 until (stopping criterion met)

(R. Agrawal & R. Srikant, SIGMOD 2000)

Converges to maximum likelihood estimate.– D. Agrawal & C.C. Aggarwal, PODS 2001.

n

ij

XiiY

jXiiY

afayxf

afayxf

n 1 )())((

)())((1

Page 13: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Works WellWorks Well

20

60

Age

0

200

400

600

800

1000

1200

Nu

mb

er

of

Peop

le

Original

Randomized

Reconstructed

Page 14: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Decision Tree ExampleDecision Tree Example

Age Salary Repeat Visitor?

23 50K Repeat17 30K Repeat43 40K Repeat68 50K Single32 70K Single20 20K Repeat

Age < 25

Salary < 50K

Repeat

Repeat

Single

Yes

Yes

No

No

Page 15: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Accuracy vs. RandomizationAccuracy vs. Randomization

10 20 40 60 80 100 150 200

Randomization Level

40

50

60

70

80

90

100

Acc

ura

cy

Original

Randomized

Reconstructed

Fn 3

Page 16: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Hippocratic DatabasesHippocratic Databases

Hippocratic Oath, 8 (circa 400 BC)– What I may see or hear in the course of

treatment … I will keep to myself.

What if the database systems were to embrace the Hippocratic Oath?

R. Agrawal, J. Kiernan, R. Srikant, Y. Xu

Hippocratic Databases. VLDB 2002..

Page 17: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

The Ten PrinciplesThe Ten Principles

Driven by current privacy legislation.– US (FIPA, 1974), Europe (OECD , 1980), Canada (1995),

Australia (2000), Japan (2003)

Principles:– Collection Group: Purpose Specification, Consent,

Limited Collection– Use Group: Limited Use, Limited Disclosure,

Limited Retention, Accuracy– Security & Openness Group: Safety, Openness,

Compliance

Page 18: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Collection GroupCollection Group

1. Purpose Specification– For personal information stored in the database, the

purposes for which the information has been collected shall be associated with that information.

2. Consent– The purposes associated with personal information shall

have consent of the donor (person whose information is being stored).

3. Limited Collection– The information collected shall be limited to the minimum

necessary for accomplishing the specified purposes.

Page 19: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Use GroupUse Group

4. Limited Use– The database shall run only those queries that

are consistent with the purposes for which the information has been collected.

5. Limited Disclosure– Personal information shall not be

communicated outside the database for purposes other than those for which there is consent from the donor of the information.

Page 20: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Use Group (2)Use Group (2)

6. Limited Retention– Personal information shall be retained only as

long as necessary for the fulfillment of the purposes for which it has been collected.

7. Accuracy– Personal information stored in the database

shall be accurate and up-to-date.

Page 21: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Security & Openness GroupSecurity & Openness Group

8. Safety– Personal information shall be protected by security

safeguards against theft and other misappropriations.

9. Openness– A donor shall be able to access all information about

the donor stored in the database.

10. Compliance– A donor shall be able to verify compliance with the

above principles. Similarly, the database shall be able to address a challenge concerning compliance.

Page 22: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

ArchitectureArchitecturePrivacyPolicy

DataCollection

Queries

PrivacyMetadataCreator

Store

PrivacyConstraintValidator

DataAccuracyAnalyzer

AuditInfo

AuditInfo

AuditTrail

QueryIntrusionDetector

AttributeAccessControl

PrivacyMetadata

Other

DataRetentionManager

RecordAccessControl

EncryptionSupport

DataCollectionAnalyzer

Page 23: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Decision-Making Across Sovereign Data Decision-Making Across Sovereign Data RepositoriesRepositories

Separate databases due to statutory, competitive, or security reasons. Selective, minimal sharing on

need-to-know basis. Example: Among those who took

a particular drug, how many had adverse reaction and their DNA contains a specific sequence? Researchers must not learn

anything beyond counts.

• Algorithms for computing joins and join counts while revealing minimal additional information.

Minimal Necessary Sharing

R S R must not

know that S has b & y

S must not know that R has a & x

u

v

RSa

u

v

x

b

u

v

y

R

S

Count (R S) R & S do not learn

anything except that the result is 2.

Page 24: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Watermarking Relational DatabasesWatermarking Relational Databases Goal: Deter data theft and assert ownership of

pirated copies.– Examples: Life Sciences, Electronic Parts.

Watermark – Intentionally introduced pattern in the data.

– Very unlikely to occur by chance.– Hard to find => hard to destroy (robust against

malicious attacks). Existing watermarking techniques developed for

multimedia: images, sound, text, … are not applicable to database tables.

– Rows in a table are unordered.– Rows can be inserted, updated, deleted.– Attributes can be added, dropped.

New algorithm for watermarking database tables.

– Watermark can be detected using only a subset of the rows and attributes of a table.

– Robust against updates,incrementally updatable.

Watermark

Insertion

Watermark

Detection

DatabaseSuspiciousDatabase

3. Pseudo randomly select a subset of the rows for marking

Function of secret key and attribute values

3. Identify marked rows/attributes, compare marks with expected mark values

Requires neither original unmarked data nor the watermark

1. Choose secret key

2. Specify table/attributes to be marked

1. Specify secret key

2. Specify table/attributes which should contain marks

4. Confirm presence or absence of the watermark

Page 25: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

Closing ThoughtsClosing Thoughts

Solutions to complex problems such as privacy require a mix of legislations, societal norms, market forces & technology

By advancing technology, we can change the mix and improve the overall quality of the solution

Page 26: Privacy Cognizant Information Systems Rakesh Agrawal IBM Almaden Research Center joint work with Srikant, Kiernan & Xu

ReferencesReferences M. Bawa, R. Bayardo, R. Agrawal. Privacy-preserving indexing of Documents on the Network. 29th

Int'l Conf. on Very Large Databases (VLDB), Berlin, Sept. 2003. R. Agrawal, A. Evfimievski, R. Srikant. Information Sharing Across Private Databases. ACM Int’l

Conf. On Management of Data (SIGMOD), San Diego, California, June 2003. A. Evfimievski, J. Gehrke, R. Srikant. Liming Privacy Breaches in Privacy Preserving Data Mining.

PODS, San Diego, California, June 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. An Xpath Based Preference Language for P3P. 12th Int'l

World Wide Web Conf. (WWW), Budapest, Hungary, May 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Implementing P3P Using Database Technology. 19th

Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Server Centric P3P. W3C Workshop on the Future of

P3P, Dulles, Virginia, Nov. 2002. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic Databases. 28th Int'l Conf. on Very Large

Databases (VLDB), Hong Kong, August 2002. R. Agrawal, J. Kiernan. Watermarking Relational Databases. 28th Int'l Conf. on Very Large

Databases (VLDB), Hong Kong, August 2002. Expanded version in VLDB Journal 2003. A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over Privacy

Preserving Data. 8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining (KDD), Edmonton, Canada, July 2002.

R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On Management of Data (SIGMOD), Dallas, Texas, May 2000.