utepcomputer science dept.1 university of texas at el paso privacy in statistical databases dr. luc...

18
UTEP Computer Science Dept. 1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

Upload: cameron-gregory

Post on 18-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.1

University of Texas at El Paso

Privacy in Statistical Databases

Dr. Luc LongpréComputer Science Department

Spring 2006

Page 2: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.2

Database with Confidential Information

• Examples: – census data– medical information

• Privacy: protect the confidentiality of individuals

• Usefulness: want to derive meaningful statistics

Page 3: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.3

The Need for Privacy Safeguards

• Per person available disk space:– 1983: 0.02Mb– 1996: 28Mb– 2000: 472Mb

• Equivalent of one page per 3 minutes of life

Page 4: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.4

• Misuse of personal health information:– banker cross-referencing cancer patients with

outstanding loans– using medical records to make decisions about

employees– snooping in hospital computer network– 40% of insurers disclose personal health

information to lenders, employers, marketers, without customer permission

The Need for Privacy Safeguards

Page 5: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.5

Approaches

• Access control, encryption:– Only fixes who has access to what– Does not protect disclosures based on inference

• Problem– Sometimes it may be possible to derive

confidential information from released information

Page 6: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.6

Examples

• Salary database

• Query: what’s the average salary of white male professors with 2 children living El Paso Texas since 1994 and in Boston from 1987 to 1994?

Page 7: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.7

Examples

• 87% of population of the US are unique under ID made of:– 5 digit ZIP, – gender, – date of birth

Page 8: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.8

Linking to Re-Identify Data

• Medical database:– Ethnicity, visit date, diagnosis, procedure,

medication, ZIP, Birth date, Sex

• Voter list:– Name, address, date registered, ZIP, Birth date,

Sex

Page 9: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.9

Statistical Database

• Data collected with the purpose of releasing statistical information.

• Important for research, policy

• Facing tremendous demand for person-specific data– data mining, fraud detection, homeland security

Page 10: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.10

Sample Size

• Possible solution: do not release any statistics on any set of less than, say,10 records

Page 11: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.11

Problem Remains

• Query 1: What’s the average salary of every male age 89 in zip code 79912?

• Query 2: What’s the average salary of people age 89 in zip code 79912?

Page 12: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.12

K-anonymity

• Release only information where at least k records are identical (work by Sweeney)

• Attacks are still possible:– Unsorted matching: use the order of records

• solution: randomize order

Page 13: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.13

K-anonymity

– Complementary release: combining k-anonymous releases may not be k-anonymous

• solution: consider all releases together

– Temporal attack: data is dynamic, adding and removing data affects k-anonymous properties

• solution: analyze k-anonymous properties of dynamic data

Page 14: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.14

Other Solutions

• Add noise in the answers

• Add noise in the data

• Limit the kinds of queries allowed to the statistical database

Page 15: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.15

Quantifying Information

• Need a formal model, possibly based on information theory

• Measure entropy in database records before and after a statistical release

Page 16: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.16

Further Complications

• Some data is more sensitive than others– Example: bits in salary

• Common knowledge, information from other databases– Could define entropy conditional to available

information– Very impractical in applications

• Some people know some of the records

Page 17: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.17

Non Additivity

• Data sensitivity is non additive– Ex: don’t mind either digit of SSN to be

released, but not all digits

• Privacy loss is non additive– Ex: There could be 2 sets of information, each

of which, if released, gives no information, but which, if together released, reveals all the information

Page 18: UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.18

Past Research

• Denning: “Cryptography and data security”, 1982

• Sweeney: Ph.D. thesis, Applications to medical data, 1996

• A few more stray results, topics becoming popular again in “privacy preserving data mining”.