Download - UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

UTEP Computer Science Dept.1

University of Texas at El Paso

Privacy in Statistical Databases

Dr. Luc LongpréComputer Science Department

Spring 2006


Database with Confidential Information

• Examples: – census data– medical information

• Privacy: protect the confidentiality of individuals

• Usefulness: want to derive meaningful statistics


The Need for Privacy Safeguards

• Per person available disk space:– 1983: 0.02Mb– 1996: 28Mb– 2000: 472Mb

• Equivalent of one page per 3 minutes of life


• Misuse of personal health information:– banker cross-referencing cancer patients with

outstanding loans– using medical records to make decisions about

employees– snooping in hospital computer network– 40% of insurers disclose personal health

information to lenders, employers, marketers, without customer permission

The Need for Privacy Safeguards


Approaches

• Access control, encryption:– Only fixes who has access to what– Does not protect disclosures based on inference

• Problem– Sometimes it may be possible to derive

confidential information from released information


Examples

• Salary database

• Query: what’s the average salary of white male professors with 2 children living El Paso Texas since 1994 and in Boston from 1987 to 1994?


Examples

• 87% of population of the US are unique under ID made of:– 5 digit ZIP, – gender, – date of birth


Linking to Re-Identify Data

• Medical database:– Ethnicity, visit date, diagnosis, procedure,

medication, ZIP, Birth date, Sex

• Voter list:– Name, address, date registered, ZIP, Birth date,

Sex


Statistical Database

• Data collected with the purpose of releasing statistical information.

• Important for research, policy

• Facing tremendous demand for person-specific data– data mining, fraud detection, homeland security


Sample Size

• Possible solution: do not release any statistics on any set of less than, say,10 records


Problem Remains

• Query 1: What’s the average salary of every male age 89 in zip code 79912?

• Query 2: What’s the average salary of people age 89 in zip code 79912?


K-anonymity

• Release only information where at least k records are identical (work by Sweeney)

• Attacks are still possible:– Unsorted matching: use the order of records

• solution: randomize order


K-anonymity

– Complementary release: combining k-anonymous releases may not be k-anonymous

• solution: consider all releases together

– Temporal attack: data is dynamic, adding and removing data affects k-anonymous properties

• solution: analyze k-anonymous properties of dynamic data


Other Solutions

• Add noise in the answers

• Add noise in the data

• Limit the kinds of queries allowed to the statistical database


Quantifying Information

• Need a formal model, possibly based on information theory

• Measure entropy in database records before and after a statistical release


Further Complications

• Some data is more sensitive than others– Example: bits in salary

• Common knowledge, information from other databases– Could define entropy conditional to available

information– Very impractical in applications

• Some people know some of the records


Non Additivity

• Data sensitivity is non additive– Ex: don’t mind either digit of SSN to be

released, but not all digits

• Privacy loss is non additive– Ex: There could be 2 sets of information, each

of which, if released, gives no information, but which, if together released, reveals all the information


Past Research

• Denning: “Cryptography and data security”, 1982

• Sweeney: Ph.D. thesis, Applications to medical data, 1996

• A few more stray results, topics becoming popular again in “privacy preserving data mining”.

Download - UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006

Top Related