utepcomputer science dept.1 university of texas at el paso privacy in statistical databases dr. luc...
TRANSCRIPT
UTEP Computer Science Dept.1
University of Texas at El Paso
Privacy in Statistical Databases
Dr. Luc LongpréComputer Science Department
Spring 2006
UTEP Computer Science Dept.2
Database with Confidential Information
• Examples: – census data– medical information
• Privacy: protect the confidentiality of individuals
• Usefulness: want to derive meaningful statistics
UTEP Computer Science Dept.3
The Need for Privacy Safeguards
• Per person available disk space:– 1983: 0.02Mb– 1996: 28Mb– 2000: 472Mb
• Equivalent of one page per 3 minutes of life
UTEP Computer Science Dept.4
• Misuse of personal health information:– banker cross-referencing cancer patients with
outstanding loans– using medical records to make decisions about
employees– snooping in hospital computer network– 40% of insurers disclose personal health
information to lenders, employers, marketers, without customer permission
The Need for Privacy Safeguards
UTEP Computer Science Dept.5
Approaches
• Access control, encryption:– Only fixes who has access to what– Does not protect disclosures based on inference
• Problem– Sometimes it may be possible to derive
confidential information from released information
UTEP Computer Science Dept.6
Examples
• Salary database
• Query: what’s the average salary of white male professors with 2 children living El Paso Texas since 1994 and in Boston from 1987 to 1994?
UTEP Computer Science Dept.7
Examples
• 87% of population of the US are unique under ID made of:– 5 digit ZIP, – gender, – date of birth
UTEP Computer Science Dept.8
Linking to Re-Identify Data
• Medical database:– Ethnicity, visit date, diagnosis, procedure,
medication, ZIP, Birth date, Sex
• Voter list:– Name, address, date registered, ZIP, Birth date,
Sex
UTEP Computer Science Dept.9
Statistical Database
• Data collected with the purpose of releasing statistical information.
• Important for research, policy
• Facing tremendous demand for person-specific data– data mining, fraud detection, homeland security
UTEP Computer Science Dept.10
Sample Size
• Possible solution: do not release any statistics on any set of less than, say,10 records
UTEP Computer Science Dept.11
Problem Remains
• Query 1: What’s the average salary of every male age 89 in zip code 79912?
• Query 2: What’s the average salary of people age 89 in zip code 79912?
UTEP Computer Science Dept.12
K-anonymity
• Release only information where at least k records are identical (work by Sweeney)
• Attacks are still possible:– Unsorted matching: use the order of records
• solution: randomize order
UTEP Computer Science Dept.13
K-anonymity
– Complementary release: combining k-anonymous releases may not be k-anonymous
• solution: consider all releases together
– Temporal attack: data is dynamic, adding and removing data affects k-anonymous properties
• solution: analyze k-anonymous properties of dynamic data
UTEP Computer Science Dept.14
Other Solutions
• Add noise in the answers
• Add noise in the data
• Limit the kinds of queries allowed to the statistical database
UTEP Computer Science Dept.15
Quantifying Information
• Need a formal model, possibly based on information theory
• Measure entropy in database records before and after a statistical release
UTEP Computer Science Dept.16
Further Complications
• Some data is more sensitive than others– Example: bits in salary
• Common knowledge, information from other databases– Could define entropy conditional to available
information– Very impractical in applications
• Some people know some of the records
UTEP Computer Science Dept.17
Non Additivity
• Data sensitivity is non additive– Ex: don’t mind either digit of SSN to be
released, but not all digits
• Privacy loss is non additive– Ex: There could be 2 sets of information, each
of which, if released, gives no information, but which, if together released, reveals all the information
UTEP Computer Science Dept.18
Past Research
• Denning: “Cryptography and data security”, 1982
• Sweeney: Ph.D. thesis, Applications to medical data, 1996
• A few more stray results, topics becoming popular again in “privacy preserving data mining”.