UTEP Computer Science Dept.1
University of Texas at El Paso
Privacy in Statistical Databases
Dr. Luc LongpréComputer Science Department
Spring 2006
UTEP Computer Science Dept.2
Database with Confidential Information
• Examples: – census data– medical information
• Privacy: protect the confidentiality of individuals
• Usefulness: want to derive meaningful statistics
UTEP Computer Science Dept.3
The Need for Privacy Safeguards
• Per person available disk space:– 1983: 0.02Mb– 1996: 28Mb– 2000: 472Mb
• Equivalent of one page per 3 minutes of life
UTEP Computer Science Dept.4
• Misuse of personal health information:– banker cross-referencing cancer patients with
outstanding loans– using medical records to make decisions about
employees– snooping in hospital computer network– 40% of insurers disclose personal health
information to lenders, employers, marketers, without customer permission
The Need for Privacy Safeguards
UTEP Computer Science Dept.5
Approaches
• Access control, encryption:– Only fixes who has access to what– Does not protect disclosures based on inference
• Problem– Sometimes it may be possible to derive
confidential information from released information
UTEP Computer Science Dept.6
Examples
• Salary database
• Query: what’s the average salary of white male professors with 2 children living El Paso Texas since 1994 and in Boston from 1987 to 1994?
UTEP Computer Science Dept.7
Examples
• 87% of population of the US are unique under ID made of:– 5 digit ZIP, – gender, – date of birth
UTEP Computer Science Dept.8
Linking to Re-Identify Data
• Medical database:– Ethnicity, visit date, diagnosis, procedure,
medication, ZIP, Birth date, Sex
• Voter list:– Name, address, date registered, ZIP, Birth date,
Sex
UTEP Computer Science Dept.9
Statistical Database
• Data collected with the purpose of releasing statistical information.
• Important for research, policy
• Facing tremendous demand for person-specific data– data mining, fraud detection, homeland security
UTEP Computer Science Dept.10
Sample Size
• Possible solution: do not release any statistics on any set of less than, say,10 records
UTEP Computer Science Dept.11
Problem Remains
• Query 1: What’s the average salary of every male age 89 in zip code 79912?
• Query 2: What’s the average salary of people age 89 in zip code 79912?
UTEP Computer Science Dept.12
K-anonymity
• Release only information where at least k records are identical (work by Sweeney)
• Attacks are still possible:– Unsorted matching: use the order of records
• solution: randomize order
UTEP Computer Science Dept.13
K-anonymity
– Complementary release: combining k-anonymous releases may not be k-anonymous
• solution: consider all releases together
– Temporal attack: data is dynamic, adding and removing data affects k-anonymous properties
• solution: analyze k-anonymous properties of dynamic data
UTEP Computer Science Dept.14
Other Solutions
• Add noise in the answers
• Add noise in the data
• Limit the kinds of queries allowed to the statistical database
UTEP Computer Science Dept.15
Quantifying Information
• Need a formal model, possibly based on information theory
• Measure entropy in database records before and after a statistical release
UTEP Computer Science Dept.16
Further Complications
• Some data is more sensitive than others– Example: bits in salary
• Common knowledge, information from other databases– Could define entropy conditional to available
information– Very impractical in applications
• Some people know some of the records
UTEP Computer Science Dept.17
Non Additivity
• Data sensitivity is non additive– Ex: don’t mind either digit of SSN to be
released, but not all digits
• Privacy loss is non additive– Ex: There could be 2 sets of information, each
of which, if released, gives no information, but which, if together released, reveals all the information
UTEP Computer Science Dept.18
Past Research
• Denning: “Cryptography and data security”, 1982
• Sweeney: Ph.D. thesis, Applications to medical data, 1996
• A few more stray results, topics becoming popular again in “privacy preserving data mining”.