making digital privacy operational typically relies on data...
TRANSCRIPT
1
Making Digital Privacy Operationaltypically relies on data anonymity
Latanya Sweeney, Ph.D.Assistant Professor of Computer Science & of Public Policy
Director, Laboratory for International Data Privacy
Carnegie Mellon [email protected]
http://sos.heinz.cmu.edu/dataprivacy/
Two Questions:
1. What kinds of problems do dataanonymity tools solve?
2. How is data anonymity differentfrom security and privacy?
2
Two Questions:
1. What kinds of problems do dataanonymity tools solve?
2. How is data anonymity differentfrom security and privacy?
Bottom line:data anonymity addressesthe identifiability of shared information.
“Can’t release data”
Accuracy, quality Distortion, anonymity
Holder
RecipientConfidentiality, Privacy, Liability concerns
3
“Privacy is dead, get over it”
Ann 10/2/61 02139 cardiacAbe 7/14/61 02139 cancerAl 3/8/61 02138 liver
Accuracy, quality Distortion, anonymity
Recipient
HolderCommon Public Health reaction
“Share data while guaranteeinganonymity”
Accuracy, quality Distortion, anonymity
Holder
A* 1961 0213* cardiacA* 1961 0213* cancerA* 1961 0213* liver
Recipient Computational solutions
4
This talk
� New areas in CS
� Fact: lots of data out there
� Fact: few fields uniquely identify a person
� Examples of compromises
� Nature of computational solutions
� Anonymity versus Security and Privacy
� Real-world examples:HIPAA and bioterrorism surveillance
Data Anonymity (new area)
The study of computational solutionsfor releasing data such that the dataremain practically useful while theidentities of the subjects of the dataare not revealed.
“Useful AND Secure”
5
Learning information about entities...
Data Linkage (“data detectives”):
combining disparate pieces of entity-specificinformation to learn more about an entity
Privacy Protection (“data protectors”):
release information such that certain entity-specific properties (such as identity) cannotbe inferred; restrict what can be learned
Data Anonymity Lab at CMUWork with real-world stakeholders:
- public health- government agencies- private industry
Kinds of projects currently underway:- health data- web data- video surveillance data- genetic data- census surveys- crime data- grocery data, and so on…
http://sos.heinz.cmu.edu/dataprivacy/
6
This talk
�New areas in CS
� Fact: lots of data out there
� Fact: few fields uniquely identify a person
� Examples of compromises
� Nature of computational solutions
� Anonymity versus Security and Privacy
� Real-world examples:HIPAA and bioterrorism surveillance
0
50
100
150
200
250
300
350
400
450
500
1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003
Year
GD
SP
(MB
/per
son)
0
5
10
15
20
25
30
35
1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003
Sew
rver
s(in
Mill
ions
)
Technically-empowered Society
1993 FirstWWWconference
2001
Growth inavailablediskstorage
Growth inactive webservers
19961991
7
Behavior 1.Collect more
Global DSP over Time
0
50
100
150
200
250
300
350
400
450
500
1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003Y e ar
Examples(in DSP) 1983 1996Each birth 280 1,864Each hospital visit 0 663Each grocery visit 32 1,272
Based on State of Illinois [Sweeney 99]. DSP in bytes/person
Expand an existingperson-specific datacollection.
Typical Birth Certificate Fields, post 1925Field nameChild's first nameChild's middle name (sometimes or initial)Child's last nameDay, month and year of birthCity and/or County of birth (sometimes hospital)Father's nameMother's name (including maiden name)Place of birth (address and town/city)Mother's age and addressMother's birthplace (town/city, state, county)Mother's occupationMother, number of previous childrenFather's age and addressFather's birthplace (town/city, state, county)Father's occupation
8
Typical Electronic Birth Certificate Fieldsin 1999-starting fields 1-15
Field# Size Field name1 1 File Status2 50 Baby’s First Name3 50 Baby’s Middle Name4 50 Baby’s Last Name5 1 Baby’s Suffix Code6 3 Baby’s Suffix Text7 8 Baby’s Date of Birth8 5 Baby’s Time of Birth9 1 AM/PM Indicator
10 1 Baby’s Sex11 3 Blood Type12 1 Born Here?13 40 Place of Birth14 1 Facility Type
Typical Electronic Birth Certificate Fieldsin 1999-starting fields 16-30Field# Size Field name
16 20 County of Birth17 6 Certifier’s Code18 30 Certifier’s Name19 1 Certifier’s Title20 30 Attendant’s Name21 1 Attendant’s Title22 23 Attendant’s Address23 19 Attendant’s City24 2 Attendant’s State25 10 Attendant’s Zip Code26 50 Mother’s First Name27 50 Mother’s Middle Name28 50 Mother’s Last Name29 9 Mother’s Social Security Number30 8 Mother’s Date of Birth
9
Typical Electronic Birth Certificate Fieldsin 1999-starting fields 31-45
field# Size Field name31 3 Mother’s State of Birth32 7 Mother’s Residence Address33 2 Mother’s Residence Direction34 20 Residence Street Address35 10 Residence Type36 2 Residence Extension37 10 Residence Apartment #38 20 Mother’s Town of Residence39 1 Mother’s Residence in City Limits40 14 Mother’s County of Residence41 3 Mother’s State of Residence42 10 Mother’s Residence Zip Code43 38 Mother’s Mailing Address44 19 Mother’s Mailing City45 2 Mother’s Mailing State
Typical Electronic Birth Certificate Fieldsin 1999-starting fields 46-60
Field# Size Field name46 10 Mother’s Mailing Zip Code47 1 Mother Married?48 50 Father’s First Name49 50 Father’s Middle Name50 50 Father’s Last Name51 1 Father’s Suffix Code52 9 Father’s Suffix Text53 9 Father’s Social Security Number54 8 Father’s Date of Birth55 3 Father’s State of Birth56 14 Mother’s Origin57 14 Mother’s Race58 2 Mother’s Elementary Education59 2 Mother’s College Education60 11 Mother’s Occupation
10
Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 61-75
Field# Size Field name61 11 Mother’s Industry62 14 Father’s Origin63 14 Father’s Race64 2 Father’s Elementary Education65 2 Father’s College Education66 11 Father’s Occupation67 11 Father’s Industry68 1 Plurality69 1 Birth Order70 2 Live Births Still Living71 2 Live Births Now Dead72 4 Month/Year Last Live Birth73 2 Number of Terminations74 4 Month/Year Last Termination75 1 Baby’s Weight Unit
Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 76-90
Field# Size Field name76 5 Baby’s Weight77 6 Date of Last Normal Menses78 1 Month Prenatal Care Began79 2 Total Number of Visits80 2 Apgar Score – 1 Minute81 2 Apgar Score – 5 Minute82 2 Estimate of Gestation83 6 Date of Blood Test84 22 Laboratory85 1 Mother Transferred In86 30 Facility Mother Transferred From87 1 Baby Transferred Out88 30 Facility Baby Transferred To89 1 Tobacco Use During Pregnancy90 3 Number of Cigarettes/Day
11
Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 91-105
Field# Size Field name91 1 Alcohol Use During Pregnancy92 3 Number of Drinks/Week93 3 Mother’s Weight Gain94 1 Release Info For SSN95 6 Operator Code96 12 Hospital ID97 1 Sent to Romans98 1 Sent to APORS99 16 Other Certifier Specify
100 12 Temporary Audit Number101 16 Other Facility Specify102 16 Other Attendant Specify103 1 Mother’s Race104 1 Father’s Race105 2 Mother’s Origin
Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 106-120
Field# Size Field name106 2 Father’s Origin107 1 Attendant Same YN108 1 Mailing Address Same YN109 1 Capture Father’s Info YN110 2 Mother’s Age111 2 Father’s Age112 12 Baby’s Hospital Med. Rec.113 1 High Risk Pregnancy YN114 1 Care Giver (For Chicago)115 1 Record Selected For Download116 1 Downloaded117 1 Printed118 12 Form Number
MEDICAL RISK FACTORS119 1 Anemia120 1 Cardiac Disease
12
Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 121-135
Field# Size Field name121 1 Acute/Chronic Lung Disease122 1 Diabetes123 1 Genital Herpes124 1 Hydramnios/Oligohydramnios125 1 Hemoglobinopathy126 1 Hypertension, Chronic127 1 Hypertension, Preg. Assoc.128 1 Eclampsia129 1 Incompetent Cervix130 1 Previous Infant 4000+ Grams131 1 Previous Preterm or SGA Infant132 1 Renal Disease133 1 Rh Sensitization134 1 Uterine Bleeding135 1 No Medical Risk Factors
Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 136-150
Field# Size Field name136 40 Other Medical Risk Factors
OBSTETRIC PROCEDURES137 1 Amniocentesis138 1 Electronic Fetal Monitoring139 1 Induction of Labor140 1 Stimulation of Labor141 1 Tocolysis142 1 Ultrasound143 1 No Obstetric Procedures144 40 Other Obstetric Procedures
COMPLICATIONS OF LABOR & D145 1 Febrile (>100 or 38C)146 1 Meconium Moderate, Heavy147 1 Premature Rupture (>12 Hrs)148 1 Abruptio Placenta149 1 Placenta Previa150 1 Other Excessive Bleeding
13
Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 151-165
Field# Size Field name151 1 Seizures During Labor152 1 Precipitous Labor (<3 Hrs)153 1 Prolonged Labor (>20 Hrs)154 1 Dysfunctional Labor155 1 Breech/Malpresentation156 1 Cephalopelvic Disproportion157 1 Cord Prolapse158 1 Anesthetic Complications159 1 Fetal Distress160 1 No Complications of L&D161 40 Other Complications of L&D
METHOD OF DELIVERY162 1 Vaginal163 1 Vaginal After Previous C-Section164 1 Primary C-Section165 1 Repeat C-Section
Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 166-180
Field# Size Field name166 1 Forceps167 1 Vacuum
ABNORMAL CONDITIONS OF NEWBO168 1 Anemia169 1 Birth Injury170 1 Fetal Alcohol Syndrome171 1 Hyaline Membrane Disease/RDS172 1 Meconium Aspiration Syndrome173 1 Assisted Ventilation <30174 1 Assisted Ventilation >30175 1 Seizures176 1 No Abnormal Conditions of Newborn177 40 Other Abnormal Condition of Newborn
CONGENITAL ANOMALIES OF CHILD178 1 Anencephalus179 1 Spina Bifida/Meningocele180 1 Hydrocephalus
14
Typical Electronic Birth Certificate Fieldsin 1999-continued fields 181-195
Field# Size Field name181 1 Microcephalus182 40 Other CNS Anomalies183 1 Heart Malformations184 40 Other Circ./Resp. Anomalies185 1 Rectal Atresia/Stenosis186 1 Tracheo-Esophageal Fistula/Esophag187 1 Omphalocele/Gastroschisis188 40 Other Gastrointestinal Ano.189 1 Malformed Genitalia190 1 Renal Agenesis191 40 Other Urogenital Anomalies192 1 Cleft Lip/Palate193 1 Polydactyly/Syndactyly/Adactyly194 1 Club Foot195 1 Diaphragmatic Hernia
Typical Electronic Birth Certificate Fieldsin 1999-continued fields 196-210
Field# Size Field name196 40 Other Musculoskeletal/Integumental A197 1 Down’s Syndrome198 40 Other Chromosomal Anomalies199 1 No Congenital Anomalies200 40 Other Congenital Anomalies
CODE STRIP201 1 Record Complete YN202 1 Record Type203 4 Facility ID204 4 City of Birth205 3 County of Birth206 2 Mother’s State of Birth207 2 Mother’s State of Residence208 4 Mother’s Town of Residence209 3 Mother’s County of Residence210 2 Father’s State of Birth
15
Typical Electronic Birth Certificate Fieldsin 1999-continued fields 211-226.
Field# Size Field name211 14 Certifier’s License Number212 6 Laboratory ID Number213 4 Mother Xfer Code214 3 Mother Xfer County Code215 4 Baby Xfer Code216 3 Baby Xfer County Code217 4 Year of Birth218 7 Certificate #219 1 Unique Code220 8 File Date221 2 Community Area222 4 Census Tract223 2 Century of Last Live Birth224 2 Century of Last Termination225 2 Century of Last Menses226 2 Century of Blood Test
Behavior 2.Collect specifically
Global DSP over Time
0
50
100
150
200
250
300
350
400
450
500
1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003Y e ar
Examples(in DSP) 1983 1996Each birth 280 1,864Each hospital visit 0 663Each grocery visit 32 1,272
Based on State of Illinois [Sweeney 99]. DSP in bytes/person
Replace an existingaggregate data collectionwith a person-specific one.
16
Hospital Discharge Data,fields 1-12
# Field description Size1 HOSPITAL ID NUMBER 122 PATIENT DATE OF BIRTH(MMDDYYYY) 83 SEX 14 ADMIT DATE (MMDYYYY) 85 DISCHARGE DATE (MMDDYYYY) 86 ADMIT SOURCE 17 ADMIT TYPE 18 LENGTH OF STAY (DAYS) 49 PATIENT STATUS 210 PRINCIPAL DIAGNOSIS CODE 611 SECONDARY DIAGNOSIS CODE - 1 612 SECONDARY DIAGNOSIS CODE - 2 6
Hospital Discharge Data,fields 12-25# Field description Size13 SECONDARY DIAGNOSIS CODE - 3 614 SECONDARY DIAGNOSIS CODE - 4 615 SECONDARY DIAGNOSIS CODE - 5 616 SECONDARY DIAGNOSIS CODE - 6 617 SECONDARY DIAGNOSIS CODE - 7 618 SECONDARY DIAGNOSIS CODE - 8 619 PRINCIPAL PROCEDURE CODE 720 SECONDARY PROCEDURE CODE - 1 721 SECONDARY PROCEDURE CODE - 2 722 SECONDARY PROCEDURE CODE - 3 723 SECONDARY PROCEDURE CODE - 4 724 SECONDARY PROCEDURE CODE - 5 725 DRG CODE 3
17
Hospital Discharge Data,fields 26-37
# Field description Size26 MDC CODE 227 TOTAL CHARGES 928 ROOM AND BOARD CHARGES 929 ANCILLARY CHARGES 930 ANESTHESIOLOGY CHARGES 931 PHARMACY CHARGES 932 RADIOLOGY CHARGES 933 CLINICAL LAB CHARGES 934 LABOR-DELIVERY CHARGES 935 OPERATING ROOM CHARGES 936 ONCOLOGY CHARGES 937 OTHER CHARGES 9
Hospital Discharge Data,fields 38-50# Field description Size38 NEWBORN INDICATOR 139 PAYER ID 1 940 TYPE CODE 1 141 PAYER ID 2 942 TYPE CODE 2 143 PAYER ID 3 944 TYPE CODE 3 145 PATIENT ZIP CODE 546 Patient Origin COUNTY 347 Patient Origin PLANNING AREA 348 Patient Origin HSA 249 PATIENT CONTROL NUMBER50 HOSPITAL HSA 2
18
Hospital Discharge by State, Part 1Private Semi-Private Semi-Public Public AHRQ
Mandate (Insiders) (Limited) (Deniable) (No Restrictions) SIDAlabama N N
Alaska N NArizona Y Y N Y Y Y
Arkansas Y Y N N NCalifornia Y Y N Y Y YColorado N Y N Y N Y
Connecticut Y Y N Y Y YDelaware Y Y N N* N*
District of Columbia N NFlorida Y N Y Y
Georgia Y N N N YHawaii N Y N Y Y Y
Idaho N NIllinois Y Y Y Y Y Y
Indiana Y Y N N NIowa Y Y N Y Y Y
Kansas Y Y N Y N YKentucky Y Y N Y NLouisiana N Y N
Maine Y Y N Y YMaryland Y Y N Y Y Y
Massachusetts Y Y N Y Y YMichigan N Y N Y N
Minnestoa N Y N Y NMissouri Y N Y Y Y
Mississippi N N
Hospital Discharge by State, Part 2Private Semi-Private Semi-Public Public AHRQ
Mandate (Insiders) (Limited) (Deniable) (No Restrictions) SIDMontana N N
Nebraska N Y N Y YNevada Y Y N N Y
New Hampshire Y Y N Y YNew Jersey N Y Y N Y Y
New Mexico Y Y N N YNew York Y Y N Y Y Y
North Carolina Y Y N NNorth Dakota Y N N Y
Ohio Y Y N N NOklahoma Y Y N Y N
Oregon Y N Y Y YPennsylvania Y Y Y Y Y YRhode Island Y Y N Y Y
South Carolina Y Y N Y Y YSouth Dakota N N
Tennessee Y Y N Y Y YTexas Y Y N N N
Utah Y Y N Y Y YVermont Y Y N Y YVirginia Y Y N Y Y
Washington Y Y N Y Y YWest Virginia Y Y N Y Y
Wisconsin Y Y N Y Y YWyoming Y Y N Y N
19
Behavior 3.Collect it if you can
Global DSP over Time
0
50
100
150
200
250
300
350
400
450
500
1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003Y e ar
Examples(in DSP) 1983 1996Each birth 280 1,864Each hospital visit 0 663Each grocery visit 32 1,272
Based on State of Illinois [Sweeney 99]. DSP in bytes/person
Given a question or problem tosolve or merely provided theopportunity, gather information bystarting a new person-specific datacollection.
Grocery dataField name Food Lion Fresh Fields Safeway Star MarketName yes yes yes yesHome street address yes yes yes yesHomy city yes yes yes yesHome state yes yes yes yesHome ZIP yes yes yes yesHome phone number yes yes yes yesSocial Security Number yes
Additional data sometimes requestedBirth date yes yesZIP code of work place yesOther stores where you shop yes yesNumber of people in household yes yesAge each person in household yes yesHow much do you spend each week yes yes
Additional data for accepting checksBank yes yesBank account number yes yes
20
Kinds of data releases
InsidersonlyPrivate(Pr)
NorestrictionsPublic (Pu)
Larger possibledistribution
Less people eligible
Limited accessSemi-private(SPr)
Deniable accessSemi-public(SPu)
Two Questions:
1. What kinds of problems do dataanonymity tools solve?
2. How is data anonymity differentfrom security and privacy?
Data anonymity tools address theidentifiability of shared information in asetting with lots of available data.
21
This talk
�New areas in CS
�Fact: lots of data out there
� Fact: few fields uniquely identify a person
� Examples of compromises
� Nature of computational solutions
� Anonymity versus Security and Privacy
� Real-world examples:HIPAA and bioterrorism surveillance
Anonymous data
… implies that the datacannot be manipulated orlinked to identify anindividual.
22
De-identified Data
… all explicit identifiers, such as name,address and phone number are removed,generalized or replaced with a made-upalternative.
De-identifying information provides noguarantee that the result is anonymous.
JLME 97, NRC 98
Health data (GIC example)
Ethnicity
Visit date
Diagnosis
Procedure
Medication
Total charge
ZIP
Birthdate
Sex
Medical Data
23
Population data (GIC example)
ZIP
Birthdate
Sex
Name
Address
Dateregistered
Partyaffiliation
Date lastvoted
Voter List
Linking to re-identify data
Ethnicity
Visit date
Diagnosis
Procedure
Medication
Total charge
ZIP
Birthdate
Sex
Name
Address
Dateregistered
Partyaffiliation
Date lastvoted
Medical Data Voter List
24
Uniqueness in Cambridge Voters
Birth date alone 12%Birth date & gender 29%Birth date & 5-digit ZIP 69%Birth date & full postal code 97%
Birth date includes month, day and year.Total 54,805 voters.
JLME 97
Few characteristics make a person unique
Birth includes month, day and year:
365 days x 100 years = 36,500 possibilities
Two genders and Five ZIP (5-digit) codes:
2 * 5 * 36,500 =365,000 possibilities
But the Cambridge Voter list had:
54,805 voters
So in general, using(birth[mon,day,yr], gender, ZIP[5-digit])provides aunique quasi-identifier.
JLME 97
25
{ date of birth, gender, 5-digit ZIP}uniquely identifies 87.1% of USA pop.
{ date of birth, gender, 5-digit ZIP}uniquely identifies 87.1% of USA pop.
ZIP 60623,112,167 people,11%, not 0%insufficient #above the age of55 living there.
26
{ date of birth, gender, 5-digit ZIP}uniquely identifies 87.1% of USA pop.
ZIP 11794, 5418people, primarilybetween 19 and24 (4666 of 5418or 86%), only13%.
Uniqueness of Demographics in U.S.
Date of Birth Mon/Yr Birth Year of Birth
ZIP5-digit
Town/Place
County
Gender
87.1%
58.4%
18.1%
3.7%
3.6%
0.04%
0.04%
0.04%
0.00004%
27
{ Year of birth, gender, County},uniquely identifies 0.00004% of U.S. pop.
0%
10%
20%
30%
40%
50%
60%
0 2000000 4000000 6000000 8000000 10000000
County Population
%po
pula
tion
Iden
tifie
d
Loving County,Texas,population 107,53% unique
YellowstoneCounty, Montana,population 52,25% unique
King County,Texas,population 354,6% unique
Two Questions:
1. What kinds of problems do dataanonymity tools solve?
2. How is data anonymity differentfrom security and privacy?
Having lots of person-specific dataavailable makes it difficult to protectagainst inferences unique to the subjects.
28
This talk
�New areas in CS
�Fact: lots of data out there
�Fact: few fields uniquely identify a person
� Examples of compromises
� Nature of computational solutions
� Anonymity versus Security and Privacy
� Real-world examples:HIPAA and bioterrorism surveillance
Cancer registry looks anonymous
Diagnosis DiagDate ZIPKaposi’s Sarcoma 1/18/91 32555Kaposi’s Sarcoma 5/12/94 37581Kaposi’s Sarcoma 3/5/92 32172Kaposi’s Sarcoma 8/8/93 30158Neuroblastoma 4/3/91 39164
29
Cancer registry looks anonymous
Diagnosis DiagDate ZIPNeuroblastoma 7/93 32125Neuroblastoma 1/92 31752Neuroblastoma 8/91 38265Neuroblastoma 5/94 37233… … …
Two Questions:
1. What kinds of problems do dataanonymity tools solve?
2. How is data anonymity differentfrom security and privacy?
Having lots of person-specific dataavailable makes it difficult to protectagainst inferences unique to the subjects.
30
This talk
�New areas in CS
�Fact: lots of data out there
�Fact: few fields uniquely identify a person
�Examples of compromises
� Nature of computational solutions
� Anonymity versus Security and Privacy
� Real-world examples:HIPAA and bioterrorism surveillance
Disclosure overview
External Information Released Information
Ann 10/2/61 02139 diagnosis
AnnAbeAl
Dan
Don
Dave
Jcd
Jwq
Jxy
Private Information
c
f
g1
Subjects
Population
Universe
g2
Ann 10/2/61 02139 marriage10/2/61 02139 diagnosis
31
Disclosure overview
External Information Released Information
Ann 10/2/61 02139 diagnosis
AnnAbeAl
Dan
Don
Dave
Jcd
Jwq
Jxy
Private Information
c
f
g
Subjects
Population
Universe
Jcd diagnosisAnn 10/2/61 02139 marriage
Disclosure overview
External Information Released Information
Ann 10/2/61 02139 diagnosis
AnnAbeAl
Dan
Don
Dave
Jcd
Jwq
Jxy
Private Information
c
f
Subjects
Population
Universe
Al 3/8/61 02138 marriage2
Ann 10/2/61 02139 marriage1 A* 1961 0213* diagnosis
32
Techniques are specific to use
Technique A-Data Mining B-StatisticalDe-identification depends dependsEncryption depends dependsSuppression depends noGeneralize values depends noSwap values no yesSubstitution depends dependsOutlier to medians no dependsPerturbation no yesRounding no yesAdditive noise no yesSampling depends dependsAdd tuples no yesScramble tuples yes yes
k-anonymity,enforce on release
�Quasi-identifier, profile {Birth 0.5, ZIP0.7, Sex0.3}
�Generalization 10/27/59� 1959
�Suppression 02139 �� ����
�Encryption 3245123� 2168582
AMIA 97, IEEE IFIP 97
33
Sample Data
SSN Ethnicity Birth Sex ZIP Problem819181496 Black 09/20/65 m 02141 short of breath195925972 Black 02/14/65 m 02141 chest pain902750852 Black 10/23/65 f 02138 hypertension985820581 Black 08/24/65 f 02138 hypertension209559459 Black 11/07/64 f 02138 obesity679392975 Black 12/01/64 f 02138 chest pain819491049 White 10/23/64 m 02138 chest pain749201844 White 03/15/65 f 02139 hypertension985302952 White 08/13/64 m 02139 obesity874593560 White 05/05/64 m 02139 short of breath703872052 White 02/13/67 m 02138 chest pain963963603 White 03/21/67 m 02138 chest pain
Datafly results
SSN Ethnicity Birth Sex ZIP Problem902387250 Black 1965 m 0214* short of breath197150725 Black 1965 m 0214* chest pain486062381 Black 1965 f 0213* hypertension235978021 Black 1965 f 0213* hypertension214684616 Black 1964 f 0213* obesity135243442 Black 1964 f 0213* chest pain487620561 White 1964 m 0213* chest pain259003630 White 1964 m 0213* obesity410968224 White 1964 m 0213* short of breath664545413 White 1967 m 0213* chest pain860424429 White 1967 m 0213* chest pain
IEEE IFIP 97, NRC 98
34
µ-Argus ResultsSSN Ethnicity Birth Sex ZIP Problem
Black 1965 m 02141 short of breathBlack 1965 m 02141 chest painBlack 1965 f 02138 hypertensionBlack 1965 f 02138 hypertensionBlack 1964 f 02138 obesityBlack 1964 f 02138 chest painWhite 1964 m 02138 chest pain
f 02139 hypertensionWhite 1964 m 02139 obesityWhite 1964 m 02139 short of breathWhite 1967 m 02138 chest painWhite 1967 m 02138 chest pain
JLME 97, NRC 98
k-similar results
SSN Ethnicity Birth Sex ZIP Problem486753948 Black 1965 m 02141 short of breath758743753 Black 1965 m 02141 chest pain976483662 1965 f 0213* hypertension845796834 1965 f 0213* hypertension497306730 Black 1964 f 02138 obesity730768597 Black 1964 f 02138 chest pain348993639 Caucasian 1964 m0213* chest pain459734637 1965 f 0213* hypertension385692728 Caucasian 1964 m0213* obesity537387873 Caucasian 1964 m0213* short of breath385346532 Caucasian 1967 m 02138 chest pain349863628 Caucasian 1967 m 02138 chest pain
35
This talk
�New areas in CS
�Fact: lots of data out there
�Fact: few fields uniquely identify a person
�Examples of compromises
�Nature of computational solutions
� Anonymity versus Security and Privacy
� Real-world examples:HIPAA and bioterrorism surveillance
Two Questions:
1. What kinds of problems do dataanonymity tools solve?
2. How is data anonymity differentfrom security and privacy?
Anonymity tools allow data to be sharedwith guarantees of anonymity while thedata remain practically useful.
36
Traditional areas of Computer Security
authorization (can you access what you request)authentication (are you who you say you are)
Examples of Authorization in Security
authorization (can you access what you request)authentication (are you who you say you are)
secure communication or eavesdropping(did anyone else get the info)
encryptionfile access privileges (read, write, execute)
37
Examples of Authentication in Security
authorization (can you access what you request)authentication (are you who you say you are)
secure communication or eavesdroppingpasswordsencryptiondigital signaturesauthenticity of data (“information assurance”)
Getting Data Into a System
Authentication: login with passwordAuthorization: allowed to write dataEncryption: to avoid eavesdropping
38
Getting Data Into a System
Authentication: login with passwordAuthorization: allowed to write dataEncryption: to avoid eavesdropping
Getting Data From a System
Authentication: login with passwordAuthorization: allowed to read dataEncryption: to avoid eavesdropping
39
Computer Security & Data Sharing
Authentication: login with passwordAuthorization: allowed to read/write dataEncryption: to avoid eavesdropping
BUT data can re-identify individual!
Data Anonymity Concerns Content
Authentication: login with passwordAuthorization: allowed to read/write dataEncryption: to avoid eavesdropping
Data can NOT reliably re-identify individual!
40
Incorrect Computer Security View
Computer Security
authenticationauthorization
Privacy
NuisanceViruses
Incorrect Computer Security View
Computer Security
authenticationauthorization
Privacy
NuisanceViruses
Privacy = privacy andconfidentiality
41
Incorrect Computer Security View
Computer Security
authenticationauthorization
Privacy
NuisanceViruses
Privacy = privacy andconfidentiality
Computer security <public safety
Incomplete Computer Security Viewincluding non-technology
Laws
Computer Security Privacy
authenticationauthorization
Regulations
Policies
NuisanceViruses
42
Computer Security, Privacy and Anonymity
Laws
Computer Security Privacy
authenticationauthorization
Regulations
anonymitytools
Policies
NuisanceViruses
Computer Security and Anonymityare computational tools
Laws
Computer Security Privacy
Regulation
anonymitytools
authenticationauthorization
Policies
NuisanceViruses
43
Z3={*****} *****�
Z2={021**} 021**�
Z1={0213*,0214*} 0213* 0214*�
Z0={02138, 02139, 02141, 02142} 02138 02139 02141 02142
DGHZ0 VGHZ0
Merging Computer Security and AnonymityComputational Tools
What versionof theinformationwill you get?
anonymitytools
authenticationauthorization
This talk
�New areas in CS
�Fact: lots of data out there
�Fact: few fields uniquely identify a person
�Examples of compromises
�Nature of computational solutions
�Anonymity versus Security and Privacy
� Real-world examples:HIPAA and bioterrorism surveillance
44
Medical Privacy before HIPAANo* medical privacy legislation (proposed or
drafted) addresses these problems.
� incorrect belief that de-identified implies anonymous
� incorrect belief that linkage and mining are controlled byencryption
� incorrect belief that security is the same as privacy
� inability to enumerate all sources, users and uses ofmedical data
new technology offers better choices than all or nothingand allows for a spectrum of solutions
Flow from the Hospital
Ann 10/2/61 02139 cardiacAbe 7/14/61 02139 cancerAl 3/8/61 02138 liver
Hospital
Recipient
Holder
Publichealth
Insurance Statedischarge
ResearchersPharmaceuticalcompany
…
(economic)(law) (law)
(IRB)(IRB)
CareProvider
45
Flow from the Provider
Ann 10/2/61 02139 cardiacAbe 7/14/61 02139 cancerAl 3/8/61 02138 liver
Care provider
Recipient
Holder
Publichealth
Insurance Statedischarge
Pharmacy
…
(economic)(law) (law)
Researchers
Transcriptionservice
Secondary Flows from the Hospital
Hospital
Publichealth
Insurance Statedischarge
ResearchersPharmaceuticalcompany
…
(economic)(law) (law)
(IRB)(IRB)
Virtually no restrictions before HIPAA, some restrictions withHIPAA. Examples: WebMD and Envoy, Cancer registries,
hospital discharge public data sets.
46
Depiction of no data sharing by the data holder
Depiction of data holder sharing data with somerecipients
1
11
1
1
47
Depiction of secondary sharingby recipients of the data
1
11
1
1 2 3
2
2
2
3
3
4
4
5
Medical Privacy and HIPAASecurity
�Audit trails, Authorization, Authentication
�Protected channels of communication
Privacy
�Limited applicability
�horrible distortion
� Increased role of IRB
�Safe harbor: {ZIP3, Year of Birth, Gender}
ELSEuse dataanonymitytools!
48
Computer Security and Anonymityare computational tools
Laws
Computer Security Privacy
Regulation
anonymitytools
authenticationauthorization
Policies
NuisanceViruses
Detect Early using Onset,Coordinate Deaths & Hospital Admits
Based on results reported in Guillemin, 1999.
1979 Sverdlovsk Anthrax Outbreak
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Time (in Days)
Cum
ulat
ive
(cas
es)
OnsetHospital AdmitsDeaths
49
How can we detect onset?How early on each can we predict?How does coordination help?
1979 Sverdlovsk Anthrax Outbreak
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Time (in Days)
Pre
vale
nce
(cas
es)
OnsetHospital AdmitsDeaths
Cum
ulat
ive
Cas
es
Continuously Observe Behaviorsto Detect Onset of Symptoms
Prodromic surveillance:
How many are acting ill?
Unusualbehaviors→syndromes?
Not confirmeddiagnoses!
50
Centralized Surveillance of Secondary Data
hospitals
schools
labs
groceries
physicians
animals
prescriptions
assisted living
deaths
businesses
detect
Emerging Central Authorities
1. Public health agency
2. Trusted broker of publichealth agency
3. Law enforcement agency
4. Corporation (for profit)
5. University (non-profit, non-competitor)
51
Access Instruments
hospitals
schools
labs
groceries
physicians
animals
prescriptions
assisted living
deaths
businesses
HIPAA
educationlaws
contract contract contract
contract
contract
HIPAA HIPAA contract
*Not includingpublic health law
Mechanical distortion decisionstypically renders data useless
Gross overview
Sufficiently de-identified
Identifiable
Explicitly identified
Readily identifiable
Sufficiently anonymous
Unusual activity
Suspicious activity
Outbreak detected
Outbreak suspected
Normal operation
Datafly Idenifiability 0..1 Detection Status 0..1
52
Explicitly identified data generatesprivacy concerns which mayultimately prohibit data sharing
Gross overview
Sufficiently de-identified
Identifiable
Explicitly identified
Readily identifiable
Sufficiently anonymous
Unusual activity
Suspicious activity
Outbreak detected
Outbreak suspected
Normal operation
Datafly Idenifiability 0..1 Detection Status 0..1
Levels of identifiabilitymatching detection status
Gross overview
Sufficiently de-identified
Identifiable
Explicitly identified
Readily identifiable
Sufficiently anonymous
Unusual activity
Suspicious activity
Outbreak detected
Outbreak suspected
Normal operation
Datafly Idenifiability 0..1 Detection Status 0..1
53
Automated Privacy Module
hospitals
schools
labs
groceries
physicians
animals
prescriptions
assisted living
deaths
businesses
detect
Automated Privacy Module
data holder
detect
raw data"anonymized"
datarequestwith status
policy agreement
54
Levels of Identifiabilityand Detection Status
Gross overview
Sufficiently de-identified
Identifiable
Explicitly identified
Readily identifiable
Sufficiently anonymous
Unusual activity
Suspicious activity
Outbreak detected
Outbreak suspected
Normal operation
Datafly Idenifiability 0..1 Detection Status 0..1
Dynamically Augment the Model WhenSurveillance Detects Possible Attack� Lower the privacy threshold when potential attack detected
– But how often, how quickly, to what level?– Can we take advantage of disease-specific processing?– Need to flush out ideas by looking at data
55
Probable Cause Predicate
Judge
Officer
Informant
facts1. What is the basis of
the knowledge?
2. Is the source believable?
ReasonableCause Predicate
(Technology,Policy)
Detector
{ DataSourcei}
factsWhat is the minimalinformation needed basedon reliable knowledgeavailable?
Data Holderj
56
Automated Privacy Module
hospitals
schools
labs
groceries
physicians
animals
prescriptions
assisted living
deaths
businesses
detect
Transmission uses traditional computer security tools.Content is based on data anonymity tools.Overall goal is public safety.
This talk
�New areas in CS
�Fact: lots of data out there
�Fact: few fields uniquely identify a person
�Examples of compromises
�Nature of computational solutions
�Anonymity versus Security and Privacy
�Real-world examplesFor more information:
[email protected]://sos.heinz.cmu.edu/dataprivacy/