lions, zebras and big data anonymization
DESCRIPTION
In a recent safari trip to Tanzania, East Africa, I observed that lions are not interested in attacking the human visitors at all. What is the secret to the safari's (non existent) security measures for their visitors? How do we determine the optimum tradeoff between enjoying the safari and safety? How can we quantify the risk? And ultimately, how can we apply these lessons + data anonymization techniques to Big Data?TRANSCRIPT
Lions, Zebras & Big Data Anonymization
Data anonymization is the process applied on data to prevent identification of individuals,
making it possible to share and analyze data securely.
Disclaimer: Stuff shared here are
personal research, does not represent
any organizationpolicies.
Prof Khaled El Emam worked on
anonymizing heritage health prize data
Are we safe?
Yes we are safe, as long as the lions prefer eating fat, juicy zebras than us.
The safari rules
1.If you are the lion, you just need to be faster than the slowest zebra.
2.If you are the zebra, you need to be able to escape from all the lions.
3.If you are the safari visitors, you want to get as close as possible to the lions & zebras without getting hurt.
Enjoyment
Security
X Point of stupidity
Max enjoyment: Live with the lions for a week
Max security: Stay at home watch National Geographic
Determined by risk appetite
How can we apply this to
Data Anonymization
The data anonymization rules
1.If you are the hacker, you just need to hack through the weakest link.
2.If you are the data, you need to be protected from all the hackers.
3. If you are anonymizing the data, you want to retain as much details as possible while minimizing risk.
Analytical Usefulness
Security
X Point of stupidity
Max Usefulness: Raw data
Determined by risk appetite
Max security: Lock up data, don’t do any analysis
11
KnownKnowns
KnownUnknowns
UnknownUnknowns
Unknownknowns
Donald’s Matrix
*Important to know
Known Knowns
Most users do not care.Not all data that can be shared
should be shared.Data policies needs updating.
Laws, Standards & Regulations.
Will people abuse their access rights?
What are the damages if data got compromised?
Motivations of hackers?
Known Unknowns
Unknown UnknownsUnknown knowns
?
Minimize risk, find out more
Be preparedWhat we should already know
Who have official access?Resources we have?
Value of data?What are the identifiers?
Sharing the data?Different data policies?
Laws, Standards & Regulations.
What are the techniques forData Anonymization
‘Hard’ MethodsMore difficult to analyze
‘Soft’ MethodsEasier to analyze
Hashing
Encryption
Lv1
Lv2
Lv3
Remove: ---Reduce: Mr. S Reclassify: 40+yrsMask: 1234****
Black boxSamplingAdd noise / fake dataShuffle
Breaking big data machine learning
‘Hard’ MethodsStrong security, difficult to analyze,
dangerous if cracked
‘Soft’ MethodsFlexible security strength, easier to
analyze, anonymized
Hashing
Encryption
Lv1
Lv2
Lv3
Remove: ---Reduce: Mr. S Reclassify: 40+yrsMask: 1234****
Black boxSamplingAdd noise / fake dataShuffle
Breaking big data machine learning
For best results, use a combination of techniques
Lv1: RRRM: Quick and dirty
Remove ID S12345739Y -> ----
Reduce Mr. Smith -> Mr. S, St 21, XY Road, Bedok-> Bedok
Reclassify 43 yrs old -> 40+$1,029,199 income-> $1million+
Mask 12345678->1234****
But these techniques are not good enough
"There are lots of smokers in the health records, but once you
narrow it down to an anonymous male black smoker born in
1965 who presented at the emergency room with aching
joints, it's actually pretty simple to merge the "anonymous"
record with a different "anonymised" database and out pops
the near-certain identity of the patient." ~ Cory Doctorow,
theguardian
Multi variable identification
Big Data is a double edge sword
Lv2: Black Box (No data visibility for data scientist)
Algorithm, Software, System or People
In-house or 3rd Party
Requests SummarizedResults
Lv2: Sampling (lowers accuracy)
Probability
Simple RandomSystematicStratifiedProbability Proportional to SizeCluster
Nonprobability(Try not touse these)
ConvenienceQuotaPurposive
Lv2: Sampling (lowers accuracy)
Probability
Simple RandomSystematicStratifiedProbability Proportional to SizeCluster
Nonprobability(Try not touse these)
ConvenienceQuotaPurposive
All data
Data Collected
Sample
Lv2 Noise, fake & shuffle within data clusters
Lv2: Add noise / fake data (lowers accuracy)
Name: Adam Smith Visit1: 14/04/13Visit2: 21/05/13Visit3: 01/06/13
Name: David HumeVisit1: 19/04/13Visit2: 26/05/13Visit3: 06/06/13
Noise:+5 days
Fake, male Scottish Name
Group visits by same person together and apply same amount of noise
Name: David Abram Visit1: 01/02/13Visit2: 11/02/13
Name: David Abram Visit1: 27/01/13Visit2: 06/02/13
Affects daily/ monthly pattern
Noise:-5 days
Lv2: Shuffle (may break data relationships but retains trend)
Name: Adam Smith Purchase1 : CabbagePurchase2 : Tomato
Name: David Abram Purchase1 : BreadPurchase2 : Sushi
Name: Emma GoldmanPurchase1: Female HygienePurchase2 : Strawberry
Shuffle
Name: Adam Smith Purchase1 : BreadPurchase2 : Sushi
Name: David Abram Purchase1 : CabbagePurchase2 : Tomato
Name: Emma GoldmanPurchase1: Female HygienePurchase2 : Strawberry
Different gender, cannot shuffle with Adam/David
Name: Emma GoldmanPurchase1: Female HygienePurchase2 : Strawberry
From David
From Adam
Are we safe?
RRRM
ID: S1235930X
Name: Adam Smith
Age: 45
Postal:428102
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
ID: -----
Name: Mr. S
Age: 40+yrs
Postal:428***
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
Are we safe?
Noise / Fake
ID: S1235930X
Name: Adam Smith
Age: 45
Postal:428102
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
ID: -----
Name: Mr. H
Age: 40+yrs
Postal:428***
Visit1: 15/04/13
Visit2: 26/05/13
Visit3: 06/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
Are we safe?
Shuffle
ID: S1235930X
Name: Adam Smith
Age: 45
Postal:428102
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
ID: -----
Name: Mr. H
Age: 40+yrs
Postal:428***
Visit1: 15/04/13
Visit2: 26/05/13
Visit3: 06/06/13
Purchase1 : Bread
Purchase2 : Sushi
Encrypted
Are we safe? Before Vs After
ID: S1235930X
Name: Adam Smith
Age: 45
Postal:428102
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
ID: -----
Name: Mr. H
Age: 40+yrs
Postal:428***
Visit1: 15/04/13
Visit2: 26/05/13
Visit3: 06/06/13
Purchase1 : Bread
Purchase2 : Sushi
Not really safe - Netflix case study
+
Prof. Arvind Narayanan
Not really safe - Netflix case study
+
Prof. Arvind Narayanan
Sparse data
Even the most prolific Netflix users has only rated a
tiny fraction of Netflix’s enormous library. Thus most
columns, which represents a particular movie, are
empty. Therefore, the chances of two or more users
giving the same rating to the same set of movies is
quite small; thus sets of user’s movie ratings can
almost uniquely identify users.
Credit: Prof. Arvind Narayanan
Best match: David
2nd Best match: Adam
Best match: Alice2nd Best match:
Lisa
Lv3: Breaking Big Data Machine Learning
Lv3 Noise, fake & shuffle across data clusters
Lv3: Add trend breaking noise / fake data
Name: David Abram Visit1: 01/02/13 (Bought item A,B,C)Visit2: 11/02/13 (Bought item D,E)
Name: David Abram Visit1: 26/01/13 (Bought item A,B,C,X)Visit2: 05/02/13 (Bought item D,E)
Re order visits, add noise to dateFake purchase
X, and sequence of visits related findings will be ignored
Name: Adam Smith Purchase1 : CabbagePurchase2 : Tomato
Name: David Abram Purchase1 : BreadPurchase2 : Sushi
Name: Emma GoldmanPurchase1: Female HygienePurchase2 : Strawberry
Shuffle
Name: Adam Smith Purchase1 : BreadPurchase2 : Sushi
Name: Emma GoldmanPurchase1 : CabbagePurchase2 : Tomato
Name: David Abram Purchase1: Female HygienePurchase2 : Strawberry
Gender related findings will be ignored
Lv3: Trend breaking shuffle
Analytical Usefulness
Security
X Point of stupidity
Max Usefulness: Raw data
Determined by risk appetite
Max security: Lock up data, don’t do any analysis
Yes we are safe, as long as the lions prefer eating fat, juicy zebras than us.
Security
Analytical Usefulness
X Point of stupidity
KnownKnowns
KnownUnknowns
UnknownUnknowns
Unknownknowns
Donald’s Matrix
[email protected]: Kai Xin, Thia
Interesting reads
• Anonymizing Health Data• Data protection in the EU: the certainty of uncertainty• Robust De-anonymization of Large Sparse Datasets• Eccentricity Explained• A new way to protect privacy in large-scale genome-wide a
ssociation studies• Why 'Anonymous' Data Sometimes Isn't• Has Big Data Made Anonymity Impossible?• ‘Anonymous’ Netflix Prize data not so anonymous after all• A Data Broker Offers a Peek Behind the Curtain