lions, zebras and big data anonymization

39
Lions, Zebras & Big Data Anonymization

Upload: kai-xin-thia

Post on 27-Jan-2015

113 views

Category:

Technology


2 download

DESCRIPTION

In a recent safari trip to Tanzania, East Africa, I observed that lions are not interested in attacking the human visitors at all. What is the secret to the safari's (non existent) security measures for their visitors? How do we determine the optimum tradeoff between enjoying the safari and safety? How can we quantify the risk? And ultimately, how can we apply these lessons + data anonymization techniques to Big Data?

TRANSCRIPT

Page 1: Lions, zebras and Big Data Anonymization

Lions, Zebras & Big Data Anonymization

Page 2: Lions, zebras and Big Data Anonymization

Data anonymization is the process applied on data to prevent identification of individuals,

making it possible to share and analyze data securely.

Page 3: Lions, zebras and Big Data Anonymization

Disclaimer: Stuff shared here are

personal research, does not represent

any organizationpolicies.

Prof Khaled El Emam worked on

anonymizing heritage health prize data

Page 4: Lions, zebras and Big Data Anonymization

Are we safe?

Page 5: Lions, zebras and Big Data Anonymization

Yes we are safe, as long as the lions prefer eating fat, juicy zebras than us.

Page 6: Lions, zebras and Big Data Anonymization

The safari rules

1.If you are the lion, you just need to be faster than the slowest zebra.

2.If you are the zebra, you need to be able to escape from all the lions.

3.If you are the safari visitors, you want to get as close as possible to the lions & zebras without getting hurt.

Page 7: Lions, zebras and Big Data Anonymization

Enjoyment

Security

X Point of stupidity

Max enjoyment: Live with the lions for a week

Max security: Stay at home watch National Geographic

Determined by risk appetite

Page 8: Lions, zebras and Big Data Anonymization

How can we apply this to

Data Anonymization

Page 9: Lions, zebras and Big Data Anonymization

The data anonymization rules

1.If you are the hacker, you just need to hack through the weakest link.

2.If you are the data, you need to be protected from all the hackers.

3. If you are anonymizing the data, you want to retain as much details as possible while minimizing risk.

Page 10: Lions, zebras and Big Data Anonymization

Analytical Usefulness

Security

X Point of stupidity

Max Usefulness: Raw data

Determined by risk appetite

Max security: Lock up data, don’t do any analysis

Page 11: Lions, zebras and Big Data Anonymization

11

KnownKnowns

KnownUnknowns

UnknownUnknowns

Unknownknowns

Donald’s Matrix

*Important to know

Page 12: Lions, zebras and Big Data Anonymization

Known Knowns

Most users do not care.Not all data that can be shared

should be shared.Data policies needs updating.

Laws, Standards & Regulations.

Will people abuse their access rights?

What are the damages if data got compromised?

Motivations of hackers?

Known Unknowns

Unknown UnknownsUnknown knowns

?

Minimize risk, find out more

Be preparedWhat we should already know

Who have official access?Resources we have?

Value of data?What are the identifiers?

Sharing the data?Different data policies?

Laws, Standards & Regulations.

Page 13: Lions, zebras and Big Data Anonymization

What are the techniques forData Anonymization

Page 14: Lions, zebras and Big Data Anonymization

‘Hard’ MethodsMore difficult to analyze

‘Soft’ MethodsEasier to analyze

Hashing

Encryption

Lv1

Lv2

Lv3

Remove: ---Reduce: Mr. S Reclassify: 40+yrsMask: 1234****

Black boxSamplingAdd noise / fake dataShuffle

Breaking big data machine learning

Page 15: Lions, zebras and Big Data Anonymization

‘Hard’ MethodsStrong security, difficult to analyze,

dangerous if cracked

‘Soft’ MethodsFlexible security strength, easier to

analyze, anonymized

Hashing

Encryption

Lv1

Lv2

Lv3

Remove: ---Reduce: Mr. S Reclassify: 40+yrsMask: 1234****

Black boxSamplingAdd noise / fake dataShuffle

Breaking big data machine learning

For best results, use a combination of techniques

Page 16: Lions, zebras and Big Data Anonymization

Lv1: RRRM: Quick and dirty

Remove ID S12345739Y -> ----

Reduce Mr. Smith -> Mr. S, St 21, XY Road, Bedok-> Bedok

Reclassify 43 yrs old -> 40+$1,029,199 income-> $1million+

Mask 12345678->1234****

But these techniques are not good enough

Page 17: Lions, zebras and Big Data Anonymization

"There are lots of smokers in the health records, but once you

narrow it down to an anonymous male black smoker born in

1965 who presented at the emergency room with aching

joints, it's actually pretty simple to merge the "anonymous"

record with a different "anonymised" database and out pops

the near-certain identity of the patient." ~ Cory Doctorow,

theguardian

Multi variable identification

Big Data is a double edge sword

Page 18: Lions, zebras and Big Data Anonymization

Lv2: Black Box (No data visibility for data scientist)

Algorithm, Software, System or People

In-house or 3rd Party

Requests SummarizedResults

Page 19: Lions, zebras and Big Data Anonymization

Lv2: Sampling (lowers accuracy)

Probability

Simple RandomSystematicStratifiedProbability Proportional to SizeCluster

Nonprobability(Try not touse these)

ConvenienceQuotaPurposive

Page 20: Lions, zebras and Big Data Anonymization

Lv2: Sampling (lowers accuracy)

Probability

Simple RandomSystematicStratifiedProbability Proportional to SizeCluster

Nonprobability(Try not touse these)

ConvenienceQuotaPurposive

All data

Data Collected

Sample

Page 21: Lions, zebras and Big Data Anonymization

Lv2 Noise, fake & shuffle within data clusters

Page 22: Lions, zebras and Big Data Anonymization

Lv2: Add noise / fake data (lowers accuracy)

Name: Adam Smith Visit1: 14/04/13Visit2: 21/05/13Visit3: 01/06/13

Name: David HumeVisit1: 19/04/13Visit2: 26/05/13Visit3: 06/06/13

Noise:+5 days

Fake, male Scottish Name

Group visits by same person together and apply same amount of noise

Name: David Abram Visit1: 01/02/13Visit2: 11/02/13

Name: David Abram Visit1: 27/01/13Visit2: 06/02/13

Affects daily/ monthly pattern

Noise:-5 days

Page 23: Lions, zebras and Big Data Anonymization

Lv2: Shuffle (may break data relationships but retains trend)

Name: Adam Smith Purchase1 : CabbagePurchase2 : Tomato

Name: David Abram Purchase1 : BreadPurchase2 : Sushi

Name: Emma GoldmanPurchase1: Female HygienePurchase2 : Strawberry

Shuffle

Name: Adam Smith Purchase1 : BreadPurchase2 : Sushi

Name: David Abram Purchase1 : CabbagePurchase2 : Tomato

Name: Emma GoldmanPurchase1: Female HygienePurchase2 : Strawberry

Different gender, cannot shuffle with Adam/David

Name: Emma GoldmanPurchase1: Female HygienePurchase2 : Strawberry

From David

From Adam

Page 24: Lions, zebras and Big Data Anonymization

Are we safe?

RRRM

ID: S1235930X

Name: Adam Smith

Age: 45

Postal:428102

Visit1: 14/04/13

Visit2: 21/05/13

Visit3: 01/06/13

Purchase1 : Cabbage

Purchase2 : Tomato

ID: -----

Name: Mr. S

Age: 40+yrs

Postal:428***

Visit1: 14/04/13

Visit2: 21/05/13

Visit3: 01/06/13

Purchase1 : Cabbage

Purchase2 : Tomato

Page 25: Lions, zebras and Big Data Anonymization

Are we safe?

Noise / Fake

ID: S1235930X

Name: Adam Smith

Age: 45

Postal:428102

Visit1: 14/04/13

Visit2: 21/05/13

Visit3: 01/06/13

Purchase1 : Cabbage

Purchase2 : Tomato

ID: -----

Name: Mr. H

Age: 40+yrs

Postal:428***

Visit1: 15/04/13

Visit2: 26/05/13

Visit3: 06/06/13

Purchase1 : Cabbage

Purchase2 : Tomato

Page 26: Lions, zebras and Big Data Anonymization

Are we safe?

Shuffle

ID: S1235930X

Name: Adam Smith

Age: 45

Postal:428102

Visit1: 14/04/13

Visit2: 21/05/13

Visit3: 01/06/13

Purchase1 : Cabbage

Purchase2 : Tomato

ID: -----

Name: Mr. H

Age: 40+yrs

Postal:428***

Visit1: 15/04/13

Visit2: 26/05/13

Visit3: 06/06/13

Purchase1 : Bread

Purchase2 : Sushi

Page 27: Lions, zebras and Big Data Anonymization

Encrypted

Are we safe? Before Vs After

ID: S1235930X

Name: Adam Smith

Age: 45

Postal:428102

Visit1: 14/04/13

Visit2: 21/05/13

Visit3: 01/06/13

Purchase1 : Cabbage

Purchase2 : Tomato

ID: -----

Name: Mr. H

Age: 40+yrs

Postal:428***

Visit1: 15/04/13

Visit2: 26/05/13

Visit3: 06/06/13

Purchase1 : Bread

Purchase2 : Sushi

Page 28: Lions, zebras and Big Data Anonymization

Not really safe - Netflix case study

+

Prof. Arvind Narayanan

Page 29: Lions, zebras and Big Data Anonymization

Not really safe - Netflix case study

+

Prof. Arvind Narayanan

Sparse data

Even the most prolific Netflix users has only rated a

tiny fraction of Netflix’s enormous library. Thus most

columns, which represents a particular movie, are

empty. Therefore, the chances of two or more users

giving the same rating to the same set of movies is

quite small; thus sets of user’s movie ratings can

almost uniquely identify users.

Page 30: Lions, zebras and Big Data Anonymization

Credit: Prof. Arvind Narayanan

Best match: David

2nd Best match: Adam

Best match: Alice2nd Best match:

Lisa

Page 31: Lions, zebras and Big Data Anonymization

Lv3: Breaking Big Data Machine Learning

Page 32: Lions, zebras and Big Data Anonymization

Lv3 Noise, fake & shuffle across data clusters

Page 33: Lions, zebras and Big Data Anonymization

Lv3: Add trend breaking noise / fake data

Name: David Abram Visit1: 01/02/13 (Bought item A,B,C)Visit2: 11/02/13 (Bought item D,E)

Name: David Abram Visit1: 26/01/13 (Bought item A,B,C,X)Visit2: 05/02/13 (Bought item D,E)

Re order visits, add noise to dateFake purchase

X, and sequence of visits related findings will be ignored

Page 34: Lions, zebras and Big Data Anonymization

Name: Adam Smith Purchase1 : CabbagePurchase2 : Tomato

Name: David Abram Purchase1 : BreadPurchase2 : Sushi

Name: Emma GoldmanPurchase1: Female HygienePurchase2 : Strawberry

Shuffle

Name: Adam Smith Purchase1 : BreadPurchase2 : Sushi

Name: Emma GoldmanPurchase1 : CabbagePurchase2 : Tomato

Name: David Abram Purchase1: Female HygienePurchase2 : Strawberry

Gender related findings will be ignored

Lv3: Trend breaking shuffle

Page 35: Lions, zebras and Big Data Anonymization

Analytical Usefulness

Security

X Point of stupidity

Max Usefulness: Raw data

Determined by risk appetite

Max security: Lock up data, don’t do any analysis

Page 36: Lions, zebras and Big Data Anonymization

Yes we are safe, as long as the lions prefer eating fat, juicy zebras than us.

Page 37: Lions, zebras and Big Data Anonymization
Page 38: Lions, zebras and Big Data Anonymization

Security

Analytical Usefulness

X Point of stupidity

KnownKnowns

KnownUnknowns

UnknownUnknowns

Unknownknowns

Donald’s Matrix

[email protected]: Kai Xin, Thia