lions, zebras and big data anonymization

Lions, Zebras & Big Data Anonymization

Data anonymization is the process applied on data to prevent identification of individuals,

making it possible to share and analyze data securely.

Disclaimer: Stuff shared here are

personal research, does not represent

any organizationpolicies.

Prof Khaled El Emam worked on

anonymizing heritage health prize data

Are we safe?

Yes we are safe, as long as the lions prefer eating fat, juicy zebras than us.

The safari rules

1.If you are the lion, you just need to be faster than the slowest zebra.

2.If you are the zebra, you need to be able to escape from all the lions.

3.If you are the safari visitors, you want to get as close as possible to the lions & zebras without getting hurt.

Enjoyment

Security

X Point of stupidity

Max enjoyment: Live with the lions for a week

Max security: Stay at home watch National Geographic

Determined by risk appetite

How can we apply this to

Data Anonymization

The data anonymization rules

1.If you are the hacker, you just need to hack through the weakest link.

2.If you are the data, you need to be protected from all the hackers.

3. If you are anonymizing the data, you want to retain as much details as possible while minimizing risk.

Analytical Usefulness

Security


Max Usefulness: Raw data


Max security: Lock up data, don’t do any analysis

11

KnownKnowns

KnownUnknowns

UnknownUnknowns

Unknownknowns

Donald’s Matrix

*Important to know

Known Knowns

Most users do not care.Not all data that can be shared

should be shared.Data policies needs updating.

Laws, Standards & Regulations.

Will people abuse their access rights?

What are the damages if data got compromised?

Motivations of hackers?

Known Unknowns

Unknown UnknownsUnknown knowns

?

Minimize risk, find out more

Be preparedWhat we should already know

Who have official access?Resources we have?

Value of data?What are the identifiers?

Sharing the data?Different data policies?

Laws, Standards & Regulations.

What are the techniques forData Anonymization

‘Hard’ MethodsMore difficult to analyze

‘Soft’ MethodsEasier to analyze

Hashing

Encryption

Lv1

Lv2

Lv3

Remove: ---Reduce: Mr. S Reclassify: 40+yrsMask: 1234****

Black boxSamplingAdd noise / fake dataShuffle

Breaking big data machine learning

‘Hard’ MethodsStrong security, difficult to analyze,

dangerous if cracked

‘Soft’ MethodsFlexible security strength, easier to

analyze, anonymized

Hashing

Encryption

Lv1

Lv2

Lv3

Remove: ---Reduce: Mr. S Reclassify: 40+yrsMask: 1234****

Black boxSamplingAdd noise / fake dataShuffle

Breaking big data machine learning

For best results, use a combination of techniques

Lv1: RRRM: Quick and dirty

Remove ID S12345739Y -> ----

Reduce Mr. Smith -> Mr. S, St 21, XY Road, Bedok-> Bedok

Reclassify 43 yrs old -> 40+$1,029,199 income-> $1million+

Mask 12345678->1234****

But these techniques are not good enough

"There are lots of smokers in the health records, but once you

narrow it down to an anonymous male black smoker born in

1965 who presented at the emergency room with aching

joints, it's actually pretty simple to merge the "anonymous"

record with a different "anonymised" database and out pops

the near-certain identity of the patient." ~ Cory Doctorow,

theguardian

Multi variable identification

Big Data is a double edge sword

Lv2: Black Box (No data visibility for data scientist)

Algorithm, Software, System or People

In-house or 3rd Party

Requests SummarizedResults

Lv2: Sampling (lowers accuracy)

Probability

Simple RandomSystematicStratifiedProbability Proportional to SizeCluster

Nonprobability(Try not touse these)

ConvenienceQuotaPurposive

Lv2: Sampling (lowers accuracy)

Probability

Simple RandomSystematicStratifiedProbability Proportional to SizeCluster

Nonprobability(Try not touse these)

ConvenienceQuotaPurposive

All data

Data Collected

Sample

Lv2 Noise, fake & shuffle within data clusters

Lv2: Add noise / fake data (lowers accuracy)

Name: Adam Smith Visit1: 14/04/13Visit2: 21/05/13Visit3: 01/06/13

Name: David HumeVisit1: 19/04/13Visit2: 26/05/13Visit3: 06/06/13

Noise:+5 days

Fake, male Scottish Name

Group visits by same person together and apply same amount of noise

Name: David Abram Visit1: 01/02/13Visit2: 11/02/13

Name: David Abram Visit1: 27/01/13Visit2: 06/02/13

Affects daily/ monthly pattern

Noise:-5 days

Lv2: Shuffle (may break data relationships but retains trend)

Name: Adam Smith Purchase1 : CabbagePurchase2 : Tomato

Name: David Abram Purchase1 : BreadPurchase2 : Sushi

Name: Emma GoldmanPurchase1: Female HygienePurchase2 : Strawberry

Shuffle

Name: Adam Smith Purchase1 : BreadPurchase2 : Sushi

Name: David Abram Purchase1 : CabbagePurchase2 : Tomato


Different gender, cannot shuffle with Adam/David


From David

From Adam

Are we safe?

RRRM

ID: S1235930X

Name: Adam Smith

Age: 45

Postal:428102

Visit1: 14/04/13

Visit2: 21/05/13

Visit3: 01/06/13

Purchase1 : Cabbage

Purchase2 : Tomato

ID: -----

Name: Mr. S

Age: 40+yrs

Postal:428***

Visit1: 14/04/13

Visit2: 21/05/13

Visit3: 01/06/13

Purchase1 : Cabbage

Purchase2 : Tomato

Are we safe?

Noise / Fake

ID: S1235930X

Name: Adam Smith

Age: 45

Postal:428102

Visit1: 14/04/13

Visit2: 21/05/13

Visit3: 01/06/13

Purchase1 : Cabbage

Purchase2 : Tomato

ID: -----

Name: Mr. H

Age: 40+yrs

Postal:428***

Visit1: 15/04/13

Visit2: 26/05/13

Visit3: 06/06/13

Purchase1 : Cabbage

Purchase2 : Tomato

Are we safe?

Shuffle

ID: S1235930X

Name: Adam Smith

Age: 45

Postal:428102

Visit1: 14/04/13

Visit2: 21/05/13

Visit3: 01/06/13

Purchase1 : Cabbage

Purchase2 : Tomato

ID: -----

Name: Mr. H

Age: 40+yrs

Postal:428***

Visit1: 15/04/13

Visit2: 26/05/13

Visit3: 06/06/13

Purchase1 : Bread

Purchase2 : Sushi

Encrypted

Are we safe? Before Vs After

ID: S1235930X

Name: Adam Smith

Age: 45

Postal:428102

Visit1: 14/04/13

Visit2: 21/05/13

Visit3: 01/06/13

Purchase1 : Cabbage

Purchase2 : Tomato

ID: -----

Name: Mr. H

Age: 40+yrs

Postal:428***

Visit1: 15/04/13

Visit2: 26/05/13

Visit3: 06/06/13

Purchase1 : Bread

Purchase2 : Sushi

Not really safe - Netflix case study

+

Prof. Arvind Narayanan

Not really safe - Netflix case study

+

Prof. Arvind Narayanan

Sparse data

Even the most prolific Netflix users has only rated a

tiny fraction of Netflix’s enormous library. Thus most

columns, which represents a particular movie, are

empty. Therefore, the chances of two or more users

giving the same rating to the same set of movies is

quite small; thus sets of user’s movie ratings can

almost uniquely identify users.

Credit: Prof. Arvind Narayanan

Best match: David

2nd Best match: Adam

Best match: Alice2nd Best match:

Lisa

Lv3: Breaking Big Data Machine Learning

Lv3 Noise, fake & shuffle across data clusters

Lv3: Add trend breaking noise / fake data

Name: David Abram Visit1: 01/02/13 (Bought item A,B,C)Visit2: 11/02/13 (Bought item D,E)

Name: David Abram Visit1: 26/01/13 (Bought item A,B,C,X)Visit2: 05/02/13 (Bought item D,E)

Re order visits, add noise to dateFake purchase

X, and sequence of visits related findings will be ignored

Name: Adam Smith Purchase1 : CabbagePurchase2 : Tomato

Name: David Abram Purchase1 : BreadPurchase2 : Sushi


Shuffle

Name: Adam Smith Purchase1 : BreadPurchase2 : Sushi

Name: Emma GoldmanPurchase1 : CabbagePurchase2 : Tomato

Name: David Abram Purchase1: Female HygienePurchase2 : Strawberry

Gender related findings will be ignored

Lv3: Trend breaking shuffle


Security


Max Usefulness: Raw data


Max security: Lock up data, don’t do any analysis

Yes we are safe, as long as the lions prefer eating fat, juicy zebras than us.

Security



KnownKnowns

KnownUnknowns

UnknownUnknowns

Unknownknowns

Donald’s Matrix

[email protected]: Kai Xin, Thia

mailto:[email protected]

Interesting reads

• Anonymizing Health Data• Data protection in the EU: the certainty of uncertainty• Robust De-anonymization of Large Sparse Datasets• Eccentricity Explained• A new way to protect privacy in large-scale genome-wide a

ssociation studies• Why 'Anonymous' Data Sometimes Isn't• Has Big Data Made Anonymity Impossible?• ‘Anonymous’ Netflix Prize data not so anonymous after all• A Data Broker Offers a Peek Behind the Curtain

http://shop.oreilly.com/product/0636920029229.do

http://www.theguardian.com/technology/blog/2013/jun/05/data-protection-eu-anonymous

http://www.theguardian.com/technology/blog/2013/jun/05/data-protection-eu-anonymous

http://www.cs.utexas.edu/~shmat/netflix-faq.html

http://33bits.org/2008/10/03/eccentricity-explained/

http://bioinformatics.oxfordjournals.org/content/29/7/886

http://bioinformatics.oxfordjournals.org/content/29/7/886

http://www.wired.com/politics/security/commentary/securitymatters/2007/12/securitymatters_1213

http://www.technologyreview.com/news/514351/has-big-data-made-anonymity-impossible/

http://www.liesdamnedlies.com/2008/10/anonymous-netfl.html

http://www.nytimes.com/2013/09/01/business/a-data-broker-offers-a-peek-behind-the-curtain.html?pagewanted=all

lions, zebras and big data anonymization

Technology

raw data

data visibility

data scie

value of data

data anonymization rules

different data policies

big data machine learning

noise fake data shuffle