data anonymization - european commission

Post on 22-Oct-2021

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Anonymization

Sara Szoc, CrossLangWorkshop

Introduction

Data Anonymization

• Concept

• Methods

• Risks

• Practical tips

What is data anonymization

What ?

• Process of removing private or confidential information from raw data

• Results in anonymous data that cannot be associated with any individual or company

Why ?

• Protection of identity and private activities

• Financial aspect

How ?

• Using anonymization technique(s)

• Selection and assessment based on use case

PersonalData

Personal or identifiable data:

Information that can lead to the identification of an individual (or a group of individuals)

• Direct identifiersperson/company name, surname, email addresscontaining name, phone number, id card/socialsecurity number, medical record number …

• Indirect identifiersdate of birth, gender, zipcode can uniquelyidentify about 80% of the US population

• Pseudonymous or encrypted datacan be used to re-identify a person and thus remains personal data

PersonalData

“Personal data that has been rendered anonymous in such a way that the individual is not or no longer identifiable is no longer considered personal data.

For data to be truly anonymised, the anonymisation must be irreversible.”

(source: General Data Protection Regulation)

SensitiveData

• Sensitive personal data• can cause harm or embarrassment to the

individual

• for limited dissemination onlyracial/ethnic origin, political/religious beliefs, genetic data, biometric data (fingerprints), health information, sexual orientation … (GDPR)

• Sensitive business information• poses a risk to the company in question if

discovered trade secrets, acquisition plans, financial data, supplier and customer information

Structuredversus

unstructureddata

• Structured data• stored in a structured way

• easily searchable

• relational databases, spreadsheets, data in formats such as JSON, XML, CSV …

• Unstructured data• anything else

• difficult to search

• text files, reports, email messages, audio files, images …

Anonymizationmethods

suppression

masking

Before anonymization

After anonymization

Anonymizationmethods

classification

Before anonymization

After anonymization

Anonymizationmethods

Name Age Location Illness

Luke 39 Belgium Flu

Ashley 57 Belgium Multiple Sclerosis

John 81 Germany Lung cancer

Roman 72 Germany Multiple Sclerosis

perturbation

swapping

Name Age Location Illness

John 40 Brussels Flu

Ashley 56 Antwerp Multiple Sclerosis

Luke 80 Berlin Lung cancer

Roman 71 Munchen Multiple Sclerosis

generalization

Pseudonymization

• Reversible process by using a key

• Still to be treated as personal data because enables re-identification

Name Pseudonymized Anonymized

John q0fdGL xxxxx

Ashley s8fhPd xxxxx

Luke EiuD5j xxxxx

Roman qOerd xxxxx

Luke EiuD5j xxxxx

Measuringanonymization

and risks

• K-anonymity, Differential privacy

• Focus on structured data

Gender Age Location Illness

male 40-50 Belgium Flu

male 40-50 Belgium Multiple Sclerosis

female >50 Germany Lung cancer

female >50 Germany Multiple Sclerosis

2-anonymous data

Existing tools

• Tools for structured data• ARX

• Cornell Anonymization Toolkit

• Tools for unstructured data• MITRE Identification Scrubber Toolkit (MIST)

• Natural Language processing tools (e.g.OpenNLP or Stanford CoreNLP NamedEntity Recognizers)

Practical tips (conclusions)

There is no “one fits all solution”, but different factors need to be taken intoconsideration:

• Analyze nature of data

• Analyze recipients

• Analyze risks (de-anonymization risk management)

• Analyze data utility

• Run anonymization process insideorganization

top related