data anonymization (1)

28
Data Anonymization (1)

Upload: ramya

Post on 08-Jan-2016

53 views

Category:

Documents


2 download

DESCRIPTION

Data Anonymization (1). Outline. Problem concepts algorithms on domain generalization hierarchy Algorithms on numerical data. The Massachusetts Governor Privacy Breach. Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Anonymization (1)

Data Anonymization (1)

Page 2: Data Anonymization (1)

Outline Problem concepts algorithms on domain generalization

hierarchy Algorithms on numerical data

Page 3: Data Anonymization (1)

The Massachusetts Governor Privacy Breach

•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge

•Name•Address•Date Registered•Party affiliation•Date last voted

• Zip

• Birth date

• Sex

Medical Data Voter List

• Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis

• Zip

• Birth date

• Sex

Sweeney, IJUFKS 2002

Quasi IdentifierQuasi Identifier

87 % of US population

3

Page 4: Data Anonymization (1)

Definition Table

Column: attributes, row: records

Quasi-identifier A list of attributes that can potentially be

used to identify individuals

K-anonymity Any QI in the table appears at least k

times

Page 5: Data Anonymization (1)

Basic techniques Generalization

Zip {02138, 02139} 0213* Domain generalization hierarchy

A0 A1…An Eg. {02138, 02139} 0213* 021* 02*0** This hierarchy is a tree structure

suppression

Page 6: Data Anonymization (1)

Balance

Better privacy guaranteeLower data utility

There are many schemes satisfying the k-anonymity specification.We want to minimize the distortion of table, in order to maximize data utility

• Suppression is required if we cannot find a k-anonymity group for a record.

Page 7: Data Anonymization (1)

Criteria Minimal generalization

Minimal generalization that satisfy the k-anonymization specification

Minimal table distortion Minimal generalization with minimal

utility loss Use precision to evaluate the loss

[sweeny papers] Application-specific utility

Page 8: Data Anonymization (1)

Complexity of finding optimal solution on generalization NP-hard (bayardo ICDE05) So all proposed algorithms are

approximate algorithms

Page 9: Data Anonymization (1)

Shared features in different solutions Always satisfy the k-anonymity

specification If some records not, suppress them

Differences are at the utility loss/cost function Sweeney’s precision metric Discernibility & classification metrics Information-privacy metric

Algorithms Assume the domain generalization hierarchy is

given Efficiency Utility maximization

Page 10: Data Anonymization (1)

Metrics to be optimized Two cost metrics – we want to minimize

(bayardo ICDE05) Discernibility

Classification The dataset has a class label column – preserving

the classification model

# of items in the k-anony group

# Records in minor classes in the group

Page 11: Data Anonymization (1)

metrics A combination of information loss and

anonymity gain (wang ICDE04) Information loss, anonymity gain Information-privacy metric

Page 12: Data Anonymization (1)

metrics Information loss

Dataset has class labels Entropy

a set S, labeled by different classes Entropy is used to calculate the impurity of labels

Information loss of a generalization G{c1,c2,…cn} p

I(G) = info(Sp) - info (Rci)

i

ii pp log Pi is the percentage of label iInfo(S)=

i p

ci

N

N

Page 13: Data Anonymization (1)

Anonymity gain A(VID) : # of records with the VID AG(VID) >= A(VID): generalization

improves or does not change A(VID) Anonymity gain

P(G) = x – A(VID)x = AG (VID) if AG (VID) <=K

x = K, otherwise

As long as k-anonymity is satisfied, further generalization of the VID does not gain

Page 14: Data Anonymization (1)

Information-privacy combined metricIP = info loss/anonymity gain = I(G)/P(G)

We want to minimize IPIf P(G) ==0, use I(G) only

Either small I(G) or large P(G) will reduce IP…If P(G)s are same, pick one with minimum I(G)

Page 15: Data Anonymization (1)

Domain-hierarchy based algorithms The sweeny’s algorithm Bayardo’s tree pruning algorithm Wang’s top-down and bottom up

algorithms They are all dimension-by-dimension

methods

Page 16: Data Anonymization (1)

Multidimensional techniques Categorical data?

Categories are mapped to numerize the categories

Bayardo 95 paper Order matters? (no research on that)

Numerical data K-anonymization n-dim space

partitioning Many existing techniques can be applied

Page 17: Data Anonymization (1)

Single-dimensional vs. multidimensional

Page 18: Data Anonymization (1)

The evolving procedure

Categorical(domain hierarchy)[sweeney, top-down/bottom-up]

numerized categories, single dimensional [bayardo05]

numerized/numerical multidimensional[Mondrian,spatial indexing,…]

Page 19: Data Anonymization (1)

Method 1: Mondrain Numerize categorical data Apply a top-down partioning process

step1

Step2.1 Step2.2

Page 20: Data Anonymization (1)

Allowable cut

Page 21: Data Anonymization (1)

Method 2: spatial indexing Multidimensional spatial techniques

Kd-tree (similar to Mondrain algorithm) R-tree and its variations

R-tree R+-tree

Leaf layer

Upperlayer

Page 22: Data Anonymization (1)

Compacting bounds

Example: uncompacted: age[1-80], salary[10k-100k]compacted: age[20-40], salary[10k-50k]

Original Mondrain does not consider compacting boundsFor R+-Tree, it is automatically done.

Information is betterpreserved

Page 23: Data Anonymization (1)

Benefits of using R+-Tree Scalable: originally designed for

indexing disk-based large data Multi-granularity k-anonymity: layers Better performance Better quality

Page 24: Data Anonymization (1)

Performance

Mondrain

Page 25: Data Anonymization (1)

Utility Metrics

Discenibility penalty KL divergence: describe the difference

between a pair of distributions

Certainty penalty

Anonymized data distribution

T: table, t: record, m: # of attributes, t.Ai generaled range, T.Ai total range

Page 26: Data Anonymization (1)
Page 27: Data Anonymization (1)

Other issues Sparse high-dimensionality

Transactional data boolean matrix“On the anonymization of sparse high-dimensional

data” ICDE08 Relate to the clustering problem of

transactional data! The above one uses matrix-based clustering item based clustering (?)

Page 28: Data Anonymization (1)

Other issues Effect of numerizing categorical data

Ordering of categories may have certain impact on quality

General-purpose utility metrics vs. special task oriented utility metrics

Attacks on k-anonymity definition