data anonymization (1). outline problem concepts algorithms on domain generalization hierarchy ...

28
Data Anonymization (1)

Upload: ashlie-edwards

Post on 03-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Data Anonymization (1)

Page 2: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Outline Problem concepts algorithms on domain generalization

hierarchy Algorithms on numerical data

Page 3: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

The Massachusetts Governor Privacy Breach

•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge

•Name•Address•Date Registered•Party affiliation•Date last voted

• Zip

• Birth date

• Sex

Medical Data Voter List

• Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis

• Zip

• Birth date

• Sex

Sweeney, IJUFKS 2002

Quasi IdentifierQuasi Identifier

87 % of US population

3

Page 4: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Definition Table

Column: attributes, row: records

Quasi-identifier A list of attributes that can potentially be

used to identify individuals

K-anonymity Any QI in the table appears at least k

times

Page 5: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Basic techniques Generalization

Zip {02138, 02139} 0213* Domain generalization hierarchy

A0 A1…An Eg. {02138, 02139} 0213* 021* 02*0** This hierarchy is a tree structure

suppression

Page 6: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Balance

Better privacy guaranteeLower data utility

There are many schemes satisfying the k-anonymity specification.We want to minimize the distortion of table, in order to maximize data utility

• Suppression is required if we cannot find a k-anonymity group for a record.

Page 7: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Criteria Minimal generalization

Minimal generalization that satisfy the k-anonymization specification

Minimal table distortion Minimal generalization with minimal

utility loss Use precision to evaluate the loss

[sweeny papers] Application-specific utility

Page 8: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Complexity of finding optimal solution on generalization NP-hard (bayardo ICDE05) So all proposed algorithms are

approximate algorithms

Page 9: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Shared features in different solutions Always satisfy the k-anonymity

specification If some records not, suppress them

Differences are at the utility loss/cost function Sweeney’s precision metric Discernibility & classification metrics Information-privacy metric

Algorithms Assume the domain generalization hierarchy is

given Efficiency Utility maximization

Page 10: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Metrics to be optimized Two cost metrics – we want to minimize

(bayardo ICDE05) Discernibility

Classification The dataset has a class label column – preserving

the classification model

# of items in the k-anony group

# Records in minor classes in the group

Page 11: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

metrics A combination of information loss and

anonymity gain (wang ICDE04) Information loss, anonymity gain Information-privacy metric

Page 12: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

metrics Information loss

Dataset has class labels Entropy

a set S, labeled by different classes Entropy is used to calculate the impurity of labels

Information loss of a generalization G{c1,c2,…cn} p

I(G) = info(Sp) - info (Rci)

i

ii pp log Pi is the percentage of label iInfo(S)=

i p

ci

N

N

Page 13: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Anonymity gain A(VID) : # of records with the VID AG(VID) >= A(VID): generalization

improves or does not change A(VID) Anonymity gain

P(G) = x – A(VID)x = AG (VID) if AG (VID) <=K

x = K, otherwise

As long as k-anonymity is satisfied, further generalization of the VID does not gain

Page 14: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Information-privacy combined metricIP = info loss/anonymity gain = I(G)/P(G)

We want to minimize IPIf P(G) ==0, use I(G) only

Either small I(G) or large P(G) will reduce IP…If P(G)s are same, pick one with minimum I(G)

Page 15: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Domain-hierarchy based algorithms The sweeny’s algorithm Bayardo’s tree pruning algorithm Wang’s top-down and bottom up

algorithms They are all dimension-by-dimension

methods

Page 16: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Multidimensional techniques Categorical data?

Categories are mapped to numerize the categories

Bayardo 95 paper Order matters? (no research on that)

Numerical data K-anonymization n-dim space

partitioning Many existing techniques can be applied

Page 17: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Single-dimensional vs. multidimensional

Page 18: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

The evolving procedure

Categorical(domain hierarchy)[sweeney, top-down/bottom-up]

numerized categories, single dimensional [bayardo05]

numerized/numerical multidimensional[Mondrian,spatial indexing,…]

Page 19: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Method 1: Mondrain Numerize categorical data Apply a top-down partioning process

step1

Step2.1 Step2.2

Page 20: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Allowable cut

Page 21: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Method 2: spatial indexing Multidimensional spatial techniques

Kd-tree (similar to Mondrain algorithm) R-tree and its variations

R-tree R+-tree

Leaf layer

Upperlayer

Page 22: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Compacting bounds

Example: uncompacted: age[1-80], salary[10k-100k]compacted: age[20-40], salary[10k-50k]

Original Mondrain does not consider compacting boundsFor R+-Tree, it is automatically done.

Information is betterpreserved

Page 23: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Benefits of using R+-Tree Scalable: originally designed for

indexing disk-based large data Multi-granularity k-anonymity: layers Better performance Better quality

Page 24: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Performance

Mondrain

Page 25: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Utility Metrics

Discenibility penalty KL divergence: describe the difference

between a pair of distributions

Certainty penalty

Anonymized data distribution

T: table, t: record, m: # of attributes, t.Ai generaled range, T.Ai total range

Page 26: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data
Page 27: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Other issues Sparse high-dimensionality

Transactional data boolean matrix“On the anonymization of sparse high-dimensional

data” ICDE08 Relate to the clustering problem of

transactional data! The above one uses matrix-based clustering item based clustering (?)

Page 28: Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data

Other issues Effect of numerizing categorical data

Ordering of categories may have certain impact on quality

General-purpose utility metrics vs. special task oriented utility metrics

Attacks on k-anonymity definition