data anonymization refined privacy models. outline k-anonymity specification is not sufficient ...

35
Data Anonymization Refined privacy models

Upload: darcy-cunningham

Post on 01-Jan-2016

229 views

Category:

Documents


0 download

TRANSCRIPT

Data Anonymization

Refined privacy models

Outline k-anonymity specification is not

sufficient Enhancing privacy

L-diversity T-closeness Max-Ent analysis

Linking the dots… Countering the privacy attack

k-anonymity addresses one type of attacks link attack

?? Other types of attacks

K-anonymity

Quasi-identifier:

The attacker can find the link using other public data

K-anonymity tries to counter this type of privacy attack:Individual -> quasi-identifier-> sensitive attributes

Example: 4-anonymized table, at least 4 records share the same quasi-identifier

Typical method: domain-specific generalization

More Privacy Problems All existing k-anonymity approaches

assume: Privacy is protected as long as the k-

anonymity specification is satisfied. But there are other problems

Homogeneity in sensitive attributes Background knowledge on individuals …

Problem1: Homogeneity Attack

If Bob lives in the zip code 13053 And he is 31 years old-> Bob surely has cancer!

We know Bob is in the table…

Problem 2: Background knowledge attack

Background knowledge:Japanese have an extremely lowIncidence of heart disease!

A Japanese Umeko lives in zip code 13068 and she is 21 years old-> Umeko has viral infection with high probability.

The cause of these two problems The values in the sensitive attribute of

some blocks have no sufficient diversity Problem1: no diversity. Problem2: background knowledge helps to reduce

the diversity.

major contributions of l-diversity Formally analyze the privacy model of k-

anonymity with the Bayes-Optimal privacy model.

Basic idea: increase the diversity of sensitive attribute values for each anonymized block

Instantiation and implementation of l-diversity concept Entropy l-diveristy Recursive l-diversity More…

Modeling the attacks

What is a privacy attack? - guess the sensitive values! (probability) Prior belief: without seeing the table, what can we

guess? S: sensitive data, Q: quasi-identifier, prior: P(S=s|Q=q) Example: Japanese vs. heart disease

Observed belief: with the observed table, our belief will change. T*: anonymized table, observed: P(S=s|(Q=q and T* is known))

Effective privacy attacks: Table T* should help to change the belief a lot!

Prior is small, observed belief is large -> positive disclosure Prior is large, observed belief is small -> negative disclosure

The definition of observed belief

A q*-block: a k-anonymized groupwith q* as the quasi-identifier

q* S

n(q*, s)n(q*)records

n(q*, s)/n(q*)

background knowledge,i.e. the prior p(S=s|Q=q)

# of records with S=s

S=s

f(s|q*): theproportionof this part

Interpret privacy problem of k-anonymity

Derived from the relationship between observed belief and privacy disclosure (positive) Extreme situation: (q,s,T*) 1 => positive disclosure

Possibility 1. n(q*, s’) << n(q*, s) => Lack of diversityPossibility 2. Strong background knowledge helps to eliminate other

items Knowledge: except one s, other s’ are not likely true while Q=q =>

f(s’|q) 0

Minimize the contributionof other items, and make

0=>

Negative disclosure: (q,s,T*) 0

0 either n(q*,s) 0 or f(s|q) 0

The Umeko example

How to address the problems? We make

n(q*, s’) << n(q*, s) is not satisfied Need more knowledge to get rid of other

items “damaging instance-level knowledge” for f(s’|q)

0 If L distinct sensitive values in the q*-block, the

attacker needs L-1 pieces of damaging knowledge to get rid of the L-1 possible sensitive values

This is the principle of L-diversity

L-diversity: how to evaluate it? Entropy l-diversity

Every q*-block satisfies the condition:

*We like uniform distribution of sensitive valuesover each block!

**Guarantees every q*-block has at least L distinct sensitive values

Entropy of uniformly distributedL distinct sensitive values

Entropy of distinct sensitive values in q*-block

Other extensions Entropy l-diversity is too restrictive

Some positive disclosures are allowed Typically, some sensitive values my have very high

frequency and are not sensitive, in practice. For example, “normal” in disease symptom.

Log(L) cannot be satisfied in some cases. Principle to relax the strong condition

Uniform distribution of sensitive values is good! When we can not achieve this, we choose to make most

value frequencies as close as possible, especially the most frequent value.

Recursive (c,l)-diversity is proposed Control the frequency difference between the most

frequent item and the most non-frequent items r1 < c(rl+rl+1+…+rm)

Implementation of l-diversity build an algorithm having a structure

similar to k-anonymity algorithms. With domain generalization hierarchy Check l-diversity instead of k-anonymity

The Anatomy approach (without anonymization of QIs)

Discussion Not addressed problems

Skewed data – common problem for l-diversity and k-anonymity Makes l-diversity very inefficient

Balance between utility and privacy The entropy l-diversity and (c,l)-diversity methods do

not guarantee good data utility The anatomy method is much better

t-closeness Address two types of attacks

Skewness attack Similarity attack

Skewness attack

Prob of cancer in the original tableis low

Prob of cancer in the anonymized table is much higher than the global prob

Semantic similarity attack

Salary is low

Has some kind of stomachdiseas

The root of these two problems Sensitive values

Difference between Global distribution and Local distribution in some block

The proposal of t-closeness Making the global and local

distributions as similar as possible Evaluate the distribution similarity Semantic similarity

Density : {3k,4k,5k} is denser than{6k,8k,11k}

“Earth mover’s distance” as the similarity measure

Privacy MaxEnt Quantify the privacy under

background knowledge attacks

So that we know how vulnerable an anonymized dataset on different assumptions of attacker’s knowledge

All attacks are based on The attacker’s background knowledge

Knowledge from the table Local/Global distribution of sensitive values can

always be calculated

Common knowledge Useful common knowledge should be consistent with

the knowledge from the table

Attack is an estimate to find P(S|Q) Find QS with high confidence Both higher/lower P(S|Q) than common

knowledge reveal info

Quantifying privacy Need to estimate the conditional

probability P(S|Q)

B: bucket

Most interesting

Without background knowledge P(S|Q,B) is estimated with the portion of S in the

bucket B

With background knowledge Complicated … The paper proposes a Maximum Entropy based

method to estimate P(S|Q,B), assuming the attacker knows different kinds of background knowledge Modeling background knowledge as the constraints

Types of background knowledge Rule-based knowledge:

P (s | q) = 1. P (s | q) = 0.

Probability-Based Knowledge P (s | q) = 0.2. P (s | Alice) = 0.2.

Vague background knowledge 0.3 ≤ P (s | q) ≤ 0.5.

Miscellaneous types P (s | q1) + P (s | q2) = 0.7 One of Alice and Bob has “Lung Cancer”.

Maximum Entropy Estimation MaxEnt principle

If you don’t know the distribution, you assume it is uniform

If you know part of the distribution, you still model the remaining uniform

Uniform distribution maximum entropy Maximizing entropy making the distribution

more like uniform.

MaxEnt for privacy analysis Maximizing entropy H(S|Q,B)

H(S|Q,B) = H(Q,S,B) –H(Q,B) Equivalent to maximizing H(Q,S,B)

H(Q,S,B) is maximized when P(Q,S,B) is uniform

The problem Given a table D’. Solve the following optimization

problem Find an assignment of P(Q,S,B) for all Q,S,B combination, which

maximizes H(Q,S,B) Satisfy a list of constraints (including the background

knowledge)

find constraints Knowledge about data distribution Constraints from the data knowledge about individuals

Modeling background knowledge about distributions P(S|Qv), Qv is a part of Q

e.g. P(Breast cancer|male)=0 P(Qv,S) = P(S|Qv)*P(Qv)

All buckets Q- = Q-Qv

In previous example, if P(flu|male)=0.3 P({male,college}, Flu,1) + P({male,highschool}, Flu,1)+P({male,college}, Flu,3) + P({male,graduate}, Flu,3} = 0.3*P(male)=0.3*6/10 = 0.18

Constraints from the Data Identify invariants from the disguised

data QI-invariant equation

SA-invariant equation

Zero-invariant equation P(q,s,b) =0, if q is not in b, or s is not in b

Knowledge about individuals Can be modeled with similar methods

Knowledge 1: Alice has either s1 or s4.

Constraint:

Knowledge 1: Two people among Alice, Bob, and Charlie have s4.

Constraint:

Alice: (i1, q1)Bob: (i4, q2)Charlie: (i9, q5)

NqipsqiPsqiPsqiP 111411111111 ),()2,,,()2,,,()1,,,(

NsqiPsqiPsqiP 2459424411 )3,,,()3,,,()2,,,(

Summary Attack analysis is an important part of

anonymization Background knowledge modeling is

the key to attack analysis But no way to enumerate all background

knowledge…