data security against knowledge loss *) by zbigniew w. ras university of north carolina, charlotte,...

Data Security against Knowledge Loss *)

by

Zbigniew W. RasUniversity of North Carolina, Charlotte,

USA

Data Security against Knowledge Discovery

Possible Challenge Problems:Possible Challenge Problems: Centers for Disease Control (Centers for Disease Control (CDC) use data mining ) use data mining

to identify trends and patterns in disease outbreaks, to identify trends and patterns in disease outbreaks, such as understanding and predicting the such as understanding and predicting the progression of a flu epidemic.progression of a flu epidemic.

Insurance companies have considerable data that Insurance companies have considerable data that would be useful, but are unwilling to disclose this would be useful, but are unwilling to disclose this due to patient privacy concerns. due to patient privacy concerns.

Alternative approach is to have insurance companies Alternative approach is to have insurance companies provide knowledge extracted from their data that provide knowledge extracted from their data that can not be traced to individual people, but can be can not be traced to individual people, but can be used to identify the trends and patterns of interest used to identify the trends and patterns of interest to the CDC.to the CDC.


Collaborative CorporationsCollaborative Corporations

Ford and Firestone shared a problem with a jointly Ford and Firestone shared a problem with a jointly produced product - Ford Explorer with produced product - Ford Explorer with Firestone tires. tires. Ford and Firestone may have been able to use Ford and Firestone may have been able to use association rule techniques to detect problems earlier. association rule techniques to detect problems earlier. This would have required extensive data sharing. This would have required extensive data sharing.

Factors such as trade secrets and agreements with Factors such as trade secrets and agreements with other manufacturers stand in the way of needed data other manufacturers stand in the way of needed data sharing. sharing.

Could we obtain the same results by sharing the Could we obtain the same results by sharing the knowledge, while still preserving the secrecy of each knowledge, while still preserving the secrecy of each side's data?side's data?

http://en.wikipedia.org/wiki/Firestone_and_Ford_tire_controversy


Possible Approach (developing joint classifier):Possible Approach (developing joint classifier): Lindell and Pinkas (CRYPTO'00), proposed a method Lindell and Pinkas (CRYPTO'00), proposed a method

that enabled two parties to build a decision tree that enabled two parties to build a decision tree without either party learning anything about the other without either party learning anything about the other party's data, except what might be revealed through party's data, except what might be revealed through the final decision tree.the final decision tree.

Clifton and Du (SIGMOD'02), proposed a method that Clifton and Du (SIGMOD'02), proposed a method that enabled two parties to build association rules without enabled two parties to build association rules without either party learning anything about the other party's either party learning anything about the other party's data.data.

Alternative Approach:Alternative Approach: Each site develops a classifier independently, these Each site develops a classifier independently, these

are used jointly to produce the global classifier. It are used jointly to produce the global classifier. It protects individual entities but it has to be shown that protects individual entities but it has to be shown that the individual classifiers do not release private the individual classifiers do not release private information.information.

Local Data SecurityLocal Data Security: Knowledge extracted : Knowledge extracted from remote data can not be traced to from remote data can not be traced to objects stored locally and used to reveal objects stored locally and used to reveal secure information about them.secure information about them.


Secure Multiparty ComputationSecure Multiparty Computation: Computation : Computation is secure if at the end of the computation, no is secure if at the end of the computation, no party knows anything except its own input party knows anything except its own input and the results [Yao, 1986].and the results [Yao, 1986].

Ontology

g a b c

g1 b2

g1 a2 b1 c2

g1 a2 c1

g1 a1 b1 c1

S2 b a d e

a1 d2

b2 a2 d2 e2

b1 a2 d1 e1

d1

S1

a b c d

a1 b2

b1 c2

a2 b2 d2

a2 b1 c1

rule support system

S KB

KBS

r1r2

qS=[a: c, d, b]qS

Ontology

g a b c

g1 b2

g1 a2 b1 c2

g1 a2 c1

g1 a1 b1 c1

S2 b a d e

a1 d2

b2 a2 d2 e2

b1 a2 d1 e1

d1

S1

a b c d

a1 b2

b1 c2

a2 b2 d2

a2 b1 c1

rule support systems

b1a2 1 S1

b2*d2a2 1 S

b2a2 1 S1

c1*b1a1 1 S2

S

KBS

r1r2

qS=[a, c, d : b]qS

Problem… Give a strategy for identifying minimal number of cells

which additionally have to be hidden at site S (part of DIS) in order to guarantee that hidden attribute in S cannot be reconstructed by Distributed Knowledge Discovery.

a b c d

a1 b2 c2 d1

a2 b1 c2

a2 b2 c1 d2

a2 b1 c1 d2

rule confidence systemS KB


Problem… Give a strategy for identifying the minimal number

of cells in S which additionally have to be hidden in order to guarantee that attribute a cannot be reconstructed by Distributed Knowledge Discovery.

a b c d

b2 c2 d1

b1 c2

b2 c1 d2

b1 c1 d2

rule confidence systemS KB


Problem… Give a strategy for identifying the minimal number of

cells in S which additionally have to be hidden in order to guarantee that attribute a cannot be reconstructed by Distributed Knowledge Discovery.

a b c d

b2 c2 d1

b1 c2

b2 c1 d2

b1 c1 d2

rule confidence system

b1a2 3/4 S2

b2*d1a1 1 S2

b2*d2 a2 1 S1

c1*b1a1 1 S2

S KB


a b c d

a1 b2 c2 d1

a2 b1 c2

a2 b2 c1 d2

a2 b1 c1 d2


Original site S

a b c d

a1 b2 c2 d1

a2 b1 c2

a2 b2 c1 d2

a1 b1 c1 d2


b1a2 3/4 S2

b2*d1a1 1 S2

b2*d2 a2 1 S1

c1*b1a1 1 S2

Reconstructed site S

Give a strategy that identifies minimum number of attribute values that need to be additionally hidden from Information System S to guarantee that a hidden attribute cannot be reconstructed by Local & Distributed Chase

KDD Lab

Research Problem

Disclosure Risk of Confidential Data

KDD Lab

x6 a2 b2 c3

Object x6

x6 a2 b2 c3 sal=$50,000

x6 a2 c3

Due to a local rule r2 = c3 b2, confidential data is restored

x6 a2 b2 c3 sal=$50,000

Global rule r1 = a2b2 sal=$50,000, we hide b2 additionally

Confidential data sal=$500 is hidden

Chain of predictions by global and local rules

Algorithm SCIKD (bottom-up strategy)

Rule A B C D E F G

r1 a1 b1 c1

r2 a1 c1 f1

r3 b1 c1

r4 b1 e1

r5 a1 c1 f1

r6 a1 c1 e1

r7 c1 e1 g1

r8 a1 c1 d1

r9 b1 c1 d1

r10 d1 f1r1 = [ b1 c1 a1 ], r2 = [ c1 f1 a1 ],

KDD Lab

D – decision attribute

KB -knowledge base

Algorithm SCIKD (bottom-up strategy)

{a1}* = {a1} unmarked{b1}* = {b1} unmarked{c1}* = {a1,b1,c1,d1,e1} {d1} marked{e1}* = {b1, e1} unmarked{f1}* = {d1,f1} {d1} marked{g1}* = {g1} unmarked

{a1, b1}* = {a1, b1} unmarked{a1, e1}* = {a1, b1, e1} unmarked{a1, g1}* = {a1, g1} unmarked{b1, e1}* = {b1, e1} unmarked{b1, g1}* = {b1, g1, e1} unmarked{e1, g1}* = {a1,b1,c1,d1,e1, g1} {d1} marked

{a1, b1, e1}* = {a1, b1, e1} unmarked /maximal subset/{a1, b1, g1}* = {a1, b1, g1} unmarked /maximal subset/{b1, e1, g1}* = {e1, g1}* marked

{a1, b1, e1, g1}* = {e1, g1}* marked

Data security versus knowledge loss X A B C D E F G

x1 a1 b1 c1 d1 e1 f1 g1

x2

{a1, b1, e1}* = {a1, b1, e1} unmarked /maximal subset/{a1, b1, g1}* = {a1, b1, g1} unmarked /maximal subset/

KDD Lab

Databased1 - has to be hidden

X A B C D E F G

x1 a1 b1 g1

x2

X A B C D E F G

x1 a1 b1 e1

x2

Data security versus knowledge loss

KDD Lab

Database D1

X A B C D E F G

x1 a1 b1 g1

x2

Rk = {rk,i: i I} set of rules extracted from Dk and [ck,i, sk,i] denotes the confidence and support of rule ri for all i Ik, k=1,2 With Rk we associate the number K(Rk) = { ck,i sk,i : i Ik}.

D1 contains more hidden knowledge than D2, if K(R1) K(R2).

X A B C D E F G

x1 a1 b1 e1

x2

Database D2

Objective Interestingness

Basic Measures for :

Domain: card[]

Support or Strength: card[ ]

Confidence or Certainty Factor: card[]/card[]

Coverage Factor: card[]/card[]

Leverage: card[] – card[]*card[]

Lift: card[]/[card[]*card[]]


KDD Lab

Database D1

X A B C D E F G

x1 a1 b1 g1

x2

Rk = {rk,i: i I} set of rules extracted from Dk and [ck,i, sk,i] denotes the coverage factor and support of rule ri for all i Ik, k=1,2 With Rk we associate the number K(Rk) = { ck,i sk,i : i Ik}.


X A B C D E F G

x1 a1 b1 e1

x2

Database D2


KDD Lab

Database D1

X A B C D E F G

x1 a1 b1 g1

x1 a1 b1 e1

Rk = {rk,i: i I} set of feasible action rules extracted from Dk and [ck,i, sk,i] denotes the confidence and support of action rule ri for all i Ik, k=1,2 With Rk we associate the number K(Rk) = { ck,i sk,i [1/costDk(rk,j)]: i Ik}.


Database D2

Action rule Action rule r: : [(b[(b11, v, v11→ w→ w11) ) (b (b22, v, v22→ w→ w22) ) … … ( b( bpp, v, vpp→ w→ wpp)](x) )](x) (d, k (d, k11→ k→ k22)(x) )(x)

The cost of r in D:: cost costDD(r) = (r) = {{DD((vvii , w , wii) : 1 ) : 1 i i p} p}

Action rule Action rule r r is is feasible in Dfeasible in D, if cost, if costDD(r) < (r) < DD(k(k11, k, k22).).

Questions?

Thank You

KDD Lab

data security against knowledge loss *) by zbigniew w. ras university of north carolina, charlotte,...

Documents

data security

partys data

considerable data

sides data

extensive data sharing

way of needed data sharing

knowledge loss

firestone factors