data security against knowledge loss *) by zbigniew w. ras university of north carolina, charlotte,...
Post on 19-Dec-2015
214 views
TRANSCRIPT
Data Security against Knowledge Loss *)
by
Zbigniew W. RasUniversity of North Carolina, Charlotte,
USA
Data Security against Knowledge Discovery
Possible Challenge Problems:Possible Challenge Problems: Centers for Disease Control (Centers for Disease Control (CDC) use data mining ) use data mining
to identify trends and patterns in disease outbreaks, to identify trends and patterns in disease outbreaks, such as understanding and predicting the such as understanding and predicting the progression of a flu epidemic.progression of a flu epidemic.
Insurance companies have considerable data that Insurance companies have considerable data that would be useful, but are unwilling to disclose this would be useful, but are unwilling to disclose this due to patient privacy concerns. due to patient privacy concerns.
Alternative approach is to have insurance companies Alternative approach is to have insurance companies provide knowledge extracted from their data that provide knowledge extracted from their data that can not be traced to individual people, but can be can not be traced to individual people, but can be used to identify the trends and patterns of interest used to identify the trends and patterns of interest to the CDC.to the CDC.
Data Security against Knowledge Discovery
Collaborative CorporationsCollaborative Corporations
Ford and Firestone shared a problem with a jointly Ford and Firestone shared a problem with a jointly produced product - Ford Explorer with produced product - Ford Explorer with Firestone tires. tires. Ford and Firestone may have been able to use Ford and Firestone may have been able to use association rule techniques to detect problems earlier. association rule techniques to detect problems earlier. This would have required extensive data sharing. This would have required extensive data sharing.
Factors such as trade secrets and agreements with Factors such as trade secrets and agreements with other manufacturers stand in the way of needed data other manufacturers stand in the way of needed data sharing. sharing.
Could we obtain the same results by sharing the Could we obtain the same results by sharing the knowledge, while still preserving the secrecy of each knowledge, while still preserving the secrecy of each side's data?side's data?
Data Security against Knowledge Discovery
Possible Approach (developing joint classifier):Possible Approach (developing joint classifier): Lindell and Pinkas (CRYPTO'00), proposed a method Lindell and Pinkas (CRYPTO'00), proposed a method
that enabled two parties to build a decision tree that enabled two parties to build a decision tree without either party learning anything about the other without either party learning anything about the other party's data, except what might be revealed through party's data, except what might be revealed through the final decision tree.the final decision tree.
Clifton and Du (SIGMOD'02), proposed a method that Clifton and Du (SIGMOD'02), proposed a method that enabled two parties to build association rules without enabled two parties to build association rules without either party learning anything about the other party's either party learning anything about the other party's data.data.
Alternative Approach:Alternative Approach: Each site develops a classifier independently, these Each site develops a classifier independently, these
are used jointly to produce the global classifier. It are used jointly to produce the global classifier. It protects individual entities but it has to be shown that protects individual entities but it has to be shown that the individual classifiers do not release private the individual classifiers do not release private information.information.
Local Data SecurityLocal Data Security: Knowledge extracted : Knowledge extracted from remote data can not be traced to from remote data can not be traced to objects stored locally and used to reveal objects stored locally and used to reveal secure information about them.secure information about them.
Data Security against Knowledge Discovery
Secure Multiparty ComputationSecure Multiparty Computation: Computation : Computation is secure if at the end of the computation, no is secure if at the end of the computation, no party knows anything except its own input party knows anything except its own input and the results [Yao, 1986].and the results [Yao, 1986].
Ontology
g a b c
g1 b2
g1 a2 b1 c2
g1 a2 c1
g1 a1 b1 c1
S2 b a d e
a1 d2
b2 a2 d2 e2
b1 a2 d1 e1
d1
S1
a b c d
a1 b2
b1 c2
a2 b2 d2
a2 b1 c1
rule support system
S KB
KBS
r1r2
qS=[a: c, d, b]qS
Ontology
g a b c
g1 b2
g1 a2 b1 c2
g1 a2 c1
g1 a1 b1 c1
S2 b a d e
a1 d2
b2 a2 d2 e2
b1 a2 d1 e1
d1
S1
a b c d
a1 b2
b1 c2
a2 b2 d2
a2 b1 c1
rule support systems
b1a2 1 S1
b2*d2a2 1 S
b2a2 1 S1
c1*b1a1 1 S2
S
KBS
r1r2
qS=[a, c, d : b]qS
Problem… Give a strategy for identifying minimal number of cells
which additionally have to be hidden at site S (part of DIS) in order to guarantee that hidden attribute in S cannot be reconstructed by Distributed Knowledge Discovery.
a b c d
a1 b2 c2 d1
a2 b1 c2
a2 b2 c1 d2
a2 b1 c1 d2
rule confidence systemS KB
Data Security against Knowledge Discovery
Problem… Give a strategy for identifying the minimal number
of cells in S which additionally have to be hidden in order to guarantee that attribute a cannot be reconstructed by Distributed Knowledge Discovery.
a b c d
b2 c2 d1
b1 c2
b2 c1 d2
b1 c1 d2
rule confidence systemS KB
Data Security against Knowledge Discovery
Problem… Give a strategy for identifying the minimal number of
cells in S which additionally have to be hidden in order to guarantee that attribute a cannot be reconstructed by Distributed Knowledge Discovery.
a b c d
b2 c2 d1
b1 c2
b2 c1 d2
b1 c1 d2
rule confidence system
b1a2 3/4 S2
b2*d1a1 1 S2
b2*d2 a2 1 S1
c1*b1a1 1 S2
S KB
Data Security against Knowledge Discovery
a b c d
a1 b2 c2 d1
a2 b1 c2
a2 b2 c1 d2
a2 b1 c1 d2
rule confidence system
Original site S
a b c d
a1 b2 c2 d1
a2 b1 c2
a2 b2 c1 d2
a1 b1 c1 d2
rule confidence system
b1a2 3/4 S2
b2*d1a1 1 S2
b2*d2 a2 1 S1
c1*b1a1 1 S2
Reconstructed site S
Give a strategy that identifies minimum number of attribute values that need to be additionally hidden from Information System S to guarantee that a hidden attribute cannot be reconstructed by Local & Distributed Chase
KDD Lab
Research Problem
Disclosure Risk of Confidential Data
KDD Lab
x6 a2 b2 c3
Object x6
x6 a2 b2 c3 sal=$50,000
x6 a2 c3
Due to a local rule r2 = c3 b2, confidential data is restored
x6 a2 b2 c3 sal=$50,000
Global rule r1 = a2b2 sal=$50,000, we hide b2 additionally
Confidential data sal=$500 is hidden
Chain of predictions by global and local rules
Algorithm SCIKD (bottom-up strategy)
Rule A B C D E F G
r1 a1 b1 c1
r2 a1 c1 f1
r3 b1 c1
r4 b1 e1
r5 a1 c1 f1
r6 a1 c1 e1
r7 c1 e1 g1
r8 a1 c1 d1
r9 b1 c1 d1
r10 d1 f1r1 = [ b1 c1 a1 ], r2 = [ c1 f1 a1 ],
KDD Lab
D – decision attribute
KB -knowledge base
Algorithm SCIKD (bottom-up strategy)
{a1}* = {a1} unmarked{b1}* = {b1} unmarked{c1}* = {a1,b1,c1,d1,e1} {d1} marked{e1}* = {b1, e1} unmarked{f1}* = {d1,f1} {d1} marked{g1}* = {g1} unmarked
{a1, b1}* = {a1, b1} unmarked{a1, e1}* = {a1, b1, e1} unmarked{a1, g1}* = {a1, g1} unmarked{b1, e1}* = {b1, e1} unmarked{b1, g1}* = {b1, g1, e1} unmarked{e1, g1}* = {a1,b1,c1,d1,e1, g1} {d1} marked
{a1, b1, e1}* = {a1, b1, e1} unmarked /maximal subset/{a1, b1, g1}* = {a1, b1, g1} unmarked /maximal subset/{b1, e1, g1}* = {e1, g1}* marked
{a1, b1, e1, g1}* = {e1, g1}* marked
Data security versus knowledge loss X A B C D E F G
x1 a1 b1 c1 d1 e1 f1 g1
x2
{a1, b1, e1}* = {a1, b1, e1} unmarked /maximal subset/{a1, b1, g1}* = {a1, b1, g1} unmarked /maximal subset/
KDD Lab
Databased1 - has to be hidden
X A B C D E F G
x1 a1 b1 g1
x2
X A B C D E F G
x1 a1 b1 e1
x2
Data security versus knowledge loss
KDD Lab
Database D1
X A B C D E F G
x1 a1 b1 g1
x2
Rk = {rk,i: i I} set of rules extracted from Dk and [ck,i, sk,i] denotes the confidence and support of rule ri for all i Ik, k=1,2 With Rk we associate the number K(Rk) = { ck,i sk,i : i Ik}.
D1 contains more hidden knowledge than D2, if K(R1) K(R2).
X A B C D E F G
x1 a1 b1 e1
x2
Database D2
Objective Interestingness
Basic Measures for :
Domain: card[]
Support or Strength: card[ ]
Confidence or Certainty Factor: card[]/card[]
Coverage Factor: card[]/card[]
Leverage: card[] – card[]*card[]
Lift: card[]/[card[]*card[]]
Data security versus knowledge loss
KDD Lab
Database D1
X A B C D E F G
x1 a1 b1 g1
x2
Rk = {rk,i: i I} set of rules extracted from Dk and [ck,i, sk,i] denotes the coverage factor and support of rule ri for all i Ik, k=1,2 With Rk we associate the number K(Rk) = { ck,i sk,i : i Ik}.
D1 contains more hidden knowledge than D2, if K(R1) K(R2).
X A B C D E F G
x1 a1 b1 e1
x2
Database D2
Data security versus knowledge loss
KDD Lab
Database D1
X A B C D E F G
x1 a1 b1 g1
x1 a1 b1 e1
Rk = {rk,i: i I} set of feasible action rules extracted from Dk and [ck,i, sk,i] denotes the confidence and support of action rule ri for all i Ik, k=1,2 With Rk we associate the number K(Rk) = { ck,i sk,i [1/costDk(rk,j)]: i Ik}.
D1 contains more hidden knowledge than D2, if K(R1) K(R2).
Database D2
Action rule Action rule r: : [(b[(b11, v, v11→ w→ w11) ) (b (b22, v, v22→ w→ w22) ) … … ( b( bpp, v, vpp→ w→ wpp)](x) )](x) (d, k (d, k11→ k→ k22)(x) )(x)
The cost of r in D:: cost costDD(r) = (r) = {{DD((vvii , w , wii) : 1 ) : 1 i i p} p}
Action rule Action rule r r is is feasible in Dfeasible in D, if cost, if costDD(r) < (r) < DD(k(k11, k, k22).).