1 privacy preserving data mining haiqin yang extracted from a ppt “secure multiparty computation...
TRANSCRIPT
1
Privacy Preserving Data Mining
Haiqin Yang
Extracted from a ppt “Secure Multiparty Computation and Privacy” Added “Privacy Preserving SVM”
2
Outline
Motivation Privacy Secure computation and privacy Privacy preserving SVM Related to 3D-LBS Challenges
3
Motivation Huge databases exist in various applications
Medical data Consumer purchase data Census data Communication and media-related data Data gathered by government agencies
Can these data be utilized? For medical research For improving customer service For homeland security
4
Motivation Data sharing is necessary for full utilization
Pooling medical data can improve the quality of medical research
Pooling of information from different government agencies can provide a wider picture What is the health status of citizens that are
supported by social welfare? Are there citizens that receive simultaneous
support from different agencies? Data gathered by the government (e.g.,
census data) should be publicly available
5
Motivation
The huge amount of data available means that it is possible to learn a lot of information about individuals from public data Purchasing patterns Family history Medical data …
6
Privacy Human definition:
Privacy and autonomy: information that is personal, confidential or private should not be unnecessarily distributed or publicly known
Privacy and control: Personal or private information should not be misused (whatever that means)
Difficulties in mathematically formulating The same information is classified differently by
different people Legitimate use is interpreted differently by different
people
7
Secure computation and privacy Secure computation
Assume that there is a function that all parties wish to compute
Secure computation shows how to compute that function in the safest way possible
In particular, it guarantees minimal information leakage (the output only)
Privacy Does the function output itself reveal “sensitive
information”, or Should the parties agree to compute this
function?
8
This talk
Secure multiparty computation Trace back to “two millionaires
problem” (A. Yao 82), or earlier Privacy preserving data mining
Privacy preserving SVM (Vaidya et al. KIS 2007)
9
Secure multiparty computation
A set of parties with private inputs Objective: jointly compute a function of the
inputs so that certain security properties (like privacy and correctness) are preserved Applications: secure elections, auctions…
Properties must be ensured even if some of the parties maliciously attack the protocol
10
Secure computation tasks
Examples: Authentication protocols Online payments Auctions Elections Privacy preserving data mining Essentially any task…
11
The real model
x
Protocol output
y
Protocol output
12
The ideal model
x
f1(x,y)
y
f2(x,y)
x
f1(x,y)
y
f2(x,y)
13
IDEALREAL
Trusted party
Protocolinteraction
The security definition
For every real adversary A
there exists anadversary S
14
Why this approach?
General – it captures all applications The specifics of an application are defined
by its functionality, security is defined as above
The security guarantees achieved are easily understood (because the ideal model is easily understood)
We can be confident that we did not “miss” any security requirements
15
The ideal model – More details The definition we gave suffices in the
case of an honest majority When there is no honest majority
Guaranteed output delivery cannot be achieved Fairness cannot be achieved
Changes to ideal model: Corrupted parties receive output first Adversary decides if honest parties receive their
outputs as well This is called security with abort
16
Defects to the ideal model When no honest majority, fairness and guaranteed
output delivery cannot be achieved This “defect” is included into the ideal model
This approach can be used to models of partial information leakage: The parties wish to compute a function f, but more
information is leaked by the protocol This can be modeled by having the trusted party
explicitly leak this information Helps for efficiency considerations
Advantage: explicit defect!
17
Privacy preserving data mining
Setting Data is distributed at different sites These sites may be third parties (e.g.,
hospitals, government bodies) or individuals Aim
Compute the data mining algorithm on the data so that nothing but the output is learned
That is, carry out a secure computation
18
Privacy Security
Secure computation only deals with the process of computing the function It does not ask whether or not the
function should be computed
19
Privacy and secure computation
Secure computation can be used to solve any distributed data-mining problem
A two-stage process Decide that the function/algorithm should be
computed – an issue of privacy Apply secure computation techniques to
compute it securely – security But, not every privacy problem can be cast as a
distributed computation
20
Privacy preserving SVM classification (Vaidya et al. KIS 2007)
21
SVM introduction
Data Objective
x1
x 2
SVM Illustration
Problem: How to pass the kernel matrix among different parties?
22
Linear kernel
Polynomial kernel
RBF kernel
Case 1: On vertically partitioned data
d
D1 D2N
Bank, health insurance company and auto insurance company collect different information about the same people
23
Secure merge Assumption: K parties, Pi holds vi
Procedure P0 chooses a random number R from a
uniform distribution over F P0 sends (R+v0) mod |F | to P1
Pi receives Pi sends to Pi+1
24
Case 2: On horizontally partitioned data
Different banks collect data for their customers
d
D1
D2N
25
Case 3: On arbitrarily partitioned data
26
How to do? Homomorphic encryption
where * is either addition or multiplication (in some abelian group)
Existing cryptosystems Goldwasser–Micali (Blum M, Goldwasser S 1984) Benaloh cryptosystem (Benaloh JC 1986) Naccache–Stern cryptosystem (Naccache D, Stern J
1998) Paillier cryptosystem (Paillier P 1999) Okamoto–Uchiyama cryptosystem (Okamoto T, Uchiyama
S 1998)
27
Algorithm
28
Other security issues
Collusion with the third party Q
29
Related to 3D-LBS Issues
Some data may need be encrypted before research test
May need to work data miners on encrypted data
Problems Lack knowledge of security and
cryptography Limited understanding of privacy
30
Future challenges Understand the real problem and what the real
requirement is A very non-trivial task and one that requires
interdisciplinary cooperation Computer scientists should help to formalize the
notion, lawyers, policy-makers, social scientists should be involved in understanding the concerns
Some challenges Reconciling cultural and legal differences relating to
privacy in different countries Understanding when privacy is “allowed” to be
breached
31
Future challenges
Appropriate modeling for secure computation
Efficient protocols…
32
Conclusion
Privacy-preserving data mining is truly needed Data mining is being used: by security
agencies, governmental bodies and corporations
Privacy advocates and citizen outcry often prevents positive use of data mining
Good solutions (which are still currently out of reach) may be able to resolve the conflict