privacy-preserving data mining - yale university · •policies for privacy-preserving data mining:...
TRANSCRIPT
Privacy-Preserving Data Mining
Rebecca WrightComputer Science DepartmentStevens Institute of Technology
www.cs.stevens.edu/~rwright
PORTIA Site Visit12 May, 2005
The Data Revolution
• The current data revolution is fueled by the perceived,actual, and potential usefulness of the data.
• Most electronic and physical activities leave some kindof data trail. These trails can provide useful informationto various parties.
• However, there are also concerns about appropriatehandling and use of sensitive information.
• Privacy-preserving methods of data handling seek toprovide sufficient privacy as well as sufficient utility.
Advantages of Privacy Protection• protection of personal information
• protection of proprietary or sensitive information
• enables collaboration between different dataowners (since they may be more willing or able tocollaborate if they need not reveal theirinformation)
• compliance with legislative policies
Overview• Introduction
• Primitives
• Higher-level protocols
– Distributed data mining
– Publishable data
– Coping with massiveness
– Beyond privacy-preserving data mining
• Implementation and experimentation
• Lessons learned, conclusions
Models for Distributed Data Mining, I
• Horizontally Partitioned • Vertically Partitioned
… … …
… …
… …
P1
P2
P3
… … … … … … …
P1 P2
…
… … … … … … …
…
Models for Distributed Data Mining, II
• Fully Distributed • Client/Server(s)
CLIENT
Wishes tocompute
on servers’data
SERVER(S)
Each holdsdatabase
…
…
…
…
…
…
P1
P2
P3
Pn-1
Pn
Cryptography vs. Randomization
inefficiency
privacy loss
inaccuracy
randomization approach
cryptographic approach
Cryptography vs. Randomization
inefficiency
privacy loss
inaccuracy
randomization approach
cryptographic approach
Secure Multiparty Computation
• Allows n players to privately compute a function f oftheir inputs.
• Overhead is polynomial in size of inputs and complexityof f [Yao86, GMW87, BGW88, CCD88, ...]
• In theory, can solve any private distributed data miningproblem. In practice, not efficient for large data.
P1
P2
Pn
Primitives for PPDM
• Common tools include secret sharing,homomorphic encryption, secure scalar product,secure set intersection, secure sums, and otherstatistics.
• PORTIA work:
– [BGN05]: homomorphic encryption of 2-DNFformulas (arbitrary additions, one multiplication),based on bilinear maps. (P)
– [AMP04]: Medians, kth ranked element. (P)
– [FNP04]: set intersection and cardinality.
Higher-Level Protocols
• [LP00]: private protocol for lnx and xlnx
• Various protocols to search remote, encrypted, oraccess controlled data (e.g. for keywords, items incommon): [BBA03(P), Goh03, FNP04, BCOP04,ABG+05(P), EFH#]
• [YZW05]: frequency mining protocol. (P)
Data Mining Models
• [WY04,YW05]: privacy-preserving construction ofBayesian networks from vertically partitioned data.
• [YZW05]: classification from frequency mining in fullydistributed model (naïve Bayes classification, decisiontrees, and association rule mining). (P)
• [JW#]: privacy-preserving k-means clustering forarbitrarily partitioned data. (In vertically partitionedcase, similar to two-party [VC03].)
• [AST05]: privacy-preserving computation ofmultidimensional aggregates on vertically or horizontallypartitioned data using randomization.
Privacy-Preserving Bayes Networks [WY04,YW05]
Goal: Cooperatively learn Bayesian networkstructure on the combination of DBA and DBB ,ideally without either party learning anything exceptthe Bayesian network itself.
DBA DBB
Alice Bob
K2 Algorithm for BN Learning
• Determining the best BN structure for a given data set is NP-hard,so heuristics are used in practice.
• The K2 algorithm [CH92] is a widely used BN structure-learningalgorithm, which we use as the starting point for our solution.
• Considers nodes in sequence. Adds new parent that most increasesa score function f, up to a maximum number of parents per node.
€
f (i,π (i)) =α 0!α1!
(α 0 +α1+1)!∏
Our Solution: Approximate Score
Modified score function: approximates the same relativeordering, and lends itself well to private computation.
• Apply natural log to f and use Stirling’s approximation
• Drop constant factor and bounded term. Result is:
where t = α0 + α1 + 1
€
g(i,π (i)) = 12 (lnα0 + lnα1(∑ − ln t) +
α0 lnα0 +α1 lnα1 − t ln t( ))
Our Solution: ComponentsSub-protocols used:
• Privacy-preserving scalar product protocol: based onhomomorphic encryption
• Privacy-preserving computation of α-parameters: usesscalar product
• Privacy-preserving score computation: uses α-parameters,[LP00] protocols for lnx and xlnx
• Privacy-preserving score comparison: uses [Yao86]
All intermediate values (scores and parameters) areprotected using secret sharing. [YW05] improves on[MSK04] for parameter computation.
Overview• Introduction
• Primitives
• Higher-level protocols
– Distributed data mining
– Publishable data
– Coping with massiveness
– Beyond privacy-preserving data mining
• Implementation and experimentation
• Lessons learned, conclusions
Publishable Data• Goal: Modify data before publishing so that results have
good privacy and good utility.
– Some situations favor one more than the other.
– May prevent some things from being learned at all.
• [DN04]: Extends privacy definitions of [EGS03,DN03]relating a priori and a posteriori knowledge, and providessolutions in a moderated publishing model.
• [CDMSW04]: provide quantifiable definitions of privacyand utility. One’s privacy is guaranteed to the extentthat one does not stand out from others.
Publishable Data: k-Anonymity
• Modify database before publishing so (quasi-identifier of)every record in the database is identical to at least k – 1 otherrecords [Swe02, MW04].
• [AFK+05]: optimal k-anonymization is NP-hard even ifthe data values are ternary. Presents efficientapproximation algorithms for k-anonymization. (P)
• [ZYW05]: in two formulations, present solutions for adata publisher to learn a k-anonymized version of a fullydistributed database without learning anything else. (P)
Coping with Massiveness
• Data mining on massive data sets in an importantfield in its own right.
• It is also privacy-relevant, because:
– Massive data sets are likely to be distributed and multiplyowned.
– Efficiency improvements are needed in order to have anyhope of adding overhead for privacy.
• [FKMSZ05]: Stream algorithms for massive graphs (P)
• [DKM04]: Approximate massive-matrix computations(P)
Beyond Privacy-Preserving Data Mining
• [JW#]: Extends private inference control of [WS04] towork with more complex query functions. Client learnsquery result if and only if inference rule is met (andlearns nothing else).
• [KMN05]: Simulatable auditing to ensure that querydenials do not leak information. (P)
• [ABG+04]: P4P: Paranoid Platform for PrivacyPreferences. Mechanism for ensuring released data isusable only for allowed tasks. (P)
Enforce policies about what kind of queries orcomputations on data are allowed.
Overview• Introduction
• Primitives
• Higher-level protocols
– Distributed data mining
– Publishable data
– Coping with massiveness
– Beyond privacy-preserving data mining
• Implementation and experimentation
• Lessons learned, conclusions
Implementation and Experimentation
• secure scalar product protocol [SWY04]
• MySQL private information retrieval (PIR) [BBFS#]
• Fairplay: a system implementing Yao’s two party securefunction evaluation [MNPS04]
• Bayesian network implementation [KRFW#] (D)
• secure computation of surveys using Fairplay and use forTaulbee survey [FPRS04] (P,D)
Survey Software [FPRS04] (P,D)
• User-friendly, open-source, free implementation using Fairplay[MNPS04], suitable for use with CRA’s Taulbee salary survey.Not adopted.
• CRA’s reasons:
– Need for data cleaning, multiyear comparisons, unanticipated use
– “Perhaps most member departments will trust us.”
• Provost Offices’ reasons:
– No legal basis for using this privacy-preserving protocol on data thatwe otherwise don’t disclose
– Correctness and security claims are hard and expensive to assess,despite open-source implementation.
– All-or-none adoption by Ivy+ peer group. Can’t make decisionunilaterally.
Future Directions in Experimentation
• Combine these and others into a general-purpose privacy-preserving data mining experimental platform. Usefulfor:
– fast prototyping of new protocols
– efficiency, accuracy comparisons of different approaches
• Experiment with real data and real uses.
– need to find a user community that has explicitly expressedinterest, and that could potentially accomplish something viaPPDM that it currently cannot accomplish.
– [Scha04]: genetics researchers may form such a community
Other Future Directions
• Preprocessing of data for PPDM.
• Privacy-preserving data solutions that use bothrandomization and cryptography in order to gainsome of the advantages of both.
• Policies for privacy-preserving data mining:languages, reconciliation, and enforcement.
• Incentive-compatible privacy-preserving datamining.
Conclusions
• Increasing use of computers and networks has led to aproliferation of sensitive data.
• Without proper precautions, this data could be misused.
• Many technologies exist for supporting proper data handling,but much work remains, and some barriers must be overcomein order for them to be deployed.
• Cryptography is a useful component, but not the wholesolution.
• Technology, policy, and education must work together.