data-mining on gbytes of encrypted data · data mining • classification • regression •...
TRANSCRIPT
Data-Mining on GBytes of Encrypted DataEncrypted Data
Valeria Nikolaenko
In collaboration withDan Boneh (Stanford); Udi Weinsberg, Stratis Ioannidis,
Marc Joye, Nina Taft (Technicolor).
Outline
• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance
2Valeria Nikolaenko
MotivationUsers
Data
Data mining engineData
Data model
Engine learns nothing more than the model!
3Valeria Nikolaenko
Privacy concern!
Data Mining
• Classification• Regression• Clustering• Summarization
: linear regression
: matrix factorization• Summarization• Dependency modeling
4Valeria Nikolaenko
: matrix factorization
Main challenge: make these algorithms privacy preserving and efficient.
Contribution
• Design of a practical system forprivacy preserving linear regression
• Implementation• Experiments on real datasets• Experiments on real datasets
Comparison to state of the art:• Hall et al.’11: 2 days vs 3 min• Graepel et al.’12: 10 min vs 2 sec
5Valeria Nikolaenko
Outline
• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance
6Valeria Nikolaenko
Computations on Encrypted Data
• 2009, C. Gentry – FHE
• 1979, A. Shamir1988, BGW
(slow for our problems)
Secret sharing1988, BGW
• 1982, A.C. Yao – Garbled circuits
• Our approach: hybrid of Yao and hom. encryption
(huge communication overhead)
7Valeria Nikolaenko
Secret sharing
Yao’s Garbled Circuits
“garbled
C(x) C(x)
8Valeria Nikolaenko
circuit“garbled circuit”
01
01
01
...
...01
G01
G11
G02
G12
G03
G13
...
...G0
nG1
n
x=010...1 Nothing is leaked about x,other than C(x)!
Data Mining System Architecture
User Evaluator
Crypto Service Provider (CSP)User i
CreateĈ
9Valeria Nikolaenko
Create“garbled circuit” ĈĈ
{Gx11, Gx2
2, …, Gxnn }
Compute C(x1, …, xn)
C(x1, …, xn)
Ĉ
{G01, G0
2, …, G0n}
{G11, G1
2, …, G1n}
xi
System Properties
�Evaluator learns the model, not the inputs
Problems:• Not scalable with the number of users• Not scalable with the number of users• Users need to be online
10Valeria Nikolaenko
Outline
• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance
11Valeria Nikolaenko
Linear Regression I
• (f,r) – from users• solve for β
bA =βr
12Valeria Nikolaenko
bA =β
f
r = <f, β>
Linear Regression II
Users
f1, r1Training engine… Training engine
fn, rn
Training engine learns nothing about f’s and r’s,other than β!
β
13Valeria Nikolaenko
…
Outline
• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance
14Valeria Nikolaenko
Construct Matrices
Users
…
f1Tf1
Evaluator
computes A = ∑fiTfi
in the circuit
Phase1: compute and ∑= iT
i ffA ∑= iT
i rfb
15Valeria Nikolaenko
fnTfn
For additions can use homomorphic encryption:
)()](),([ 2121 AAHEAHEAHE +→
in the circuit
Construct Matrices
Users
…
EvaluatorHE(f1
Tf1)
• computesHE(A) = HE(∑f Tf )
Phase1: compute and ∑= iT
i ffA ∑= iT
i rfb
16Valeria Nikolaenko
For additions can use homomorphic encryption:
)()](),([ 2121 AAHEAHEAHE +→
HE(fnTfn)
HE(A) = HE(∑fiTfi )
• decrypts in the circuit
Circuit – independent of the number of usersUsers – offline
Solve Linear System
Input: HE(A),HE(b)
Decrypt A and b
bA =βPhase2: solve for β
Cholesky: compute L, s.t. A= LLT
Solve for β: L(LT β)=b
17Valeria Nikolaenko
β
Privacy Preserving Regression System
Evaluator Crypto Service Provider (CSP)
HEpk(f iTf i)
pk
(pk, sk)←HE.KeyGen
Create “garbled circuit” Ĉ
ComputeHEpk(A = ∑fiTfi),HEpk(b = ∑fiTri)
HEpk(f iTr i)
pk
pkPhase1
{G01, G0
2, …, G0n}
{G11, G1
2, …, G1n}
18Valeria Nikolaenko
Send ĈĈHEpk(f i
Tr i)
β
Garbled inputs
ComputeC(HEpk(A),HEpk(b))
Phase2
System Properties
�Evaluator learns the model, not the inputs�Scalable with the number of users�Users can be offline
19Valeria Nikolaenko
Extensions:• Masking instead of decryption in circuit• Protection against malicious
Evaluator and CSP
Outline
• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance
20Valeria Nikolaenko
Performance• Hybrid vs. Pure-Yao: 100 times improvement in time!
• For up to 20 features, 1000’s users• time < 3 min• communication < 1GB
21Valeria Nikolaenko
• communication < 1GB
• For 100 million of users, 20 features: 8.75 hours
• Tested on real datasets
Conclusion
• Privacy-preserving data-mining is efficient• Our approach can be used in practice
• Current work: matrix factorization
22Valeria Nikolaenko
• Current work: matrix factorization
• Future work:– implement other data mining algorithms– improving implementation to support high
parallelization