splitx: high-performance private analytics
DESCRIPTION
SplitX: High-Performance Private Analytics. Ruichuan Chen (Bell Labs / Alcatel-Lucent) Istemi Ekin Akkus (MPI-SWS) Paul Francis (MPI-SWS). Data analytics is important. Evaluate system performance Understand user behavior Discover statistical patterns. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/1.jpg)
SplitX: High-Performance Private Analytics
Ruichuan Chen (Bell Labs / Alcatel-Lucent)Istemi Ekin Akkus (MPI-SWS)Paul Francis (MPI-SWS)
![Page 2: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/2.jpg)
Data analytics is important
Evaluate system performance
Understand user behavior
Discover statistical patterns
![Page 3: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/3.jpg)
Data exposure has become a major concern
Third-partyTrackers
Smart-phone Apps
![Page 4: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/4.jpg)
User-owned and operated
Data exposure has to be brought under control!
User-owned and operated principle Personal data should be stored in a local
host under the user’s control.
![Page 5: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/5.jpg)
Motivation and problem
How to make aggregate queries over distributed private user data while still preserving user privacy?
Data Data Data
Analyst
![Page 6: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/6.jpg)
Outline
Related work
SplitX system Key insights System design Performance comparison Implementation & deployment
Conclusion
![Page 7: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/7.jpg)
A general approach
Based on differential privacy. Differential privacy adds noise to the
output of a computation (i.e., query).
Hide the presence or absence of a user.
DatabaseQuery Module
(add noise)AnalystData
Data Data
![Page 8: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/8.jpg)
Previous systems Servers aggregate
answers without seeing individual user data.
Differentially private noise is added to the aggregate result.
Data Data Data
Analyst
Servers
Analyst
Akkus et al., CCS’12; Chen et al., NSDI’12; Dwork et al., EUROCRYPT’06; Hardt et al., CCS’12; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11
![Page 9: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/9.jpg)
Primary technical problems Scale poorly
Require public-key operations or something even more expensive.Akkus et al., CCS’12; Chen et al., NSDI’12; Dwork et al., EUROCRYPT’06; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11
Suffer from answer pollution Even a single malicious user can
substantially distort the aggregate result through a single answer.Hardt et al., CCS’12; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11
![Page 10: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/10.jpg)
Outline
Related work
SplitX system Key insights System design Performance comparison Implementation & deployment
Conclusion
![Page 11: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/11.jpg)
SplitX
A high-performance private analytics system 2 to 3 orders of magnitude more efficient in
bandwidth 3 to 5 orders of magnitude more efficient in
computation Resistant to answer pollution
![Page 12: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/12.jpg)
Components & assumptions
Data Data Data
Analyst
Servers(1 aggregator and 2 mixes)
Analysts are potentially malicious(violating user privacy)
Clients are user devices.Clients are potentially malicious(distorting the final results)
Servers are honest but curious1) Follow the specified protocol2) Try to exploit additional info that can be learned in so doing
Analyst
![Page 13: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/13.jpg)
Outline
Related work
SplitX system Key insights System design Performance comparison Implementation & deployment
Conclusion
![Page 14: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/14.jpg)
Key insights: XOR encryption How to achieve high performance?
Client wants to send M to aggregator Client splits M, and sends split messages to
aggregator via mixes Aggregator joins split messages to recreate M
AggregatorClientMix2
Mix1M R M R
R R
Mgenerate R recreate M
![Page 15: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/15.jpg)
Key insights: XOR encryption How to achieve high performance?
M denotes that client sends two split messages of M to aggregator via Mix1 and Mix2.
For clarity
AggregatorClientMix2
Mix1M R M R
R R
AggregatorClientMix2
Mix1
M
generate R recreate M
![Page 16: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/16.jpg)
Key insights: query buckets How to limit answer pollution?
Solution: Ensure that a client cannot arbitrarily
manipulate answers. Divide answer’s value range into buckets. Enforce a binary answer in each bucket.
![Page 17: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/17.jpg)
Key insights: query buckets
Query: “SELECT age FROM splitx”
4 buckets: 0~19, 20~39, 40~59, and ≥60. Answers: a ‘1’ or ‘0’ per bucket.
30 years-old 0, 1, 0, 0 Answers encoded in a bit-vector.
An answer from a malicious client cannot substantially distort the query result!
![Page 18: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/18.jpg)
Outline
Related work
SplitX system Key insights System design Performance comparison Implementation & deployment
Conclusion
![Page 19: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/19.jpg)
System design
1) Query publish/subscribe Analyst publishes its queries Client subscribes to an analyst’s queries
2) Query answering Client answers queries Mixes add differentially private noise Mixes shuffle answers Aggregator generates query results
![Page 20: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/20.jpg)
1) Query publish/subscribe
AggregatorClient
Mix2
Mix1
Query1, Query2, …
Analyst
Analyst ID
Query1, Query2, …
![Page 21: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/21.jpg)
1) Query publish/subscribe
Query example: age distribution among male users?
QID: SQL:
Buckets: DP parameter ( ): Tend:
123
11:59:59PM on Aug 16, 2013
0~19, 20~39, 40~59, and ≥60
1.0
SELECT age FROM splitxWHERE gender=‘male’
![Page 22: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/22.jpg)
2) Query answering
Client answers queries Mixes add differentially private noise Mixes shuffle answers Aggregator generates query results
![Page 23: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/23.jpg)
Step 1: client answers queries
Client executes query over its local data and generates an answer
‘1’ or ‘0’ per bucket
Encoded as a bit-vector
![Page 24: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/24.jpg)
Step 1: client answers queries
Client splits its answer, and sends the split answers with the query ID to the two mixes, respectively.
AggregatorClient
Mix2
Mix1
Analyst
QID, answer
Mix knows which query a client answered.Privacy violation!
![Page 25: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/25.jpg)
Step 2: mixes add DP noise
Each mix individually adds some random bit-vectors as the differentially private noise
How many bit-vectors needed?c: # clients queried : DP parameter
Mix1
0100
1110
……
0111
……
Mix2
1101
1001
……
0101
……
Mix2
1101
1001
……
Mix1
0100
1110
……
random bit-vectors as noise
![Page 26: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/26.jpg)
Step 3: mixes shuffle split answers
Each mix maintains c+n split answers Mixes shuffle the split answers for each
column (i.e., bucket) in a synchronized way.
Mix1
0100
1110
……
0111
……
Mix2
1101
1001
……
0101
……
Mix1
1110
0111
……
0100
……
Mix2
1101
1101
……
0001
……
shuffle
![Page 27: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/27.jpg)
Mixes transmit shuffled answers
Each mix transmits the shuffled split answers to the aggregator.
AggregatorClient
Mix2
Mix1
Analyst
Mix1
……
Mix2
…… c+n shuffled split answers
c+n shuffled split answers
![Page 28: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/28.jpg)
Step 4: aggregator generates query result
Join each bit position in the two split answer arrays.
Sum up the values for each bucket.
Obtain the noisy count for each bucket.
Mix1
1110
0111
……
0100
……
Mix2
1101
1101
……
0001
……
Agg
0011
1010
……
0101
……
=
![Page 29: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/29.jpg)
Privacy issue at the mixes Client splits the answer, and sends the
split answers with the query ID to the two mixes
Mix knows which query a specific client answered!
AggregatorClient
Mix2
Mix1
Analyst
QID, answer
![Page 30: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/30.jpg)
Solution: double-splitting
Client
Mix2Mix2
Mix1Mix1
Mix1
Mix2
AggregatorAggregator
AggregatorAggregator
AggregatorClient
Mix2
Mix1
Analyst
QID, answer
QID, answer
![Page 31: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/31.jpg)
Duplicate answer detection
A client can answer a query many times!
How to detect and remove duplicate answers?
Triple-splitting is needed
Section 5 in the paper.
![Page 32: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/32.jpg)
Outline
Related work
SplitX system Key insights System design Performance comparison Implementation & deployment
Conclusion
![Page 33: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/33.jpg)
Computational overhead
Three to five orders of magnitude more efficient in computation than previous systems
PDDP [NSDI’12]Akkus et al. [CCS’12] – “A” is #buckets that a client reports
![Page 34: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/34.jpg)
Implementation
Client side Google Chrome extension Capture webpages browsed, searches
made, extensions installed
Server side (mix + aggregator) Web services on Jetty RPCs defined in Thrift language
![Page 35: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/35.jpg)
Deployment Query results from a 416-client
deployment
Most visited websites: google, facebook, youtube
Most used apps: gmail, youtube, google drive
91% of clients made ≤50 searches / day 70% of clients visited >50 webpages / day 97% of clients visited ≤100 websites / day
![Page 36: SplitX: High-Performance Private Analytics](https://reader035.vdocument.in/reader035/viewer/2022081519/568146a9550346895db3c514/html5/thumbnails/36.jpg)
Conclusion
SplitX: a high-performance private analytics system Orders of magnitude more efficient than
previous systems Resistant to answer pollution
Key insights XOR-based encryption Query buckets