applying epsilon-differential private query log releasing scheme to document retrieval sicong zhang,...

16
APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August 13 th , 2015 @PIR 2015, Santiago, Chile. 1

Upload: pamela-ferguson

Post on 28-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August

1

APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO

DOCUMENT RETRIEVAL

Sicong Zhang, Hui Yang, Lisa Singh

Georgetown University

August 13th, 2015

@PIR 2015, Santiago, Chile.

Page 2: APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August

2

Introduction & Motivation

• Web search query logs are important and valuable for IR research. Many recent IR methodologies are developed or inspired from the analysis of user behavior in search query logs.

• However, these query logs contain sensitive data, which makes them difficult to be released directly even for research purposes. In 2006, AOL released a piece of query log without adequate anonymization, which lead to severe social and legal issues.

• More companies can release their query logs if adequate privacy protection can be placed.

Page 3: APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August

3

Query Log Releasing• A general query log releasing big picture for search engine:

Page 4: APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August

4

Query Log Releasing

• Existing approaches on query log Releasing:• Deletion.

• Log Deletion, Hashing Queries, Identifier Deletion, Hashing Identifiers, Scrubbing Query Content, Deleting Infrequent Queries and Shortening Sessions etc.

• Proved to be not private enough by recent works• K-Anonymity.

• Need certain assumption about the adversary.• Differential Privacy.

• A stronger privacy notion.• Previous work: approximate differential privacy.• We: pure differential privacy.

Page 5: APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August

5

Query Log Releasing

• This workshop paper introduces our ongoing research project on this privacy preserving query log releasing problem. This is a one time release of the query log in a non-interactive setting.

• In this work, we propose a framework using differential privacy on query logs to guarantee high levels of privacy which achieves pure(ε)-differential privacy. And we make document retrieval experiments based on the released query log to show that our query log releasing algorithm is still very useful for IR tasks while preserving privacy.

Page 6: APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August

6

Project Framework• Dataset

• We use the AOL 2006 query log dataset.• Split into two parts: Q (for release algorithm’s input) and Qtest (for

evaluation).

• Query Log Releasing Algorithm• Use Q as input, and release the anonymized query log Q’.

• Document Retrieval• Use Q’ to help document retrieval for queries in Qtest.

• Actual clicked documents in Qtest forms the ground truth table.

• Evaluations• Compare retrieved results with the ground truth table to evaluate.• IR Metrics: nDCG@10, Precision@1, Recall@10, etc.

Page 7: APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August

7

Project Framework

Page 8: APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August

8

Major Steps in the Releasing Algorithm

Sensitive information removal.

Limiting amount of search queries

for each of the users in the input query log.

Extend the query candidates

for releasing by an external query pool.

Select queries to release base on

the query counts with Laplacian noise.

Release query counts and click counts

with Laplacian noise added.

Release query transitions information, which preserves

some sequential information of the search sessions.

Page 9: APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August

9

Experiments and Privacy Guarantees

• We proved that the (anonymized) userID attribute of the search logs can not be released to public if we want to achieve user level differential privacy. We also proof that our approach achieve pure differential privacy.

• Our experiments are based on the document retrieval task using the released query log, and with varying privacy guarantees. Furthermore, we can propose recommendations for commercial search engines about their future query log release using our framework.

Page 10: APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August

10

Evaluations & Results• A natural baseline

• do document retrieval with the original (not private) query log. • the k-anonymity approach from Carpineto and Romano [5]

• The parameter k in k-anonymity means, only those queries appear in at least k different users can be released.

• # Evaluated Queries are size of common queries between Q’ and Qtest.

Page 11: APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August

11

Conclusions• This project addressed the important security concerns in this

query log releasing task. We present our ε-differential private algorithm to release query logs and make experiments to examine how useful the released query logs are.

• In this paper, we evaluate the IR utility of our query log releasing schemes based on the document retrieval task. Experiments show that our released query log is still very useful for document retrieval, and it outperforms the k-anonymity releasing scheme in both privacy and utility.

Page 12: APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August

12

Conclusions (Cont’)• More comparative experiments in our project also illustrates the

privacy-utility trade-off in query log releasing process. Specifically, the stricter privacy standard we require, the lower utility we can maintain from the released query log.

• Since the high level privacy has been guaranteed by our ε-differential private query log releasing algorithm, we may recommend those commercial search engines to use softer parameter settings in our algorithm in order to maintain high utility of the released query log.

• We believe this project is an important step towards a final solution of releasing web search logs.

Page 13: APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August

13

Thanks!

•Presenter: Jiyun Luo

•1st Author: Sicong Zhang•Email: [email protected]

•Georgetown University• @PIR 2015, Santiago, Chile.

Page 14: APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August

14

Q&A

Page 15: APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August

15

Q&A• The format of the Log Q’• query, URL, click counts

Page 16: APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August

16

Laplacian distribution