publishing search logs – a comparative study of privacy guarantees by coreieeeprojects

Upload: madhu-sudan-u

Post on 06-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Publishing Search Logs A Comparative Study of Privacy Guarantees by coreieeeprojects

    1/16

    1

    Publishing Search Logs A Comparative Studyof Privacy Guarantees

    Michaela Gotz, Ashwin Machanavajjhala, Guozhang Wang, Xiaokui Xiao, and Johannes Gehrke

    AbstractSearch engine companies collect the database of intentions, the histories of their users search queries. These search

    logs are a gold mine for researchers. Search engine companies, however, are wary of publishing search logs in order not to disclose

    sensitive information.

    In this paper we analyze algorithms for publishing frequent keywords, queries and clicks of a search log. We first show how methods

    that achieve variants ofk-anonymity are vulnerable to active attacks. We then demonstrate that the stronger guarantee ensured by

    -differential privacy unfortunately does not provide any utility for this problem. We then propose a novel algorithm ZEALOUS and show

    howto set its parameters to achieve (, )-probabilistic privacy. We also contrast ouranalysis of ZEALOUS with an analysis by Korolova

    et al. [17] that achieves (, )-indistinguishability.

    Our paper concludes with a large experimental study using real applications where we compare ZEALOUS and previous work that

    achieves k-anonymity in search log publishing. Our results show that ZEALOUS yields comparable utility to kanonymity while at the

    same time achieving much stronger privacy guarantees.

    Index TermsH.2.0.a Security, integrity, and protection < H.2.0 General < H.2 Database Management < H Information Technology

    and Systems, H.3.0.a Web Search< H.3.0 General < H.3 Information Storage and Retrieval < H Information Technologyand Systems

    1 INTRODUCTION

    Civilization is the progress toward a society of privacy. Thesavages whole existence is public, ruled by the laws of histribe. Civilization is the process of setting man free from men. Ayn Rand.

    My favorite thing about the Internet is that you get to gointo the private world of real creeps without having to smell

    them. Penn Jillette.

    Search engines play a crucial role in the navigationthrough the vastness of the Web. Todays search enginesdo not just collect and index webpages, they also collectand mine information about their users. They storethe queries, clicks, IP-addresses, and other informationabout the interactions with users in what is called asearch log. Search logs contain valuable information thatsearch engines use to tailor their services better to theirusers needs. They enable the discovery of trends, pat-

    Michaela Gotz is a PhD student in the Computer Science Department atCornell University in Ithaca, NY.E-mail: [email protected]

    Ashwin Machanavajjhala is a research scientist at Yahoo! Research inSilicon Valley.E-mail: [email protected]

    Guozhang Wang is a PhD student in the Computer Science Departmentat Cornell University in Ithaca, NY.E-mail: [email protected]

    Xiaokui Xiao is an assistant professor in the School of Computer Engi-neering at Nanyang Technological University (NTU) in Singapore.E-mail: [email protected]

    Johannes Gehrke is a professor in the Computer Science Department atCornell University in Ithaca, NY.E-mail: [email protected]

    terns, and anomalies in the search behavior of users,and they can be used in the development and testingof new algorithms to improve search performance andquality. Scientists all around the world would like totap this gold mine for their own research; search en-gine companies, however, do not release them becausethey contain sensitive information about their users, forexample searches for diseases, lifestyle choices, personal

    tastes, and political affiliations.The only release of a search log happened in 2007

    by AOL, and it went into the annals of tech historyas one of the great debacles in the search industry.1

    AOL published three months of search logs of 650,000users. The only measure to protect user privacy wasthe replacement of userids with random numbers utterly insufficient protection as the New York Timesshowed by identifying a user from Lilburn, Georgia [4],whose search queries not only contained identifyinginformation but also sensitive information about herfriends ailments.

    The AOL search log release shows that simply replac-

    ing userids with random numbers does not prevent in-formation disclosure. Other adhoc methods have beenstudied and found to be similarly insufficient, such asthe removal of names, age, zip codes and other iden-tifiers [14] and the replacement of keywords in searchqueries by random numbers [18].

    In this paper, we compare formal methods of limitingdisclosure when publishing frequent keywords, queries,

    1. http://en.wikipedia.org/wiki/AOL search data scandaldescribes the incident, which resulted in the resignation of AOLsCTO and an ongoing class action lawsuit against AOL resulting fromthe data release.

    Digital Object Indentifier 10.1109/TKDE.2011.26 1041-4347/11/$26.00 2011 IEEE

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Publishing Search Logs A Comparative Study of Privacy Guarantees by coreieeeprojects

    2/16

    2

    and clicks of a search log. These methods we study varyin the guarantee of disclosure limitations they provideand in the amount of useful information they retain. Wefirst describe two negative results. We show that existingproposals to achieve k-anonymity [23] in search logs [1],[21], [12], [13] are insufficient in the light of attackers whocan actively influence the search log. We then turn todifferential privacy [9], a much stronger privacy guarantee;however, we show that it is impossible to achieve goodutility with differential privacy.

    We then describe Algorithm ZEALOUS2, developedindependently by Korolova et al. [17] and us [10] withthe goal to achieve relaxations of differential privacy.Korolova et al. showed how to set the parameters ofZEALOUS to guarantee (, )-indistinguishability [8],and we here offer a new analysis that shows how toset the parameters of ZEALOUS to guarantee (, )-

    probabilistic differential privacy [20] (Section 4.2), a muchstronger privacy guarantee as our analytical comparisonshows.

    Our paper concludes with an extensive experimentalevaluation, where we compare the utility of various al-gorithms that guarantee anonymity or privacy in searchlog publishing. Our evaluation includes applications thatuse search logs for improving both search experience andsearch performance, and our results show that ZEAL-OUS output is sufficient for these applications whileachieving strong formal privacy guarantees.

    We believe that the results of this research enablesearch engine companies to make their search log avail-able to researchers without disclosing their users sen-sitive information: Search engine companies can ap-ply our algorithm to generate statistics that are (, )-

    probabilistic differentially private while retaining goodutility for the two applications we have tested. Beyondpublishing search logs we believe that our findings areof interest when publishing frequent itemsets, as ZEAL-OUS protects privacy against much stronger attackersthan those considered in existing work on privacy-preserving publishing of frequent items/itemsets [19].

    The remainder of this paper is organized as follows.We start with some background in Section 2 and thendescribe our negative results 3. We then describe Algo-rithm ZEALOUS and its analysis in Section 4, and wecompare indistinguishability with probabilistic differen-tial privacy in Section 5. Section 6 shows the results of an

    extensive study of how to set parameters in ZEALOUS,and Section 7 contains a thorough evaluation of ZEAL-OUS in comparison with previous work. We concludewith a discussion of related work and other applications.

    2 PRELIMINARIES

    In this section we introduce the problem of publishingfrequent keywords, queries, clicks and other items of asearch log.

    2. Zearch Log Publising

    2.1 Search Logs

    Search engines such as Bing, Google, or Yahoo log inter-actions with their users. When a user submits a queryand clicks on one or more results, a new entry is addedto the search log. Without loss of generality, we assumethat a search log has the following schema:

    USER-

    ID,

    QUERY,

    TIME,

    CLICKS,

    where a USER-ID identifies a user, a QUERY is a set ofkeywords, and CLICKS is a list of urls that the userclicked on. The user-id can be determined in variousways; for example, through cookies, IP addresses or useraccounts. A user history or search history consists of allsearch entries from a single user. Such a history is usuallypartitioned into sessions containing similar queries; howthis partitioning is done is orthogonal to the techniquesin this paper. A query pair consists of two subsequentqueries from the same user that are contained in thesame session.

    We say that a user history contains a keywordk

    if thereexists a search log entry such that k is a keyword in thequery of the search log. A keyword histogram of a searchlog S records for each keyword k the number of usersck whose search history in S contains k. A keywordhistogram is thus a set of pairs (k, ck). We define thequery histogram, the query pair histogram, and the clickhistogram analogously. We classify a keyword, query,consecutive query, click in a histogram to be frequent if itscount exceeds some predefined threshold ; when we donot want to specify whether we count keywords, queries,etc., we also refer to these objects as items.

    With this terminology, we can define our goal as

    publishing frequent items (utility) without disclosingsensitive information about the users (privacy). We willmake both the notion of utility and privacy more formalin the next sections.

    2.2 Disclosure Limitations for Publishing Search

    Logs

    A simple type of disclosure is the identification of aparticular users search history (or parts of the history)in the published search log. The concept ofk-anonymityhas been introduced to avoid such identifications.

    Definition 1 (k-anonymity [23]): A search log is k-anonymous if the search history of every individual isindistinguishable from the history of at least k 1 otherindividuals in the published search log.

    There are several proposals in the literature to achievedifferent variants of k-anonymity for search logs. Adarproposes to partition the search log into sessions andthen to discard queries that are associated with fewerthan k different user-ids. In each session the user-idis then replaced by a random number [1]. We call theoutput of Adars Algorithm a k-query anonymous searchlog.

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Publishing Search Logs A Comparative Study of Privacy Guarantees by coreieeeprojects

    3/16

    3

    Motwani and Nabar add or delete keywords fromsessions until each session contains the same keywordsas at least k 1 other sessions in the search log [21],following by a replacement of the user-id by a randomnumber. We call the output of this algorithm a k-sessionanonymous search log. He and Naughton generalize key-words by taking their prefix until each keyword is partof at least k search histories and publish a histogramof the partially generalized keywords [12]. We call theoutput a k-keyword anonymous search log. Efficient waysto anonymize a search log are also discussed in work byYuan et al. [13].

    Stronger disclosure limitations try to limit what anattacker can learn about a user. Differential privacyguarantees that an attacker learns roughly the sameinformation about a user whether or not the searchhistory of that user was included in the search log [9].Differential privacy has previously been applied to con-tingency tables [3], learning problems [5], [16], syntheticdata generation of commuting patterns [20] and more.

    Definition 2 (-differential privacy [9]): An algorithm A is-differentially private if for all search logs S and S differingin the search history of a single user and for all output searchlogs O:

    P r[A(S) = O] eP r[A(S) = O].

    This definition ensures that the output of the algorithmis insensitive to changing/omitting the complete searchhistory of a single user. We will refer to search logsthat only differ in the search history of a single user asneighboring search logs. Note that similar to the variants ofk-anonymity we could also define variants of differentialprivacy by looking at neighboring search logs that differonly in the content of one session, one query or onekeyword. However, we chose to focus on the strongestdefinition in which an attacker learns roughly the sameabout a user even if that users whole search history wasomitted.

    Differential privacy is a very strong guarantee and insome cases it can be too strong to be practically achiev-able. We will review two relaxations that have been pro-posed in the literature. Machanavajjhala et al. proposedthe following probabilistic version of differential privacy.

    Definition 3 (probabilistic differential privacy [20]): An

    algorithm A satisfies (, )-probabilistic differential privacy if for all search logs S we can divide the output space intoto sets 1, 2 such that

    (1)Pr[A(S) 2] , and

    for all neighboring search logs S and for all O 1:

    (2)e Pr[A(S) = O] Pr[A(S) = O] e Pr[A(S) = O]

    This definition guarantees that algorithm A achieves-differential privacy with high probability ( 1 ).The set 2 contains all outputs that are considered

    privacy breaches according to -differential privacy; theprobability of such an output is bounded by .

    The following relaxation has been proposed by Dworket al. [8].

    Definition 4 (indistinguishability [8]): An algorithm A is(, )-indistinguishable if for all search logs S, S differing inone user history and for all subsets O of the output space :

    Pr[A(S) O ] e Pr[A(S) O ] +

    We will compare these two definitions in Section 5. Inparticular, we will show that probabilistic differentialprivacy implies indistinguishability, but the conversedoes not hold: We show that there exists an algorithmthat is (, )-indistinguishable yet not (, )-probabilisticdifferentially private for any and any < 1, thusshowing that (, )-probabilistic differential privacy isclearly stronger than (, )-indistinguishability.

    2.3 Utility Measures

    We will compare the utility of algorithms producing san-itized search logs both theoretically and experimentally.

    2.3.1 Theoretical Utility Measures

    For simplicity, suppose we want to publish all items(such as keywords, queries, etc.) with frequency at least in a search log; we call such items frequent items; wecall all other items infrequent items. Consider a discretedomain of items D. Each user contributes a set of theseitems to a search log S. We denote by fd(S) the frequencyof item d D in search log S. We drop the dependencyfrom S when it is clear from the context.

    We define the inaccuracy of a (randomized) algorithmas the expected number of items it gets wrong, i.e., thenumber of frequent items that are not included in theoutput, plus the number of infrequent items that areincluded in the output. We do not expect an algorithmto be perfect. It may make mistakes for items withfrequency very close to , and thus we do not takethese items in our notion of accuracy into account. Weformalize this slack by a parameter , and given , weintroduce the following new notions. We call an item dwith frequency fd + a very-frequent item and anitem d with frequency fd a very-infrequent item.We will measure the inaccuracy of an algorithm then

    only using its inability to retain the very-frequent itemsand its inability to filter out the very infrequent items.

    Definition 5 ((A, S)-inaccuracy): Given an algorithm Aand an input search log S, the (A, S)-inaccuracy with slack is defined as

    E[|{d A(S)|fd(S) < } {d A(S)|fd(S) > + }|]

    The expectation is taken over the randomness of thealgorithm. As an example, consider the simple algorithmthat always outputs the empty set; we call this algorithmthe baseline algorithm. On input S the Baseline Algorithm

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Publishing Search Logs A Comparative Study of Privacy Guarantees by coreieeeprojects

    4/16

    4

    has an inaccuracy equal to the number of items withfrequency at least + .

    For the results in the next sections it will be usefulto distinguish the error of an algorithm on the very-frequent items and its error on the very-infrequent items.We can rewrite the inaccuracy as:

    d:fd(S)>+

    1 Pr[d A(S)] +dD:fd(S)

  • 8/3/2019 Publishing Search Logs A Comparative Study of Privacy Guarantees by coreieeeprojects

    5/16

    5

    Lemma 1 follows directly from the definition of -differential privacy. Based on Lemma 1, we have thefollowing theorem, which shows that any -differentiallyprivate algorithm that is accurate for very-frequent itemsmust be inaccurate for very-infrequent items. The ra-tionale is that, if given a search log S, an algorithmoutputs one very-frequent item d in S, then even if theinput to the algorithm is a search log where d is very-infrequent, the algorithm should still output d with acertain probability; otherwise, the algorithm cannot bedifferentially private.

    Theorem 1: Consider an accuracy constant c, a thresh-old , a slack and a very large domain D of size

    U m2e2(+)/m

    c(+) +1

    +1

    , where m denotes the maximum

    number of items that a user may have in a search log. LetA be an -differentially private algorithm that is c-accurate(according to Definition 6) for the very-frequent items. Then,

    for any input search log, the inaccuracy of A is greater thanthe inaccuracy of an algorithm that always outputs an emptyset.

    Proof: Consider an -differentially private algorithmA that is c-accurate for the very-frequent items. Fixsome input S. We are going to show that for each very-infrequent item d in S the probability of outputting d isat least c/(e(+)/m). For each item d D construct Sdfrom S by changing + of the items to d. That wayd is very-frequent (with frequency at least + ) andL1(S, Sd) 2( + ). By Definition 6, we have that

    Pr[d A(Sd)] c.

    By Lemma 1 it follows that the probability of outputtingd is at least c/(e2(+)/m) for any input database. Thismeans that we can compute a lower bound on theinability to filter out the very-infrequent items in S bysumming up this probability over all possible valuesd D that are very-infrequent in S. Note, that thereare at least D Um+1 many very-infrequent items in S.Therefore, the inability to filter out the very-infrequent

    items is at least

    |D| Um+1

    c/(e2(+)/m). For large

    domains of size at least U m2e2(+)/m

    c(+) +1

    +1

    the

    inaccuracy is at least 2Um+ which is greater than theinaccuracy of the baseline.

    To illustrate Theorem 1, let us consider a search log Swhere each query contains at most 3 keywords selected

    from a limited vocabulary of 900,000 words. Let D bethe domain of the consecutive query pairs in S. We have|D| = 5.3 1035. Consider the following setting of theparameters + = 50, m = 10, U = 1,000,000, = 1, thatis typical practice. By Theorem 1, if an -differentiallyprivate algorithm A is 0.01-accurate for very-frequentquery pairs, then, in terms of overall inaccuracy (for

    both very-frequent and very-infrequent query pairs), Amust be inferior to an algorithm that always outputsan empty set. In other words, no differentially privatealgorithm can be accurate for both very-frequent andvery-infrequent query pairs.

    4 ACHIEVING PRIVACY

    In this section, we introduce a search log publishingalgorithm called ZEALOUS that has been independentlydeveloped by Korolova et al. [17] and us [10]. ZEAL-OUS ensures probabilistic differential privacy, and it fol-lows a simple two-phase framework. In the first phase,ZEALOUS generates a histogram of items in the input

    search log, and then removes from the histogram theitems with frequencies below a threshold. In the secondphase, ZEALOUS adds noise to the histogram counts,and eliminates the items whose noisy frequencies aresmaller than another threshold. The resulting histogram(referred to as the sanitized histogram) is then returnedas the output. Figure 1 depicts the steps of ZEALOUS.

    Algorithm ZEALOUS for Publishing Frequent Items ofa Search LogInput: Search log S, positive numbers m, , ,

    1. For each user u select a set su of up to m distinctitems from us search history in S.3

    2. Based on the selected items, create a histogramconsisting of pairs (k, ck), where k denotes an itemand ck denotes the number of users u that have kin their search history su. We call this histogram theoriginal histogram.

    3. Delete from the histogram the pairs (k, ck) withcount ck smaller than .

    4. For each pair (k, ck) in the histogram, sample arandom number k from the Laplace distributionLap()4, and add k to the count ck, resulting ina noisy count: ck ck + k.

    5. Delete from the histogram the pairs (k, ck) withnoisy counts ck .

    6. Publish the remaining items and their noisy counts.

    To understand the purpose of the various steps onehas to keep in mind the privacy guarantee we wouldlike to achieve. Step 1., 2. and 4. of the algorithm arefairly standard. It is known that adding Laplacian noiseto histogram counts achieves -differential privacy [9].However, the previous section explained that these stepsalone result in poor utility because for large domainsmany infrequent items will have high noisy counts. Todeal better with large domains we restrict the histogramto items with counts at least in Step 2. This restrictionleaks information and thus the output after Step 4. is not-differentially private. One can show that it is not even(, )probabilistic differentially private (for < 1/2).Step 5. disguises the information leaked in Step 3. inorder to achieve probabilistic differential privacy.

    In what follows, we will investigate the theoreticalperformance of ZEALOUS in terms of both privacy

    3. These items can be selected in various ways as long as theselection criteria is not based on the data. Random selection is onecandidate.

    4. The Laplace distribution with scale parameter has the probabil-

    ity density function 12

    e|x| .

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Publishing Search Logs A Comparative Study of Privacy Guarantees by coreieeeprojects

    6/16

    6

    00110 00 00 00 00 00 01 11 11 11 11 11 10000001111110 00 00 00 00 00 00 01 11 11 11 11 11 11 1 00110 00 00 00 00 00 01 11 11 11 11 11 10000001111110 0 0 00 0 0 00 0 0 00 0 0 00 0 0 00 0 0 00 0 0 01 1 1 11 1 1 11 1 1 11 1 1 11 1 1 11 1 1 11 1 1 1

    0 0 0 00 0 0 00 0 0 00 0 0 00 0 0 00 0 0 00 0 0 00 0 0 01 1 1 11 1 1 11 1 1 11 1 1 11 1 1 11 1 1 11 1 1 11 1 1 1

    00000001111111 0000001111110 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 01 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 1

    00000001111111 0 00 00 01 11 11 1000000111111000000111111

    0 00 00 00 00 00 01 11 11 11 11 11 10 0 0 00 0 0 00 0 0 00 0 0 00 0 0 00 0 0 00 0 0 01 1 1 11 1 1 11 1 1 11 1 1 11 1 1 11 1 1 11 1 1 10000000011111111 0000011111 0000011111

    ...

    0 00 00 00 00 00 00 01 11 11 11 11 11 11 1

    ...

    free hondacertifiedcar

    ...

    free hondacertifiedcar

    ...

    filter

    filter

    add noise

    filter >

    per user

    free hondacar

    ...

    free hondacertifiedcar

    ...

    m keywords

    Histogram s

    Histogram

    Histogram

    Histogram

    Searchlog SLBob 73, honda accord, ...

    Bob 73, certified car, ...

    CarlRu, free mp3, ...

    free hondacar

    Fig. 1. PrivacyPreserving Algorithm.

    and utility. Section 4.1 and Section 4.2 discuss the pri-vacy guarantees of ZEALOUS with respect to (, )-

    indistinguishability and (, )-probabilistic differentialprivacy, respectively. Section 4.3 presents a quantitativeanalysis of the privacy protection offered by ZEALOUS.Sections 4.4 and 4.5 analyze the utility guarantees ofZEALOUS.

    4.1 Indistinguishability Analysis

    Theorem 2 states how the parameters of ZEALOUS can be set to obtain a sanitized histogram that provides(, )-indistinguishability.

    Theorem 2: [17] Given a search log S and positivenumbers m, , , and , ZEALOUS achieves (, )-

    indistinguishability, if 2m/, and (1)

    = 1, and (2)

    m

    1

    log( 2

    m )

    . (3)

    To publish not only frequent queries but also theirclicks, Korolova et al. [17] suggest to first determine thefrequent queries and then publish noisy counts of theclicks to their top-100 ranked documents. In particular,if we use ZEALOUS to publish frequent queries in a

    manner that achieves (

    ,

    )-indistinguishability, we canalso publish the noisy click distributions of the top-100 ranked documents for each of the frequent queries,

    by simply adding Laplacian noise to the click countswith scale 2m/. Together the sanitized query and clickhistogram achieves (2, )-indistinguishability.

    4.2 Probabilistic Differential Privacy Analysis

    Given values for , , and m, the following theoremtells us how to set the parameters and to ensurethat ZEALOUS achieves (, )-probabilistic differentialprivacy.

    Privacy Guarantee = 100 = 200 = 1 (, = 10) = 1.3 1037 = 4.7 1081

    = 1.4 1041 = 5.2 1085

    = 5 (, = 2) = 3.2 103 = 6.5 1012

    = 1.4 108 = 2.9 1017

    TABLE 1

    (, )-indistinguishability vs. (, )-probabilisticdifferential privacy. U = 500k, m = 5.

    Theorem 3: Given a search log S and positive numbers m,, , and , ZEALOUS achieves (, )-probabilistic differen-tial privacy, if

    2m/, and (4)

    max

    ln

    2 2e

    1

    , ln

    2

    U m/

    ,

    (5)where U denotes the number of users in S.

    The proof of Theorem 3 can be found in Appendix A.1.

    4.3 Quantitative Comparison of Prob. Diff. Privacy

    and Indistinguishability for ZEALOUS

    In Table 1, we illustrate the levels of (, )-indistinguishability and (, )-probabilistic differentialprivacy achieved by ZEALOUS for various noise andthreshold parameters. We fix the number of users toU = 500k, and the maximum number of items from auser to m = 5, which is a typical setting that will beexplored in our experiments. Table 1 shows the tradeoff

    between utility and privacy: A larger results in agreater amount of noise in the sanitized search log (i.e.,

    decreased data utility), but it also leads to smaller and (i.e., stronger privacy guarantee). Similarly, when

    increases, the sanitized search log provides less utility(since fewer items are published) but a higher level ofprivacy protection (as and decreases).

    Interestingly, given and , we always have > .This is due to the fact that (, )-probabilistic differentialprivacy is a stronger privacy guarantee than (, )-indistinguishability, as will be discussed in Section 5.

    4.4 Utility Analysis

    Next, we analyze the utility guarantee of ZEALOUS in

    terms of its accuracy (as defined in Section 2.3.1).

    Theorem 4: Given parameters = , = +, noisescale , and a search log S, the inaccuracy of ZEALOUS withslack equals

    d:fd(S)>+

    1/2e2/ +

    dD:fd(S)

  • 8/3/2019 Publishing Search Logs A Comparative Study of Privacy Guarantees by coreieeeprojects

    7/16

    7

    Proof: It is easy to see that the ZEALOUS accuracyof filtering out infrequent items is perfect. Moreover, theprobability of outputting a very-frequent item is at least

    1 1/2e

    which is the probability that the Lap()-distributed noisethat is added to the count is at least so that a

    very-frequent item with count at least + remains inthe output of the algorithm. This probability is at least1/2. All in all it has higher accuracy than the baselinealgorithm on all inputs with at least one very-frequentitem.

    4.5 Separation Result

    Combining the analysis in Sections 3.2 and 4.4, we obtainthe following separation result between -differentialprivacy and (, )- probabilistic differential privacy.

    Theorem 5 (Separation Result): Our (, )- probabilisticdifferentially private algorithm ZEALOUS is able to retain

    frequent items with probability at least 1/2 while filtering outall infrequent items. On the other hand, for any -differentially

    private algorithm that can retain frequent items with non-zeroprobability (independent of the input database), its inaccuracyfor large item domains is larger than an algorithm that alwaysoutputs an empty set.

    5 COMPARING INDISTINGUISHABILITY WITHPROBABILISTIC DIFFERENTIAL PRIVACY

    In this section we study the relationship between(, )-probabilistic differential privacy and (, )-

    indistinguishability. First we will prove that probabilisticdifferential privacy implies indistinguishability. Thenwe will show that the converse is not true. We show thatthere exists an algorithm that is (, )-indistinguishableyet blatantly non--differentially private (and also not(, )-probabilistic differentially private for any valueof and < 1). This fact might convince a datapublisher to strongly prefer an algorithm that achieves(, )-probabilistic differential privacy over one that isonly known to achieve (, )-indistinguishability. It alsomight convince researchers to analyze the probabilisticprivacy guarantee of algorithms that are only known to

    be indistinguishable as in [8] or [22].First we show that our definition implies (, )-

    indistinguishability.

    Proposition 1: If an algorithm A is (, )-probabilistic dif- ferentially private then it is also (, )-indistinguishable.

    The proof of Proposition 1 can be found in Ap-pendix A.2. The converse of Proposition 1 does not hold.In particular, there exists an an algorithm that is (, )-indistinguishable but not (, )-probabilistic differentiallyprivate for any choice of and < 1, as illustrated inthe following example.

    Example 1: Consider the following algorithm thattakes as input a search log S with search histories ofU users.

    Algorithm AInput: Search log S DU

    1. Sample uniformly at random a single search historyfrom the set of all histories excluding the first users

    search history.2. Return this search history.

    The following proposition analyzes the privacy ofAlgorithm A.

    Proposition 2: For any finite domain of search histories DAlgorithm A is (, 1/(|D| 1))-indistinguishable for all >0 on inputs from DU.

    The proof can be found in Appendix A.3. The nextproposition shows that every single output of the algo-rithm constitutes a privacy breach.

    Proposition 3: For any search log S, the output of Al-gorithm A constitutes a privacy breach according to -differentially privacy for any value of .

    Proof: Fix an input S and an output O that is differ-ent from the search history of the first user. Consider theinput S differing from S only in the first user history,where S1 = O. Here

    1/(|D| 1) = Pr[A(S) = O] e Pr[A(S) = O] = 0

    Thus the output S breaches the privacy of the first useraccording to -differentially privacy.

    Corollary 1: Algorithm A is (, 1/(|D| 1))-indistinguishable for all > 0. But it is not (, )-probabilistic

    differentially private for any and any < 1.

    By Corollary 1, an algorithm that is (, )-indistinguishable may not achieve any form of (, )-probabilistic differential privacy, even if is set to anextremely small value of 1/(|D| 1). This illustrates thesignificant gap between (, )-indistinguishable and(, )-probabilistic differential privacy.

    6 CHOOSING PARAMETERS

    Apart from the privacy parameters and , ZEALOUSrequires the data publisher to specify two more param-

    eters: , the first threshold used to eliminate keywordswith low counts (Step 3), and m, the number of contri-butions per user. These parameters affect both the noiseadded to each count as well as the second threshold. Before we discuss the choice of these parameters weexplain the general setup of our experiments.

    Data. In our experiments we work with a search logof user queries from a major search engine collectedfrom 500,000 users over a period of one month. Thissearch log contains about one million distinct keywords,three million distinct queries, three million distinct querypairs, and 4.5 million distinct clicks.

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Publishing Search Logs A Comparative Study of Privacy Guarantees by coreieeeprojects

    8/16

    8

    0

    0.0005

    0.001

    0.0015

    0.002

    0.0025

    0.003

    0 5 10 15 20 25 30 35 40

    AverageL-1Distance

    m (Contributions per user)

    0

    0.001

    0.002

    0.003

    0.004

    0.005

    0.006

    0 5 10 15 20 25 30 35 40

    AverageL-1Distance

    m (Contributions per user)

    0

    0.0005

    0.001

    0.0015

    0.002

    0.0025

    0.003

    0.0035

    0.004

    0.0045

    0.005

    0.0055

    0 5 10 15 20 25 30 35 40

    AverageL-1Distance

    m (Contributions per user)

    0

    0.005

    0.01

    0.015

    0.02

    0.025

    0.03

    0.035

    0.04

    0.045

    0 5 10 15 20 25 30 35 40

    AverageL-1Distance

    m (Contributions per user)

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    0.55

    0 5 10 15 20 25 30 35 40

    KLDivergence

    m (Contributions per user)

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0 5 10 15 20 25 30 35 40

    KLDivergence

    m (Contributions per user)

    0

    0.5

    1

    1.5

    2

    2.5

    0 5 10 15 20 25 30 35 40

    KLDivergence

    m (Contributions per user)

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    0 5 10 15 20 25 30 35 40

    KLDivergence

    m (Contributions per user)

    0

    2000

    4000

    6000

    8000

    10000

    12000

    14000

    16000

    18000

    0 5 10 15 20 25 30 35 40

    Zeroes

    m (Contributions per user)

    0

    500

    1000

    1500

    2000

    2500

    3000

    0 5 10 15 20 25 30 35 40

    Zeroes

    m (Contributions per user)

    0

    500

    1000

    1500

    2000

    2500

    0 5 10 15 20 25 30 35 40

    Zeroes

    m (Contributions per user)

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    200

    0 5 10 15 20 25 30 35 40

    Zeroes

    m (Contributions per user)

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    0 5 10 15 20 25 30 35 40

    Fractionofzeroes

    m (Contributions per user)

    Top 200Top 1000

    Top 15000Top 20000

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    0 5 10 15 20 25 30 35 40

    Fractionofzeroes

    m (Contributions per user)

    Top 50Top 100

    Top 2000Top 3000

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0 5 10 15 20 25 30 35 40

    Fractionofzeroes

    m (Contributions per user)

    Top 200Top 400

    Top 2000Top 3000

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0 5 10 15 20 25 30 35 40

    Fractionofzeroes

    m (Contributions per user)

    Top 50Top 100

    Top 300Top 400

    (a) Keywords (b) Queries (c) Clicks (d) Query Pairs

    Fig. 2. Effect on statistics by varying m for several values of top-j.

    1 3 4 5 7 9 81.1205 78.7260 78.5753 78.6827 79.3368 80.3316

    TABLE 2

    as a function of form = 2, = 1, = 0.01

    Privacy Parameters. In all experiments we set = 0.001.Thus the probability that the output of ZEALOUS could

    breach the privacy of any user is very small. We exploredifferent levels of (, )-probabilistic differential privacy

    by varying .

    6.1 Choosing Threshold

    We would like to retain as much information as possiblein the published search log. A smaller value for

    immediately leads to a histogram with higher utilitybecause fewer items and their noisy counts are filteredout in the last step of ZEALOUS. Thus if we choose in a way that minimizes we maximize the utility ofthe resulting histogram. Interestingly, choosing = 1

    does not necessarily minimize the value of . Table 2presents the value of for different values of form = 2 and = 1. Table 2 shows that for our parametersettings is minimized when = 4. We can show thefollowing optimality result which tells us how to choose optimally in order to maximize utility.

    Proposition 4: For a fixed , and m choosing = 2m/minimizes the value of .The proof follows from taking the derivative of as a

    function of (based on Equation (5)) to determine itsminimum.

    6.2 Choosing the Number of Contributions m

    Proposition 4 tells us how to set in order to maximizeutility. Next we will discuss how to set m optimally. Wewill do so by studying the effect of varying m on thecoverage and the precision of the top-j most frequentitems in the sanitized histogram. The top-j coverage of asanitized search log is defined as the fraction of distinctitems among the top-j most frequent items in the orig-inal search log that also appear in the sanitized search

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Publishing Search Logs A Comparative Study of Privacy Guarantees by coreieeeprojects

    9/16

    9

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0 20000 40000 60000 80000 100000

    KLDivergence

    Top j

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    0 2000 4000 6000 8000 10000

    KLDivergence

    Top j

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    0 2000 4000 6000 8000 10000

    KLDivergence

    Top j

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    0 1 00 2 00 3 00 4 00 5 00 6 00 70 0 8 0 0 9 00 1 00 0

    KLDivergence

    Top j

    0

    10000

    20000

    30000

    40000

    50000

    60000

    70000

    80000

    90000

    100000

    0 20000 40000 60000 80000 100000

    Zeroes

    Top j

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    0 2000 4000 6000 8000 10000

    Zeroes

    Top j

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    0 2000 4000 6000 8000 10000

    Zeroes

    Top j

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    0 1 00 2 00 3 00 4 00 5 00 6 00 70 0 8 0 0 9 00 1 00 0

    Zeroes

    Top j

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 20000 40000 60000 80000 100000

    Fractionofzeroes

    Top j

    m = 1m = 2

    m = 4m = 6

    m = 10m = 40

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 2000 4000 6000 8000 10000

    Fractionofzeroes

    Top j

    m = 1m = 2

    m = 4m = 6

    m = 10m = 40

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0 2000 4000 6000 8000 10000

    Fractionofzeroes

    Top j

    m = 1m = 2

    m = 4m = 6

    m = 10m = 40

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0 1 00 2 00 3 00 4 00 5 00 6 00 70 0 8 0 0 9 00 1 00 0

    Fractionofzeroes

    Top j

    m = 1m = 2

    m = 4m = 6

    m = 10m = 40

    (a) Keywords (b) Queries (c) Clicks (d) Query Pairs

    Fig. 3. Effect on Statistics Of Varying j in top-j for Different Values ofm.

    (A) Distinct item counts with different m.m 1 4 8 20 40keywords 6667 6043 5372 4062 2964queries 3334 2087 1440 751 408clicks 2813 1576 1001 486 246query pairs 331 169 100 40 13

    (B) Total item counts 103 with different m.m 1 4 8 20 40keywords 329 1157 1894 3106 3871queries 147 314 402 464 439clicks 118 234 286 317 290query pairs 8 14 15 12 7

    TABLE 3

    log. The top-j precision of a sanitized search log is definedas the distance between the relative frequencies in theoriginal search log versus the sanitized search log for

    the top-j most frequent items. In particular, we studytwo distance metrics between the relative frequencies:the average L-1 distance and the KL-divergence.

    As a first study of coverage, Table 3 shows the numberof distinct items (recall that items can be keywords,queries, query pairs, or clicks) in the sanitized search logas m increases. We observe that coverage decreases as weincrease m. Moreover, the decrease in the number of pub-lished items is more dramatic for larger domains thanfor smaller domains. The number of distinct keywordsdecreases by 55% while at the same time the numberof distinct query pairs decreases by 96% as we increase

    keywords queries click query pairsavg items/user 56 20 14 7

    TABLE 4

    Avg number of items per user in the original search log

    m from 1 to 40. This trend has two reasons. First, fromTheorem 3 and Proposition 4 we see that threshold

    increases super-linearly in m. Second, as m increases thenumber of keywords contributed by the users increasesonly sub-linearly in m; fewer users are able to supplym items for increasing values of m. Hence, fewer itemspass the threshold as m increases. The reduction islarger for query pairs than for keywords, because theaverage number of query pairs per user is smaller thanthe average number of keywords per user in the original

    search log (shown in Table 4).To understand how m affects precision, we measure

    the total sum of the counts in the sanitized histogramas we increase m in Table 3. Higher total counts offerthe possibility to match the original distribution at afiner granularity. We observe that as we increase m, thetotal counts increase until a tipping point is reachedafter which they start decreasing again. This effect isas expected for the following reason: As m increases,each user contributes more items, which leads to highercounts in the sanitized histogram. However, the totalcount increases only sub-linearly with m (and even

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Publishing Search Logs A Comparative Study of Privacy Guarantees by coreieeeprojects

    10/16

    10

    decreases) due to the reduction in coverage we discussedabove. We found that the tipping point where the totalcount starts to decrease corresponds approximately tothe average number of items contributed by each user inthe original search log (shown in Table 4). This suggeststhat we should choose m to be smaller than the averagenumber of items, because it offers better coverage, highertotal counts, and reduces the noise compared to highervalues of m.

    Let us take a closer look at the precision and coverageof the histograms of the various domains in Figures 2and 3. In Figure 2 we vary m between 1 and 40. Eachcurve plots the precision or coverage of the sanitizedsearch log at various values of the top-j parameter incomparison to the original search log. We vary the top-jparameter but never choose it higher than the numberof distinct items in the original search log for the variousdomains. The first two rows plot precision curves for theaverage L-1 distance (first row) and the KL-divergence(second row) of the relative frequencies. The lower two

    rows plot the coverage curves, i.e., the total number oftop-j items (third row) and the relative number of top-jitems (fourth row) in the original search log that do notappear in sanitized search log. First, observe that thecoverage decreases as m increases, which confirms ourdiscussion about the number of distinct items. Moreover,we see that the coverage gets worse for increasing valuesof the top-j parameter. This illustrates that ZEALOUSgives better utility for the more frequent items. Second,note that for small values of the top-j parameter, valuesof m > 1 give better precision. However, when the top-

    j parameter is increased, m = 1 gives better precisionbecause the precision of the top-j values degrades due

    to items no longer appearing in the sanitized search logdue to the increased cutoffs.

    Figure 3 shows the same statistics varying the top-jparameter on the x-axis. Each curve plots the precisionfor m = 1, 2, 4, 8, 10, 40, respectively. Note that m = 1does not always give the best precision; for keywords,m = 8 has the lowest KL-divergence, and for queries,m = 2 has the lowest KL-divergence. As we can seefrom these results, there are two regimes for settingthe value of m. If we are mainly interested in coverage,then m should be set to 1. However, if we are onlyinterested in a few top-j items then we can increaseprecision by choosing a larger value for m; and in this

    case we recommend the average number of items peruser.

    We will see this dichotomy again in our real appli-cations of search log analysis: The index caching ap-plication does not require high coverage because of itsstorage restriction. However, high precision of the top-jmost frequent items is necessary to determine which ofthem to keep in memory. On the other hand, in orderto generate many query substitutions, a larger numberof distinct queries and query pairs is required. Thus mshould be set to a large value for index caching and toa small value for query substitution.

    7 APPLICATION-O RIENTED EVALUATION

    In this section we show the results of an application-oriented evaluation of ZEALOUS in comparison to a k-anonymous search log and the original search log aspoints of comparison. Note that our utility evaluationdoes not determine the better algorithm since whenchoosing an algorithm in practice one has to consider

    both the utility and disclosure limitation guarantees ofan algorithm. Our results show the price that wehave to pay (in terms of decreased utility) when wegive the stronger guarantees of (probabilistic versionsof) differential privacy as opposed to k-anonymity.

    Algorithms.We experimentally compare the utility of ZEALOUS

    against a representative k-anonymity algorithm by Adarfor publishing search logs [1]. Recall that Adars Al-gorithm creates a k-query anonymous search log asfollows: First all queries that are posed by fewer thank distinct users are eliminated. Then histograms ofkeywords, queries, and query pairs from the k-queryanonymous search log are computed. ZEALOUS can

    be used to achieve (, )-indistinguishability as well as(, )-probabilistic differential privacy. For the ease ofpresentation we only show results with probabilistic dif-ferential privacy; using Theorems 2 and 3 it is straightfor-ward to compute the corresponding indistinguishabilityguarantee. For brevity, we refer to the (, )-probabilisticdifferentially private algorithm as Differential in thefigures.

    Evaluation Metrics.We evaluate the performance of the algorithms in two

    ways. First, we measure how well the output of the algo-

    rithms preserves selected statistics of the original searchlog. Second, we pick two real applications from theinformation retrieval community to evaluate the utilityof ZEALOUS: Index caching as a representative applica-tion for search performance, and query substitution as arepresentative application for search quality. Evaluatingthe output of ZEALOUS with these two applicationswill help us to fully understand the performance ofZEALOUS in an application context. We first describeour utility evaluation with statistics in Section 7.1 andthen our evaluation with real applications in Sections7.2 and 7.3.

    7.1 General Statistics

    We explore different statistics that measure the differenceof sanitized histograms to the histograms computedusing the original search log. We analyze the histogramsof keywords, queries, and query pairs for both saniti-zation methods. For clicks we only consider ZEALOUShistograms since a k-query anonymous search log is notdesigned to publish click data.

    In our first experiment we compare the distribution ofthe counts in the histograms. Note that a k-query anony-mous search log will never have query and keyword

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Publishing Search Logs A Comparative Study of Privacy Guarantees by coreieeeprojects

    11/16

    11

    counts below k, and similarly a ZEALOUS histogramwill never have counts below . We choose = 5, m = 1for which threshold 10. Therefore we deliberatelyset k = 10 such that k .

    Figure 4 shows the distribution of the counts in thehistograms on a log-log scale. Recall that the k-queryanonymous search log does not contain any click data,and thus it does not appear in Figure 4(c). We see that thepower-law shape of the distribution is well preserved.However, the total frequencies are lower for the sanitizedsearch logs than the frequencies in the original histogram

    because the algorithms filter out a large number of items.We also see the cutoffs created by k and . We observethat as the domain increases from keywords to clicks andquery pairs, the number of items that are not frequentin the original search log increases. For example, thenumber of clicks with count equal to one is an orderof magnitude larger than the number of keywords withcount equal to one.

    While the shape of the count distribution is well

    preserved, we would also like to know whether thecounts of frequent keywords, queries, query pairs, andclicks are also preserved and what impact the privacyparameters and the anonymity parameter k have. Fig-ure 5 shows the average differences to the counts in theoriginal histogram. We scaled up the counts in sanitizedhistograms by a common factor so that the total countswere equal to the total counts of the original histogram,then we calculated the average difference between thecounts. The average is taken over all keywords that havenon-zero count in the original search log. As such thismetric takes both coverage and precision into account.

    As expected, with increasing the average difference

    decreases, since the noise added to each count decreases.Similarly, by decreasing k the accuracy increases becausemore queries will pass the threshold. Figure 5 showsthat the average difference is comparable for the kanonymous histogram and the output of ZEALOUS.Note that the output of ZEALOUS for keywords is moreaccurate than a k-anonymous histogram for all values of > 2. For queries we obtain roughly the same averagedifference for k = 60 and = 6. For query pairs thek-query anonymous histogram provides better utility.

    We also computed other metrics such as the root-mean-square value of the differences and the total varia-tion difference; they all reveal similar qualitative trends.

    Despite the fact that ZEALOUS disregards many searchlog records (by throwing out all but m contributionsper user and by throwing out low frequent counts),ZEALOUS is able to preserve the overall distributionwell.

    7.2 Index Caching

    In the index caching problem, we aim to cache in-memory a set of posting lists that maximizes the hit-probability over all keywords (see Section2.3.2). In ourexperiments, we use an improved version of the algo-rithm developed by BaezaYates to decide which posting

    0

    0.2

    0.4

    0.6

    0.8

    1

    10 20 30 40 50 60 70 80 90 100

    1 2 3 4 5 6 7 8 9 10

    HitProbability

    k for k-Anonymity

    e for e-Differential

    OriginalAnonymityDifferential

    0

    0.2

    0.4

    0.6

    0.8

    1

    1 2 3 4 5 6 7 8 9 10

    HitProbability

    e for e-Differential

    Originalm = 1m = 6

    (a) (b)

    Fig. 6. Hit probabilities.

    lists should be kept in memory [2]. Our algorithm firstassigns each keyword a score, which equals its frequencyin the search log divided by the number of documentsthat contain the keyword. Keywords are chosen using agreedy bin-packing strategy where we sequentially addposting lists from the keywords with the highest scoreuntil the memory is filled. In our experiments we fixedthe memory size to be 1 GB, and each document postingto be 8 Bytes (other parameters give comparable results).Our inverted index stores the document posting list foreach keyword sorted according to their relevance whichallows to retrieve the documents in the order of theirrelevance. We truncate this list in memory to containat most 200,000 documents. Hence, for an incomingquery the search engine retrieves the posting list foreach keyword in the query either from memory or fromdisk. If the intersection of the posting lists happens to beempty, then less relevant documents are retrieved fromdisk for those keywords for which only the truncatedposting list is kept on memory.

    Figure 6(a) shows the hitprobabilities of the invertedindex constructed using the original search log, the k-anonymous search log, and the ZEALOUS histogram(for m = 6) with our greedy approximation algorithm.We observe that our ZEALOUS histogram achieves bet-ter utility than the k-query anonymous search log fora range of parameters. We note that the utility suffersonly marginally when increasing the privacy parameteror the anonymity parameter (at least in the range that wehave considered). This can be explained by the fact thatit requires only a few very frequent keywords to achievea high hitprobability. Keywords with a big positiveimpact on the hit-probability are less likely to be filtered

    out by ZEALOUS than keywords with a small positiveimpact. This explains the marginal decrease in utility forincreased privacy.

    As a last experiment we study the effect of varyingm on the hit-probability in Figure 6(b). We observe thatthe hit probability for m = 6 is above 0.36 whereasthe hit probability for m = 1 is less than 0.33. Asdiscussed a higher value for m increases the accuracy,

    but reduces the coverage. Index caching really requiresroughly the top 85 most frequent keywords that are stillcovered when setting m = 6. We also experimented withhigher values ofm and observed that the hit-probability

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Publishing Search Logs A Comparative Study of Privacy Guarantees by coreieeeprojects

    12/16

    12

    100

    101

    102

    103

    104

    105

    106

    100 101 102 103 104 105 106

    Frequency

    Count

    Originalk-Anonymitye-Differential

    100

    101

    102

    103

    104

    105

    106

    107

    100 101 102 103 104 105

    Frequency

    Count

    Originalk-Anonymitye-Differential

    100

    101

    102

    103

    104

    105

    106

    107

    100 101 102 103 104 105

    Frequency

    Count

    Originale-Differential

    100

    101

    102

    103

    104

    105

    106

    107

    100 101 102 103 104

    Frequency

    Count

    Originalk-Anonymitye-Differential

    (a) Keyword Counts (b) Query Counts (c) Click Counts (d) Query Pair Counts

    Fig. 4. Distributions of counts in the histograms.

    0

    10

    20

    30

    40

    50

    0 10 20 30 40 50 60 70 80 90 100

    0 1 2 3 4 5 6 7 8 9 10

    AverageDifferenceofCounts

    k for k-Anonymity

    e for e-Differential

    DifferentialAnonymity

    0

    1

    2

    3

    4

    5

    6

    7

    0 10 20 30 40 50 60 70 80 90 100

    0 1 2 3 4 5 6 7 8 9 10

    AverageDifferenceofCounts

    k for k-Anonymity

    e for e-Differential

    DifferentialAnonymity

    0

    0.5

    1

    1.5

    2

    2.5

    3

    0 1 2 3 4 5 6 7 8 9 10

    AverageDifferenceofCounts

    e for e-Differential

    Differential0

    0.5

    1

    1.5

    2

    2.5

    3

    0 10 20 30 40 50 60 70 80 90 100

    0 1 2 3 4 5 6 7 8 9 10

    AverageDifferenceofCounts

    k for k-Anonymity

    e for e-Differential

    DifferentialAnonymity

    (a) Keywords (b) Queries (c) Clicks (d) Query Pairs

    Fig. 5. Average difference between counts in the original histogram and the probabilistic differential privacy-preserving

    histogram, and the anonymous histogram for varying privacy / anonymity parameters and k. Parameterm is fixed to1.

    decreases at some point.

    7.3 Query Substitution

    Algorithms for query substitution examine query pairsto learn how users re-phrase queries. We use an algo-rithm developed by Jones et al. in which related queriesfor a query are identified in two steps [15]. First, the

    query is partitioned into subsets of keywords, calledphrases, based on their mutual information. Next, foreach phrase, candidate query substitutions are deter-mined based on the distribution of queries.

    We run this algorithm to generate ranked substitutionon the sanitized search logs. We then compare theserankings with the rankings produced by the originalsearch log which serve as ground truth. To measure thequality of the query substitutions, we compute the pre-cision/recall, MAP (mean average precision) and NDG(normalized discounted cumulative gain) of the top-jsuggestions for each query; let us define these metricsnext.

    Consider a query q and its list of top-j ranked substitu-tions q0, . . . , q

    j1 computed based on a sanitized search

    log. We compare this ranking against the top-j rankedsubstitutions q0, . . . , qj1 computed based on the originalsearch log as follows. The precision of a query q is thefraction of substitutions from the sanitized search logthat are also contained in our ground truth ranking:

    Precision(q) =|{q0, . . . , qj1} {q0, . . . , q

    j1}|

    |{q0, . . . , qj1}|

    Note, that the number of items in the ranking for a queryq can be less than j. The recall of a query q is the fraction

    of substitutions in our ground truth that are containedin the substitutions from the sanitized search log:

    Recall(q) =|{q0, . . . , qj1} {q0, . . . , q

    j1}|

    |{q0, . . . , qj1}|

    MAP measures the precision of the ranked items for a

    query as the ratio of true rank and assigned rank:

    MAP(q) =

    j1i=0

    i + 1

    rank of qi in [q0, . . . , qj1] + 1

    ,

    where the rank of qi is zero in case it does is notcontained in the list [q0, . . . , q

    j1] otherwise it is i

    , s.t.qi = qi .

    Our last metric called NDCG measures how the rel-evant substitutions are placed in the ranking list. Itdoes not only compare the ranks of a substitution inthe two rankings, but is also penalizes highly relevantsubstitutions according to [q0, . . . , qj1] that have a very

    low rank in [q0, . . . , qj1]. Moreover, it takes the length ofthe actual lists into consideration. We refer the reader tothe paper by Chakrabarti et al. [7] for details on NDCG.

    The discussed metrics compare rankings for one query.To compare the utility of our algorithms, we averageover all queries. For coverage we average over all queriesfor which the original search log produces substitutions.For all other metrics that try to capture the precisionof a ranking, we average only over the queries forwhich the sanitized search logs produce substitutions.We generated query substitution only for the 100,000most frequent queries of the original search log since

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Publishing Search Logs A Comparative Study of Privacy Guarantees by coreieeeprojects

    13/16

    13

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 10 20 30 40 50 60 70 80 90 100

    0 1 2 3 4 5 6 7 8 9 10

    NDCGScore

    k for k-Anonymity

    e for e-Differential

    e-Diff Top 2e-Diff Top 5

    k-Anon Top 2k-Anon Top 5

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 10 20 30 40 50 60 70 80 90 100

    0 1 2 3 4 5 6 7 8 9 10

    MAPScore

    k for k-Anonymity

    e for e-Differential

    e-Diff Top 2e-Diff Top 5

    k-Anon Top 2k-Anon Top 5

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 10 20 30 40 50 60 70 80 90 100

    0 1 2 3 4 5 6 7 8 9 10

    Precision

    k for k-Anonymity

    e for e-Differential

    e-Diff Top 2e-Diff Top 5

    k-Anon Top 2k-Anon Top 5

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 10 20 30 40 50 60 70 80 90 100

    0 1 2 3 4 5 6 7 8 9 10

    Recall

    k for k-Anonymity

    e for e-Differential

    e-Diff Top 2e-Diff Top 5

    k-Anon Top 2k-Anon Top 5

    (a) NDCG (b) MAP (c) Precision (d) Recall

    Fig. 7. Quality of the query substitutions of the privacy-preserving histograms, and the anonymous search log.

    0

    0.005

    0.01

    0.015

    0.02

    0.025

    0.03

    0.035

    0.04

    1 2 3 4 5 6 7 8 9 10

    Coverage

    e for e-Differential

    m = 1, Top 2m = 1, Top 5m = 6, Top 2m = 6, Top 5

    Fig. 8. Coverage of the privacy-preserving histograms for

    m = 1 and m = 6.

    the substitution algorithm only works well given enoughinformation about a query.

    In Figure 7 we vary k and for m = 1 and we draw theutility curves for top-j for j = 2 and j = 5. We observethat varying and k has hardly any influence on perfor-

    mance. On all precision measures, ZEALOUS providesutility comparable to k-query-anonymity. However, thecoverage provided by ZEALOUS is not good. This is

    because the computation of query substitutions relies notonly on the frequent query pairs but also on the countof phrase pairs which record for two sets of keywordshow often a query containing the first set was followed

    by another query containing the second set. Thus aphrase pair can have a high frequency even though allquery pairs it is contained in have very low frequency.ZEALOUS filters out these low frequency query pairsand thus loses many frequent phrase pairs.

    As a last experiment, we study the effect of increasing

    m for query substitutions. Figure 8 plots the averagecoverage of the top-2 and top-5 substitutions produced

    by ZEALOUS for m = 1 and m = 6 for various valuesof . It is clear that across the board larger values of mlead to smaller coverage, thus confirming our intuitionoutlined the previous section.

    8 RELATED WOR K

    Related work on anonymity in search logs [1], [12], [21],[13] is discussed in Section 3.1.

    More recently, there has been work on privacy insearch logs. Korolova et al. [17] proposes the same

    basic algorithm that we propose in [10] and review inSection 4.5 They show (, )-indistinguishability of thealgorithm whereas we show (, )-probabilistic differen-tial privacy of the algorithm which is a strictly strongerguarantee, see Section 5. One difference is that ouralgorithm has two thresholds , as opposed to one andwe explain how to set threshold optimally. Korolovaet al. [17] set = 1 (which is not the optimal choice

    in many cases). Our experiments augment and extendthe experiments of Korolova et al. [17]. We illustrate thetradeoff of setting the number of contributions m forvarious domains and statistics including L1-distance andKL divergence which extends [17] greatly. Our applica-tion oriented evaluation considers different applications.We compare the performance of ZEALOUS to that of k-query anonymity and observe that the loss in utility iscomparable for anonymity and privacy while anonymityoffers a much weaker guarantee.

    9 BEYOND SEARCH LOGS

    While the main focus of this paper are search logs, ourresults apply to other scenarios as well. For example,consider a retailer who collects customer transactions.Each transaction consists of a basket of products togetherwith their prices, and a time-stamp. In this case ZEAL-OUS can be applied to publish frequently purchasedproducts or sets of products. This information can also

    be used in a recommender system or in a market basketanalysis to decide on the goods and promotions in astore [11]. Another example concerns monitoring thehealth of patients. Each time a patient sees a doctorthe doctor records the diseases of the patient and the

    suggested treatment. It would be interesting to publishfrequent combinations of diseases.

    All of our results apply to the more general problemof publishing frequent items / itemsets / consecutiveitemsets. Existing work on publishing frequent itemsetsoften only tries to achieve anonymity or makes strongassumptions about the background knowledge of an

    5. In order to improve utility of the algorithm as stated in [17], wesuggest to first filter out infrequent keywords using the 2-thresholdapproach of ZEALOUS and then publish noise counts of queriesconsisting of up to 3 frequent keywords and the clicks of their topranked documents.

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Publishing Search Logs A Comparative Study of Privacy Guarantees by coreieeeprojects

    14/16

    14

    attacker, see for example some of the references in thesurvey by Luo et al. [19].

    10 CONCLUSIONS

    This paper contains a comparative study about publish-ing frequent keywords, queries, and clicks in search logs.We compare the disclosure limitation guarantees and thetheoretical and practical utility of various approaches.Our comparison includes earlier work on anonymity and(, )indistinguishability and our proposed solution toachieve (, )-probabilistic differential privacy in searchlogs. In our comparison, we revealed interesting relation-ships between indistinguishability and probabilistic dif-ferential privacy which might be of independent interest.Our results (positive as well as negative) can be appliedmore generally to the problem of publishing frequentitems or itemsets.

    A topic of future work is the development of algo-rithms that allow to publish useful information about

    infrequent keywords, queries, and clicks in a search log.

    ACKNOWLEDGMENTS

    We would like to thank our colleagues Filip Radlinskiand Yisong Yue for helpful discussions about the usageof search logs.

    REFERENCES

    [1] Eytan Adar. User 4xxxxx9: Anonymizing query logs. In WWWWorkshop on Query Log Analysis, 2007.

    [2] Roberto Baeza-Yates. Web usage mining in search engines. WebMining: Applications and Techniques, 2004.

    [3] B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, andK. Talwar. Privacy, accuracy and consistency too: A holisticsolution to contingency table release. In PODS, 2007.

    [4] Michael Barbaro and Tom Zeller. A face is exposedfor aol searcher no. 4417749. New York Timeshttp://www.nytimes.com/2006/08/09/technology/09aol.html?ex=1312776000en=f6f61949c6da4d38ei=5090, 2006.

    [5] Avrim Blum, Katrina Ligett, and Aaron Roth. A learning theoryapproach to non-interactive database privacy. In STOC, pages609618, 2008.

    [6] Justin Brickell and Vitaly Shmatikov. The cost of privacy: de-struction of data-mining utility in anonymized data publishing.In KDD, 2008.

    [7] Soumen Chakrabarti, Rajiv Khanna, Uma Sawant, and ChiruBhattacharyya. Structured learning for non-smooth rankinglosses. In KDD, pages 8896, 2008.

    [8] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya

    Mironov, and Moni Naor. Our data, ourselves: Privacy viadistributed noise generation. In EUROCRYPT, 2006.[9] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith.

    Calibrating noise to sensitivity in private data analysis. In TCC,2006.

    [10] Michaela Gotz, Ashwin Machanavajjhala, Guozhang Wang, Xi-aokui Xiao, and Johannes Gehrke. Privacy in search logs. CoRR,abs/0904.0682v2, 2009.

    [11] Jiawei Han and Micheline Kamber. Data Mining: Concepts andTechniques. Morgan Kaufmann, 1st edition, September 2000.

    [12] Yeye He and Jeffrey F. Naughton. Anonymization of set-valueddata via top-down, local generalization. PVLDB, 2(1):934945,2009.

    [13] Yuan Hong, Xiaoyun He, Jaideep Vaidya, Nabil Adam, andVijayalakshmi Atluri. Effective anonymization of query logs. InCIKM, 2009.

    [14] Rosie Jones, Ravi Kumar, Bo Pang, and Andrew Tomkins. I knowwhat you did last summer: query logs and user privacy. InCIKM, 2007.

    [15] Rosie Jones, Benjamin Rey, Omid Madani, and Wiley Greiner.Generating query substitutions. In WWW, 2006.

    [16] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim,Sofya Raskhodnikova, and Adam Smith. What can we learnprivately? In FOCS, pages 531540, 2008.

    [17] Aleksandra Korolova, Krishnaram Kenthapadi, Nina Mishra, andAlexandros Ntoulas. Releasing search queries and clicks privately.

    In WWW, 2009.[18] Ravi Kumar, Jasmine Novak, Bo Pang, and Andrew Tomkins. On

    anonymizing query logs via token-based hashing. In WWW, 2007.[19] Yongcheng Luo, Yan Zhao, and Jiajin Le. A survey on the

    privacy preserving algorithm of association rule mining. ElectronicCommerce and Security, International Symposium, 1:241245, 2009.

    [20] Ashwin Machanavajjhala, Daniel Kifer, John M. Abowd, JohannesGehrke, and Lars Vilhuber. Privacy: Theory meets practice on themap. In ICDE, 2008.

    [21] Rajeev Motwani and Shubha Nabar. Anonymizing unstructureddata. arXiv, 2008.

    [22] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smoothsensitivity and sampling in private data analysis. In STOC, 2007.

    [23] Pierangela Samarati. Protecting respondents identities in micro-data release. IEEE Trans. on Knowl. and Data Eng., 13(6):10101027,2001.

    Michaela Gotz is a PhD student in the Computer Science Departmentat Cornell University in Ithaca, NY. She obtained her Bachelors degreein Computer Science at Saarland University in 2007.

    Ashwin Machanavajjhala is a research scientist at Yahoo! Research inSilicon Valley. He obtained his PhD in Computer Science from CornellUniversity in 2008 under the supervision of Johannes Gehrke.

    Guozhang Wang is a PhD student in the Computer Science Depart-ment at Cornell University in Ithaca, NY. He obtained a Bachelor degreefrom Fudan University.

    Xiaokui Xiao is an assistant professor in the School of Computer Engineering at Nanyang Technological University (NTU) in Singapore.He obtained his PhD degree in Computer Science from the Chinese

    University of Hong Kong in 2008 under the supervision of Yufei Tao. Hespent a year at Cornell University as a postdoctoral associate.

    Johannes Gehrke is a professor in the Computer Science Departmentat Cornell University in Ithaca, NY. He has received a National ScienceFoundation Career Award, a Sloan Fellowhsip, and an IBM FacultyAward. He is the author of numerous publications in data mining anddatabase system. He co-authored the undergrad textbook Database

    Management Systems.

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Publishing Search Logs A Comparative Study of Privacy Guarantees by coreieeeprojects

    15/16

    15

    APPENDIX AONLINE APPENDIX

    This appendix is available online [10]. We provide it herefor the convenience of the reviewers. It is not meant to

    be part of the final paper.

    A.1 Analysis of ZEALOUS: Proof of Theorem 10

    Let H be the keyword histogram constructed by ZEAL-OUS in Step 2 when applied to S and K be the set ofkeywords in H whose count equals . Let be the set ofkeyword histograms, that do not contain any keywordin K. For notational simplicity, let us denote ZEALOUSas a function Z. We will prove the theorem by showingthat, given Equations (4) and (5),

    P r[Z(S) / ] , (6)

    and for any keyword histogram and for anyneighboring search log S of S,

    e P r[Z(S) = ] P r[Z(S) = ] e P r[Z(S) = ].(7)

    We will first prove that Equation (6) holds. Assumethat the i-th keyword in K has a count ci in Z(S) fori [1, |K|]. Then,

    P r[Z(S) / ]

    = P r

    i [1, |K|], ci >

    = 1 P r

    i [1, |K|], ci

    = 1

    i[1,|K|]

    1

    2e

    |x| dx

    (the noise added to ci has to be

    )

    = 1

    1

    1

    2 e

    |K|

    |K|

    2 e

    U m

    2 e

    (because |K| U m/)

    U m

    2 e

    ln( 2Um/ ) (by Equation 5)

    = . (8)

    Next, we will show that Equation (7) also holds. Let

    S

    be any neighboring search log of S. Let be anypossible output of ZEALOUS given S, such that .To establish Equation (7), it suffices to prove that

    P r[Z(S) = ]

    P r[Z(S) = ] e, and (9)

    P r[Z(S) = ]

    P r[Z(S) = ] e. (10)

    We will derive Equation (9). The proof of (10) isanalogous.

    Let H be the keyword histogram constructed byZEALOUS in Step 2 when applied to S. Let be the

    set of keywords that have different counts in H and H.Since S and S differ in the search history of a singleuser, and each user contributes at most m keywords, wehave || 2m. Let ki (i [1, ||]) be the i-th keywordin , and di, di, and d

    i be the counts of ki in H, H

    ,and , respectively. Since a user adds at most one to thecount of a keyword (see Step 2.), we have di di = 1for any i [1, ||]. To simplify notation, let Ei, E

    i

    , andEi, Ei

    denote the event that ki has counts di, di, di in

    H, H, and Z(S), Z(S), respectively. Therefore,

    P r[Z(S) = ]

    P r[Z(S) = ]=

    i[1,||]

    P r[Ei | Ei]

    P r[Ei | Ei]

    .

    In what follows, we will show that Pr[Ei|Ei]

    Pr[Ei|Ei]

    e1/

    for any i [1, ||]. We differentiate three cases: (i) di ,di , (ii) di < and (iii) di = and d

    i = 1.

    Consider case (i) when di and di are at least . Then,if di > 0, we have

    P r[Ei | Ei]

    P r[E

    i

    | E

    i]

    =12

    e|didi|/

    12e

    |didi|/

    = e(|did

    i||d

    idi|)/

    e|didi|/

    = e1 . (because |di d

    i| = 1 for any i)

    On the other hand, if di = 0,

    P r[Ei | Ei]

    P r[Ei | Ei]

    =

    di

    12e

    |x|/dxdi

    12e

    |x|/dx e

    1 .

    Now consider case (ii) when di is less than . Since

    , and ZEALOUS eliminates all counts in H thatare smaller than , we have di = 0, and P r[E

    i | Ei] = 1.

    On the other hand,

    P r[Ei

    | Ei] =

    1, if di

    1 12e|di|/, otherwise

    Therefore,

    P r[Ei | Ei]

    P r[Ei | Ei]

    1

    1 12e|di|/

    1

    1 12e()/

    1

    1 12eln22e

    1

    (by Equation 7)

    = e1 .

    Consider now case (iii) when di = and di = 1.Since we have di = 0. Moreover, since ZEALOUSeliminates all counts in H that are smaller than , itfollows that P r[Ei | E

    i] = 1 . Therefore,

    P r[Ei | Ei]

    P r[Ei | Ei]

    = P r[Ei | Ei] e

    1 .

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Publishing Search Logs A Comparative Study of Privacy Guarantees by coreieeeprojects

    16/16

    16

    In summary, Pr[Ei|Ei]

    Pr[Ei|Ei]

    e1/. Since || 2m, we

    have

    P r[Z(S) = ]

    P r[Z(S) = ]

    =

    i[1,||]

    P r[Ei | Ei]

    P r[Ei | Ei]

    i[1,||]e1/

    = e||/

    e (by Equation 6 and || 2m).

    This concludes the proof of the theorem.

    A.2 Proof of Proposition 1

    Assume that, for all search logs S, we can divide theoutput space into to two sets 1, 2, such that

    (1)Pr[A(S) 2] , and

    for all search logs S differing from S only in the searchhistory of a single user and for all O 1:

    (2)Pr[A(S) = O] e Pr[A(S) = O] and

    Pr[A(S) = O] e Pr[A(S) = O].

    Consider any subset O of the output space of A. LetO1 = O 1 and O2 = O 2. We have

    Pr[A(S) O ]

    =

    OO2

    Pr[A(S) = O]dO +

    OO1

    Pr[A(S) = O]dO

    O2Pr[A(S) = O]dO + e

    O1Pr[A(S

    ) = O]dO

    + eO1

    Pr[A(S) = O]dO

    + e Pr[A(S) 1].

    A.3 Proof of Proposition 2

    We have to show that for all search logs S, S differingin one user history and for all sets O :

    Pr[A(S) O] Pr[A(S) O ] + 1/(|D| 1).

    Since Algorithm A neglects all but the first input this is

    true for for neighboring search logs not differing in thefirst users input. We are left with the case of two neigh-

    boring search logs S, S differing in the search historyof the first user. Let us analyze the output distributionsof Algorithm 1 under these two inputs S and S. Forall search histories except the search histories of the firstuser in S, S the output probability is 1/(|D|1) for eitherinput. Only for the two search histories of the first userS1, S1 the output probabilities differ: Algorithm 1 neveroutputs S1 given S, but it outputs this search historywith probability 1/(|D| 1) given S. Symmetrically,Algorithm A never outputs S1 given S

    , but it outputs

    this search history with probability 1/(|D| 1) given S.Thus, we have for all sets O

    Pr[A(S) O] =

    dO(DS1)

    1/(|D| 1) (11)

    1/(|D| 1) +

    dO(DS2)

    1/(|D| 1)

    (12)

    = Pr[A(S) O ] + 1/(|D| 1) (13)

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.