practical volume-based attacks on encrypted databasessensitive data they store and process, these...

27
Practical Volume-Based Attacks on Encrypted Databases Stephanie Wang Rishabh Poddar Jianan Lu Raluca Ada Popa Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2019-50 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-50.html May 16, 2019

Upload: others

Post on 14-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

Practical Volume-Based Attacks on Encrypted Databases

Stephanie WangRishabh PoddarJianan LuRaluca Ada Popa

Electrical Engineering and Computer SciencesUniversity of California at Berkeley

Technical Report No. UCB/EECS-2019-50http://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-50.html

May 16, 2019

Page 2: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

Copyright © 2019, by the author(s).All rights reserved.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.

Page 3: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

Practical Volume-Based Attacks on Encrypted Databases

by Jianan Lu

Research Project

Submitted to the Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, in partial satisfaction of the requirements for the degree of Master of Science, Plan II. Approval for the Report and Comprehensive Examination:

Committee:

Professor Raluca Ada Popa Research Advisor

(Date)

* * * * * * *

Professor Alessandro Chiesa Second Reader

(Date)

2019.05.14
May 16, 2019
Page 4: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

Abstract

Databases are the key component of most computer systems today. Because of the valuable andsensitive data they store and process, these database systems have become the primary target ofdigital attacks. For example, con�dential information (e.g., social security number, home address)of over 140 million people is leaked in 2017 from Equifax, one of US’s largest credit reportingcompanies.This prevalence of database breaches spurs more interests towards building secure databases

in both academia and industry. There have been a set of proposed works that can protect datausing advanced encryption schemes or hide query access patterns, albeit at some performancecost. However, recent work has also shown that volume leakage is a signi�cant vulnerability thatcan be exploited to reconstruct the entire database even when using state-of-the-art designs withstrongest security guarantees.In this work, we present new attacks for recovering the content of individual user queries,

assuming no leakage from the system except the number of results. Unlike previous volume-basedattacks that rely on assumptions either too stringent or unrealistic for many real-world systems,our attacks directly leverage real application semantics running on top of these database systems.The key insight is that, by exploiting the behavior of speci�c applications, one can immediatelyhave an attack without making further assumptions like prior work does about the underlyingsystem.

Page 5: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

2

C�������

Contents 21 Introduction 42 Related Work 62.1 Cryptographic schemes and systems 62.2 Related attacks 63 Attack model 83.1 System Model 83.2 Application model 103.2.1 File injection 103.2.2 Query replay 104 Our Attack 124.1 Overview 124.2 Attack Algorithm 125 Evaluation 155.1 Simulation Using Encrypted Email Databases 155.1.1 Setup 155.1.2 Evaluation 155.2 Case Study of Gmail Inbox Search 175.2.1 Setup 175.2.2 Key Characteristics 185.2.3 End-to-end Attack 185.2.4 Other Applications 196 Mitigations 206.1 Disable File Injection 206.2 Prevent File Measurement 206.3 Block Query Replay 207 Conclusion 21References 22

Page 6: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

3

Acknowledgements

I am really thankful to my advisor, Professor Raluca Ada Popa, for guiding me through thisproject and teaching me how to conduct research in system security. I have learned a lot from her,to name a few, as how to think of security problems, to tackle technical challenges, to position one’swork in a convincing way, and to give good deliverable. Without her great advice and support,I would not be able to make this work or continue to pursue my academic interest in graduateschool!

I am also super grateful to my collaborators, Rishabh Poddar and Stephanie Wang. When I �rstjoined this project, I had little knowledge of computer networks and no experience in launchingattacks against real systems. They helped me to design my experiments in a step-by-step manner,taught me a lot of technical skills, and answered a lot, if not too many, of my questions.

I am also very thankful to Wen Zhang who taught me how to use various network attack tools.His assistance directly helps me make positive progress in this work.

Finally, a big thank-you to Professor Alessandro Chiesa for generously giving me great feedbackon this report!

Page 7: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

4

1 INTRODUCTIONIn recent years, the interest towards building databases with stronger security guarantees hasincreased drastically. One powerful technique is to add a layer of encryption that can permit thequerying of encrypted data. Even when attackers break in and compromise the database server, theycannot steal the data or read its content without having access to the correct decryption keys. Anumber of practical encrypted database systems are proposed [3, 10, 12, 17, 19, 28, 40, 47, 50, 58] andsome have already been integrated into existing infrastructures. For example, Microsoft deployed anew feature, called Always Encrypted [40], for Azure SQL Database or SQL Server databases in2017. This service gives data owners the freedom to encrypt sensitive columns of their own datasuch that unauthorized parties who no more have direct access to the data can still run services onit.The majority of these schemes reply on property-preserving encryption [6, 7, 33, 37, 46] or

searchable encryption [8, 10, 11, 13, 27, 31, 34, 36, 43, 44, 54, 55]. However, most of them leak queryaccess patterns. At a high level, given a query, the query access pattern is the set of records thatsatisfy the query condition. Consider the example of an email application: a user issues a searchquery for a keyword over their emails. The mail server typically stores an inverted index (alsocalled a secondary index) for each user’s mailbox, which maps a keyword to a list of emails. Whenfetching the results of a queried keyword, most of these schemes leak to the attacker the set ofemail identi�ers that match the keyword (i.e., the access patterns of the query), even though theemail bodies remain encrypted. A set of recent works [2, 9, 14, 21, 26, 29, 30, 32, 35, 38, 49, 60]has shown that such access patterns leak signi�cant information to a compromised server. In thecontext of the email example, they were able to reconstruct the keyword that the user searches for,as well as email contents.Many of these works discuss oblivious protocols, such as ORAM (Oblivious RAM) [24, 57] or

PIR (Private Information Retrieval) [20], as a solution to this leakage. These schemes hide accesspatterns: even an attacker eavesdropping at the database server does not learn which records areaccessed and hence cannot infer the queried keywords. These schemes are often regarded as givinga very strong security guarantee, the main downside largely being their slow performance.However, in seminal work, Kellaris et al. [32] showed that even strongest schemes today, the

ones that enable encryption and provably conceal access patterns, still leak result count of queries.The attacker neither knows the content of individual queries (which are encrypted), nor does itlearn which documents were returned in response (i.e., access patterns remain hidden). Instead,it only observes the total number of returned records, or the volume of query results. This canallow attackers to reconstruct the database counts, i.e., the number of documents in the databasecontaining each particular value. The primary contribution of Kellaris et al. was to show thatvolume-based attacks were possible. Because their methods depend on certain assumptions aboutthe queries (e.g., amount, type, distribution), their attacks are limited in scope, if not yet practical.In this project, we improve upon the state-of-the-art and show that volume-based attacks are

more likely to be practical. Unlike Kellaris et al. and other literature on volume-based attacks, ourwork targets at the application layer without making any assumptions about user queries or theunderlying data. In particular, we focus on real applications that allow users to search for keywordsover a secondary index, a common data structure in database systems that maps keys to a set ofmatching records. In the encrypted database literature, this corresponds to the model of searchableencryption schemes [8, 10, 11, 13, 27, 31, 34, 36, 43, 44, 54, 55]. Our aim is to recover the content ofindividual queries that search for a speci�c keyword in the database.

Page 8: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

5

Our key insight is that, by directly leveraging real application semantics, one can immediatelyhave an attack without making any further assumption about the underlying database system.Furthermore, it allows our attack to be more e�cient and thus eminently practical.By and large, real-world applications today leak far more information than just the volume of

results. However, privacy-conscious services have begun deploying sophisticated schemes to plugtraditional sources of leakage, including access patterns (e.g., the Signal messaging service [1, 39]).The takeaway of our work is that as practitioners take steps for enhancing the privacy guaranteesof their applications in future, they must also account for the leakage of result volumes. Application-speci�c behavior that facilitates easy exploitation of this leakage (as demonstrated by our attacks)must be patched.In §2, we discuss important works related to encrypted database systems and volume-leakage

attacks. §3 describes our threat model in more detail. We survey 11 representative applications andidentify two key characteristics that can be exploited by attackers—(i) �le injection, and (ii) automaticquery replay. In particular, Gmail inbox search �ts our setting seamlessly, and satis�es both theaforementioned properties. Given these attacker abilities, we present an attack algorithm that isable to reconstruct user queries with 100% con�dence on secondary indices in §4. Subsequently, weevaluate our attacks on an encrypted database using the Enron email dataset in Section 5.1. Wealso perform an end-to-end attack on the Gmail web client by simulating a server-side adversary inSection 5.2. Our attack on Gmail completes within a matter of minutes, demonstrating the feasibilityof our techniques. Finally, we introduce potential mitigations in §6, and end this report in §7.

Page 9: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

6

2 RELATEDWORKTo access or compute on encrypted data, the community has developed a rich set of cryptographicschemes and protocols, as well as encrypted database systems. Each of these schemes and systemsmakes various tradeo�s between information revealed to the server, supported functionality, andperformance. A recent set of attack papers study the information an attacker can obtain from theseschemes and systems, referred to as leakage-abuse attacks by Cash et al. [11]. Most of the attacks inthis category leverage leakage from data relations or access patterns, and very few works target atthese systems relying only on volume leakage, as our work does.

2.1 Cryptographic schemes and systemsThere are a multitude of ways to access or compute on encrypted data, such as property-preservingor property-revealing encryption [6, 7, 33, 37, 46] or searchable encryption [8, 10, 11, 13, 27, 31, 34, 36,43, 44, 54, 55]. For a comprehensive survey, see [53]. We take ORAM and PIR as two representativeexamples below and will discuss them in more detail in Section 3.1. It is important to note that ourwork directly applies to all these systems that leak the volume of results, not limited to ORAM orPIR.

Oblivious RAM techniques [24, 57] and Private Information Retrieval (PIR) [20] schemes enablea client to access data items stored at the server without the server knowing the query requested.These two types of schemes consider di�erent models and employ di�erent techniques, but ulti-mately, the goal of both is to hide the query from the server.Many works leverage ORAM for di�erent purposes. For example, ObliviStore [56] and CURI-

OUS [5] show how to use ORAM for cloud storage. TaoStore [52] shows how to support asyn-chronicity in multi-user cases. These systems leak the volume of results to the server. Roche etal. [51] propose an ORAM scheme (called vORAM) that supports variable-sized data blocks byincluding them within an ORAM node (or bucket) on the same path. While such a scheme conferssome degree of hiding, it limits the amount of data that can be included on a path in this way,say L �les, and the attacker sees how many ORAM paths are fetched. Hence, the attacker canestimate the number of results with an error margin of L. In the database setting, this error margincan be made relatively small, because the database fetches the rows that match the keyword (notjust the row identi�ers), and these cannot all be stored on the same path. Moreover, Naveed [42]demonstrates that, in general, extending ORAM schemes to hide the number of query results is(for a large fraction of queries) slower than streaming the database through the client.

Some works [4, 45, 59] build SQL databases or keyword indices on top of PIR. For example, toperform an index search for a keyword k , the client performs PIR retrievals to traverse the indexand select every value in the index. The server does not know which data items were fetched, butit still sees the number of results.

2.2 Related a�acksWhen considering the amount of leakage that attacks exploit, there are at least three categories:attacks exploiting data relations, attacks exploiting access patterns, and attacks exploiting resultset size but not access patterns or data relations. The last category is the most challenging becausethe attacker needs to work with the least amount of information. At the same time, this category isalso the least studied. Our attack is in this last category, and we now discuss other volume-basedattacks.

Cash et al. [10] point out that if an attacker knows the exact number of times a keyword appearsin a victim’s documents, and if that result size is unique to this keyword, the attacker can identifythe keyword when seeing the result size. In comparison, our attack does not assume the attacker

Page 10: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

7

knows the frequency of each keyword in a victim’s index—indeed, when attacking a speci�c userin the email application, the attacker often does not have access to the victim’s mailbox and doesnot know these counts. Moreover, many keywords don’t have unique counts (e.g., 99% words inthe Enron dataset, Section 5.1), making this attack not work for these keywords. Our attack canrecover 100% queries in a dictionary in realistic scenarios without access pattern information.

Kellaris et al. [32] also show how an attacker can reconstruct contents of a �eld in the databasegiven only the result size, but their attack di�ers from ours in assumptions and target. First, Kellariset al. assume that (1) the user makes range queries that are uniformly distributed on that column,a property on which their algorithm relies crucially; and (2) the user makes O(N 4 logN ) querieswhere N is the size of the domain. Such a large number of queries is infeasible for the attacker toobserve in many settings. Very recently, Grubbs et al. [25] improved upon the results of Kellaris etal. by demonstrating attacks that do not make assumptions on the distribution of queries, as longas all possible range queries are issued. For queries drawn from a uniform distribution, their attackrequires O(N 2 logN ) queries.

In contrast, our attack requires only a single query to be issued by the user, followed byO(log |D |)replays, which is often less than 10 in number (§5). Our attack also makes no assumptions aboutthe query distribution—assuming a uniform range query distribution is not realistic for manyapplications. On the other hand, unlike the aforementioned works, our attack requires the ability toinject and replay queries. We validate that this can be achieved in realistic scenarios in (Section 3.2).A second di�erence is that the aforementioned attacks reconstruct the database (out of rangequeries), but not individual query keywords; we reconstruct queries, but do not target the overalldatabase. However, we note that a similar reconstruction follows as a direct consequence of ourattack, where the original counts for each keyword could be determined if queries for all possiblekeywords are issued.

Page 11: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

8

3 ATTACK MODELIn this section, we discuss the generic system model (Section 3.1) and the application model (Sec-tion 3.2) that is vulnerable to our attacks. We present the three key attack assumptions: (1) that thesystem leaks volume, (2) that the application allows data injection, and (3) that the applicationsautomatically replays queries under certain conditions. We then demonstrate the validity of ourassumptions by studying a number of concrete instances for both the system and applicationmodels. In particular, we examine 11 popular web applications that allow users to issue keywordsearch queries over an inverted index—we �nd that (i) all 11 applications allow attackers to injectdata into the victim user’s index; and (ii) 5 of the 11 applications also replay queries automaticallywithout user intervention.

3.1 System ModelWe consider systems in which an untrusted server (the adversary in our setting) maintains asecondary index in an encrypted database. The index maps a keyword to a list of documents ordatabase rows (referred to as �les, henceforth) that the keyword appears in and is stored on theserver for query e�ciency. Whenever the application proxy or the client queries the index for akeyword, the user receives the corresponding list of �les containing the keyword. We assume thatthe query’s execution reveals no information to the adversary except the number of results. That is,there is only volume leakage.More formally, similar to Kellaris et al. [32], we de�ne a database D as a set of records that

associate keywords with the �les from a collection F that the keywords appear in:

D = {(w, f ) : w 2 f , f 2 F }

A query for wordw is a function qw (wherew is private) that maps D to a list of matching �lesin F :

qw (D) = { f : (w, f ) 2 D} .An implementation of such a database may internally use one layer of indirection, so that the

�rst query returns a list of �le pointers, or indices into F , and the subsequent queries are used tofetch the �le contents from F .The adversary’s goal is to identify the private keywordw using only the size of the result set

|qw (D)|.Examples. We now illustrate the relevance of volume-based attacks by discussing concreteexamples of cryptographic systems that only leak the volume of results to attackers. In particular,we consider a client-server model where the database stored at the server is encrypted usingsophisticated techniques that also hide access patterns, and the server maintains a secondary indexover the encrypted data.

ORAM-based systems. In the ORAM model (Figure 1, left), the server stores the secondary indexin an ORAM instance, with a trusted proxy containing the ORAM key. An (optional) applicationserver lies in front of the proxy, and clients access the system via the application server. As is thecase in many database-backed applications, the application server also implements access controlover the data. Now we consider the case where a database administrator backs the database withORAM, which hides access patterns from the server in addition to the result contents. However,even with a guarantee of this strength, volume leakage is possible for a passive attacker becausethe size of the result contents is not hidden. In addition, even if a layer of indirection is used, so thatthe �rst query only returns a list of �le pointers, the number of �les returned can still be measuredby recording the number of subsequent queries made.

Page 12: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

9

Clients ServerORAM proxy

App. server

ORAM

ServerClient

Fig. 1. (Le�) ORAM-based system model; both the server and a subset of the clients are untrusted (shaded).(Right) PIR-based model; the server is untrusted (shaded).

PIR-based systems. In the PIR model (Figure 1, right), an untrusted server owns the databaseand maintains a secondary index on it for fast access. In this case, the server sees all the data, as itsrole is to maintain and serve publicly available data. There can be multiple clients here too, buteach client has an independent interaction with the server so it su�ces to focus on a single client.The untrusted server answers user queries in which the requested is private [20]. However, sincethe data is publicly available, the untrusted server can easily learn the size of the query results.It is important to note that any sophisticated cryptographic schemes that leak volume are

vulnerable to our attacks. ORAM and PIR are just two examples of them. The takeaway is thatcontent encryption is not su�cient to prevent volume leakage, as we’ve seen in these examples.

Page 13: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

10

Application Type of queries File injection strategy No. ofreplays ob-served in 10minutes

Gmail Email keywords Send emails to victim 6Facebook Names, post keywords Create posts in a group page 4Dropbox File keywords Upload �les to a shared folder 1Google Doc File keywords Upload �les to a shared folder 1iCloud Mail Email keywords Send emails to victim 1Twitter Hashtags Post tweets with hashtags 0Piazza Post keywords Create posts in a class 0Slack Names, message keywords Send messages to a group channel 0Skype Names, message keywords Send messages to victim 0Yahoo Mail Email keywords Send emails to victim 0Outlook Mail (Hot-mail)

Email keywords Send emails to victim 0

Fig. 2. An empirical assessment of 11 popular web applications. In each case, we list the type of query madeby the user, how the a�acker can influence the result of this query by application-specific injection, andthe number of replays observed. To measure replay, all server responses are dropped for 10 minutes, and wereport the number of duplicate queries made within 10 minutes.

3.2 Application modelWe assume an application model based on the behavior of actual applications that rely on asecondary index. The model consists of two key assumptions: the ability to inject data into a user’sindex and the ability to replay a user query without the involvement of the user. We argue thatthese two assumptions are in fact often inherent to the application. As evidence, we survey a varietyof popular web applications that rely on a secondary index and validate both assumptions in 5 outof the 11 surveyed. As an example, Gmail inbox search �ts our setting seamlessly, and satis�esboth the aforementioned properties. The user can search for all emails that contain a keyword, anattacker can inject data by simply sending the user an email with a speci�c keyword, and queriesare automatically replayed by the application when the server’s response is delayed. Finally, wedescribe how these assumptions together make it di�cult to detect our attack.

3.2.1 File injection. First, we �nd that many applications inherently allow other users to injectapplication data into a victim user’s index. This property of applications has also been noted inprior work [10, 38, 60]. In our setting, it allows the attacker to potentially in�uence the volume ofresults returned by a query. For example, to inject into Gmail inbox search, the attacker sends thevictim user an email. Injection is especially easy if there is a secondary index that is shared. Toinject an entry for a hashtag in Twitter, the attacker uploads a post with that hashtag; in Slack,the attacker simply sends a message. The ability of the attacker to inject in such applications isfundamental because they are inherently designed for multiple users to interact and share data.

3.2.2 �ery replay. Second, we assume that the attacker can replay a user’s query a �nite numberof times. While this assumption is certainly not universal across applications that rely on a sec-ondary index, we �nd that it is surprisingly common, with many applications replaying queriesautomatically in the background without user intervention. This is because many applications arewritten to handle transient errors transparently, to put as little burden on the user as possible. In

Page 14: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

11

particular, an application that wants to provide a seamless experience when network connectiv-ity is spotty may retry a query automatically if the response is too slow. Indeed, the HTTP/1.1RFC [18] speci�es that “When an inbound connection is closed prematurely, a client MAY open anew connection and automatically retransmit an aborted sequence of [idempotent] requests.” Acompromised server can force this behavior by simply dropping its HTTPS responses, triggeringan automatic replay.Examples. To show that these assumptions are realistic, we surveyed 11 applications, includingGmail, Twitter, and Facebook, and tested for the ability to inject data and replay queries for atarget user (Figure 2). To test for injection, we examined the application functionality to determinewhether the attacker could inject data into an index searchable by the user. To measure the numberof query replays, we drop responses from the server without disturbing the original connection,record all network tra�c from the application client, and count the number of duplicate queriesthat appear within 10 minutes. We �nd that for all applications, injection is possible, althoughsometimes only if the attacker and the user share some index (e.g., they are both members of apublic Facebook group). We also �nd that 5 of these applications have query replay.Upon further investigation of these 5 applications, we �nd that all retry queries automatically,

though the rate of retries varies. Two applications, Gmail and Facebook, retry the query repeatedly.The remaining three, Dropbox, Google Drive, and iCloud Mail, retry the query once. The numberof retries is important because the greater the number of retries, the easier it is for the attacker toidentify the query. Nevertheless, as we show in Section 5.1.2, even a single replay is su�cient forsigni�cantly pruning the space of query possibilities, and, in many cases, for mounting the attackfeasibly.Avoiding detection. Because the attack relies on injection visible to the user, one practicalconcern in launching the attack is avoiding detection. Fortunately, in many settings, our attack isdi�cult to detect before it completes. The reason is that once the user issues a query, the attackercan continue to block the responses from the server, causing the web application to retry queriesuntil the attack completes.

We veri�ed this behavior with Gmail: no results are returned to the user during the attack, andto the user it simply appears that he has a bad network connection. That is, once the user initiatesthe query, the attack will complete without further actions from the victim user.

It is possible that the user later sees the injected emails and realizes from the synthetic contentthat he is under attack, but this happens only after the attack completes. Further, we note thatalthough services like Gmail may strip suspicious HTML elements during email preprocessing, wecan still use style formatting to avoid showing the injected content to the user, to reduce suspicion.The rest of the email could show content that is more user-friendly, e.g., an ad. It is further unlikelyfor spam �lters to detect the injected emails, since the attack targets a speci�c user. This is just oneexample, but it illustrates the numerous ways that an attacker could inject data in a way that isdi�cult to detect before the query is reconstructed.

Page 15: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

12

4 OUR ATTACKGiven the attacker abilities discussed in Section 3.2 in concert with volume leakage, we presentand analyze a �le-injection attack to recover a user’s query on a secondary index. We show thatthis attack can be launched on the generic encrypted database systems described in Section 3.1, aslong as the attacker can view the number of results returned.

4.1 OverviewAt a high level, our attack works by searching on the keyword universe through multiple rounds ofuser query replay. By recording the result counts between rounds, the attacker can narrow downthe keyword search space by a constant factor per round.

The attacker uses �le injection to in�uence the result count between rounds. During each round,the attacker constructs �les from the keyword search space and injects the �les into the user’sindex. The response for each round will then contain some number of injected �les. The attackercan use the new result count to determine the number of �les injected after the previous round. Inthis way, the attacker can determine which subset of the search space contains the user’s query.

The setup of the attack is as follows: A user queries qw on a databaseD, as de�ned in Section 3.1.The response is the set of matching �le contents, qw (D) = { f1, . . . , fn}. The goal of the attack is torecoverw , using only n = |qw (D)|, the number of �les returned.

4.2 A�ack AlgorithmAlgorithm 1 provides pseudocode for our attack R����������. In more detail, the attacker �rstrecords the user query’s qw and the number of �les returned, n0. n0 is the number of �les thatalready matched tow prior to the attack. This enables the attacker to di�erentiate user-uploaded�les from injected ones.

Next, the attacker proceeds in rounds to reduce the keyword search space. He chooses an initialdictionary D0, a set of words that might containw , and a parameter k . During each round j, theattacker divides D j , the current dictionary during round j, into k equal partitions. He injects k�les into the server and distributes the words among them as follows: If a word appears in the i-thpartition, he adds the word to exactly i out of the k �les. Hence, if a word appears in the k-th partition,the attacker adds this word to all k �les.

The attacker then replays the user’s query qw on the updated database and records the numberof �les returned, nj . Assuming that the attacker can block updates to the secondary index, thenumber of �les injected since the previous round is then i⇤ = nj � nj�1. Thus,w must have beenassigned to the i⇤-th partition during round j. The attacker repeats this in rounds, each time usingthe i⇤-th partition as the new dictionary, until |D | = 1.

This attack converges in a bounded number of rounds since each round is guaranteed to reducethe dictionary size. Furthermore, for a high enough k and a small enough D, the number of rounds,i.e., the number of times the attacker has to replay the user’s query, is quite low. We formalize thisin the following claim:

C���� 1. For any dictionary D and for any wordw 2 D, let qw be a private query forw , and k bethe number of partitions. Then, R����������(qw ,k) returnsw after

⌃logk |D |

⌥rounds.

P����. Consider the j-th round of the attack, which searches a dictionary D j that containsw .wis guaranteed to match to a partition of the dictionary that has size |D j |/k . Thus, round j + 1of the attack will search a dictionary of size at most |D j |/k that also contains w . The algorithmrepeats until the dictionary has size one. At this point, R���������� returns the only word inthe dictionary, w . Thus, it takes

⌃logk |D |

⌥rounds to complete the attack, where D is the initial

dictionary. ⇤

Page 16: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

13

Algorithm 1 Pseudocode for the base attack. qw is a private query for a wordw on a database D.Each round of the attack partitions the search space by k .1: procedure R����������(qw , k)2: D the initial database3: D keyword universe4: n |qw (D)|5: while |D | > 1 do6: for i in [1, . . . ,k] do7: Fi an empty �le8: Di an empty dictionary9: end for10: for index in [1, . . . , |D |] do11: w D[index]12: i

jindex|D |/k

k13: Appendw to i unique �les in F14: Addw to dictionary Di15: end for16: D I�����F����(D, F )17: n0 |qw (D)|18: i n0 � n19: D Di20: n n0

21: end while22: return D[0]23: end procedure

The attacker must also inject a signi�cant number of �les. We show that the number of �les,along with the �le size, measured in number of words, is not too large.

C���� 2. For any dictionary D and for any wordw 2 D, let qw be a private query forw and k be thenumber of partitions. Then, the total number of �les injected by R����������(qw ,k) is k

⌃logk |D |

⌥.

P����. During a single round of the attack, the words in the i-th partition of the dictionary mustbe distributed among i unique �les, so that the number of results for the query qw during the nextround will be increased by i ifw was in that partition. The maximum value for i is k , the numberof partitions. Therefore, each round requires injecting at least k �les. There are

⌃logk |D |

⌥rounds

according to Claim 1, so we require a total of k⌃logk |D |

⌥�le injections. ⇤

C���� 3. For any dictionary D and for any word w 2 D, let qw be a private query for w and kbe the number of partitions. Then, the total number of words injected by R����������(qw ,k) isO(k |D |).

P����. A dictionary D j is searched during round j of the attack. Each word in partition i of thedictionary appears i times during round j . Each partition has size |D j |

k . Therefore, the total �le sizeinjected during this round is |D j |

k (1 + 2 + · · · + k) = O(k |D j |).Each round reduces the size of the dictionary searched by a factor ofk , so |D j+1 | = |D j |

k . Accordingto Claim 1, there are

⌃logk |D |

⌥many rounds. Then, if the initial dictionary has size |D |, the total

Page 17: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

14

Notation De�nitionD The database, a secondary index mapping

words to the �les they are associated with.qw A query for the wordw , wherew is hidden.D The dictionary, a set of words probed by the

attacker.k The number of partitions to search during

each round. A higher k means more �les in-jected per round, but fewer rounds total.

nj |qw (D)|, or the number of �le results for thequery on round j. For j = 0, this is the user’sinitial query dictionary.

m A parameter for the single-round attack.A higher m means more �les injected, buthigher expected accuracy.

s A parameter for the noisy data attack. Ahigher s means more �les injected, but agreater possible amount of noise tolerated.

Fig. 3. A table of notation for the a�acks described.

�le size injected across all rounds is:

k |D | + k✓ |D |k

◆+ k

✓ |D |k2

◆+ · · · + k

✓ |D |k dlogk |D |e

< k |D |✓1 +

1k+

1k2+ . . .

◆= O(k |D |)

By leveraging the three assumptions presented in §3, our attack can recover a user’s query on ageneric secondary index with perfect accuracy. The number of results returned is indeed the onlyinformation we assume. Moreover, we do not require any knowledge of the distribution of queries,�le contents and other metadata.

Page 18: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

15

5 EVALUATIONIn this section, we evaluate the overheads and accuracy for our attack by simulating an encryptedemail database in Section 5.1. We also present a case study on Gmail to evaluate the feasibility of theattacker’s capabilities in a real-world application in Section 5.2. We �nd that a Gmail attacker canperform the necessary injection, replay and �le count measurement. In addition, we demonstratesuccessful attacks on Gmail for a variety of dictionary sizes that complete within a matter ofminutes.

5.1 Simulation Using Encrypted Email Databases5.1.1 Setup. In the following experiments, we use the entire corpus of emails from the Enronemail dataset [16] as the queried documents, consisting of ⇠500K emails belonging to 151 usersand ⇠2.5GB in size. We extracted keywords from this dataset by �rst stemming the words [48],and then removing 675 stopwords. We next �ltered out any words that contained non-alphabeticcharacters, or were � 20 or 3 characters long. This gave us a total of ⇠259K keywords. In ourexperiments, we only used the top ⇠123K keywords (i.e., those that appeared in > 3 documents) inorder to remove noise from the dataset.

Since an attacker’s dictionary may contain words that do not exist in the queried documents, wesupplemented the Enron keywords with a corpus of English words [15]. Preprocessing the Englishwords in a similar manner yielded a total of ⇠257K keywords. The union of both datasets resultedin a universe of ⇠342K keywords.

5.1.2 Evaluation. Assuming that the queried word is in the initial dictionary chosen by the attacker,our attack achieves perfect query recovery, with strict bounds on the overheads necessary in numberof query replays and data injected. Our simulation of the attack in Figures 4 and 5 con�rms thetheoretical guarantees in Claim 1 and Claim 2.In this experiment, we build the attacker’s dictionary D by randomly sampling keywords from

the keyword universe. We pick the keyword queried by the user at random from D in order tostress test the e�ort required by the attacker—a keyword not in the D would be trivially detected atthe end of a single round without requiring further replays. We then report the number of roundsrequired to guess the keyword with 100% accuracy for di�erent choices of k in Figure 4, and thetotal number of �les injected across rounds in Figure 5. Recall from Section 4.2 that any instance ofthe attack converges after exactly

⌃logk |D |

⌥replays and k

⌃logk |D |

⌥�les injected, where k is an

integer chosen by the attacker. Thus, with a dictionary of �xed size |D |, the parameter k representsa tradeo� between the number of query replays required vs. the number of �le injections required.The attacker’s choice of k then depends on the attacker’s ability to replay the query and the rate atwhich �les can be injected for the target application.

We explore this tradeo� with �xed-size dictionaries in Figure 6, which demonstrates how theaverage number of bytes injected per round increases with k (while the number of rounds decreases).In the worst case where the dictionary comprises the entire keyword universe and k = 24, thebytes injected per round is still less than 10MB, demonstrating the feasibility of the attack. Wealso show the maximum number of bytes injected across any round in Figure 6, equivalent to thenumber of bytes injected during the �rst round. The number of bytes that the attacker can injectduring a single round must be at least as large as this number. We �nd that even in the worst case,this is approximately 50MB.

Takeaway. The attack can be mounted easily even when queries are replayed at most once,i.e., the attacker can recover the keyword in merely two rounds without having to inject morethan several tens of MBs of data. As an example, Gmail limits the size of emails to a comfortable

Page 19: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

16

2

6

10

14

18

22

2 4 8 16 32

Num

berofrounds

Choice of k

D = 100D = 1000D = 10KD = 340K

Fig. 4. Number of roundsrequired to identify a keywordwith varying choices of k ,across di�erent dictionarysizes.

020406080100120140

2 4 8 16 32

Totalflesinjected

Choice of k

D = 100D = 1000D = 10KD = 340K

Fig. 5. Total number of files in-jected with varying choices ofk , across di�erent dictionarysizes.

2 4 8 16 32

Max

bytesinjected

Choice of k

100B1KB10KB100KB1MB10MB100MB

2 4 8 16 32

Avg.bytesinjected

Choice of k

Fig. 6. Number of bytes injected with vary-ing choices of k , across di�erent dictionarysizes: (le�) average bytes per round; (right)maximum bytes across rounds.

25MB [23], and the attacker only needs to send 3-4 emails to the victim’s inbox in order to identifythe query.

Page 20: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

17

Fig. 7. (Gmail) Network setup: The MITM proxy and the real Gmail server together simulates a maliciousGmail server (shaded) who can only learn the result counts of user queries issued by the Gmail client.

5.2 Case Study of Gmail Inbox SearchSo far, we have experimentally validated the theoretical performance of our attacks across variousparameter choices. In this section, we demonstrate the practical feasibility of our attack in real-worldapplications by attacking Gmail’s inbox search feature.

We attack Gmail by simulating a server-side adversary using aman-in-the-middle proxy (Figure 7),and we assume that the volume of results is leaked to it by the server. We �rst show that the attackercan indeed meet the two key requirements of �le injection and automatic query replay.

Subsequently, we perform an exhaustive experiment across a wide range of dictionary sizes (10to 100K) to determine the minimum amount of time required to mount a successful attack on Gmail.The parameters of our attack are governed by the following constraints: (i) the periodicity of replaysin the Gmail client; (ii) the time it takes to inject �les into a user’s inbox; and (iii) the paginationlimit in Gmail (which upper bounds the total number of injections). Despite these constraints, weshow that our attack completes within a matter of minutes for the Gmail application. We �nd thatfor a small dictionary of size 10, a successful attack can be mounted within 1 minute from start toend; for a large dictionary with 100K words, an attack completes successfully in around 7 minutes(see Figure 9). We now describe our methodology in more detail.

5.2.1 Setup. Since we don’t have control over Gmail servers, we simulate a server-side adversaryusing a man-in-the-middle (MITM) HTTPS proxy [41]. Speci�cally, we launch the Gmail web clienton a browser within a guest virtual machine, and launch the MITM proxy on the host. We rerouteall host network tra�c through the MITM proxy. Subsequently, we install the proxy’s certi�cateat the client browser in order to simulate a server-side adversary. At this point, all TLS networktra�c to and from the client browser passes through the MITM proxy, which it can then examineand manipulate. Though the proxy can now read all plaintext data on the HTTP layer, we makesure that during attack it can only process header information unrelated to the private keywords,e.g.,, HTTP method, Origin and Referer �elds.

Note that our implementation does not modify any code running on the real Gmail client or theGmail server. The MITM proxy is the entity who can see the number of matched emails returnedfrom the server, construct and inject special emails, and drop server responses to trigger clientreplays. Now the proxy and the actual Gmail server together comprises a untrusted Gmail server(the highlighted grey area in Figure 7) who only learns about the volume of keyword search results,similar to our model described in §3.

Page 21: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

18

01020304050

10 20 30 40 50 60 70Num

berofflesinjected

Time since injection (seconds)

Fig. 8. (Gmail) CDF measuring the time it takes for files to be injected into the index for Gmail.

5.2.2 Key Characteristics. We then study Gmail’s inbox search features to demonstrate the practicalfeasibility of our attack.

Query replay. Once a user issues a query, we use the MITM proxy to trigger automatic queryreplay by simply dropping the HTTP responses returned by the server. After a period of time, theGmail web client retries the query automatically, without user intervention. Speci�cally, we �ndthat the client replays the query every 1–3 minutes in the absence of a response. To the user, itsimply appears as if the client has a bad network connection.

File injection. File injection in Gmail is simple; the attacker requires a separate Gmail account tosend emails to the victim. The attacker must send k emails in each round and also be sure that theyare all indexed by the next replay (i.e., at least 60s). We determined the rate at which emails couldbe injected (Figure 8) to show that it is feasible to index a su�cient number of emails. We foundthat after injecting 40 emails of size 10KB each, 36 were visible in the user’s mailbox 60 secondslater, shown in Figure 8. Thus, within a time window of 60s, the attacker can pick any value lessthan 36 as a safe option for k .

Volume leakage. In this experiment we assume that the proxy directly obtains the exact resultset size from the server, since we simulate a server-side adversary. However, we �nd that Gmailhas a maximum pagination limit of 100, i.e., the server returns at most 100 results in response to aquery. The pagination limit constrains the parameter regime of our attack, in that it upper boundsthe total number of �les that can be injected by the attacker over the duration of the attack.

5.2.3 End-to-end A�ack. The aim of the experiment is to minimize the time it takes to launch asuccessful attack. However, the constraints discussed above—the periodicity of replays, the time ittakes to inject �les, and the pagination limit—restrict the parameter regime within which an attackercan operate. Therefore, we start by computing the optimal parameters required for mounting asuccessful attack within the space of possible parameters. Next, we attack Gmail using the computedparameters and report the end-to-end duration of our attack.Since the Gmail application has a �xed periodicity of replays, the attack duration is directly

governed by the number of replays required for the attack. Hence, given a dictionary size |D |, ouraim is to minimize logk |D |, where k refers to the number of �les that need to be injected per round.However, given the pagination limit of ` = 100, we require that the total number of injected �lesk ⇥ logk |D | be less than `. At the same time, k should be less than 36 given the time it takes toinject �les.

Page 22: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

19

We therefore solve the following optimization problem:minimize logk |D |subject to k ⇥ logk |D | < 100

and k < 36

|D | k No. of replays(theoretical)

No. of replays(actual)

Total injectedemails

Attack duration

10 10 1 1 10 1 min100 10 2 2 20 2 min 5s1K 32 2 2 63-64 2 min 5s10K 22 3 3 64 5 min 6s100K 18 4 5 71 7 min 10s

Fig. 9. (Gmail) A�ack parameters and duration for Gmail across various dictionary sizes.

Figure 9 summarizes our �ndings for varying sizes of the attacker’s dictionary, 10 to 100K. Notethat the total number of injected emails is sometimes marginally less than k ⇥ logk |D |. This isbecause logk |D | is not always an integer, and therefore �les of interest across subsequent roundsmay sometimes contain less than k keywords. Additionally, for |D | = 100K , our attack requires anextra round of replay because the size of the injected �les in the �rst round were large, increasingthe time it took for �les to get sent by the client and indexed by the server.Overall, our experiment demonstrates the feasibility of volume-based attacks on Gmail, which

can be successfully completed within a matter of minutes depending on the size of the attacker’sdictionary.

In addition, the attack is di�cult to detect because during the course of the attack, the user onlysees a suspended connection. The user only makes a single query, and the Gmail client automaticallyreplays the query in the background. During this time, emails injected by the attacker are also notdelivered to the user’s web client, and only modify the server-side index. The user may later seethe injected emails, but only after the attack successfully completes.

5.2.4 Other Applications. Besides Gmail, we also use the same setup in Section 5.2.1 to evaluateour attacks on 4 other popular applications: Facebook, Dropbox, Google Doc, and iCloud Mail.Take Facebook as an example, we consider a large group channel, e.g.,, free & for sale or courseadvices, where both the attacker and the user have joined. When the user makes a search withinthe group, i.e., buying some necessity, the attacker can inject specially-crafted posts to the group toinfer what keywords the user searches for. This is problematic because the reconstructed keywordscan help the attacker to learn sensitive information about users such as what items they need at themoment or what kinds of academic advices they are seeking for. After conducting a case study onFacebook (similar to the methodology used in Section 5.2.2), we �nd that Facebook replays groupsearch queries every 1- 4 minutes if not hearing back from the server. Subsequently, we launch ourattack with dictionary size D of 10 and partition number k of 10 to demonstrate its practicality. Theattack completes in 4 minutes.

Page 23: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

20

6 MITIGATIONSIn this section, we discuss mitigations of our attack. Overall, injection is often fundamental to appli-cation functionality, padding is too expensive, and replay could be a legitimate user or applicationaction, as discussed in Section 3.2. Nevertheless, based on our evaluation in §5, we believe that themitigations proposed below could reduce the extent of our attack by limiting the attacker’s abilities(Section 3.2) or making it too expensive to mount.

Note while these methods can mitigate the attack, the vulnerability from result count leakage isdi�cult to eradicate from a system completely because it relies on very little from the system modeland on features inherent to the application model. Rather than trying to reduce system leakage,we believe that the most e�ective mitigation is actually on the application side, although thesetechniques too may be burdensome because they interfere with application-speci�c functionality(e.g., disallowing users from sending email).

6.1 Disable File InjectionFile injection is arguably the most di�cult to defend against, since it is often a part of the targetapplication’s functionality. For example, an email inbox search feature is not much use if it couldonly search from emails that were sent by the user, not to the user. Thus, we believe that the maindefense here is rate-limiting and detection. In the email application (Section 3.1), this would requirethe server to actively �lter out suspicious emails. As we found in Section 5.2, applications such asGmail already rate-limit emails; however, this was not enough to defeat the attack.

6.2 Prevent File MeasurementReducing the attacker’s ability to measure the number of �les contained in a response is far moree�ective. However, padding query responses, while e�ective in hiding the response size, leads tounacceptable bandwidth overheads for the application. We believe that to varying degrees, this is aproperty of all padding schemes.

A more e�ective way to hinder the attacker is to inject some noise in the query responses. Thisrequires little overhead in server-client bandwidth compared to the attacker’s overhead: an additivefactor of k per query compared to a multiplicative factor of k per attack. This countermeasure isalso simple to implement; simply add a random number of dummy �les to every response and havethe client �lter them out.Another method is to limit the number of results that can be fetched at a time. The user must

explicitly request further results if needed. This lowers the feasible dictionary size for the attack, atthe cost of user convenience.

6.3 Block�ery ReplayThe most e�ective way to prevent our attack is to block query replays. Query replays are animportant feature of applications such as Gmail that produce the illusion of a seamless connectionduring limited network connectivity (Section 5.2). A possible countermeasure is to include a uniquequery ID for each request, so that the server can detect and �lter out duplicate requests.The main disadvantage of such an approach is that the server would then have to record and

replay past responses in order to both prevent the attack and keep the application available. Long-running user sessions would have to be garbage-collected, potentially sacri�cing correctness. Morecrucially, web servers are often replicated for performance and fault tolerance. Ensuring consistencyfor duplicate queries in such settings is well-known to be expensive, if even possible [22].

Page 24: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

21

7 CONCLUSIONIn this work, we proposed a novel generic attack on encrypted databases that only leverages resultsize leakage. We demonstrated that our attack can reconstruct queries with 100% con�dence for arange of realistic settings, jeopardizing the security guarantees of these systems. We showed thee�ectiveness of our attack via both theoretical bounds and an empirical evaluation, including ademonstration on the Gmail web application.

Page 25: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

22

REFERENCES[1] Signal. https://signal.org.[2] M. A. Abdelraheem, T. Andersson, and C. Gehrmann. Inference and Record-Injection Attacks on Searchable

Encrypted Relational Databases. Cryptology ePrint Archive, Report 2017/024, 2017. http://eprint.iacr.org/2017/024.[3] A. Arasu, K. Eguro, R. Kaushik, D. Kossmann, R. Ramamurthy, and R. Venkatesan. A secure coprocessor for database

applications. In FPL, 2013.[4] D. Asonov. Querying Databases Privately. In ISBN 3-540-22441-6 Springer-Verlag Berlin Heidelberg, 2003.[5] V. Bindschaedler, M. Naveed, X. Pan, X. Wang, and Y. Huang. Practicing Oblivious Access on Cloud Storage: The Gap,

the Fallacy, and the New Way Forward. In Proceedings of the 22nd ACM Conference on Computer and CommunicationsSecurity (CCS), Denver, CO, 2015.

[6] A. Boldyreva, N. Chenette, Y. Lee, and A. O’Neill. Order-Preserving Symmetric Encryption. In Proceedings of the 28thAnnual International Conference on the Theory and Applications of Cryptographic Techniques (Eurocrypt), Cologne,Germany, 2009.

[7] A. Boldyreva, N. Chenette, and A. O’Neill. Order-Preserving Encryption Revisited: Improved Security Analysis andAlternative Solutions. In Proceedings of the 31st International Cryptology Conference (CRYPTO), Santa Barbara, CA,2011.

[8] R. Bost. �o�o� : Forward Secure Searchable Encryption. In Proceedings of the 23rd ACM Conference on Computer andCommunications Security (CCS), Vienna, Austria, 2016.

[9] D. Cash, P. Grubbs, J. Perry, and T. Ristenpart. Leakage-Abuse Attacks Against Searchable Encryption. In Proceedingsof the 22nd ACM Conference on Computer and Communications Security (CCS), Denver, CO, 2015.

[10] D. Cash, J. Jaeger, S. Jarecki, C. Jutla, H. Krawczyk, M. Rosu, and M. Steiner. Dynamic Searchable Encryption inVery-Large Databases: Data Structures and Implementation. In Proceedings of the 21st Network and Distributed SystemSecurity Symposium (NDSS), San Diego, CA, 2014.

[11] D. Cash, S. Jarecki, C. Jutla, H. Krawczyk, M. Rosu, and M. Steiner. Highly-Scalable Searchable Symmetric Encryptionwith Support for Boolean Queries. In Proceedings of the 33rd International Cryptology Conference (CRYPTO), SantaBarbara, CA, 2013.

[12] CipherCloud: Cloud Data Protection Solution, 2017. http://www.ciphercloud.com.[13] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky. Searchable symmetric encryption: improved de�nitions and

e�cient constructions. In Proceedings of the 13th ACM Conference on Computer and Communications Security (CCS),Alexandria, VA, 2006.

[14] J. L. Dautrich, Jr. and C. V. Ravishankar. Compromising Privacy in Precise Query Protocols. In International Conferenceon Extending Database Technology, 2013.

[15] English keywords dataset, 2017. https://github.com/dwyl/english-words.[16] Enron email dataset, 2017. https://www.cs.cmu.edu/~./enron/.[17] S. Faber, S. Jarecki, H. Krawczyk, Q. Nguyen, M. Rosu, and M. Steiner. Rich Queries on Encrypted Data: Beyond Exact

Matches. In Proceedings of the 20th European Symposium on Research in Computer Security (ESORICS), Vienna, Austria,2015.

[18] R. Fielding and J. Reschke. Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing. RFC 7230, 2014.[19] B. Fuller, M. Varia, A. Yerukhimovich, E. Shen, A. Hamlin, V. Gadepally, R. Shay, J. D. Mitchell, and R. K. Cunningham.

SoK: Cryptographically Protected Database Search. In Proceedings of the 38th IEEE Symposium on Security and Privacy(IEEE S&P), 2017.

[20] W. Gasarch. A survey on private information retrieval. In The Computational Complexity Column, 2007.[21] M. Giaruad, A. Anzala-Yamajako, O. Bernard, and P. Lafourcase. Practical Passive Leakage-Abuse Attacks Against

Symmetric Searchable Encryption. Cryptology ePrint Archive, Report 2017/046, 2017. http://eprint.iacr.org/2017/046.[22] S. Gilbert and N. Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services.

ACM SIGACT News, 33(2):51–59, 2002.[23] Gmail size limits, 2017. https://support.google.com/mail/answer/6584.[24] O. Goldreich and R. Ostrovsky. Software Protection and Simulation on Oblivious RAMs. J. ACM, pages 431–473, 1996.[25] P. Grubbs, M.-S. Lacharité, B. Minaud, and K. G. Paterson. Pump up the volume: Practical database reconstruction

from volume leakage on range queries. In Proceedings of the 25th ACM Conference on Computer and CommunicationsSecurity (CCS), 2018.

[26] P. Grubbs, R. McPherson, M. Naveed, T. Ristenpart, and V. Shmatikov. Breaking Web Applications Built On Top ofEncrypted Data. In Proceedings of the 23rd ACM Conference on Computer and Communications Security (CCS), Vienna,Austria, 2016.

[27] W. He, D. Akhawe, S. Jain, E. Shi, and D. X. Song. ShadowCrypt: Encrypted Web Applications for Everyone. InProceedings of the 21st ACM Conference on Computer and Communications Security (CCS), Scottsdale, AZ, 2014.

[28] iQrypt: Encrypt and query your database, 2017. http://iqrypt.com/.

Page 26: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

23

[29] M. S. Islam, M. Kuzu, and M. Kantarcioglu. Access Pattern Disclosure on Searchable Encryption: Rami�cation, Attackand Mitigation. In Proceedings of the 19th Network and Distributed System Security Symposium (NDSS), San Diego, CA,2012.

[30] M. S. Islam, M. Kuzu, and M. Kantarcioglu. Inference Attack Against Encrypted Range Queries on OutsourcedDatabases. In Proceedings of the 4th ACM Conference on Data and Application Security and Privacy, San Antonio, TX,2014.

[31] S. Kamara, C. Papamanthou, and T. Roeder. Dynamic Searchable Symmetric Encryption. In Proceedings of the 19thACM Conference on Computer and Communications Security (CCS), Raleigh, NC, 2012.

[32] G. Kellaris, G. Kollios, K. Nissim, and A. O’Neill. Generic Attacks on Secure Outsourced Databases. In Proceedings ofthe 23rd ACM Conference on Computer and Communications Security (CCS), Vienna, Austria, 2016.

[33] F. Kerschbaum and A. Schröpfer. Optimal Average-Complexity Ideal-Security Order-Preserving Encryption. InProceedings of the 21st ACM Conference on Computer and Communications Security (CCS), Scottsdale, AZ, 2014.

[34] K. Kurosawa. Garbled searchable symmetric encryption. In Proceedings of the 18th International Conference onFinancial Cryptography, 2014.

[35] M.-S. Lacharité, B. Minaud, and K. G. Paterson. Improved reconstruction attacks on encrypted data using range queryleakage. In Proceedings of the 39th IEEE Symposium on Security and Privacy (IEEE S&P), 2018.

[36] B. Lau, S. P. Chung, C. Song, Y. Jang, W. Lee, and A. Boldyreva. Mimesis Aegis: A Mimicry Privacy Shield - A System’sApproach to Data Privacy on Public Cloud. In Proceedings of the 23rd USENIX Security Symposium, San Diego, CA,2014.

[37] K. Lewi and D. J. Wu. Order-Revealing Encryption: New Constructions, Applications, and Lower Bounds. InProceedings of the 23rd ACM Conference on Computer and Communications Security (CCS), Vienna, Austria, 2016.

[38] C. Liu, L. Zhu, M. Wang, and Y.-A. Tan. Search pattern leakage in searchable encryption: Attacks and newconstruction. Inf. Sci., 265:176–188, 2014.

[39] M. Marlinspike. Technology preview: Private contact discovery for signal.https://signal.org/blog/private-contact-discovery/, 2017.

[40] Microsoft SQL Server: Always Encrypted Database Engine, 2017.https://msdn.microsoft.com/en-us/library/mt163865.aspx.

[41] MITM Proxy, 2018. http://mitmproxy.org/.[42] M. Naveed. The Fallacy of Composition of Oblivious RAM and Searchable Encryption. Cryptology ePrint Archive,

Report 2015/668, 2015. http://eprint.iacr.org/2015/668.[43] M. Naveed, M. Prabhakaran, and C. A. Gunter. Dynamic Searchable Encryption via Blind Storage. In Proceedings of the

35th IEEE Symposium on Security and Privacy (IEEE S&P), 2014.[44] W. Ogata, K. Koiwa, A. Kanaoka, and S. Matsuo. Toward Practical Searchable Symmetric Encryption. In Proceedings of

the 8th International Workshop on Security, 2013.[45] F. Olumo�n and I. Goldberg. Privacy-preserving queries over relational databases. In Proceedings of the 10th Privacy

Enhancing Technologies Symposium (PETS), Berlin, Germany, 2010.[46] R. A. Popa, F. H. Li, and N. Zeldovich. An Ideal-Security Protocol for Order-Preserving Encoding. In Proceedings of the

34th IEEE Symposium on Security and Privacy (IEEE S&P), San Francisco, CA, 2013.[47] R. A. Popa, C. M. S. Red�eld, N. Zeldovich, and H. Balakrishnan. CryptDB: Protecting Con�dentiality with Encrypted

Query Processing. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP), Cascais,Portugal, 2011.

[48] M. F. Porter. An Algorithm for Su�x Stripping. Readings in Information Retrieval, pages 313–316, 1997.[49] D. Pouliot and C. V. Wright. The Shadow Nemesis: Inference Attacks on E�ciently Deployable, E�ciently Searchable

Encryption. In Proceedings of the 23rd ACM Conference on Computer and Communications Security (CCS), Vienna,Austria, 2016.

[50] Cloud Threat Intelligence, Skyhigh Cloud Security labs, Skyhigh Networks, 2017. https://www.skyhighnetworks.com/.[51] D. S. Roche, A. J. Aviv, and S. G. Choi. A Practical Oblivious Map Data Structure with Secure Deletion and History

Independence. In Proceedings of the 37th IEEE Symposium on Security and Privacy (IEEE S&P), 2016.[52] C. Sahin, V. Zakhary, A. El Abbadi, H. Lin, and S. Tessaro. TaoStore: Overcoming Asynchronicity in Oblivious Data

Storage. In Proceedings of the 37th IEEE Symposium on Security and Privacy (IEEE S&P), 2016.[53] N. P. Smart (Editor). Future Directions in Computing on Encrypted Data. In ECRYPT report, 2015.[54] D. X. Song, D. Wagner, and A. Perrig. Practical Techniques for Searches on Encrypted Data. In Proceedings of the 21st

IEEE Symposium on Security and Privacy (IEEE S&P), Oakland, CA, 2000.[55] E. Stefanov, C. Papamanthou, and E. Shi. Practical Dynamic Searchable Encryption with Small Leakage. In Proceedings

of the 21st Network and Distributed System Security Symposium (NDSS), San Diego, CA, 2014.[56] E. Stefanov and E. Shi. ObliviStore: High Performance Oblivious Cloud Storage. In Proceedings of the 36th IEEE

Symposium on Security and Privacy (IEEE S&P), 2015.

Page 27: Practical Volume-Based Attacks on Encrypted Databasessensitive data they store and process, these database systems have become the primary target of digital attacks. For example, condential

24

[57] E. Stefanov, M. van Dijk, E. Shi, C. W. Fletcher, L. Ren, X. Yu, and S. Devadas. Path ORAM: an extremely simpleoblivious RAM protocol. In Proceedings of the 20th ACM Conference on Computer and Communications Security (CCS),Berlin, German, 2013.

[58] S. Tu, M. F. Kaashoek, S. Madden, and N. Zeldovich. Processing Analytical Queries over Encrypted Data. InProceedings of the 39th International Conference on Very Large Data Bases (VLDB), Riva del Garda, Italy, 2013.

[59] S. Wang, D. Agrawal, and A. E. Abbadi. Generalizing PIR for Practical Private Retrieval of Public Data. In LectureNotes in Computer Science, volume 6166, 2010.

[60] Y. Zhang, J. Katz, and C. Papamanthou. All Your Queries Are Belong to Us: The Power of File-Injection Attacks onSearchable Encryption. In Proceedings of the 25th USENIX Security Symposium, Austin, TX, 2016.