elimfs: achieving efficient, leakage-resilient, and …...multi-keyword fuzzy search on encrypted...

14
1939-1374 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2765323, IEEE Transactions on Services Computing 1 EliMFS: Achieving Efficient, Leakage-resilient, and Multi-keyword Fuzzy Search on Encrypted Cloud Data Jing Chen, Kun He, Lan Deng, Quan Yuan, Ruiying Du, Yang Xiang, and Jie Wu Abstract—Motivated by privacy preservation requirements for outsourced data, keyword searches over encrypted cloud data have become a hot topic. Compared to single-keyword exact searches, multi-keyword fuzzy search schemes attract more attention because of their improvements in search accuracy, typo tolerance, and user experience in general. However, existing multi-keyword fuzzy search solutions are not sufficiently effi- cient when the file set in the cloud is large. To address this, we propose an Efficient Leakage-resilient Multi-keyword Fuzzy Search (EliMFS) framework over encrypted cloud data. In this framework, a novel two-stage index structure is exploited to ensure that search time is independent of file set size. The multi- keyword fuzzy search function is achieved through a delicate design based on the Gram Counting Order, the Bloom filter, and the Locality-Sensitive Hashing. Furthermore, considering the leakages caused by the two-stage index structure, we propose two specific schemes to resist these potential attacks in different threat models. Extensive analysis and experiments show that our schemes are highly efficient and leakage-resilient. Index Terms—Cloud security, searchable encryption, multi- keyword fuzzy search. I. I NTRODUCTION O UTSOURCING data to the cloud has become a prevalent trend in recent years because of the benefits it brings to data owners including convenient access, decreased costs, and flexible data management. With the impetus of IT giants like Amazon, Google, and Microsoft, more enterprises and individ- uals incline to upload their data to cloud storage [1]. However, data leakages occur frequently, causing huge losses to both cloud providers and cloud users. The lack of data security and privacy protection has become the biggest obstacle for the development of cloud storage technologies. To protect owners’ data privacy, all data, especially the sensitive information, must be encrypted before outsourcing, which unfortunately makes traditional plaintext-search-based data utilization service a challenging task. Searchable Symmetric Encryption (SSE) is a useful cryp- tographic primitive for searching encrypted data using partic- ular keywords without leaking private information to cloud J. Chen, K. He, and L. Deng are with the State Key Laboratory of Software Engineering, Computer School, Wuhan University, China, 430072. E-mail: chenjing, kunhe, [email protected] R. Du is with the Collaborative Innovation Center of Geospatial Technology Wuhan University. E-mail: [email protected] Q. Yuan is with the Computer School, University of Texas-Permian Basin, TX, USA. E-mail: yuan [email protected] Y. Xiang is with the School ofInformation Technology, Deakin University, Australia. E-mail: [email protected] J. Wu is with the Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA. E-mail: [email protected] servers [2]. To support ciphertext searches on encrypted cloud data, researchers have proposed a number of SSE schemes. Considering the huge amount of outsourced documents in the cloud, one important direction is a multi-keyword search [3]– [5], which supports multiple (conjunctive or disjunctive) key- words search, e.g., “cloud” and “security”, at one time. Com- pared to a single-word search [6], [7], a multi-keyword search can efficiently improve search result accuracy since single- keyword searches usually return results that are far too coarse. Note that most existing multi-keyword search techniques are less friendly towards keyword spelling since they only support exact keyword matching [8], [9]. Considering that users may occasionally perform searches with typos in their search keywords, fuzzy search is proposed to achieve approximate keyword matches to a certain extent. However, they usually exploit edited distances to quantify keywords’ similarities for single-word searches only. To enhance the user-searching experience and protect data privacy, achieving efficient, leakage-resilient, and multi- keyword fuzzy search services on encrypted cloud data be- comes a significant challenge, especially with the increasing number of on-demand data users and the exploding quantities of outsourced cloud data. In [10], the authors implemented a search scheme that satisfies both multi-keyword and fuzzy re- quirements simultaneously by treating several phrases in a pre- defined dictionary as one keyword for a search; this approach lacks flexibility in reality. [11] tackled the privacy-preserving multi-keyword fuzzy search problem using a forward index structure. However, forward index based schemes need to store a list of keywords for each file and must check every file in the file set when searching [12]; this is unaffordable when a file set is large. In this paper, we propose an Efficient Leakage-resilient Multi-keyword Fuzzy Search (EliMFS) framework for en- crypted cloud data. Unlike existing multi-keyword fuzzy search schemes, we design a novel two-stage index structure, consisting of a forward index and an inverted index to improve search efficiency. Different from a forward index, the inverted index stores a list of file identifiers for each keyword, and a user can directly find the files for each keyword in a query [13]. By building up the two-stage index and designing the corresponding search algorithm, our search complexity is only determined by the number of the files associated with one keyword in the query rather than the entire file- set size. Furthermore, note that there is a trade-off in the complex index structure which leads to a greater search

Upload: others

Post on 28-Jul-2020

5 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: EliMFS: Achieving Efficient, Leakage-resilient, and …...Multi-keyword Fuzzy Search on Encrypted Cloud Data Jing Chen, Kun He, Lan Deng, Quan Yuan, Ruiying Du, Yang Xiang, and Jie

1939-1374 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2765323, IEEETransactions on Services Computing

1

EliMFS: Achieving Efficient, Leakage-resilient, andMulti-keyword Fuzzy Search on Encrypted Cloud

DataJing Chen, Kun He, Lan Deng, Quan Yuan, Ruiying Du, Yang Xiang, and Jie Wu

Abstract—Motivated by privacy preservation requirements foroutsourced data, keyword searches over encrypted cloud datahave become a hot topic. Compared to single-keyword exactsearches, multi-keyword fuzzy search schemes attract moreattention because of their improvements in search accuracy,typo tolerance, and user experience in general. However, existingmulti-keyword fuzzy search solutions are not sufficiently effi-cient when the file set in the cloud is large. To address this,we propose an Efficient Leakage-resilient Multi-keyword FuzzySearch (EliMFS) framework over encrypted cloud data. In thisframework, a novel two-stage index structure is exploited toensure that search time is independent of file set size. The multi-keyword fuzzy search function is achieved through a delicatedesign based on the Gram Counting Order, the Bloom filter,and the Locality-Sensitive Hashing. Furthermore, considering theleakages caused by the two-stage index structure, we proposetwo specific schemes to resist these potential attacks in differentthreat models. Extensive analysis and experiments show that ourschemes are highly efficient and leakage-resilient.

Index Terms—Cloud security, searchable encryption, multi-keyword fuzzy search.

I. I NTRODUCTION

OUTSOURCING data to the cloud has become a prevalenttrend in recent years because of the benefits it brings to

data owners including convenient access, decreased costs, andflexible data management. With the impetus of IT giants likeAmazon, Google, and Microsoft, more enterprises and individ-uals incline to upload their data to cloud storage [1]. However,data leakages occur frequently, causing huge losses to bothcloud providers and cloud users. The lack of data securityand privacy protection has become the biggest obstacle for thedevelopment of cloud storage technologies. To protect owners’data privacy, all data, especially the sensitive information, mustbe encrypted before outsourcing, which unfortunately makestraditional plaintext-search-based data utilization service achallenging task.

Searchable Symmetric Encryption(SSE) is a useful cryp-tographic primitive for searching encrypted data using partic-ular keywords without leaking private information to cloud

J. Chen, K. He, and L. Deng are with the State Key Laboratory of SoftwareEngineering, Computer School, Wuhan University, China, 430072.E-mail: chenjing, kunhe, [email protected]

R. Du is with the Collaborative Innovation Center of Geospatial TechnologyWuhan University. E-mail: [email protected]

Q. Yuan is with the Computer School, University of Texas-Permian Basin,TX, USA. E-mail: [email protected]

Y. Xiang is with the School of Information Technology, Deakin University,Australia. E-mail: [email protected]

J. Wu is with the Department of Computer and Information Sciences,Temple University, Philadelphia, PA, USA. E-mail: [email protected]

servers [2]. To support ciphertext searches on encrypted clouddata, researchers have proposed a number of SSE schemes.Considering the huge amount of outsourced documents in thecloud, one important direction is amulti-keywordsearch [3]–[5], which supports multiple (conjunctive or disjunctive) key-words search, e.g., “cloud” and “security”, at one time. Com-pared to asingle-wordsearch [6], [7], a multi-keyword searchcan efficiently improve search result accuracy since single-keyword searches usually return results that are far too coarse.Note that most existing multi-keyword search techniques areless friendly towards keyword spelling since theyonly supportexact keyword matching [8], [9]. Considering that users mayoccasionally perform searches with typos in their searchkeywords, fuzzy searchis proposed to achieve approximatekeyword matches to a certain extent. However, they usuallyexploit edited distances to quantify keywords’ similarities forsingle-word searchesonly.

To enhance the user-searching experience and protectdata privacy, achieving efficient, leakage-resilient, and multi-keyword fuzzy search services on encrypted cloud data be-comes a significant challenge, especially with the increasingnumber of on-demand data users and the exploding quantitiesof outsourced cloud data. In [10], the authors implemented asearch scheme that satisfies both multi-keyword and fuzzy re-quirements simultaneously by treating several phrases in a pre-defined dictionary as one keyword for a search; this approachlacks flexibility in reality. [11] tackled the privacy-preservingmulti-keyword fuzzy search problem using aforward indexstructure. However, forward index based schemes need to storea list of keywords for each file and must check every file inthe file set when searching [12]; this is unaffordable when afile set is large.

In this paper, we propose an Efficient Leakage-resilientMulti-keyword Fuzzy Search (EliMFS) framework for en-crypted cloud data. Unlike existing multi-keyword fuzzysearch schemes, we design a novel two-stage index structure,consisting of aforward indexand aninverted indexto improvesearch efficiency. Different from a forward index, the invertedindex stores a list of file identifiers for each keyword, anda user can directly find the files for each keyword in aquery [13]. By building up the two-stage index and designingthe corresponding search algorithm, our search complexityis only determined by the number of the files associatedwith one keyword in the query rather than the entire file-set size. Furthermore, note thatthere is a trade-off in thecomplex index structure which leads to a greater search

Page 2: EliMFS: Achieving Efficient, Leakage-resilient, and …...Multi-keyword Fuzzy Search on Encrypted Cloud Data Jing Chen, Kun He, Lan Deng, Quan Yuan, Ruiying Du, Yang Xiang, and Jie

1939-1374 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2765323, IEEETransactions on Services Computing

2

efficiency, but also causes more leakages. We aim to design aframework to achieve the desired efficiency without hurting thesystem privacy. Therefore, we analyze the possible leakagesof our framework in different threat models and design twocryptographic schemes to balance the trade-offs among searchtime, storage cost, and privacy for different security levels.The multi-keyword fuzzy search function is implemented bythe flexible combination of some widely-used tools, includingbigram vector representation, locality-sensitive hashing (LSH)[14], and the Bloom filter [15]. Our contributions can besummarized as follows:

1) To the best of our knowledge, this is the first multi-keyword fuzzy search scheme for encrypted cloud datawhose search complexity is independent of the file setsize;

2) Compared to existing multi-keyword fuzzy search solu-tions [11], [16], we design a novel two-stage index struc-ture and corresponding search algorithms to improve thesearch efficiency on encrypted cloud data;

3) To protect data privacy, we analyze the possible leakagesof our framework in different threat models and we pro-pose two specific schemes to achieve leakage-resiliency;

4) A comprehensive analysis of the privacy and efficiencyis given, and experiments on a real-world dataset furthershow that our design has a low overhead in the searchprocess.

The rest of the paper is organized as follows. Section IIpresents the formulation of our problem and the prelimi-naries. Sections III and IV provide detailed descriptions ofour EliMFS framework and leakage-resilient schemes, respec-tively. Sections V and VI present theoretical and experimentalanalyses, respectively. Section VII summarizes the relatedwork, and the conclusion appears in Section VIII.

II. PROBLEM STATEMENT

Before describing our schemes, we state the efficient multi-keyword fuzzy search problem in the outsourcing storagesetting and introduce some preliminaries.

A. System Model

Our system consists of three entities: a cloud server, a dataowner, and a group of authorized users, as shown in Fig. 1.Initially, the data owner, who holds a file set, creates a secretkey, constructs a customized secure index for this file set, anduploads the secure index along with the encrypted file set tothe cloud server. When the data owner needs to search his fileswith keywords,w1, w2, . . . , wn, he generates a token based onthe queried keywords and his secret key. After the token is sentto the cloud server, the cloud server executes a search with thetoken and the secure index, and then returns the matched fileidentifiers to finish the search process. With the search results,the data owner can download the corresponding encrypted filesfrom the cloud server and decrypt them with his secret key. Asthe gray arrows showed, other authorized users attempting tosearch the data owner’s file set can obtain the tokens and filedecryption keys shared with them, according to access controlstrategies.

resu

lts

GenToken()

Cloud Server

Data

Owner

IndexFiles

Tokenrequest

Initialize()

results

Search() To

ke

nupload

Token

Other Users

Fig. 1. Structure of encrypted cloud storage system

B. Threat Model & Design Goals

We assume that both the data owner and the authorized usersare trusted, but that the cloud server is honest-but-curious. Thatmeans the cloud server follows the designed protocols hon-estly, but it may collect, analyze, and leak the information ofits customers [17]. Regarding privacy requirements, two threatmodels with different adversary capabilities are considered inthis paper, as in other related works [3], [11].• Known Ciphertext Model: The cloud server can only

observe the encrypted file set, the encrypted index, thesubmitted tokens, and the search results. It has no priorknowledge of the file set or the search keywords.

• Known Background Model: In addition to the knowledgein the Known Ciphertext Model, the cloud server is alsoable to obtain some additional background informationlike the statistical information about the files, the queriedkeywords and their corresponding tokens, and the fre-quency of searched keywords.

Considering the above threat models, our scheme is de-signed to achieve the following privacy and performance goals.• Multi-keyword Fuzzy Search: One scheme should support

multi-keyword fuzzy searches over outsourced encryptedfiles. For instance, a search based on a spelling mistake“clod security” can still return the “cloud security” rele-vant files.

• Efficiency: Considering that large file sets may exist inthe cloud, our scheme aims to be efficient and practicalin terms offile-set-size independence. That is, the searchtime should be sublinear with respect to the file set size.

• Leakage-resilience: The cloud server should neither inferprivate information nor leverage the intermediate resultsof searches in the targeted threat models.

C. Security Definitions

We recall the semantic security definitions from [11] and[13], and define the notions below:• History: A History is the information which reflects the

interaction between the data owner and the cloud server.It consists of the file setD, the keyword setW , and thequery setQ = {q1, q2, . . . } that is being searched.

• View: It is what the cloud server can actually see. TheViewof aHistoryconsists of the encrypted file setED, theencrypted indexI, and the tokensToken1, . . . ,Token|Q|.

Page 3: EliMFS: Achieving Efficient, Leakage-resilient, and …...Multi-keyword Fuzzy Search on Encrypted Cloud Data Jing Chen, Kun He, Lan Deng, Quan Yuan, Ruiying Du, Yang Xiang, and Jie

1939-1374 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2765323, IEEETransactions on Services Computing

3

• Trace: The Trace of a History is defined as the infor-mation that can be revealed to the cloud server, such asthe file set size and the search results. TheTraceof ourschemes will be listed in detail in Section IV.

Definition 1. For two Views with the sameTracewhere oneview is generated from aHistory selected by the cloud serverand the other is generated by a simulator, an EliMFS schemeis semantically secure under the known ciphertext model, ifno probabilistic polynomial time (PPT) adversary is able todistinguish them with a probability non-negligibly larger than1/2.

Definition 2. Given a collection of keyword-token pairs andtwo Views with the sameTracewhere one view is generatedfrom a History selected by the cloud server and the otheris generated by a simulator, an EliMFS scheme is semanti-cally secure under the known background model, if no PPTadversary is able to distinguish them with a probability non-negligibly larger than1/2.

D. Preliminaries

1) Gram Counting Order: An n-gram is a contiguoussequence ofn items from a given sequence of characters [18].For example, given a keyword “cloud”, the 1-gram sequenceis {c, l, o, u, d}, and the 2-gram sequence is{cl, lo, ou, ud}.In our structure, we choose a 2-gram sequence to parse akeyword, and record it in an array of size26×26. Each elementin the array represents one possible bigram. If a bigram existsin the 2-gram sequence, the corresponding element is set tobe 1; otherwise it remains 0. In this way, all the keywordsare transformed into bit arrays that can be input parameters ofLocality-Sensitive Hashing functions.

2) Locality-Sensitive Hashing:Locality-Sensitive Hashing(LSH) has the capacity to map similar items to the samebucket with a high probability. In this paper, all keywords areprocessed by LSH before being attached to indexes or tokens,so that the fuzzy search can be achieved. A hash functionfamily F is (R1, R2, P1, P2)-sensitive, if for any inputs ofp, q,all hash functions,h ∈ F , satisfy the following conditions:

• if d(p, q) ≤ R1, thenh(p) = h(q) with a probability ofat leastP1;

• if d(p, q) ≥ R2, thenh(p) = h(q) with a probability ofat mostP2;

whered(p, q) denotes some kind of distance (e.g. Euclideandistance or Hamming distance) betweenp and q, R1 andR2 are the distance thresholds, andP1 andP2 represent theprobability thresholds. There are multiple options for imple-menting LSH, such as LSHs based on bit sampling andP -stable distribution. To recognize similar keywords, we employa P -stable distribution LSH [19] that maps ad-dimensionalvector to an integer. The hash functions of aP -stable hashfamily follow h~a,b(~v) = ⌊

~a·~v+bc ⌋, where~a is a d-dimensional

vector and each dimension is chosen independently from aP -stable distribution.b is a random real number chosen from[0, c]. The P -stable distribution LSH first projects~v onto ~a,then it utilizesc to separate them. A different~a and b resultin different hash functions in the family.

3) Bloom Filter: A Bloom filter [20], using a bit array ofm bits to characterize a setS, verifies whether an element isa member of a set. Each element inS is denoted byk bitsrandomly chosen from them array positions. The Bloom filterBfs of the setS is structured as follows:

a. SetBfs to be anm bit length array of all zeros andchoose a hash familyH = {hi | hi : a → [1,m], a ∈S, i ∈ [1, k]}.

b. For eachq ∈ S, set the positionshi(q) (i ∈ [1, k]) inBfs to 1.

To check several elements, we generate a bit array in thesame way. We then compute the inner product of this bitarray andBfs. If the result equalsnk, it means that theseelements belong toS, wheren is the number of elements tobe checked. In our scheme, every file has its own Bloom filterthat corresponds to the multiple contained keywords.

4) Secure kNN Computation:Secure kNN (k-nearest neigh-bour) computation, proposed by Wong [21], is designed tocompute the distance between two encrypted database records.In a secure kNN computation, all database records are ex-tended tom-dimension vectors and encrypted by a vectorSof m bits and twom×m invertible matrices{M1,M2}. Thealgorithms are introduced in our schemes in Section IV. Moredetails of secure kNN computation are referred to in [21].

III. E LI MFS FRAMEWORK

In this section, we present the definition and main phases ofour EliMFS framework and build a two-stage index structureto improve the search efficiency.

A. Definition

We first give the definitions of the two-stage index and theEliMFS scheme. The main notations used in this paper aresummarized in TABLE I. An example of the two-stage indexstructure is shown in Fig. 2.

Definition 3 (EliMFS). An Efficient Leakage-resilient Multi-keyword Fuzzy SSE search scheme is based on the two-stage index structure and consists of three polynomial-timealgorithms as follows:

• (SK, Index) ← Initialize(λ,D,W): This algorithmtakes as inputs the secure parametersλ, the file setD, andthe corresponding keyword setW . It generates the secretkeySK and builds up the secure indexIndex = (I1, I2).Here, the first-stage index,I1, is an inverted index inwhich each tag corresponds to a list of identifiers. Thesecond-stage index,I2, is a forward index in which eachfile identifier corresponds to the keywords it contains;

• Token ← GenToken(SK, ~w): Given secret keySKand the keyword list~w = {w1, w2, . . . , wn}, the dataowner computes aToken as a certificate to search~w inD;

• ~id ← Search(Index,Token): The cloud server inputsthe secureIndex and theToken received from the re-quested user and outputs the file identifier list,~id, thatmatches the search keywords in theToken.

Page 4: EliMFS: Achieving Efficient, Leakage-resilient, and …...Multi-keyword Fuzzy Search on Encrypted Cloud Data Jing Chen, Kun He, Lan Deng, Quan Yuan, Ruiying Du, Yang Xiang, and Jie

1939-1374 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2765323, IEEETransactions on Services Computing

4

TABLE INOTATIONS

Notation Meaning

D the file set of a data owner

W the keyword set ofDD(w) file identifiers matching the keywordw

q$← S sampling an elementq randomly from a setS

x← F an algorithmF with outputx

[1, n], n ∈ N an integer set{1, 2, . . . , n}I1, I2 the first- and second- stage index

λ a secure parameter

SK the secret key of the data owner

w, ~w a keyword, and a keyword set

id, ~id a file identifier, and a file identifier set

ϑ the intermediate value of a keyword

(Enc,Dec) a symmetric encryption algorithm

F() a pseudo-random function

Bf a Bloom filter

Q a set of search queries

m the length of Bloom filter

F = {f1, . . . , fl} a LSH family with l hash functions

H = {h1, . . . , hk} the k hash functions of Bloom filter

tag the entrance of a keyword inI1e an encrypted file identifier

thr the threshold attached in a token

ip the inner product of two Bloom filters

B. Main Phases

In this section, we introduce the three phases of the EliMFSframework:Initialization, Token Generation, andSearch, witha search example of the keywords “cloud” and “storage,” asshown in Fig. 2. Two instances of the framework with differentefficiency privacy trade-offs are proposed in Section IV.

1) Initialization: In this phase, the data owner has twomajor tasks: generating the secret key and constructing asecure index. On one hand, the secret key can be initializedby the existing cryptographic methods, e.g., DES or AES. Onthe other hand, to implement an efficient multi-keyword fuzzysearch, we build a two-stage secure index containing variousfile identifiers and keywords. In the first-stage index, eachkeyword is transformed into a tag, which points to all thefile identifiers that contain this keyword. The second index isjust the opposite — it is generated by all the file identifierswith their matched keywords.

In consideration of privacy, each file identifier is encryptedby the algorithmEnc() when we use this framework. For thefuzzy search function, a keywordwi is transformed into anintermediate valueϑi(i ∈ [1, |W|]) by sequentially computinga 26× 26-length bit array of a 2-gram sequence and an LSHvalue (introduced in Section II-D). As illustrated in Fig. 2, inthe first-stage index, thetagi is created byϑi and a pseudo-random functionF(x). In the second-stage index, each fileidentifier maps to a Bloom filter [20] deduced by all theintermediate values of multiple target keywords. In this two-stage index structure, the first stage is able to efficientlyexecute single keyword searches and shrink the search range in

tagx

h ...hk h ...hk

id

id

id

I

k

tag

tag

tag id , id

w id , id , id

Owner

Server

Owner

id w w w

I2:

y

Bft

x

Bf

h1...hk

Bf

ww

Fig. 2. The framework of EliMFS schemes

multi-keyword searches. The second stage is used to estimatewhether the results of the first stage indeed contain all targetkeywords.

2) Token Generation:This phase is also executed on thedata owner. In this phase, a token is generated for keywordsearches. The token consists of two parts: a tag for searchingin the first-stage index and an item for searching in the second-stage index.

The bottom of Fig. 2 shows an example of token generationfor the keywords “cloud” and “storage.” A token consists ofa tag and a Bloom filter. To calculate these two metadata, thekeywords “cloud” and “storage” should be first transferredto two intermediate valuesϑx and ϑy; this is similar tothe process in the initialization phase. Then, the data ownerrandomly chooses a keyword with a smallD(w) and calculatesthe tagx by the intermediate value and a pseudo-randomfunctionF(). Assume the selected keyword is “cloud” in thisexample. Meanwhile, the Bloom filter is generated from bothϑx andϑy by the hash functionsh1, . . . hk.

3) Search: Unlike the two aforementioned phases, theSearch phase is executed on cloud servers. The keywordsearch process is first executed in the first-stage index. Havingobtained a set of file identifiers from the first-stage search,the cloud servers can now check this relatively small file set(instead of the whole file set) using the second part of thetoken.

As shown in Fig. 2, after receiving thetagx, the cloudserver searches in the first-stage index to find the encrypted fileidentifier list (i.e.E(id1),E(id3)) that matches the keyword“cloud.” Then, the cloud server decrypts each encrypted fileidentifier in the list with the secret key received from the dataowner. In the second-stage index, the Bloom filter verifieswhether the file also matches the other keyword “storage.”As mentioned in Section II-D, the inner productsBf1 · BftandBf3 · Bft are calculated as decision fundaments, whereBfi is the Bloom filter ofidi andBft is the Bloom filter ofthe token. If the inner product is approximately equal to2k,

Page 5: EliMFS: Achieving Efficient, Leakage-resilient, and …...Multi-keyword Fuzzy Search on Encrypted Cloud Data Jing Chen, Kun He, Lan Deng, Quan Yuan, Ruiying Du, Yang Xiang, and Jie

1939-1374 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2765323, IEEETransactions on Services Computing

5

TABLE IIINITIALIZATION STEPS IN ELI MFS-B

Step 1:

1 K1,K2

$← {0, 1}λ, M1,M2 ∈ Rm×m, S ∈ {0, 1}m

2 SK = {K1, K2,M1,M2, S}Step 2: For eachw ∈ W

3 ϑw ← F(w), tag ← F(K1, ϑw), Ke ← F(K2, ϑw),h1(ϑw), . . . , hk(ϑw)

4 For all id ∈ D(w): appende← Enc(Ke, id) to I1[tag]; setI2[id][i] =1, i ∈ {h1(ϑw), h2(ϑw), . . . , hk(ϑw)}

Step 3: For eachI2[id] = I

5 Initialize I′

, I′′

to m-bit length vectors

6 I′

[j] = I′′

[j] = I[j] if S[j] = 1; I′

[j] = 1

2I[j]+r, I

′′

[j] = 1

2I[j]−r

if S[j] = 0, j ∈ [1,m], r is a random number

7 I2[id] = {MT1· I′

,MT2· I′′}

the corresponding file contains both of the keywords and theidentifier of this file will be attached to the search result. Atlast, these identifiers are sent to the data owner so that he canretrieve files from the cloud server.

IV. EFFICIENT LEAKAGE-RESILIENT SCHEMES

To resist leakages in different threat models, we instantiatetwo specific schemes for the EliMFS framework.

A. EliMFS-B: Basic Scheme

1) Basic Leakages:Based on the analysis of various pri-vacy threats in cloud search, we list the basic leakages thatneed to be resisted:

• Plaintext Leakage: The content of the outsourced filesand the searched keywords may be obtained by the cloudserver if the schemes are not secure.

• Search Pattern Leakage: The search Pattern indicateswhether two tokens are performing for the same query.Since most existing schemes use deterministic algorithms,search pattern leakageis usually not well protected. Inthis paper, we have a semi-protection function againstsearch pattern leakage; the cloud server can only estimatewhether two queries have the same keyword that is usedin the first-stage search. For instance, given the tokens ofqueriesQ = (w1, w2, . . . ), Q

= (w′

1, w′

2, . . . ), the cloudserver is able to judge whetherw1 = w

1, but cannotknow whetherQ = Q

.• Subset Leakage: This is a special leakage introduced by

the multi-keyword search since the cloud server mayobtain the intermediate search results of the queried mul-tiple keywords. For example, if a multi-keyword search isimplemented by searching each keyword in query~w, thenthe cloud server can obtain the search result of any subsetof ~w. Given the token of a queryQ = (w1, w2, . . . , wn),the Subset Leakageis the result of a tampered queryQ

= (w1, ~w) in our two-stage index. The notation~wrefers to any nonvoid subset ofw2 . . . wn.

2) Scheme Construction:We then detail our solution tothe basic leakages: EliMFS-B. For simplicity, we usewi torepresent the 2-gram sequence of keywordwi.

The steps of algorithmInitialize(λ,D,W) are shown inTABLE II. Given a secure parameterλ and a file setD witha keyword setW , the data owner generates his secret key,

TABLE IIIGENTOKEN STEPS INELI MFS-B

Step 1:1 ϑw1

← F(w1), tag ← F(K1, ϑw1), Ke ← F(K2, ϑw1

)Step 2:

2 Initialize Bfq to be am-bit length array of all zeros3 For all wi ∈ ~w: get ϑwi

← F(wi), h1(ϑwi), . . . , hk(ϑwi

), and setBfq [j] = 1, j ∈ {h1(ϑwi

), . . . , hk(ϑwi)}

Step 3:

4 Initialize Bf′

q , Bf′′

q to m-bit length vectors

5 Bf′

q [j] = Bf′′

q [j] = Bfq [j] if S[j] = 0; Bf′

q [j] = 1

2Bfq [j] + r

,

Bf′′

q [j] = 1

2Bfq [j]− r

if S[j] = 1, j ∈ [1,m], r′

is a random number

6 Chosethr 6 k|~w|, Token = {tag,M−1

1·Bf

q ,M−1

2·Bf

′′

q ,Ke, thr}

SK, and the secure index,Index = {I1, I2}. Obviously, thisalgorithm consists of three steps. In step 1, the data ownergenerates his secret key, which contains two random sequences(K1,K2), two invertible matrices (M1,M2), and a vector (S).Step 2 builds the index for the file setD. To generate aninverted indexI1 and a forward indexI2 simultaneously, thedata owner initializesI1 and I2 to empty arrays. Then, foreach keyword inW , the data owner computes the intermediatevalue, the entrance inI1, the secret key for encrypting fileidentifiers, and thek hash values used in the Bloom filter.Here,F = {fi | fi : {0, 1}26×26 → n, n ∈ Z, i ∈ [1, l]},F : {0, 1}λ×{0, 1}λ → {0, 1}λ, andH = {h | h : {0, 1}∗ →[1,m]}. The data owner encrypts all the file identifiers thatmatchw with Ke, attaches them to setI1[tag], and then setsall the positionsh1(ϑw), h2(ϑw), . . . , hk(ϑw) in I2[id] to 1.In the last step, the data owner encrypts the Bloom filters inthe second index (I2) using the secure kNN encryption [21]with (M1,M2, S). The Bloom filterI is split into two vectors{I

, I′′

} according toS, and thenI2[id] is set to be{MT1 ·

I′

,MT2 · I

′′

}. Finally, the data owner getsIndex = {I1, I2}and sends the index to the cloud server with the correspondingencrypted files.

To search with keywords~w = {w1, w2, . . . , wn}, the dataowner needs to generate a token as a certificate by followingthe steps in TABLE III. Assumingw1 is the chosen keywordfor the first-stage search, the data owner first computes theintermediate value, the entrance inI1, and the secret key fordecrypting file identifiers. Note that ifn = 1, the next twosteps cannot be executed andtag can be sent to the cloudserver directly. In step 2, a Bloom filter (Bfq) is built fromall the keywords in~w. Each keyword is processed withF()and inserted intoBfq with k hash functions. In the third step,the Bloom filterBfq is encrypted withM1,M2, S in a mannersimilar to that inInitialize(). At last, the data owner choosesa threshold (thr) whose value can be selected arbitrarily basedon the preferred search accuracy.thr is used to compare theinner products during the second search stage to judge whethera file ID should be returned.Token = {tag,M−11 ·Bf

q,M−12 ·

Bf′′

q ,Ke, thr} is the final output and is sent to the cloudserver.

Upon receiving the token, the cloud server executesSearch(), as shown in TABLE IV. First, the cloud serverreads all the encrypted identifiers inI1[tag]. If Token ={tag}, the cloud server returns these identifiers and finishes thealgorithm. IfToken = {tag,M−11 ·Bf

q,M−12 ·Bf

′′

q ,Ke, thr},

Page 6: EliMFS: Achieving Efficient, Leakage-resilient, and …...Multi-keyword Fuzzy Search on Encrypted Cloud Data Jing Chen, Kun He, Lan Deng, Quan Yuan, Ruiying Du, Yang Xiang, and Jie

1939-1374 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2765323, IEEETransactions on Services Computing

6

TABLE IVSEARCH STEPS INELI MFS-B

Step 1:1 RetrieveI1[tag]

Step 2: For eache ∈ I1[tag]2 id← Dec(Ke, e), and getI2[id] = {MT

1· I′

,MT2· I′′}

3 MT1I′ ·M−1

1Bf

q +MT2I′′ ·M−1

2Bf

′′

q = I′T ·Bf

q + I′′T ·Bf

′′

q =

IT · Bfq4 If IT · Bfq ≥ thr, attachid to ~id

the cloud server checks the Bloom filters ofI1[tag] one by onein Step 2. The encrypted identifiers are decrypted byKe, andthe corresponding Bloom filtersI2[id] = {MT

1 · I′

,MT2 · I

′′

}are retrieved. Here, the secret keyKe is known by the cloudserver, but it reveals nothing except the search results ofw1.Since the search results of queried keywords are not includedin the protected data, we consider this leakage to be acceptable.With I2[id] and{M−11 ·Bf

q,M−12 ·Bf

′′

q }, the inner productIT ·Bfq can be easily obtained as shown in line 3 of TABLEIV. Then, the server determines the feedback to the data owneraccording to the comparison results between the inner productsand thr. If IT · Bfq ≥ thr, the cloud server attachesid tothe result. The results (~id) are returned to the data owner, andwith these identifiers, the data owner can retrieve files fromthe cloud server as he wishes.

3) Analysis: From the perspective of practicality, bothefficiencyandsecurityare important considerations.

Regarding efficiency, the search complexity of EliMFS-Bis O(|D(w1)|), the communication cost is constantlyO(1),and the storage cost on the cloud server isO(|W| + |D|). Adetailed analysis can be found in Section V. All in all, EliMFS-B performs very efficiently, although it can only protectbasic leakages in the known ciphertext model. It is suitablefor applications that have a high efficiency requirement anda general privacy requirement, meaning that EliMFS-B canachieve semantic security withTrace1 as shown below.

For History Hi = (D,W ,Q = {q1, q2, . . . , qn}), Trace1 is

• Size Leakage: The file set size|D| and the keyword setsize |W|.

• Result Leakage: This leakage includes all the searchresults(R1 . . . Rn), Ri = {(id, ip), id ∈ D(qi)}, and theresults of the first search stageD(w11) . . .D(wn1), whereid is the identifier that matches queryqi, ip is the innerproductI2[id] ·Bfqi , andwi1 is the selected keyword inqueryqi for the first-stage search.

• Equality Leakage: This leakage indicates whether twoqueries have the samewi1. We number the keywords from1 to |W|, and setEquality Leakageto be a vector (EL)of length |Q|, whereEL[i] is the sequence number ofwi1 ∈ qi.

Theorem 1. EliMFS-B is semantically secure with Trace1under the known ciphertext model ifF() is a secure psuedo-random function and(Enc,Dec) is a secure symmetricencryption algorithm.

Proof: To prove this theorem, we have to construct asimulator that is indistinguishable from a real instance of ourscheme. Recall that the security game between an attacker

A and a challengerC under the known ciphertext model isdefined as follows.

1) A chooses aHistory Hi = (D,W ,Q = {q1, . . . , qn})whoseTrace is Tr. Then,A sendsHi andTr to C.

2) C randomly chooses a bitb$← {0, 1}. Then, it

calculatesV ib = (EDb, Indexb,Tokenb1, . . . ,Tokenbn)to get the View of Hi and invokes a simulatorS with Tr to obtain a View V i1−b =(ED1−b, Index1−b,Token(1−b)1, . . . ,Token(1−b)n).Both V ib andV i1−b are sent back toA.

3) A selectsb′

∈ {0, 1} as the output of the game.As described in Definition 1, if there exists a simulator that

makesPr[b′

= b] =1

2+ε whereε is negligible, we can prove

that Theorem 1 stands.The simulator is designed as follows. First,S randomly

selectsf′

i$← {0, 1}|fi|, i ∈ [1, |D|] and setsED

= {f′

i , i ∈[1, |D|]}. Since the file set is encrypted by a secure symmetricencryption algorithm,EDb andED

are indistinguishable.After simulating the encrypted file set, the simulator gen-

erates the index and the tokens. Since the Bloom filter isencrypted by the kNN encryption, which has been proven tobe secure under the known ciphertext model [21],I2 andI

2

are indistinguishable. We then describe howS simulatesI′

1:1) Initialize I

1 to an empty array.

2) For eachwi ∈ W , selecttag′

i,K′

i$← {0, 1}λ, and i ∈

[1, |W|] and initializeI′

1[tag′

i] to be an empty array.3) For eachidj ∈ D(wi1), i ∈ [1, n], j ∈ [1, |D|], generate

e′

ij ← Enc(K′

EL[i], idj) and appende′

ij to I′

1[tag′

EL[i]].

4) PadI′

1 to the required size with random data.The tokens are constructed as follows. For eachqi ∈Q, i ∈ [1, n], get tag

EL[i] and set Token′

i =

(tag′

EL[i], Enc(Bf′

i ),K′

EL[i], thr′

i). Here,Enc(Bf′

i ) is gen-

erated by the secure kNN encryptionthr′

i = Min(ip)i, ip ∈Ri.

Since the keywords are encrypted by a secure pseudo-random functionF(·), file identifiers are encrypted by a securesymmetric encryption algorithm(Enc,Dec), the Bloom filteris encrypted by the kNN encryption (proven to be secure underthe known ciphertext model), and the index and tokens ofV ibandV i1−b are indistinguishable.

This is done so that thePlaintext Leakageis preserved. Dueto the random numberr

, different tokens will be generatedfrom the same query. The cloud server can only know whethertwo tokens have the same first-stage searched keyword, notwhether they are generated from the same query, so theSearchPattern Leakageis semi-protected. Further, all the queriedkeywords are searched together in the second search stageusing anm-length array. As a result, it is impossible for thecloud server to know the search result of any subset of~wexceptw1, andSubset Leakageis prevented.

B. EliMFS-E: Extended Scheme

1) Advanced threats:Aside from basic leakages, the two-stage structure incurs an additional threat:Cross Leakage.

Given two tokens of different queriesQ1 =(w1, w2, . . . , wn) and Q2 = (w

1, w′

2, . . . , w′

n), where

Page 7: EliMFS: Achieving Efficient, Leakage-resilient, and …...Multi-keyword Fuzzy Search on Encrypted Cloud Data Jing Chen, Kun He, Lan Deng, Quan Yuan, Ruiying Du, Yang Xiang, and Jie

1939-1374 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2765323, IEEETransactions on Services Computing

7

w1 and w′

1 are different, the server is able to search thequeriesQ3 = (w1, w

2, . . . , w′

n), andQ4 = (w′

1, w2, . . . , wn).The following example illustrates how cross leakage oc-

curs. Suppose the server receives two tokens,Token1 =(tag1, Bf1,Ke1, thr1) for query ~w1 = (w1, w2) andToken2 = (tag2, Bf2,Ke2, thr2) for query ~w2 = (w3, w4).To get the search result of~w3 = (w3, w2), we focus on thefiles inD(w3), namelyI1[tag2]. We divide the files that matchw3 into four categories: a) files that matchw3, w1, andw2; b)files that matchw3 andw1 but notw2; c) files that matchw3

andw2 but notw1; d) files that matchw3 only. The judgementconditions are as follows:

f ∈ D(w3)

f ∈ I1[tag1]

{

I2[idf ] ·Bf1 ≈ k|~w1| aI2[idf ] ·Bf1 6= k|~w1| b

f 6∈ I1[tag1]

{

I2[idf ] ·Bf1 ≈ k(|~w1| − 1) cI2[idf ] ·Bf1 6= k(|~w1| − 1) d

The search result of query(w3, w2) consists of all theids incategories a) and c). The result of query~w4 = (w1, w4) canbe obtained similarly.

Although the cloud server is only given a thresholdthr1instead of|~w1|, it can still obtain the approximate values of|~w1| according to the maximum value of inner products whensearching in the second-stage index. Therefore, an extendedscheme that can preventCross Leakageis necessary.

In the known background model, the kNN encryption mayno longer be secure, since the cloud server can obtain morehistorical information. For instance, if the cloud server collectsa large number of keywords and their corresponding tokens, itmay utilize linear analyses to break the kNN encryption andlink plaintext Bloom filters to encrypted ones [22]. Table IVshows the linear equation:IT · Bfq = MT

1 I′

·M−11 Bf′

q +

MT2 I

′′

·M−12 Bf′′

q , in which only I, an m-bit length vector,is unknown to the cloud server under the known backgroundmodel. With enough query-token pairs, the cloud server canconstructm similar linear equations to derive the coordinatesof I.

2) Scheme Construction:The main reason forCross Leak-age is that the encrypted Bloom filter of a file is revealedas soon as any of its keywords are searched in the first-stage index. Therefore, we propose EliMFS-E, which solvesthe problem by binding the Bloom filters inI2 to both filesand keywords. The main process of EliMFS-E is similar toEliMFS-B with several differences.

To prevent linear analyses of the cloud server under theknown background model, we replace the hash functionsh1 . . . hk with keyed hash functions when generating Bloomfilters. When generating his secret keySK, the data ownerchoosesk hash keys (keyi, . . . keyk) and embeds them intoSK. To calculate the locations of thek hash values fora keyword w in the Bloom filter, functionshi(ϑw, keyi),i ∈ [1, k] are invoked. The detailedInitialize() algorithm issimilar to that of EliMFS-B and is shown in TABLE V.

First, the data owner picks his secret key, which in-cludes k additional keys (key1, key2, . . . , keyk). In Step2, the Bloom filters of all the files are built and storedin a temporary set (Itemp). Each keyword is transformedinto an intermediate value and is inserted intoItemp by

TABLE VINITIALIZATION STEPS IN ELI MFS-E

Step 1:

1 K1,K2, key1, . . . , keyk$← {0, 1}λ, M1,M2 ∈ R

m×m, S ∈ {0, 1}m2 SK = {K1,K2, key1, . . . , keyk,M1,M2, S}

Step 2: Initialize Itemp and for each w ∈ W :3 ϑw ← F(w), h1(ϑw, key1), h2(ϑw, key2), . . . , hk(ϑw , keyk)4 For all id ∈ D(w), set Itemp[id][i] = 1, i ∈{h1(ϑw, key1), h2(ϑw, key2), . . . , hk(ϑw , keyk)}

Step 3: For eachw ∈ W5 tag ← F(K1, ϑw), Ke ← F(K2, ϑw), Sw = S, and set

Sw[hi(ϑw , keyi)] = 1 − S[hi(ϑw , keyi)], i ∈ [1, k]6 For eachid ∈ D(w):

• Appende← Enc(Ke, id) to I1[tag]• Get I = Itemp[id] and initializeI

, I′′

to m-bit length vectors• I

[j] = I′′

[j] = I[j], if Sw[j] = 1; I′

[j] = 1

2I[j] + r, I

′′

[j] =1

2I[j] − r, if Sw[j] = 0, j ∈ [1,m], r is a random number

• I2[e] = {MT1 · I

,MT2 · I

′′}

TABLE VIGENTOKEN STEPS INELI MFS-E

Step 1:1 ϑw1

← F(w1), tag ← F(K1, ϑw1), Ke ← F(K2, ϑw1

), Sw1= S,

and setSw1[hi(ϑw1

, keyi)] = 1− Sw1[hi(ϑw1

, keyi)], i ∈ [1, k]Step 2:

2 Initialize Bfq to be am-bit length array of all zeros3 For all wi ∈ ~w: get ϑwi

← F(wi),h1(ϑwi

, key1), . . . , hk(ϑwi, keyk), and set Bfq [j] = 1,

j ∈ {h1(ϑwi, key1), . . . , hk(ϑwi

, keyk)}Step 3:

4 Initialize Bf′

q , Bf′′

q to m-bit length vectors

5 Bf′

q [j] = Bf′′

q [j] = Bfq [j] if S[j] = 0; Bf′

q [j] = 1

2Bfq [j] + r

,

Bf′′

q [j] = 1

2Bfq [j]− r

if S[j] = 1, j ∈ [1,m], r′

is a random number

6 Chosethr 6 k|~w|, Token = {tag,M−1

1· Bf

q,M−1

2· Bf

′′

q , thr}

TABLE VIISEARCH STEPS INELI MFS-E

Step 1:1 RetrieveI1[tag]

Step 2: For eache ∈ I1[tag]2 GetI2[e] = {MT

1· I′

,MT2· I′′}

3 MT1I′ ·M−1

1Bf

q +MT2I′′ ·M−1

2Bf

′′

q = I′T ·Bf

q + I′′T ·Bf

′′

q =

IT · Bfq4 If IT ·Bfq ≥ thr, attachid to ~id

h1(ϑw, key1), h2(ϑw, key2), . . . , hk(ϑw, keyk), where func-tions h1 to hk are chosen from a hash familyH = {h |h : {0, 1}∗ × {0, 1}λ → [1,m]}. The indexesI1, I2 areconstructed in Step 3. In line 5, the entrance inI1, thesecret key for encrypting file identifiers, and a specific vector(Sw) of each keyword are calculated. Then, for each file thatmatchesw, the data owner inserts the encrypted identifiere into I1 and setsI2[e] as the Bloom filter encrypted by{M1,M2, Sw}. Thus, the data owner can get the whole secretindex,Index = {I1, I2}, and sends it to the cloud server withthe corresponding encrypted files.

The improvement of theGenToken() algorithm is sim-ilar, as shown in TABLE VI. The hash functionshi(ϑw)are replaced with keyed hash functionshi(ϑw, keyi) andthe Bloom filter of the query(w1, w2, . . . ) is encrypted by(M1,M2, Sw1

), whereSw1is calculated from the intermediate

value ofw1 (namely,ϑw1). Because the secret keyKe would

no longer be sent to the cloud server, the decryption operationin theSearch() algorithm is removed (TABLE VII).

Page 8: EliMFS: Achieving Efficient, Leakage-resilient, and …...Multi-keyword Fuzzy Search on Encrypted Cloud Data Jing Chen, Kun He, Lan Deng, Quan Yuan, Ruiying Du, Yang Xiang, and Jie

1939-1374 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2765323, IEEETransactions on Services Computing

8

3) Analysis: Regardingefficiency, EliMFS-E has a higherstorage cost (O(Σw∈W |D(w)|)) than EliMFS-B. The searchcomplexity and communication cost are the same in EliMFS-E as in EliMFS-B:O(|w1|) andO(1), respectively. EliMFS-Ealso provides a stronger leakage-resilient ability, which meansa higher semantic security guarantee withTrace2.

For History Hi = (D,W ,Q = {q1, q2, . . . , qn}), Trace2 is

• Size Leakage: The file set size|D|, the keyword set size|W|, the quantity of file-keyword pairsΣw∈W |D(w)|, andthe result size of the first-search stage|D(wi1)|, wherewi1 is the selected keyword in queryqi for the first stagesearch.

• Result Leakage: (R1 . . . Rn), Ri = {(id, ip), e ∈ D(qi)},whereid is the identifier that matches queryqi andip isthe inner productI2[id] ·Bfqi .

• Equality Leakage: The sequence number of the selectedkeyword of each query is(EL[1], . . . , EL[n]).

Theorem 2. EliMFS-E is semantically secure with Trace2under the known background model ifF() is a secure pseudo-random function and(Enc,Dec) is a secure symmetricencryption algorithm.

Proof: As shown in the proof of Theorem 1, we have toconstruct a simulator that produces indistinguishable traces.The difference between the two theorems is that EliMFS-Eis designed for the known background model. In the knownbackground model, we allow the cloud server to collect acertain number of keyword-token pairs that are different fromthe queries used in the game. With these pairs and twoViewsfollowing the sameTrace, the cloud server should not beable to distinguish which one is generated by the simulator.Formally, the security game between an attackerA and achallengerC under the known background model is definedas follows.

1) A choosesq and sends it toC. Then,C responds withthe corresponding tokens. This step can be carried outby A adaptively for polynomial times, and the set of allqueries is denoted byQ′.

2) A chooses aHistory, Hi = (D,W ,Q = {q1, . . . , qn})whoseTrace is Tr, andqj 6∈ Q′ for 1 ≤ j ≤ n. ThenA sendsHi andTr to C.

3) C randomly chooses a bitb$← {0, 1}. Then, it

calculatesV ib = (EDb, Indexb,Tokenb1, . . . ,Tokenbn)to get the View of Hi and invokes a simulatorS with Tr to obtain a View V i1−b =(ED1−b, Index1−b,Token(1−b)1, . . . ,Token(1−b)n).Both V ib andV i1−b are sent back toA.

4) A selectsb′

∈ {0, 1} as the output of the game.

S simulatesED′

in the same way here as in the proof ofTheorem 1. To simulate the index and tokens,S first randomlypicksM

1,M′

2 ∈ Rm×m, S

∈ {0, 1}m. The index is generatedas follows.

1) Initializes I′

1, I′

2, and I′

temp to empty arrays, and ini-tializes each item ofI

temp to be anm bit length arrayof all zeros.

2) For eachqi ∈ Q and i ∈ [1, n],

a) Generate a Bloom filter (Bf′

i ) in which the numberof bit 1 equalsMax(ip), ip ∈ Ri.

b) For each(idj , ipj) ∈ Ri andj ∈ [1, |D|], randomlychooseipj 1s fromBf

i , construct a bit arrayBf′

ij ,and setI

temp[j] asI′

temp[j] +Bf′

ij .c) For the elements ofI

temp that are bigger than 1,replace them with 1.

3) For eachqi ∈ Q, if there does not existEL[j] =

EL[i](j < i), select tag′

EL[i], e′

1, . . . , e′

|D(wi1)|$←

{0, 1}λ and setI′

1[tag′

EL[i]] = (e′

1, . . . , e′

|D(wi1)|).

4) Choose a freee′

for each idx ∈ Ri, set I′

2[e′

] to beencryptedI

temp[x], and marke′

as used.5) For the other freee

, generate elements inI′

2 for themrandomly.

6) PadI′

1 andI′

2 to the required size with random data.With all this be done,I

1 andI′

2 are built.Then, S constructs the tokens. For eachqi ∈ Q and

i ∈ [1, n], S gets tag′

EL[i] and encryptsBf′

i with EL[i]

and SK′

. S sets Token′

i = (tagEL[i]′ , Enc(Bf′

i ), thr′

i),

thr′

i = Min(ip)i, ip ∈ Ri.Basic leakages are not incurred since the files are encrypted

by semantically secure symmetric encryption, keywords areencrypted by a secure pseudo-random functionF(), and fileidentifiers are encrypted by a secure symmetric encryptionalgorithm (Enc,Dec). Due to the the indistinguishability ofkeyed hash functions, no PPT adversary is able to carry outlinear analysis on kNN encryption, so EliMFS-E is secureunder the known background model.

Given two tokens of queries~w1 = (w1, w2) and~w2 = (w3, w4), Token1 = (tag1, Bf1, thr1),Token2 =(tag2, Bf2, thr2), and the server cannot get the correct innerproduct of the Bloom filters encrypted byw

3, or ofBf1, whichis encrypted byw

1. Therefore, the approach for obtaining thesearch result of(w3, w1) mentioned above no longer works,andCross Leakageis prevented. Note that the search result ofthe first-stage search can be easily covered up by obfuscatingoperations like padding each item inI1 to a fixed size withrandom encrypted file identifiers.

V. THEORETICAL ANALYSIS

Before dealing with experiments, we first analyze the perfor-mance of our schemes theoretically. As shown in TABLE VIII,we compare the asymptotic performance of our schemes withthe schemes of [5], [11] and [23].

[23] provided a fuzzy keyword searchable encryptionscheme with an inverted index of sizeO(W). In this scheme,the entire index must be scanned to find the fuzzy result of akeyword. Thus, its search time complexity isO(|W|). Wanget al. [5] proposed a method of searching multiple keywordsover an inverted index. To get the result, the cloud server hasto compute the product of the token, whose length is|W|,with the entire inverted index, which includes a|W| × |W|matrix. As a result, the efficiency of this scheme is not quitesatisfactory, and it does not support a fuzzy keyword search.The scheme of [11] provided the functions of both multi-keyword and fuzzy keyword searches. However, its search time

Page 9: EliMFS: Achieving Efficient, Leakage-resilient, and …...Multi-keyword Fuzzy Search on Encrypted Cloud Data Jing Chen, Kun He, Lan Deng, Quan Yuan, Ruiying Du, Yang Xiang, and Jie

1939-1374 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2765323, IEEETransactions on Services Computing

9

TABLE VIIICOMPARISON OFASYMPTOTICPERFORMANCE

Multi-keyword Search Fuzzy Search Search Time Communication Cost Storage Cost

paper [23] × √O(|W|) O(1) O(|W |)

paper [5]√ × O(|W|) O(|W|) O(|W|2)

paper [11]√ √

O(|D|) O(1) O(|D|)EliMFS-B

√ √O(|D(w1)|) O(1) O(|W|+ |D|)

EliMFS-E√ √

O(|D(w1)|) O(1) O(Σw∈W |D(w)|)

is linear with the size of the file set because it utilizes a forwardindex.

The search complexity of EliMFS-B isO(|D(w1)|) sincethe search time of the first-stage index isO(1) and the secondsearch stage only takes place in the results of the first searchstage. This scheme needs only one round of communication,and its token size is constant. Therefore, its communicationcost is constantlyO(1), no matter how many keywords aresearched at one time. The storage cost of the cloud server isthe space forI1 andI2; that is,O(|W|+ |D|). In conclusion,EliMFS-B has the best performance in all efficiency aspectswith a basic semantic security guarantee.

EliMFS-E has a performance similar to EliMFS-B. It hasa larger storage cost, but its leakage-resilient ability hasimproved. SinceI1 is an inverted index and each file-keywordpair corresponds to an item inI2, the storage space for theindex is O(W + Σw∈W |D(w)|), which is approximate toO(Σw∈W |D(w)|). Suppose the client has a file set of 10000files and each file contains 150 keywords; the size of the indexof EliMFS-E is about 30GB whenm = 5000. For the same fileset, the size of the index of scheme EliMFS-B is only about200MB. The querying and searching processes of EliMFS-Eare similar to those of EliMFS-B. Thus, the communicationcost and the search complexity remain the same. But the searchtime of EliMFS-E may be even shorter than that of EliMFS-Bbecause all the Bloom filters that match the same keywordcan be stored in consecutive positions. This means that onlyone disk read operation is needed. The decryption operationid← Dec(Ke, e) is also omitted in EliMFS-E. The trade-offis worthwhile because the storage space of the cloud server isenormous. In our future works, we will improve this schemeby optimizing the storage cost with some strategies.

VI. EXPERIMENTAL ANALYSIS

To present practical utility, we tested our schemes on a real-world dataset: 10000 files with 16027 keywords selected fromthe Enron Email Dataset [24]. We implemented prototypes ofEliMFS-B, EliMFS-E, and the basic scheme of [11], which isrepresented as MFS.

The programs were written in Java and were executedon a desktop equipped with an Intel Core i5-4200H CPUat 2.8 GHz, 8 GB RAM, and a 64-bit Windows 10. Forsimplicity, the AES algorithm was used as the psuedo-randomfunctionF() and the CPA secure symmetric encryption algo-rithm (Enc,Dec). SHA-256 and HmacSHA1 were utilizedin hi(ϑw) andhi(ϑw, keyi), respectively.

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10

Pro

babi

lity

Number of hash functions

pr1(c=0.2)pr1(c=0.5)pr2(c=0.2)pr2(c=0.5)

Fig. 3. The performance of keyword fuzzifying with variousls and cs

A. Search Accuracy

To achieve the best accuracy for the keyword search, theparameters in LSH and the Bloom filter need to be selectedprudently. The fuzzy search is realized with Locality-SensitiveHashing, as described. We employ a 2-stable distribution LSHfamily (F ) which containsl hash functions, and each hashfunction has the expression

h(~a,b)(~v) = ⌊~a · ~v + b

c⌋

Here,~a is a26×26 = 676 dimensional vector, and each of itsdimensions are randomly chosen from the distribution definedby f(x) = 1√

2πe−x

2/2. b ∈ [0, c] is a random real number.While c is a fixed real number for one hash family, differentchoices of~a andb result in different hash functions.

To achieve the optimal performance of the fuzzy search, alarge number of keywords are involved in our experiment. Theintermediate values of these keywords and the keywords thatare similar to them are calculated to evaluate the fuzzy level.In our experiment, we use two probabilities to evaluate theaccuracy of the fuzzy keyword search:

• True positive:pr1 = {F(w1) = F(w2)|w1 andw2 onlyone character distinction exists.}

• True negative:pr2 = {F(w1) 6= F(w2)|w1 and w2

multiple character distinctions exist.}

These probabilities should be as high as possible.Fig. 3 shows thatpr1 decreases when the quantity of

the hash functions inF gets larger, while increases whencbecomes bigger. However,pr2 acts the opposite way. Thatis, pr2 increases as the number of the hash functions inFincreases, and decreases with the increment ofc. Therefore, itis important to make a trade-off betweenpr1 andpr2. In the

Page 10: EliMFS: Achieving Efficient, Leakage-resilient, and …...Multi-keyword Fuzzy Search on Encrypted Cloud Data Jing Chen, Kun He, Lan Deng, Quan Yuan, Ruiying Du, Yang Xiang, and Jie

1939-1374 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2765323, IEEETransactions on Services Computing

10

0

0.1

0.2

0.3

0.4

0.5

0.6

1000 1500 2000 2500 3000 3500 4000 4500 5000

fals

e po

sitiv

e ra

te

m

k=5k=10k=15k=20

Fig. 4. The false positive rates of Bloom filter with variousms andks

200

400

600

800

1000

1000 2000 4000 6000 8000 10000

Tim

e (s

econ

d)

File Set Size

Bf encryptionEliMFS-B

MFS

Fig. 5. The index generation times when|D| ranges

following experiments, we setl = 7 and c = 0.3 so thatpr1andpr2 are both larger than 0.8.

Bloom filter has a distinct advantage on spacial and tem-poral cost on searching compared with other data structures.However, it also has some shortcomings. There exist falsepositives when searching with Bloom filter, and the falsepositive isPfp = (1− e−kn/m)k [15], wheren is the numberof the elements in the set,m is the length of the bit array, andk is the number of hash functions. In our experiments, 100to 182 keywords are extracted from one file, so we setn tobe 180. As shown in Fig 4, the false positive rates are testedunder differentms andks. The result indicates that,fp growswith k whenm is small, andfp declines with the increase ofkwhenm is large. Andfp is always smaller than 0.0004 whenm is larger than 3000. Therefore in the following experiments,we setm to 5000 andk to 20 for the best search accuracy.

Due to the existence of false positives and false negatives,some matched files may have a lower inner product than otherfiles. A trade-off between false positives and false negativescan be made by choosing a proper threshold (thr).

B. Efficiency

In this subsection, we mainly analyze the time consumptionof our schemes from different aspects.

1) Index building time:In Fig. 5, the columns with stripesshow the index generation time of our EliMFS-B and MFS.We can see that the generation time increases linearly with

0

20

40

60

80

100

120

140

0 5 10 15 20

Tok

en G

ener

atio

n T

ime

(ms)

Number of Queried Keywords

EliMFS-BEliMFS-E

MFS

Fig. 6. The token generation times with different|~w|s

|D|. EliMFS-B takes a little more time than MFS becauseit needs to build two indexes,I1 and I2, while MFS onlyneeds to build one. However, the difference is quite smallsince the most time-consuming operation is the encryptionof the Bloom filter, shown as the purple columns. Thus,faster matrix calculation can shorten the initialization timeeffectively. Index generation is executed only once when thesystem starts up, which has little influence on the efficiencyof the whole system. Therefore, the extra time consumptionof EliMFS-B is acceptable.

2) Token generation time:Fig. 6 shows the token gen-eration time with different keyword numbers for querying.When only one keyword is searched, our schemes show a hugeadvantage. The token generation time is quite small becauseonly a tag is computed. When multiple keywords are searched,the lines jump up to more than 60ms because the most time-intensive step in this phase is the encryption of the Bloomfilter, which costs nearly 60ms. It is obvious that the tokengeneration time rises linearly when the number of keywordsgrows. In addition, the growth rate of EliMFS-E is larger thanthat of both EliMFS-B and MFS. This is because|~w| times ofoperations are needed to insert all the queried keywords intothe Bloom filter, and the keyed hash functionhi(ϑw, keyi)takes more time thanhi(ϑw). Even though EliMFS-B andEliMFS-E take a little more time to generate tokens, they canobtain more benefits from the search time, as shown below.

3) Search time:As the most important performance metric,search times with various impact factors are presented inFig. 7. Fig. 7a reflects the relationship between the search timeand the file set size. With the growth of the file set size, thecurve of MFS ascends linearly, while the curves of EliMFS-Band EliMFS-E grow slowly at a low level. This is because inMFS, the search process always has to go through the entirefile set. The search times of our schemes, on the other hand,only rely on the results of the first search stage. When the fileset gets larger, the size of the search result in the first stagegrows, which leads the second search stage to expend moretime. EliMFS-B is a little slower than EliMFS-E because itexecutes one more operation:id← Dec(Ke, e).

The relationship between the search time and three otherparameters are displayed in Figs. 7b, 7c, and 7d, when thefile set size is fixed at 10000. The search time of MFS

Page 11: EliMFS: Achieving Efficient, Leakage-resilient, and …...Multi-keyword Fuzzy Search on Encrypted Cloud Data Jing Chen, Kun He, Lan Deng, Quan Yuan, Ruiying Du, Yang Xiang, and Jie

1939-1374 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2765323, IEEETransactions on Services Computing

11

0

200

400

600

800

1000

1200

0 2000 4000 6000 8000 10000

Sea

rch

Tim

e (m

s)

File Set Size

EliMFS-BEliMFS-E

MFS

(a) Search times with different|D|s

0

200

400

600

800

1000

1200

1400

5000 10000 15000 20000

Sea

rch

Tim

e (m

s)

Keyword Set Size

EliMFS-BEliMFS-E

MFS

(b) Search times with different|W|s

0

200

400

600

800

1000

1200

1400

0 500 1000 1500 2000

Sea

rch

Tim

e (m

s)

Number of files that match w1

EliMFS-BEliMFS-E

MFS

(c) Search times with different|D(w1)|s

0

200

400

600

800

1000

1200

1400

Sea

rch

Tim

e (m

s)Number of queried keywords

EliMFS-BEliMFS-E

MFS

(d) Search times with different|~w|sFig. 7. The search time of EliMFS-B, EliMFS-E, and MFS

remains unchanged in these three images since the file setsize is constant. In Fig. 7b, different quantities of keywordsare extracted from these files so that various sizes ofW aremade up. We can see that the line trends of our schemes aresimilar to those in Fig. 7a: the search times of our schemesare much shorter than that of MFS, and they grow slowly with|W|. The more keywords extracted from a file, the more likelya file will match. Therefore, more results can be obtained inthe first search stage. This is why the search times of ourschemes increase with the growth of the keyword set size.

Fig. 7c demonstrates the relationship between the searchtime and |D(w1)| when searching in 10000 files, wherew1

is the selected keyword in a query for the first stage search.The search time of MFS stays constant as mentioned above,and in EliMFS-B and EliMFS-E, the search time linearlyincreases with|D(w1)| because the server needs to search overall the files that matchw1 to get an accurate result for thequeried keyword set. Sincew1 is the keyword that matchesthe fewest files, the size ofD(w1) would not be very large.Thus, EliMFS-B and EliMFS-E search much faster than MFS,except in the extreme case wherew1 matches almost all thefiles in D.

In Fig. 7d, when the file set size is fixed (10000 files and16027 keywords), the search time is almost independent of thequery size in all three schemes; in such a scenario, our schemesspend far less time than MFS. When only one keyword isqueried, the second search stage will be omitted, and as aresult, the search time is extremely short. When|~w| > 1, thesearch time remains stable no matter how many keywords aresearched.

In conclusion, our EliMFS schemes based on LSH, theBloom filter, and a two-stage index, have a similar token gen-eration time as MFS, but reduce the search time remarkably.

VII. R ELATED WORK

Searchable Encryption(SE) enables data owners to out-source their private files to a semi-trusted server without re-vealing the plaintexts while simultaneously guaranteeing key-word search functions. Currently, researchers propose manystate-of-the-art SSE schemes: [2] proposed the first SearchableSymmetric Encryption (SSE) scheme, and full-domain searchscheme rather than an index-based scheme; [25] and [26] pre-sented SE schemes based on asymmetric encryption; [27] and[28] dealt with a malicious cloud server and proposed schemesto support verifiable and searchable symmetric encryption;[29] redefined the security of SE schemes based on the adver-sarial server’s prior knowledge; [6] and [8] proposed schemesthat support dynamic updating; [30] improved SE schemesin terms of the index I/O efficiency; [31] utilized attribute-based encryption to achieve verifiable multi-user SSE; [32]proposed a new verifiable database (VDB) framework so thatclients cannot only retrieve database records, but also detectany attempt by the server to tamper with the data; [33] firstapplied encrypted keyword searching on de-duplicated data;and [34] and [35] applied searchable encryptions to specificapplications, such as public key encryption and E-HealthClouds. However, none of these works support multi-keywordor fuzzy keyword searches.

Page 12: EliMFS: Achieving Efficient, Leakage-resilient, and …...Multi-keyword Fuzzy Search on Encrypted Cloud Data Jing Chen, Kun He, Lan Deng, Quan Yuan, Ruiying Du, Yang Xiang, and Jie

1939-1374 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2765323, IEEETransactions on Services Computing

12

A. Multi-keyword Search

The multi-keyword searchschemes can search multiplekeywords simultaneously over encrypted data. [3] constructeda binary data vector for each file where each bit in the vectorrepresents a keyword of the file. This structure is similar toschemes based on the Bloom filter whenk = 1, wherek is thenumber of hash functions. In [36], a tree structure index basedon term frequency was used to support ranked searches, andthe cosine similarity measure was exploited to check whetherthe file index contained indicated keywords. [4] utilized aninverted index and a hash table to improve efficiency, butincreased communication costs. Furthermore, [9] expanded[4] to a multi-user scenario. [37] supported a pattern-matchingstring search, a more flexible method than a general booleansearch SSE. [5] achieved a multi-keyword search with aninverted index for the first time, but its efficiency was not quitesatisfactory. [38] dealt with ranked multi-keyword searchesin a multi-owner mode based on a bilinear map and theDecisional Bilinear Diffie-Hellman (DBDH) assumption, butan administration server was needed. All of, these schemesonly considerexact keyword searches.

B. Fuzzy Keyword Search

To deal with spelling mistakes and searches for similarkeywords, multiplefuzzy keyword searchschemes were pro-posed [39], [40]. [7] proposed a wildcard-based fuzzy keywordsearch over encrypted data. [41] improved it with a smallerindex. The work of [42], which proposed a higher securityguarantee but with a larger space requirement, was alsobased on [7]. Finger-prints were extracted from keywords andencrypted with a secure kNN encryption to achieve a top-kfuzzy search in [23]. Xu et al. [43] exploited edit distancesto quantify keyword similarities in their work, and they alsoemployed a TF-IDF (Term Frequency and Inverse DocumentFrequency) rule to rank the results. However, most existingworks (except [10] and [11]) do not support multi-keywordand fuzzy searches simultaneously.

C. Multi-keyword Fuzzy Search

To implement a fuzzy search, [10] exploited a Bedtreeinverted index and realized multi-keyword search functionsthrough pre-defined phrases, e.g. “Cloud Storage” as a singlekeyword. However, in this scheme, one Bloom filter must bebuilt for each edit distance value of each keyword, whichleads to a tremendous index size. In [11], Wang et al. usedthe forward index, the Bloom filter, and Locality-SensitiveHashing techniques to implement multi-keyword fuzzy search.However, its search time is linear with the file set size in thecloud. Therefore, its performance may be severely influencedwhen there are huge file sets to search. In [44], Wang et al.proposed a ranked multi-keyword fuzzy search on encrypteddata based on a two-layered index structure. Unfortunately,both of the two layers use forward index, which is still linearwith the file set size in the cloud. Fu et al. [16] improvedsearch accuracy via a uni-gram based keyword transformationmethod while remaining the search time linearly with the file

set size. Our paper focuses on further improving the searchperformance without hurting the system’s privacy in terms ofleakages.

VIII. C ONCLUSION

In this paper, we focus on a multi-keyword fuzzy searchover a large encrypted file set in cloud storage. An Effi-cient Leakage-resilient Multi-keyword Fuzzy Search (EliMFS)framework is proposed for the encrypted cloud data; it con-sists of a novel two-stage index structure, Locality-SensitiveHashing, and a Bloom filter. The two-stage index structureensures that the search time is linearly independent of the fileset size. Meanwhile, the multi-keyword fuzzy search functionis implemented based on the Gram Counting Order, the Bloomfilter, and Locality-Sensitive Hashing. Regarding the leakagescaused by the two-stage index, we present two schemes tohandle threats in different threat models. Theoretical analysisand experimental evaluations demonstrate our design’s practi-cality.

ACKNOWLEDGMENT

This research was supported in part by the Key Laboratoryof Aerospace Information Security and the Trusted ComputingMinistry of Education; by NSF grants CNS 1460971, CNS1439672, CNS 1301774, ECCS 1231461, ECCS 1128209, andCNS 1138963; by the National Natural Science Foundation ofChina under grants 61272451, 61572380, U1536204; by theMajor State Basic Research Development Program of Chinaunder grant 2014CB340600; and by the National High Tech-nology Research and Development Program (863 Program) ofChina under grant 2014BAH41B00. The corresponding authoris Kun He.

REFERENCES

[1] K. He, J. Chen, R. Du, Q. Wu, G. Xue, and X. Zhang, “DeyPoS:Deduplicatable dynamic proof of storage for multi-user environments,”IEEE Transactions on Computers, vol. 65, no. 12, pp. 3631–3645, 2016.

[2] D. X. Song, D. Wagner, and A. Perrig, “Practical techniques for searcheson encrypted data,” in2013 IEEE Symposium on Security and Privacy,(Los Alamitos, CA, USA), IEEE Computer Society, 2000.

[3] N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, “Privacy-preservingmulti-keyword ranked search over encrypted cloud data,” inProceed-ings of 2011 IEEE INFOCOM International Conference on ComputerCommunications, 2011.

[4] D. Cash, S. Jarecki, C. Jutla, H. Krawczyk, M.-C. Rou, and M. Steiner,“Highly-scalable searchable symmetric encryption with support forboolean queries,” inAdvances in Cryptology CRYPTO 2013, LectureNotes in Computer Science, pp. 353–373, Springer Berlin Heidelberg,Jan. 2013.

[5] B. Wang, W. Song, W. Lou, and Y. T. Hou, “Inverted index basedmulti-keyword public-key searchable encryption with strong privacyguarantee,” inProceedings of the 2015 IEEE INFOCOM InternationalConference on Computer Communications, 2015.

[6] M. Naveed, M. Prabhakaran, and C. A. Gunter, “Dynamic searchableencryption via blind storage,” Tech. Rep. 219, 2014.

[7] J. Li, Q. Wang, C. Wang, N. Cao, K. Ren, and W. Lou, “Fuzzykeyword search over encrypted data in cloud computing,” inProceedingsof the 2010 IEEE INFOCOM International Conference on ComputerCommunications, 2010.

[8] E. Stefanov, C. Papamanthou, and E. Shi, “Practical dynamic searchableencryption with small leakage,”IACR Cryptology ePrint Archive, 2013.

[9] S. Jarecki, C. Jutla, H. Krawczyk, M. Rosu, and M. Steiner, “Outsourcedsymmetric private information retrieval,” inProceedings of the 2013ACM SIGSAC conference on Computer & communications security,CCS ’13, (New York, NY, USA), pp. 875–888, ACM, 2013.

Page 13: EliMFS: Achieving Efficient, Leakage-resilient, and …...Multi-keyword Fuzzy Search on Encrypted Cloud Data Jing Chen, Kun He, Lan Deng, Quan Yuan, Ruiying Du, Yang Xiang, and Jie

1939-1374 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2765323, IEEETransactions on Services Computing

13

[10] M. Chuah and W. Hu, “Privacy-aware BedTree based solution for fuzzymulti-keyword search over encrypted data,” inProceedings of the 2011ICDCSW International Conference on Distributed Computing Systems,pp. 273–281, June 2011.

[11] B. Wang, S. Yu, W. Lou, and Y. Hou, “Privacy-preserving multi-keyword fuzzy search over encrypted data in the cloud,” inProceedingsof the 2014 IEEE INFOCOM International Conference on ComputerCommunications, 2014.

[12] E.-J. Goh and others, “Secure indexes,”IACR Cryptology ePrint Archive,vol. 2003, p. 216, 2003.

[13] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, “Searchablesymmetric encryption: Improved definitions and efficient constructions,”in Proceedings of the 13th ACM Conference on Computer and Commu-nications Security, CCS ’06, (New York, NY, USA), pp. 79–88, ACM,2006.

[14] “Locality-sensitive hashing,” Jan. 2016. Page Version ID: 701793057.[15] “Bloom filter,” Feb. 2016. Page Version ID: 704138885.[16] Z. Fu, X. Wu, C. Guan, X. Sun, and K. Ren, “Toward Efficient

Multi-Keyword Fuzzy Search Over Encrypted Outsourced Data WithAccuracy Improvement,”IEEE Transactions on Information Forensicsand Security, vol. 11, pp. 2706–2716, Dec. 2016.

[17] J. Chen, K. He, Q. Yuan, G. Xue, R. Du, and L. Wang, “Batchidentification game model for invalid signatures in wireless mobilenetworks,” IEEE Transactions on Mobile Computing, 2016.

[18] A. Behm, S. Ji, C. Li, and J. Lu, “Space-Constrained Gram-BasedIndexing for Efficient Approximate String Search,” inProceedings ofthe 25th IEEE ICDE International Conference on Data Engineering,2009.

[19] M. Datar and P. Indyk, “Locality-sensitive hashing scheme based on p-stable distributions,” inIn SCG 04: Proceedings of the twentieth annualsymposium on Computational geometry, pp. 253–262, ACM Press, 2004.

[20] B. H. Bloom, “Space/time trade-offs in hash coding with allowableerrors,” Commun. ACM, vol. 13, no. 7, pp. 422–426, 1970.

[21] W. K. Wong, D. W.-l. Cheung, B. Kao, and N. Mamoulis, “Secure kNNComputation on Encrypted Databases,” inProceedings of the 2009 ACMSIGMOD International Conference on Management of Data, (New York,NY, USA), pp. 139–152, ACM, 2009.

[22] B. Yao, F. Li, and X. Xiao, “Secure nearest neighbor revisited,” inProceedings of the 2013 ICDE International Conference on DataEngineering, 2013.

[23] D. Wang, S. Fu, and M. Xu, “A privacy-preserving fuzzy keyword searchscheme over encrypted cloud data,” inProceedings of the 2013 IEEECloudCom International Conference on Cloud Computing Technologyand Science, 2013.

[24] EDRM, “Enron email dataset.”[25] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano, “Public key

encryption with keyword search,” inAdvances in Cryptology-Eurocrypt2004, pp. 506–522, Springer, 2004.

[26] M. Bellare, A. Boldyreva, and A. ONeill, “Deterministic and efficientlysearchable encryption,” inAdvances in Cryptology - CRYPTO 2007,Lecture Notes in Computer Science, pp. 535–552, Springer BerlinHeidelberg, 2007.

[27] R. Cheng, J. Yan, C. Guan, F. Zhang, and K. Ren, “Verifiable Search-able Symmetric Encryption from Indistinguishability Obfuscation,” inProceedings of the 10th ACM Symposium on Information, Computerand Communications Security, ASIA CCS ’15, (New York, NY, USA),pp. 621–626, ACM, 2015.

[28] W. Sun, X. Liu, W. Lou, Y. T. Hou, and H. Li, “Catch you if you lie tome: Efficient verifiable conjunctive keyword search over large dynamicencrypted cloud data,” inProceedings of the 2015 IEEE INFOCOMInternational Conference on Computer Communications, pp. 2110–2118, Apr. 2015.

[29] D. Cash, P. Grubbs, J. Perry, and T. Ristenpart, “Leakage-Abuse AttacksAgainst Searchable Encryption,” inProceedings of the 22Nd ACMSIGSAC Conference on Computer and Communications Security, CCS’15, (New York, NY, USA), pp. 668–679, ACM, 2015.

[30] D. Cash, J. Jaeger, S. Jarecki, C. Jutla, H. Krawczyk, M.-C. Rosu, andM. Steiner, “Dynamic searchable encryption in very large databases:Data structures and implementation,” inProceedings of the 2014 NDSSNetwork and Distributed System Security Symposium, vol. 14, 2014.

[31] Q. Zheng, S. Xu, and G. Ateniese, “VABKS: Verifiable attribute-based keyword search over outsourced encrypted data,” inProceedingsof the 2014 IEEE INFOCOM International Conference on ComputerCommunications, 2014.

[32] X. Chen, J. Li, X. Huang, and M. Jianfeng, “New Publicly VerifiableDatabases With Efficient Updates,”IEEE Transactions on Dependableand Secure Computing, vol. 12, no. 5, pp. 1–1, 2014.

[33] J. Li, X. Chen, F. Xhafa, and L. Barolli, “Secure Deduplication StorageSystems Supporting Keyword Search,”Journal of Computer and SystemSciences, vol. 81, no. 8, pp. 1532–1541, 2015.

[34] R. Chen, Y. Mu, G. Yang, F. Guo, and X. Wang, “Dual-Server Public-Key Encryption With Keyword Search for Secure Cloud Storage,”IEEETransactions on Information Forensics and Security, vol. 11, pp. 789–798, Apr. 2016.

[35] Y. Yang and M. Ma, “Conjunctive Keyword Search With DesignatedTester and Timing Enabled Proxy Re-Encryption Function for E-HealthClouds,” IEEE Transactions on Information Forensics and Security,vol. 11, pp. 746–759, Apr. 2016.

[36] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, Y. T. Hou, and H. Li, “Privacy-preserving multi-keyword text search in the cloud supporting similarity-based ranking,” inProceedings of the 2013 ACM SIGSAC Symposiumon Information, Computer and Communications Security, ASIA CCS’13, (New York, NY, USA), pp. 71–82, ACM, 2013.

[37] D. Wang, X. Jia, C. Wang, K. Yang, S. Fu, and M. Xu, “Generalizedpattern matching string search on encrypted data in cloud systems,” inProceedings of the 2015 IEEE INFOCOM International Conference onComputer Communications, pp. 2101–2109, Apr. 2015.

[38] W. Zhang, Y. Lin, S. Xiao, J. Wu, and S. Zhou, “Privacy PreservingRanked Multi-Keyword Search for Multiple Data Owners in CloudComputing,”IEEE Transactions on Computers, vol. 65, no. 5, pp. 1566–1577, 2016.

[39] J. Wang, H. Ma, Q. Tang, J. Li, H. Zhu, S. Ma, and X. Chen,“Efficient verifiable fuzzy keyword search over encrypted data in cloudcomputing,”Computer science and information systems, vol. 10, no. 2,pp. 667–684, 2013.

[40] P. Xu, H. Jin, Q. Wu, and W. Wang, “Public-key encryption with fuzzykeyword search: A provably secure scheme under keyword guessingattack,” IEEE Transactions on computers, vol. 62, no. 11, pp. 2266–2277, 2013.

[41] C. Liu, L. Zhu, L. Li, and Y. Tan, “Fuzzy keyword search on encryptedcloud storage data with small index,” inProceedings of the 2011 IEEECCIS International Conference on Cloud Computing and IntelligenceSystems, 2011.

[42] A. Boldyreva and N. Chenette, “Efficient Fuzzy Search on EncryptedData,” in Fast Software Encryption, Lecture Notes in Computer Science,pp. 613–633, Springer Berlin Heidelberg, 2014.

[43] Q. Xu, H. Shen, Y. Sang, and H. Tian, “Privacy-preserving ranked fuzzykeyword search over encrypted cloud data,” inInternational Conferenceon Parallel and Distributed Computing, Applications and Technologies(PDCAT), pp. 239–245, 2013.

[44] J. Wang, X. Yu, and M. Zhao, “Privacy-Preserving Ranked Multi-keyword Fuzzy Search on Cloud Encrypted Data Supporting RangeQuery,” Arabian Journal for Science & Engineering (Springer Science& Business Media B.V. ), vol. 40, pp. 2375–2388, Aug. 2015.

Jing Chen received his Ph.D. degree in computerscience from Huazhong University of Science andTechnology, Wuhan. He is currently a full professorat the Computer School at Wuhan University. Hisresearch interests are in cloud security and networksecurity. He has published more than 80 researchpapers in many international journals and confer-ences, such as IEEE Transactions on Parallel andDistributed Systems, IEEE Transactions on Com-puters, IEEE Transactions on Mobile Computing,INFOCOM, SECON, and TrustCom.

Kun He is a PH.D student at Wuhan University.His research interests include cryptography, networksecurity, mobile computing, and cloud computing.He has published research papers in IEEE Trans-actions on Parallel and Distributed Systems, IEEETransactions on Computers, IEEE Transactions onMobile Computing, Security and CommunicationNetworks, and IEEE TRUSTCOM.

Page 14: EliMFS: Achieving Efficient, Leakage-resilient, and …...Multi-keyword Fuzzy Search on Encrypted Cloud Data Jing Chen, Kun He, Lan Deng, Quan Yuan, Ruiying Du, Yang Xiang, and Jie

1939-1374 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2765323, IEEETransactions on Services Computing

14

Lan Deng received her B.S. degree in informationsecurity at Wuhan University, Wuhan, China in2013. Currently she is a Master degree candidatein the Computer School at Wuhan University. Herresearch interests include cryptography, network se-curity, secure de-duplication, and searchable encryp-tion. She has published research papers in IEEETRUSTCOM, 2014. She also served as a reviewerfor IEEE Transactions on Information Forensics &Security.

Quan Yuan is an assistant professor in the De-partment of Math and Computer Science at theUniversity of Texas-Permian Basin, TX, USA. Hisresearch interests include mobile computing, rout-ing protocols, peer-to-peer computing, parallel anddistributed systems, and computer networks. He haspublished more than 30 research papers in manyinternational journals and conferences, such as IEEETransactions on Parallel and Distributed Systems,INFOCOM, MobiHoc, SECON, and TrustCom.

Ruiying Du received her BS, MS, and PH.D de-grees in computer science in 1987, 1994, and 2008,from Wuhan University, Wuhan, China. She is afull professor at the Computer School at WuhanUniversity. Her research interests include networksecurity, wireless networks, cloud computing, andmobile computing. She has published more than 80research papers in many international journals andconferences, such as IEEE Transactions on Paralleland Distributed Systems, the International Journalof Parallel and Distributed Systems, INFOCOM,

SECON, TrustCom, and NSS.

Yang Xiang received his PhD in computer sciencefrom Deakin University, Australia. He is currentlya full professor at the School of Information Tech-nology, Deakin University. His research interestsinclude network and system security, distributedsystems, and networking. He has published morethan 130 research papers in many international jour-nals and conferences, such as IEEE Transactionson Computers, IEEE Transactions on InformationSecurity and Forensics. He has been the PC memberfor more than 60 international conferences in net-

working and security. He serves as the Associate Editor of IEEE Transactionson Computers, Security and Communication Networks (Wiley).

Jie Wu is the Associate Vice Provost for Interna-tional Affairs at Temple University. He also serves asDirector of the Center for Networked Computing andas Laura H. Carnell professor. He served as Chairof Computer and Information Sciences from 2009 to2016. Prior to joining Temple University, he was aprogram director at the National Science Foundationand was a distinguished professor at Florida AtlanticUniversity. His current research interests includemobile computing and wireless networks, routingprotocols, cloud and green computing, network trust

and security, and social network applications. Dr. Wu regularly publishes inscholarly journals, conference proceedings, and books. He serves on severaleditorial boards, including IEEE Transactions on Service Computing and theJournal of Parallel and Distributed Computing. Dr. Wu was general co-chairfor IEEE MASS 2006, IEEE IPDPS 2008, IEEE ICDCS 2013, ACM MobiHoc2014, ICPP 2016, and IEEE CNS 2016, as well as program co-chair for IEEEINFOCOM 2011 and CCF CNCC 2013. He was an IEEE Computer SocietyDistinguished Visitor, ACM Distinguished Speaker, and chair for the IEEETechnical Committee on Distributed Processing (TCDP). Dr. Wu is a CCFDistinguished Speaker and a Fellow of the IEEE. He is the recipient of the2011 China Computer Federation (CCF) Overseas Outstanding AchievementAward.