eﬃcient and secure file deduplication in cloud...

184IEICE TRANS. INF. & SYST., VOL.E97–D, NO.2 FEBRUARY 2014

PAPER

Efficient and Secure File Deduplication in Cloud Storage

Youngjoo SHIN†,††a) and Kwangjo KIM†b), Members

SUMMARY Outsourcing to a cloud storage brings forth new chal-lenges for the efficient utilization of computing resources as well as simul-taneously maintaining privacy and security for the outsourced data. Datadeduplication refers to a technique that eliminates redundant data on thestorage and the network, and is considered to be one of the most-promisingtechnologies that offers efficient resource utilization in the cloud comput-ing. In terms of data security, however, deduplication obstructs applyingencryption on the outsourced data and even causes a side channel throughwhich information can be leaked. Achieving both efficient resource uti-lization and data security still remains open. This paper addresses thischallenging issue and proposes a novel solution that enables data dedu-plication while also providing the required data security and privacy. Weachieve this goal by constructing and utilizing equality predicate encryp-tion schemes which allow to know only equivalence relations between en-crypted data. We also utilize a hybrid approach for data deduplication toprevent information leakage due to the side channel. The performance andsecurity analyses indicate that the proposed scheme is efficient to securelymanage the outsourced data in the cloud computing.key words: cloud computing security, data deduplication, predicate en-cryption, online guessing attack

1. Introduction

Cloud computing is a promising technology that econom-ically enables data outsourcing as a service using Internettechnologies with elastic provisioning and usage-based pric-ing [1]. As demand for data outsourcing increases, pay-as-you-use cloud paradigm drives the need for cost-efficientstorage, specifically for reducing storage space and networkbandwidth overhead, which is directly related to the finan-cial cost savings. Furthermore, the need for data securityand privacy also arises as more and more sensitive dataare being outsourced to cloud, such as emails and personalhealth records, etc. For the fast growth of the cloud com-puting, both problems of cost efficiency and data securityshould be addressed well and resolved simultaneously.

In order to achieve cost savings, commercial serviceproviders utilize their resources efficiently through cross-user data deduplication [2], which refers to a technique thateliminates copies of a redundant file (or data chunk) acrossusers and provides links to the file instead of storing thecopies. There are two approaches of deduplication based onits architecture. In the server-side (i.e., target-based) dedu-

Manuscript received November 19, 2012.Manuscript revised October 22, 2013.†The authors are with the Department of Computer Science,

KAIST, Korea.††The author is with the Attached Institute of ETRI, Korea.a) E-mail: [email protected]) E-mail: [email protected]

DOI: 10.1587/transinf.E97.D.184

plication, the cloud storage server mainly handles dedupli-cation. On the other hand, client-side (i.e., source-based)deduplication occurs at the client before a file is transferredto the server. Client-side deduplication improves not onlystorage space but also network bandwidth utilization. Asnoted in [3], this technique achieves disk and bandwidth sav-ings of more than 90%, and thus most commercial storageservice providers including DropBox, Mozy and Memopaltake advantage of client-side deduplication.

Data security is another challenging issue in cloudcomputing, which originates from the fact that cloud serversare usually outside of the trust domain of the data own-ers. For recent years, the security research communityhas mainly addressed privacy or security on the outsourcedcloud data, specifically focusing on access control [4]–[8]and searchability over the encrypted data [9]–[13]. Theseproposed solutions are usually attained through a sort ofencryption techniques. Most commercial storage serviceproviders, however, are reluctant to apply encryption on thestored data, because encrypting on data impedes executingdata deduplication [3], [14]. Storage service providers maybe unlikely to stop using deduplication due to the high costsavings offered by the technique.

Several approaches [15]–[18] proposed some ideas toenable data deduplication over the encrypted data. Theirsolutions are commonly based on so-called convergent en-cryption, which takes a hashed value of a file as an encryp-tion key. Using this technique, identical ciphertexts are al-ways generated from identical plaintexts in a deterministicmanner. Hence, the cloud server can perform deduplica-tion over encrypted data, without knowing the exact contentof the file. From the view of cryptography, however, it isunderstood that deterministic encryption, including conver-gent encryption naturally, is not as secure as randomizedone [19].

Keeping confidentiality of the outsourced data from theuntrusted cloud server is not the only security problem forthe system utilizing data deduplication. It has been shownthat most of storage services using client-side deduplicationcommonly incur a side channel through which an adversarymay learn information about the contents of files of otherusers [3]. This is because these services share two inherentproperties; 1) data transmission in a network can be visi-ble to an adversary, and 2) deterministic and small-size data,such as a hashed value of a file, is queried to the cloud serverbefore file uploading. A user who has not access to but curi-ous about some file might mount an online guessing attack

Copyright c© 2014 The Institute of Electronics, Information and Communication Engineers

SHIN and KIM: EFFICIENT AND SECURE FILE DEDUPLICATION IN CLOUD STORAGE185

Fig. 1 Online Guessing Attack Scenario: for each possible versions oftarget file unauthorized user continues querying its hashed value until re-ceiving ‘FileExists’ message from the cloud server.

exploiting the side channel. The attack is as follows: an ad-versary first constructs a search domain comprising possibleversions of the target file, and then queries a hash of eachversion to the server until a deduplication event occurs. (SeeFig. 1) This attack is very relevant for a corporate environ-ment where the file content usually has low-entropy. That is,most of files are small variations of standard templates andthe number of possible versions of the target file is moderate.It is challenging but certainly necessary to prevent such aninformation leakage in the client-side deduplication system.

To sum up, the two issues of efficient resource uti-lization and data security in the cloud computing have notbeen considered together and well addressed yet in eitherof academia or industries. It actually still remains open toachieve data security against the untrusted server as well asunauthorized users capable of online guessing attack, whileenabling data deduplication.

In this paper, we address this issue and propose a novelsolution for cloud storage that raises efficiency up to a levelof practicality while also achieving strong security. Con-cretely, we propose a data deduplication scheme which issecure against the untrusted server as well as unauthorizedusers. Our proposed scheme utilizes primarily a novel cryp-tographic primitive, namely equality predicate encryption,which allows cloud server to know only equivalence rela-tions among ciphertexts without leaking any other informa-tion of their plaintexts. Thus data privacy can be kept fromthe cloud server while data deduplication is still enabledover the encrypted files. In addition, the proposed schemeallows the cloud server to perform data deduplication in ahybrid manner. That is, deduplication will occurs either ofat server side or at client side with some probability. Thisstrategy greatly reduces the risk of information leakage byincreasing attack overhead of online-guessing adversaries.In order to achieve our design goals, we constructed twoequality predicate encryption schemes in the symmetric-keysetting suitable for our application. The proposed schemeis built upon these constructions, and the required data se-curity is strongly enforced through the underlying provablesecurity of the constructions.

Main contributions of this paper can be summarized asfollows: 1) This paper is the first one that simultaneouslyresolved two issues of the data security and the efficient re-source utilization in the cloud computing; 2) The proposedscheme also gives data owners a guarantee that adding theirdata to the cloud storage has a very limited effect on what anadversary exploiting side channel may learn about the data;

3) The proposed scheme as well as our constructed equal-ity predicate encryptions are proved to be secure under thestandard cryptographic assumption.

The rest of this paper is organized as follows: Section 2presents the system models and an overview of our solution.Section 3 describes our constructions of equality predicateencryption scheme. In Sect. 4, we describe the proposed se-cure data deduplication scheme based on the constructionsof the encryption scheme in detail. In Sect. 5, we analyzethe security and discuss the performance of the proposedscheme. We describe related works in Sect. 6. Finally, weconclude our work in Sect. 7.

2. Models and Solution Overview

2.1 System and Attack Model

We assume that the system consists of the three entities:Data owners, users and a cloud server. A data owner wishesto outsource data (or file) in the external storage server andis in charge of encrypting the data before outsourcing it. Inorder to access the outsourced data, a user should be au-thorized by the data owner. An authorized user possess thesecret key of the data file and can read the data by decrypt-ing it. The cloud server, which stores outsourced data fromdata owners, has abundant but limited storage and networkcapacity. The cloud server is operated by the cloud serviceprovider who is interested in cost savings by improving diskspace and network utilization.

Similar to the previous approaches, we just assume thecloud server to be honest-but-curious. That is, the cloudserver will honestly execute the assigned tasks in the systembut may try to learn some information of stored data as muchas possible. Unauthorized users would try to access storeddata which is not authorized to them. To achieve this goal,they may perform an online guessing attack using the sidechannel incurred by a client-side duplication as an oracle.In addition, users and other data owners may also collude tocompute secret information that is necessary to access datawith any information that colluders have. Both of the cloudserver and any unauthorized users mounting these attacksshould be prevented from getting any information of unau-thorized data on the storage.

We also consider malicious data owners who mountsome kind of attacks targeting a cloud server or other dataowners. In uploading a file, they would try to disturb dedu-plication or corrupt data owned by others by doing dishon-est behaviors such as computing with a manipulated secretkey and uploading a fake file. Reliability of the proposedscheme against such attacks should be retained as well.

2.2 Solution Overview

Our goal is to help the cloud server enjoy the cost savingsoffered by data deduplication, while also giving data own-ers a guarantee that their data is kept confidential against the


untrusted cloud server and the unauthorized users. Specif-ically, our motivation is to design an efficient and se-cure cross-user data deduplication scheme for cloud stor-age services. For this, we have primarily constructed,based on PEKS (Public-key Encryption with KeywordSearch) [9], two equality predicate encryptions schemes inthe symmetric-key setting: 1) SEPE (Symmetric-key Equal-ity Predicate Encryption) which is a basic equality predi-cate encryption scheme and 2) SEPEn which is an extendedscheme of SEPE that has one-to-n tokens mapping (n ≥ 1).That is, SEPEn allows n possible tokens to be matched thecorresponding encrypted file. Our constructions allow thecloud server in possession of a token associated with eachfile to know the equivalence relations between the encryptedfiles without knowledge of the file content.

In the proposed scheme, a data owner generates ci-phertexts of a file and the corresponding tokens using bothof SEPE and SEPEn before uploading the file to the cloudserver. Then the cloud server retrieves an identical file inthe storage by comparing these tokens with stored cipher-texts. If an identical file has existed in the storage, us-ing SEPE, the cloud server will always perform server-sidededuplication even without knowing the file content. Onthe other hand, SEPEn will enables client-side deduplica-tion with some probability less than 1. This is because justone out of n possible tokens will be correctly verified withthe stored file. Hence, using SEPEn gives online guessingadversaries substantial search complexity so that no adver-sary with probabilistic and polynomially bounded computa-tional resources can recover the information. With help ofSEPE and SEPEn, the proposed scheme provides data secu-rity while also enabling deduplication either of at server-sideor at client-side in a hybrid manner. Some network transmis-sion overhead may be introduced due to the randomized oc-currence of client-side deduplication, but the overhead canbe minimized while ensuring the required security as wewill discuss later.

3. Symmetric-Key Equality Predicate Encryption

In this section, we formally define symmetric-key equalitypredicate encryption and its security requirement. Then, weconstruct two encryption schemes, SEPE and SEPEn, whichfulfill the definition and security requirement. The securityof the constructed schemes is also proved under the crypto-graphic complexity assumptions rigorously.

3.1 Definitions

3.1.1 Symmetric-Key Equality Predicate Encryption

We first describe the concept of predicate encryption beforegiving the definition of symmetric-key equality predicateencryption. By definitions in [20] and [21], predicate en-cryption is a kind of functional encryption system in whicha token associated with a function allows a user to evaluatethe function over the encrypted data. In predicate encryption

scheme, an encryption of a plaintext M can be evaluated us-ing a token associated with a predicate to learn whether Msatisfies the predicate. Symmetric-key equality predicate en-cryption is a predicate encryption scheme in the symmetric-key setting that allows to evaluate an equality predicate overa ciphertext of M given a token of another plaintext M′ tolearn whether M and M′ are equal. Symmetric-key settingmeans that both of encryption and token generation are com-puted with the same secret key. Note that PEKS [9] is anequality predicate encryption scheme in the public-key set-ting.

Definition 1. A symmetric-key equality predicate encryp-tion scheme consists of the following probabilistic and poly-nomial time algorithms:Initialize(1λ): Take as input a security parameter 1λ andoutput a global parameter Param. For brevity, the globalparameter Param output by Initialize algorithm is omittedbelow.KeyGen(aux): Output a secret key S K. An auxiliary inputaux may be used in generating the secret key.Encrypt(S K,M): Take as input a secret key S K and a plain-text M and output a ciphertext CT .GenToken(S K,M): Take as input a secret key S K and aplaintext M and output a token T K.Test(T K,CT): Take as input T K=GenToken(S K,M′) andCT=Encrypt(S K,M), and output ‘Yes’ if M = M′ and ‘No’otherwise.

3.1.2 Security Definition

The required security of symmetric-key equality predicateencryption scheme is an adaptive chosen plaintext security.We define the security against an active adversary who canaccess an encryption oracle and also obtain tokens T K forany plaintext M of his choice. Even under such an attack,the adversary is not allowed to distinguish an encryption ofa plaintext M0 from an encryption of a plaintext M1. Nowwe define the security game between a challenger and theadversaryA, as below:

Setup: Challenger runs the Initialize algorithm to generate aglobal parameter Param and the KeyGen algorithm to gen-erate a secret key S K. It gives the global parameter to theadversary and keeps S K to itself.Phase 1: The adversary can adaptively ask the challengerfor ciphertext CT or token T K for any plaintext M ∈ {0, 1}∗of his choice.Challenge: Once A decides that Phase 1 is over, A sendstwo challenging plaintexts M0, M1 to the challenger. Therestriction is thatA did not previously ask for the ciphertextsor the tokens of M0 and M1. The challenger picks a randomb ∈ {0, 1} and givesA a CT of Mb.Phase 2: A continues to ask for ciphertexts CT or tokensT K for any plaintext M of his choice except M0, M1.Guess: Finally, A outputs b′ ∈ {0, 1} and wins the game ifb = b′.


We define the adversaryA’s advantage in breaking thesymmetric-key equality predicate encryption scheme as

AdvA =∣∣∣∣∣Pr{b = b′} − 1

2

∣∣∣∣∣ .Definition 2. A symmetric-key equality predicate encryp-tion scheme is semantically secure against an adaptive cho-sen plaintext attack if for any probabilistic and polynomialtime (PPT) adversaryA, the advantage ofA in winning thegame is negligible in λ.

3.2 Cryptographic Background

In this section, we briefly review the bilinear maps and itscomplexity assumption on which our constructions are built.

3.2.1 Bilinear Maps

Our constructions are based on some facts about groups withefficiently computable bilinear maps. Let G and GT be twomultiplicative cyclic groups of prime order p. Let g be agenerator of G. A bilinear map is an injective function e :G × G→ GT with following properties:

1. Bilinearity: for all u, v ∈ G and a, b ∈ Z∗p, we havee(ua, vb) = e(u, v)ab.

2. Non-degeneracy: e(g, g) � 1.3. Computability: There is an efficient algorithm to com-

pute e(u, v) for ∀u, v ∈ G.

3.2.2 Bilinear Diffie-Hellman (BDH) Assumption

Let a, b, c ∈ Z∗p be chosen at random and g be a generatorof G. The Bilinear Diffie-Hellman problem is to computee(g, g)abc ∈ GT given g, ga, gb, gc ∈ G as input. The BDHassumption [22], [23] states that no PPT algorithm can solvethe BDH problem with non-negligible advantage.

3.3 Our Constructions

We give two constructions for symmetric-key equality pred-icate encryption scheme based on Bilinear Diffie-Hellmanassumption in the random oracle model: (1) SEPE which isa basic construction of symmetric-key equality predicate en-cryption scheme and (2) SEPEn which has one-to-n tokensmapping so that n multiple tokens are possible to the corre-sponding ciphertext. We will need the following hash func-tions H0 : {0, 1}∗ → Z∗p, H1 : {0, 1}∗ → G, Hi

2 : {0, 1}∗ → Z∗p,where i ∈ N is an index and H3 : GT → {0, 1}log p. Hi

2 can beeasily constructed from a keyed hash algorithm like MAC inwhich an index i is used as a key.

3.3.1 SEPE

SEPE scheme consists of the following algorithms.

Initialize(1λ): The initialize algorithm randomly chooses a

prime p with the bit length λ and generates a bilinear groupG of order p with its generator g. Then, it outputs a tu-ple Param=〈p, g,G,GT 〉 as a global parameter. Note thatParam is taken as an input implicitly next algorithms.KeyGen(aux): The key generating algorithm takes a binarystring aux with arbitrary length as an auxiliary input andcomputes S K = H0(aux) ∈ Z∗p. Then, it outputs S K as asecret key.Encrypt(S K,M): Let us denote S K=α ∈ Z∗p. The encryp-tion algorithm first chooses r ∈ Z∗p at random and thencomputes a ciphertext CT = 〈hr, gr,H3(e(h, gα)r)〉, whereh = H1(M) ∈ G.GenToken(S K,M): Let us denote S K=α ∈ Z∗p. The al-gorithm first chooses k ∈ Z∗p at random and then com-putes a token of a plaintext M as T K = 〈hα+k, gk〉, whereh = H1(M) ∈ G.Test(T K,CT ): We assume that CT = 〈c1, c2, c3〉 for a plain-text M and T K = 〈t1, t2〉 for a plaintext M′. The test algo-rithm computes

γ = e(t1, c2)/e(c1, t2).

If c3 = H3(γ), it outputs ‘Yes’; if not, it outputs ‘No’.

The algorithm KeyGen generates a secret key from auxwhich is distributed over {0, 1}∗. Let us denote the min-entropy of the distribution of aux by Haux∞ .

Theorem 1. SEPE described above is semantically secureagainst an adaptively chosen plaintext attack in the randomoracle model assuming that BDH is intractable and Haux∞ isa polynomial in λ.

Proof . Refer to Appendix A for the proof. �

3.3.2 SEPEn

SEPEn scheme consists of the following algorithms.

Initializen(1λ′): The initialize algorithm parses an input

1λ′

in a polynomial time as a tuple 〈1λ, d〉 that containstwo security parameters 1λ and d. The algorithm ran-domly chooses a prime p with the bit length λ and gen-erates G of order p with its generator g. Then, it outputsParam=〈p, g,G,GT , d〉 as a global parameter. Note that thesecurity parameter d inherently defines a set of permutationsΦ in which a permutation domain is D = {1, 2, . . . , d} ⊂ N.KeyGenn(aux): The key generating algorithm takes a binarystring aux with arbitrary length as an auxiliary input and out-puts a secret key S K=〈α, π〉 by computing α = H0(aux) ∈Z∗p and choosing randomly π from Φ.

Encryptn(S K,M): Let us denote S K=〈α, π〉, where α ∈Z∗p and π : D → D ∈ Φ. The encrypt algorithm

first chooses r ∈ Z∗p at random and then computes h ∈G by using two vectors �u = (1, 2, . . . , d) and �vπ =(Hπ(1)

2 (M),Hπ(2)2 (M), . . . ,Hπ(d)

2 (M)),

h = g�u·�vπ , where �u ·�vπ =∑i∈D

i · Hπ(i)2 (M).


Table 1 Comparison of predicate encryption schemes.

PEKS [9] Shen et al. [21] Blundo et al. [24] SEPE (SEPEn)

Secret key environment public-key symmetric-key symmetric-key symmetric-keyPredicate type equality multiple multiple equalityCross-user deduplication No Yes Yes YesOne-to-n token mapping No No No Yes

Ciphertext size |G| + log g ≥ 4|G| ≥ 2|G| + |GT | 2|G| + log gToken size |G| ≥ 4|G| ≥ 2|G| 2|G|Encrypt algorithm cost 2e+p ≥ 8m + 8e ≥ 3e 3e+pGenToken algorithm cost e ≥ 8m + 8e ≥ 2e 2eTest algorithm cost p ≥ 3m + 4p ≥ 2m + 2p m+2p

|G|: size of an element in G, |GT |: size of an element in GT , g: order of G, m: multiplication (division), e: exponentiation, p: pairing

The ciphertext is CT = 〈hr, gr,H3(e(h, gα)r)〉.GenTokenn(S K,M): Let us denote S K=〈α, π〉. The algo-rithm first chooses k ∈ Z∗p at random and then computes

h = g�u·�vπ = g∑

i∈D i·Hπ(i)2 (M). The token of the plaintext M isT K = 〈hk+α, gk〉.Testn(T K,CT ): We assume that CT = 〈c1, c2, c3〉 for aplaintext M and T K = 〈t1, t2〉 for a plaintext M′. The testalgorithm computes

γ = e(t1, c2)/e(c1, t2).

If c3 = H3(γ), it outputs ‘Yes’; if not, it outputs ‘No’.

Theorem 2. SEPEn described above is semantically se-cure against an adaptively chosen plaintext attack in therandom oracle model assuming that BDH is intractable andHaux∞ is a polynomial in λ.

Proof . Refer to Appendix B for the proof. �

3.4 Discussion

We discuss the effectiveness of our constructions, SEPEand SEPEn, by comparing with several predicate encryptionschemes that are similar to ours. Table 1 summarizes thecomparisons of predicate encryption schemes.

PEKS [9] is a public-key equality predicate encryptionscheme that allows to identify equality relations among ci-phertexts and tokens. PEKS is designed to be run in thepublic-key environment, and thus the equality of ciphertextsthat are encrypted with one’s public key can only be testedwith tokens that are generated with the corresponding pri-vate key. This public-key setting is not suitable for cross-user data deduplication, in which the equality test should beperformed across ciphertexts and tokens that are generatedwith different users’ secret keys. For cloud storage services,it is more desirable to perform data deduplication acrossmultiple users’ outsourced data than deduplicating only overa user’s data, since it gives more chance of eliminating theredundancy and saves more resources of the cloud server.

Cross-user deduplication can be implemented by asymmetric-key type of predicate encryption algorithm. Shenet al. [21] and Blundo et al. [24] proposed symmetric-keypredicate encryption schemes which support more generalpredicates such as a comparison predicate (x ≥ a), a sub-set predicate (x ∈ S ), and arbitrary conjunctive predicates

(P1 ∧ P2 ∧ . . . ∧ Pl). Both schemes also support predicateprivacy, which is a security property that a token reveals noinformation about the predicate that it contains. For datadeduplication that only the equality predicate is sufficient,however, the feature that supports multiple predicates is notnecessary, and is inefficient in terms of ciphertext size (tokensize) and computation time of encryption, token generationand predicate testing. Predicate privacy is also unnecessaryfor data deduplication, since only one type of predicate isrequired.

SEPE and SEPEn are designed to be more suitable fordata deduplication. The symmetric-key based constructionmakes it applicable to cross-user deduplication that elimi-nates redundancy of data across multiple users. Moreover,SEPE (SEPEn) uses less storage resources (i.e., size of ci-phertexts and tokens) and incurs less computation, particu-larly in Test algorithm that causes most of the computationalburden in the cloud server, compared with other symmetric-key predicate encryption schemes. One-to-n token mappingof SEPEn is another feature that makes our constructionsmore useful for designing secure data deduplication.

4. The Proposed Data Deduplication Scheme

4.1 Definition and Notation

We state our definitions and notations. Table 2 gives thedescription of notations to be used in the proposed scheme.For each file, a data owner assigns a globally unique file idf id and encrypts with a set of randomly chosen symmetricencryption keys FEK={FEK1,FEK2,. . . ,FEKl}. Also, FEKis encrypted with another secret key α. The cloud servermaintains a search index SI to retrieve a stored file in thestorage. SI is a list of entries in which each entry corre-sponds to a file entity logically, and comprises a set of searchkeys and an index value. Search keys include a file id f idand two predicate ciphertexts CT f id, CTn, f id which are en-crypted with SEPE and SEPEn, respectively. Files in thestorage are retrieved by the cloud server with one of searchkeys.

The usage of a search key depends on a type of fileoperation requested by users. An index value contains anaddress addrFile at which the file is physically stored in thesystem. Each SI entry has its address addrIndex and can be


Table 2 Notations used in the proposed scheme description.

Notation DescriptionFEK1, . . . ,FEKl symmetric file encryption keysSI search index for file retrieval in the cloud storagef id a unique id assigned to each fileaddrob j address at which an object ob j resides in the systemCT f id SEPE ciphertext of a file of f idCTn, f id SEPEn ciphertext of a file of f idCT [i]

f id i-th element of CT f id (i=1,2,3)

CT [i]n, f id i-th element of CTn, f id (i=1,2,3)

T K f id token which corresponds to ciphertext CT f id

T Kn, f id token which corresponds to ciphertext CTn, f id

T K[i]f id i-th element of T K f id (i=1,2)

T K[i]n, f id i-th element of T Kn, f id (i=1,2)

{M}key ciphertext of M encrypted by a symmetric encryptionalgorithm with an encryption key key

<i1,i2,. . .,in> tuple of n elements i1, i2, . . . , in

Fig. 2 A search index and a file storage on a cloud server.

referenced through the address. Figure 2 shows the formatof a SI entry and the structure of file storage.

For data reliability, an erasure code is used in the pro-posed scheme. An erasure (x, y)-code is an encoding func-tion E : X → Y that transforms a message X comprised of xsymbols to the code Y of x + y redundant symbols such thatany x of them are sufficient to reconstruct the original mes-sage X. The number of losses that the erasure code E cansustain is y, while the redundancy factor is RD = (x + y)/x.

4.2 Scheme Description

The proposed scheme is composed of several system leveloperations: System Setup, New File Upload, File Access,File Update and File Deletion.

System Setup: Initially, the cloud server and data ownershave agreed two security parameters λ and d. The cloudserver calls the algorithm Initializen(〈λ, d〉) and outputs theglobal parameter Param=〈p, g,G,GT , d〉. The parametersp, g, G and GT will be used commonly in both of SEPE andSEPEn.

New File Upload: When uploading a file to the cloud server,a data owner proceeds as the following:

1. assign a unique file id f id to this data file.2. generate a secret key 〈α, π〉 by running KeyGenn(File),

where File is a binary representation of this file con-tent. (α is also assigned to the secret key of SEPE.)

3. compute two tokens T Kf id and T Kn, f id by runningGenToken(α, File) and GenTokenn(〈α, π〉, File), re-spectively.

4. compute two ciphertexts CT f id=Encrypt(α, File) andCTn, f id=Encryptn(〈α, π〉, File).

5. query to the cloud server with a tuple < f id, T Kf id,T Kn, f id, CT f id, CTn, f id>.

To ensure that the tuple is correct, the data owner isrequested to compute ciphertexts and tokens by sharing ran-dom values r, k ∈ Z∗p between SEPE and SEPEn at steps 3

and 4 above. (Thus, the tuple should satisfy CT [2]f id = CT [2]

n, f id

and T K[2]f id = T K[2]

n, f id.)

(Tuple verification) Upon receiving the tuple, thecloud server first verifies the consistency of a re-ceived tuple by checking both Testn(T Kn, f id,CTn, f id) andTest(T Kf id,CT f id) output ‘Yes’. Then, the cloud servercontinues to test the Eqs. (1), (2) and (3).

e(CT [1]

f id,T K[1]n, f id

)= e

(CT [1]

n, f id,T K[1]f id

)(1)

CT [2]f id = CT [2]

n, f id (2)

T K[2]f id = T K[2]

n, f id. (3)

If one of the verification processes fails, the cloudserver reports an error and halts the operation.

(Deduplication) If the tuple is successfully verified, thecloud server retrieves entries in the search index SI com-paring T Kn, f id with CTn, f id′ of each SI entry through the al-gorithm Testn(T Kn, f id, CTn, f id′ ). If a matching entry found,the cloud server gets an address addrFile from the entry andcreates a new entry 〈 f id, CT f id, CTn, f id, addrFile〉. Thenthe cloud server appends it to SI and responds the dataowner with a message FileExist (i.e., client-side dedupli-cation event occurs). If not, the cloud server responds witha message FileNotExist.

Upon receiving a message FileNotExist, the dataowner continues to upload the actual file to the cloud serveras the following:

1. transform File into F by running the erasure code E,and split F according to block length b (e.g., b=512bits) into l blocks F1,F2, . . . ,Fl (padding is added ifthe length of F is not aligned to b).

2. choose randomly encryption keys FEK1,FEK2, . . . ,FEKl from the key space.

3. encrypt the file blocks with a symmetric encryption al-gorithm C = 〈{F1}FEK1 ,{FEK1}α,. . . ,{Fl}FEKl ,{FEKl}α〉,where α is a hashed value of the file described above.

4. send a tuple < f id, C> to the cloud server.

Upon receiving the message < f id, C> from the dataowner, the cloud server finds the same file comparing T Kf id

with CT f id′ of each SI entry through Test(T Kf id, CT f id′ ).If the same file already exists in the storage, the cloud

server randomly picks lr (0 ≤ lr ≤ l), and replaces lr ar-bitrary blocks of the existing file (including their FEKs) by


newly uploaded lr blocks from C.The cloud server then creates new entry 〈 f id, CT f id,

CTn, f id, addrFile〉, where addrFile is from the matching entryand appends it to SI (i.e., server-side deduplication eventoccurs). If not found, the cloud server allocates new storagespace with new address addrFile and puts the encrypted fileC into the file system. Then the server creates new SI entry〈 f id, CT f id, CTn, f id, addrFile〉 and appends it to SI. Insteadof two distinct searches in this operation, it is possible toperform a single search during uploading by retrieving SIwith CT f id and CTn, f id at once.

As described above, the hashed value α acts not onlyas a plaintext of SEPE and SEPEn but also as an encryptionkey of FEK associated with the file itself. Using α as anencryption key helps keeping the system simple, since dataowners in possession of same file should also exclusivelyshare a hashed value of the file without any explicit key shar-ing. If a random encryption key is used instead of the filehash, then we will need costly key management mechanismsuch as broadcast encryption [25] to share the key amongdata owners who have a common file. Such a strategy issimilar to convergent encryptions, but not the heart of thetechnique that enables deduplication contrary to convergentencryptions.

File Access: A user authorized to access a file stored ina cloud storage should be in possession of 〈 f id, α〉 of thefile. The user sends a file access request message with afile id f id. The cloud server finds the stored file retriev-ing the search index SI with a search key f id. If found,the cloud server accesses the file in the file system withthe address addrFile from the matching entry, and returnsC = 〈{F1}FEK1 ,{FEK1}α,. . . ,{Fl}FEKl ,{FEKl}α〉 to the user.The user will recover the encrypted file with a decryptionkey α. The search index SI can be ordered according to asearch key f id thus can be built into a tree-structure. Thesearch operation cost with a search key f id then becomesO(log n), where n denotes the size of SI.

File Update: The procedure for updating a file is fully equalto New File Upload operation.

File Deletion: The data owner sends a file deletion requestwith f id. The cloud server searches a matching SI entryassociated with f id. If the searched file is associated withmore f ids other than the requested one, the cloud serverjust removes the entry from SI. Otherwise, the server alsoremoves the actual file in the storage.

4.3 Efficiency Improvements

We present how to improve efficiency of the proposedscheme and to enhance its performance.

4.3.1 Auxiliary Search Index

The operations of New File Upload and File Update requirea search operation on the search index SI. Since SI entries

cannot be sorted with respect to search keys CT and CTn,the searching cost is linear to the size of SI (i.e., the num-ber of files). In this section, we describe a technique thatimproves the efficiency of the proposed scheme in terms ofsearching cost. An idea behind the technique is motivatedfrom the fact that the size of uploading files is distributedbetween a few to giga bytes and the cloud server can get in-formation of the size of a file from the encrypted one. Ouridea is to use an auxiliary search index (AI) in additionto SI. In AI, the size of a file is actually a search key,and an index value contains addrIndex, an address of the SIentry that indicates the corresponding file. Each AI entrycan be sorted with respect to its search key. With help ofa tree structure like B+-tree, an item insertion, deletion andsearching operation inAI can be handled in an efficient andscalable manner.

When a data owner uploads (or updates) a file, thedata owner sends a tuple <size, f id, T Kf id, T Kn, f id, CT f id,CTn, f id> to the cloud server. Then, the server first looks upAI entries that correspond to size. If found, the cloud servergets matching SI entries which are addressed by addrIndex.The number of matching SI entries may be one or moreaccording to a density of file size distribution. The cloudserver then enumerates the matching SI entries and finds acorresponding entry such that Test (or Testn) algorithm re-sults in ‘Yes’. In this technique, using AI limits efficientlythe search scope of SI thus eventually reduces searchingcost to sub-linear complexity. In order to use this technique,the original scheme should be slightly modified. The onlydifference from the original scheme is that size field is addedon the above tuple in this technique.

4.3.2 Tradeoff between Security and Network Efficiency

The proposed scheme is designed to be resistant against theonline guessing attack. While an unauthorized user has dif-ficulty in guessing the content of a stored file, usual file up-load operation may also have little chance that a client-sidededuplication event occurs, which eventually causes the in-crease of network transmission overhead. In order to avoidlosing network transmission efficiency, the proposed schemeallows the security parameter d of SEPEn to be selected onthe tradeoff between the security and the efficiency. The se-curity parameter d denotes a domain size of permutation andactually determines the size of set of all permutations with adomain D such that |D|=d. Hence, the lower value of d giveshigher probability that client-side deduplication occurs andvice versa.

This tradeoff is reasonable in practical applications.Data owners may upload not only high privacy requiringfiles that contain personal information but also common fileslike software setup files. The uploading files may also varybetween high and low entropy. In the proposed scheme, thevalue of d is globally chosen in system setup phase. How-ever, it is not difficult to modify the proposed scheme toallow d be selected for each file in uploading phase. Eachfile’s security parameter d can be decided based on its prop-


erties like the entropy or the required level for privacy. Filesize can be another decision factor of d because it is reason-ably assumed that file size and its entropy are highly corre-lated with each other [3]. For example, a file that requiresno privacy or its size is greater than some threshold maybe given its security parameter d = 1, which implies thatclient-side deduplication will always occur with probability1 if the same file has existed in the cloud server.

A quantitative analysis of the relation between the se-curity parameter d and both security and the network trans-mission overhead will be presented in Sect. 5.2.2.

4.4 Improving Reliability of the Cloud Storage

By applying a general data replication mechanism to theproposed scheme, we can further improve reliability of thescheme against malicious data owners incurring data loss. Adata replication mechanism, where replication factor (RP) isr, creates r replicas of a data object and stores them acrossgeographically distributed storage systems.

New File Upload operation of the proposed scheme isextended to utilize such a data replication as follows. Whenaccepting < f id,C > from a data owner, the cloud serversearches on SI by using Test algorithm. If a matching en-try is found (i.e., server-side deduplication event occurs), thecloud server gets addrFile from the SI entry and determinesthe locations of existing replicas for the file. If the numberof the existing replicas is less than r, a new one is allocatedto store C = 〈{F1}FEK1 ,{FEK1}α,. . . ,{Fl}FEKl ,{FEKl}α〉. Oth-erwise, the cloud server selects one of them and replaces lrblocks of the chosen replica, where lr is randomly pickedfrom {0, 1, . . . , l}, by new blocks of C. In other cases, thecloud server follows the original protocol described in theprevious section.

With help of an erasure (x, y)-code, corrupted data canbe recovered as long as at least �lx/(x + y)� blocks amongr replicas stored within the storage sustain the corruption,where l is the number of blocks of the file.

Note that data replication is orthogonal to data dedupli-cation. Goals of removing data redundancy and improvingreliability can be achieved simultaneously using both datareplication and data deduplication together.

5. Analysis

In this section, we first present the security analysis of theproposed scheme. Then, we present the storage, communi-cation and computation performance analysis.

5.1 Security Analysis

5.1.1 Data Confidentiality

In the proposed scheme, files in the cloud storage areencrypted with a symmetric encryption algorithm likeAES [26]. Encrypting the file generates a randomized ci-phertext, because FEK is randomly chosen from the key

space. Therefore any adversary including the cloud serverwho has only ciphertexts of the file cannot distinguish itfrom the random data. That is, the adversary does not knowany information of the content of a file.

The cloud server will perform data deduplicationthrough running Test and Testn algorithm with inputs of to-kens and ciphertexts of SEPE and SEPEn. Unless tokens aregiven, the cloud server has negligible advantage in gettingany information of the plaintext from its ciphertexts by the-orems, Theorems 1 and 2. Although the cloud server mayhave knowledge of equivalence relations among the cipher-texts in the search index, Theorem 3 says that any PPT al-gorithm including the test algorithms, Test and Testn, doesnot reveal any partial information to the cloud server otherthan the equivalence relation.

Note that the cloud server obviously can infer to theinformation of file size from the corresponding ciphertextbecause the file size and its ciphertext size are inherentlycorrelated with each other.

Theorem 3. For any two plaintexts M and M′ such thatM�M′, any PPT algorithm given SEPE (or SEPEn) cipher-texts and tokens of M and M′ as inputs cannot output anypartial information of the plaintexts M and M′ under therandom oracle assumption.

Proof . For the sake of simplicity, let us assume that hashfunctions used in SEPE and SEPEn (i.e., H1, H2) are de-noted as F : {0, 1}∗ → G. In computation of the tokens andthe ciphertexts, the plaintext M and M′ are given as inputsin evaluating the hash function F. By collision resistanceproperty of a cryptographic hash function, it is highly un-likely that F(M)=F(M′) for any two different plaintext Mand M′. Besides, using the fact that a cryptographic hashfunction can be modeled as a random oracle [27], both ofF(M) and F(M′) will be indistinguishable from random inG. Therefore, in the case of Test and Testn algorithm, anintermediate computation result γ will be indistinguishablefrom a random data in GT , and the test algorithm will even-tually result in meaningless output ‘No’. Another PPT algo-rithms given ciphertexts and tokens of M and M′ as inputswill also eventually output a data that looks like a randomfrom the view of the algorithm. �

5.1.2 Resistance against Online Guessing Attack

Now we analyze security against an adversary who mountsonline guessing attack to figure out the information of inter-esting file in the cloud storage. The adversary is neither inpossession of the file nor its hashed value. However, the ad-versary may have relatively small search domain for the filebecause the file has inherent low-entropy property or he hassome knowledge of it. For each candidate in the search do-main the adversary generates a token T Kn and then queriesthe token to the cloud server. If a matching ciphertext CTn

exists such that Testn(CTn,T Kn) results ‘Yes’ for the query,the cloud server will respond with a message FileExist in-


dicating deduplication (at client-side) event occurs. The ad-versary will continue querying to the cloud server until re-ceiving FileExist.

In the attack scenario, an adversary may select a cor-rect candidate in high likelihood of matching the target file.However, the adversary should query a correct token T Kn

for the selected candidate in order to render a deduplicationevent (at client-side). Theorem 4 says that a probabilitythat an adversary computes a correct token T Kn such thatTestn(CTn, T Kn) results in ‘Yes’ is negligible in the secu-rity parameter d. Hence, the proposed scheme is resistantagainst online guessing attacks.

Theorem 4. For any PPT algorithmA, the probability thatA outputs the corresponding token T Kn for given SEPEn

ciphertext CTn is negligible in the security parameter d.

Proof . Let us assume a PPT algorithm A which com-putes a token T Kn for a corresponding SEPEn ciphertextCTn. A is given d, g, α and CTn as inputs and tries tofind out the correct token T Kn. We also assume that CTn

=〈hr, gr,H3(e(h, ga)r)〉, where r is chosen at random fromZ∗p. IfA can get h from hr, it is obvious that A easily com-

putes the corresponding token T Kn=〈hk+α, gk〉 with given h,where k is chosen randomly from Z∗p. However, the prob-lem that computes h from hr in a cyclic group G is not easierthan the discrete logarithm problem (DLP) on elliptic curveswhich is generally known to be computationally infeasiblefor any PPT algorithm if p is sufficiently large [23]. Hence,the best way for A to find out the token is to choose oneof possible tokens randomly and test it by running Testn al-gorithm. A can choose a token at random by tossing coinsto choose a random permutation π : D → D, where D ={1, 2, . . . , d}. Now we analyze the probability thatA outputsthe correct token T Kn given the input CTn. Let us denote aset of all permutations π : D→ D asΦ. We assume that CTn

has been computed with a permutation π. That is, h = g�u·�vπin CTn, where �u · �vπ = ∑

i∈D iHπ(i)2 (M). When algorithm Achooses randomly a permutation π′ from Φ, h is computedas h = g�u·�vπ′ in T Kn, where �u · �vπ′ = ∑

i∈D iHπ′(i)

2 (M). Notethat �u ·�vπ and �u ·�vπ′ also can be represented as �u ·�vπ = pk+ rand �u · �vπ′ = pk′ + r, where p is order of G, k, k′ ∈ N and0 ≤ r, r′ < p. Let us denote an event p(k − 1) ≤ �u ·�vπ′ < pkas Ek. Then, the probability is

Pr{A outputs the correct T Kn}= Pr{π′ R←Φ , Testn(CTn,T Kn) = ‘Yes’ }= Pr{π′ R←Φ , g�u·�vπ = g�u·�vπ′ }= Pr{π′ R←Φ , �u ·�vπ = �u ·�vπ′ mod p}≤

∑k∈N

Pr{r = r′|Ek}Pr{π′ R←Φ , Ek}

≤∑k∈N

Pr{π′ R←Φ , π = π′}Pr{π′ R←Φ , Ek}

= Pr{π′ R←Φ , π = π′}∑k∈N

Pr{π′ R←Φ , Ek}

≤ Pr{π′ R←Φ , π = π′}=

1d!≤ 1

2d−1.

Therefore, the probability that A outputs the corre-sponding T Kn is negligible in the security parameter d. �

5.1.3 Collusion-Resistance

Unauthorized users and even other data owners who do nothave an access right to their interesting file can collude to-gether to access the file. They should know the correct ac-cess token 〈 f id, α〉 in order to get the file via File Accessoperation. Hence they will try to compute 〈 f id, α〉 from in-formation such as secret keys or tokens which they have.Assuming that f id is chosen from sufficiently large spaceand kept secret to the authorized users and data owners, thecolluding adversaries cannot compute f id since f id is inde-pendent of any information which they have.

They may also try to compute the corresponding to-kens T K, T Kn to take advantage of a side channel on client-side deduplication. However, since a permutation π of CTn

corresponding the file was chosen from Φ at random andis independent of any information, it is impossible for thecolluding adversaries to get the file by Theorem 4.

5.1.4 Resistance against Malicious Data Owner

We analyze security of the proposed scheme against twokinds of attacks, which are launched by a malicious dataowner who does not follow the protocol and behaves dis-honestly.

Server-side deduplication disturbing attack: Let us con-sider a malicious data ownerM who tries to disturb server-side deduplication so as to intentionally reduce the availablestorage resources of the cloud server. During New File Up-load operation,M would mount such an attack by compos-ing the wrong tuple < f id,T K′f id,T Kn, f id,CT ′f id,CTn, f id >,where T K′f id, CT ′f id are calculated using the randomly gen-erated secret key α′ (� α).

For the proposed scheme, this attack does not succeed,since the correctness of the received tuple is verified by thecloud server before acceptance. By definition of bilinearmap (in Sect. 3.2.1) and definition of our construction, itis infeasible for M to compose the wrong tuple such thatT K′f id, CT ′f id are calculated using randomly generated α′,while also satisfying the Eqs. (1)–(3) in Sect. 4.2, as well asallowing Test(T K′f id,CT ′f id) to output ‘Yes’.

Duplicate faking attack: We consider a malicious dataowner M who tries to corrupt other users’ data. Dur-ing New File Upload operation, M may send a tuple <f id,T Kf id,T Kn, f id,CT f id,CTn, f id > for File correctly, butthen upload a fake < f id,C′ >, where C′ is constructed fromFile′(� File) or a wrong secret key α′(� α). We analyze se-curity against such an attack in terms of two properties; (i)corruption detectability and (ii) data recoverability.


Fig. 3 File recoverability against fake uploads.

The proposed scheme ensures that the corruption ofFile is always detected by data owners or their authorizedusers who treat the same file. Let us denote a decryption ofciphertext E with a key k by Dec(k, E), n-th element of atuple T by T [n], and a decoding function of an erasure codeE byD. Data corruption can be detected as follows.

1. get C = 〈{F1}FEK1 ,{FEK1}α,. . . ,{Fl}FEKl ,{FEKl}α〉 viaFile Access operation.

2. for each Ti=〈{Fi}FEKi , {FEKi}α〉 in C, where 1 ≤ i ≤ l,compute vi=Dec(Dec(α,T [2]

i ),T [1]i ).

3. compute w = D(v1||v2|| . . . ||vl), and z = KeyGen(w).4. test z = α.

If z is not equal to a secret key α, then the file is cor-rupted.

With help of data replication, the proposed scheme alsoensures that with high probability, a file can be recoveredfrom corruption against fake uploads by M. To justifyour argument, we conducted a simulation, in which NewFile Upload operations for the same file (the block size isb = 512 and the number of blocks is l = 16, 384) were per-formed in 1,000 times including fake uploads among them.The ratio of the fake uploads varies from 0 to 0.99. At eachNew File Upload operation, it was checked that sufficientblocks on the storage remain uncorrupted for recovery.

Figure 3 shows the simulation result. RP refers to areplication factor of data replication mechanism (i.e., thenumber of replicas), and RD to a redundancy factor of theerasure code. With RP=10 and RD=1.5, the probability ofsuccessfully recovering corrupted data is more than 95%.

5.2 Performance Analysis

5.2.1 Storage Overhead

When a data owner uploads a file to the cloud server, datadeduplication will always occur either at client-side or atserver-side, if the same file exists in the cloud storage. That

is, the proposed scheme always removes redundant copiesof a file across multiple users, thus keeps the cloud storageoptimized in terms of disk space utilization.

Additional storage overhead may be introduced due tothe search index. The size of SI or AI entry, however, isnegligible compared to the size of corresponding file.

Employing data replication, which is discussed inSect. 4.4, may increase the required storage capacity. Inthe proposed scheme, however, data replication is run overdeduplicated data, and thus is independent of eliminatingthe redundancy. The amount of increased storage resourcesdoes not exceed by the replication factor of a data replicationmechanism.

5.2.2 Network Transmission Overhead

The hybrid approach that prevents the online guessing at-tack may introduce some network transmission overheaddue to unnecessary file uploads. We analyze how much thenetwork bandwidth is actually consumed for the proposedscheme.

For this, we compare the proposed scheme with a ba-sic scheme, which is exactly same to the proposed schemeexcept that client-side deduplication always occurs (i.e., thesecurity parameter d = 1). Note that the basic scheme in-curs no network transmission overhead at all. Thus, the dif-ference Δ between the amount of network bandwidth con-sumed for the proposed scheme and the amount for the basicscheme represents the network transmission overhead of theproposed scheme.

Several events in New File Upload operation are de-fined as follows:

1. E: The file F exists in the cloud storage.2. TP: New File Upload operation transmits the actual

copy of F in the proposed scheme.3. TB: New File Upload operation transmits the actual

copy of F in the basic scheme.

From the security analysis in Sect. 5.1.2, we havePr{TP | E}≥ 1 − 1

d! . Without loss of generality, we assumePr{TP | E}= 1 − 1

d! in this section. Note that Pr{TB | E}=0.The probability that the event TP occurs is

Pr{TP} = Pr{TP | E}Pr{E} + Pr{TP | ¬E}Pr{¬E}= Pr{TP | E}Pr{E} + Pr{¬E}= Pr{TP | E}Pr{E} + 1 − Pr{E}=

(1 − 1

d!− 1

)Pr{E} + 1 = −Pr{E}

d!+ 1.

By the similar way, the probability that the event TB

occurs is Pr{TB}=1-Pr{E}.The expected amount of network traffic in uploading

the file F for the proposed scheme and the basic scheme ares·Pr{TP} and s·Pr{TB}, respectively, where s is the size of F .Thus, the network transmission overhead of the proposedscheme can be represented by the difference Δ


Fig. 4 Tradeoff between security and the network transmission overheadover the security parameter d (The file size of FA and FB is 90 MBs,Pr{EFA } = 0.001 and Pr{EFB } = 0.3).

Δ =| sPr{TP} − sPr{TB} |= sPr{E}(1 − 1

d!

). (4)

For each file in the cloud storage, the probability Pr{E}will be distributed between 0 to 1 according to its proper-ties, especially to the type of file content. That is, it is easilyinferred that common files (e.g., software install files) usu-ally have high Pr{E}, and private files, that contain sensitiveand personal information, requiring high privacy have lowPr{E}. From the above equation, we can observe that asPr{E} goes to 0, so does Δ, while Δ closes to s(1 − 1/d!) asPr{E} goes to 1. This result implies that we can reduce over-all network transmission overhead substantially by just onlydecreasing the security parameter d of common files, whilekeeping private file’s parameter d high, as we described inSect. 4.3.2.

Let us consider an example that a data owner uploadstwo files FA and FB, which represent a private file and acommon file, respectively. Suppose that the size of both FA

and FB is 90 MBs and the probabilities that the file existsin the cloud storage are Pr{EFA } = 0.001 and Pr{EFB} = 0.3.Figure 4 shows the quantitative relation between the securityparameter d and both security (i.e., probability p = 1

d! thatan adversary computes the correct T Kn) and the networktransmission overhead (ΔFA for FA and ΔFB for FB). Settingthe parameter d to 11 for both FA and FB raises the overallnetwork transmission overhead (ΔFA + ΔFB ) to more than26 MBs. On the other hand, the overall overhead can bereduced to 92 KBs by just only decreasing the common fileFB’s parameter d to 1, while keeping the probability p of theprivate file FA lower than 2−24.

5.2.3 Computation Overhead

Computational burden at the cloud server is mainly intro-duced by searching over the storage. We analyze the compu-tation complexity for the following operations that perform

Table 3 Computation complexity of the proposed scheme consideringtwo strategies: SI and SI withAI (SI+AI).

OperationThe Proposed SchemeSI SI+AI

New File Upload O(Ns) O(log Na + Ns,a)File Update O(Ns) O(log Na + Ns,a)File Access O(log Ns) O(log Ns)

the searching: New File Upload, File Update, File Access.In New File Upload and File Update operation, the

cloud server searches the matching file in SI through twosearch keys CT and CTn. For eachSI entry the cloud servercompares it with given tokens by computing Test and Testnalgorithms which require two pairings and one multiplica-tion over GT . Let us denote Ns as the total number of SIentries, Na as the total number of AI entries and Ns,a asthe number of SI entries which correspond to the matchingAI entry. The computation complexity of New File Uploadand File Update are shown in Table 3. With help ofAI, theproposed scheme (SI+AI in Table 3) can have sub-linearcomplexity in this operations.

In File Access operation, SI is retrieved through thesearch key f id. Since SI is sorted according to f id, thesearching cost is more efficient than that of above operationsas shown in Table 3. We also note that no cryptographicoperations are needed in comparison over f id.

6. Related Work

Existing approaches related to our proposed scheme include1) secure data outsourcing and 2) data deduplication usingconvergent encryption.

Most of them addressed the issue of secure data out-sourcing in terms of cryptographic access control as well assearching over encrypted data. In order to achieve a goalof fine-grained access control and efficient revocation overthe outsourced data, various techniques are proposed builton attribute-based encryption (ABE) [28], [29], which is acryptographic primitive that ensures access control over en-crypted data. S. Yu et al. addressed an issue of attribute re-vocation and proposed two solutions [5] and [4] in CP (Ci-phertext Policy)-ABE and KP (Key Policy)-ABE, respec-tively. J. Hur et al. [6] combines CP-ABE and broadcastencryption technique [25] to enable more fine-grained ac-cess control with efficient and scalable revocation capability.Z. Zhou et al. [7] and M. Green et al. [8] considered ABEencryption and decryption overhead at client side, whichgrows with the complexity of the access formula. They pro-posed solutions that outsource the computation burden to thecloud server while preserving privacy of outsourced data.

Besides an issue of access control, it is also importantto solve the problem of searching over encrypted data in se-cure data outsourcing. D. Boneh et al. [9] first describedthe notion of predicate encryption and gave PEKS, a con-crete implementation of equality predicate encryption in thepublic-key setting. Their solution is suitable for a secure e-mail server which is possible for the untrusted server to re-


trieve encrypted mails. Several approaches addressed issuesof efficient searching over encrypted data through tree-basedindexing. M. Bellare et al. [11] and S. Sedghi et al. [12] pro-posed deterministic encryption in which the searching costis O(log n) in the public-key setting and the symmetric-keysetting, respectively. Due to a deterministic property, thesesolutions [11], [12] loose some security, as noted in [10]. R.Curtmola et al. [10] and C. Dong et al. [13] focused on en-hancing security notions of previous solutions.

The notion of convergent encryption was first presentedin [16] as a rough idea. Convergent encryption uses ahashed value computed from a file as an encryption keyof the file itself. Any client encrypting a given data chunkwill use the same key to do so, thus identical plaintext val-ues will encrypt to identical ciphertext values, regardlessof who encrypts them. This helps the storage server per-form data deduplication on the encrypted data. Several so-lutions after [16] fundamentally follow this strategy. M.Storer et al. [15] further extended [16] and implemented datadeduplication framework on the encrypted data storage. L.Marques et al. [17] and P. Anderson et al. [18] utilized con-vergent encryption to address different applications such asmobile devices. The security of convergent encryption can-not be guaranteed strongly due to its inherent deterministicproperty, and even has not been rigorously analyzed yet inthe previous approaches.

7. Conclusion

This paper aims at achieving both cost efficiency and datasecurity in cloud computing. One challenge in this contextis to build a data deduplication system which offers cost sav-ings in terms of disk space and network bandwidth utiliza-tion, while also providing data security and privacy againstthe untrusted cloud server and the unauthorized users. Inthis paper, we proposed an efficient and secure data dedupli-cation scheme to resolve the challenging issue. In order toachieve both of efficiency and security, we constructed twoequality predicate encryption schemes in the symmetric-keysetting, on which the proposed data deduplication schemeis built. Our rigorous security proofs show that our pro-posed scheme is provably secure under cryptographic secu-rity models.

Acknowledgements

This research was funded by the MSIP (Ministry of Science,ICT & Future Planning), Korea in the ICT R&D Program2013.

References

[1] M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski,G. Lee, D. Patterson, A. Rabkin, I. Stoica, et al., “Above theclouds: A berkeley view of cloud computing,” Technical ReportUCB/EECS-2009-28, 2009.

[2] D. Russell, “Data deduplication will be even bigger in 2010,” Gart-ner, Feb. 2010.

[3] D. Harnik, B. Pinkas, and A. Shulman-Peleg, “Side channels incloud services: Deduplication in cloud storage,” IEEE Security andPrivacy Magazine, vol.8, pp.40–47, Nov. 2010.

[4] S. Y, C. Wang, K. Ren, and W. Lou, “Achieving secure, scalable, andfine-grained data access control in cloud computing,” Proc. IEEEConf. Information Comm. (INFOCOM’10), pp.534–542, 2010.

[5] S. Yu, C. Wang, K. Ren, and W. Lou, “Attribute based data shar-ing with attribute revocation,” Proc. ACM Symp. Information, Com-puter and Comm. Security (ASIACCS’10), pp.261–270, 2010.

[6] J. Hur and D.K. Noh, “Attribute-based access control with efficientrevocation in data outsourcing systems,” IEEE Trans. Parallel Dis-trib. Syst., vol.22, pp.1214–1221, July 2011.

[7] Z. Zhou and D. Huang, “Efficient and secure data storage opera-tions for mobile cloud computing.” Cryptology ePrint Archive, Re-port 2011/185, 2011. http://eprint.iacr.org/.

[8] M. Green, S. Hohenberger, and B. Waters, “Outsourcing the decryp-tion of abe ciphertexts,” Proc. USENIX Conf. Security (SEC’11),2011.

[9] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano, “Publickey encryption with keyword search,” Proc. Eurocrypt’04, pp.506–522, 2004.

[10] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, “Searchablesymmetric encryption: improved definitions and efficient construc-tions,” Proc. ACM Conf. Computer and Comm. Security (CCS’06),pp.79–88, 2006.

[11] M. Bellare, A. Boldyreva, and A. O’Neill, “Deterministic and ef-ficiently searchable encryption,” Proc. CRYPTO’07, pp.535–552,2007.

[12] S. Sedghi, P. van Liesdonk, J.M. Doumen, P.H. Hartel, andW. Jonker, “Adaptively secure computationally efficient searchablesymmetric encryption,” Technical Report TR-CTIT-09-13, Centrefor Telematics and Information Technology University of Twente,April 2009.

[13] C. Dong, G. Russello, and N. Dulay, “Shared and searchable en-crypted data for untrusted servers,” Journal of Comput. Secur.,vol.19, pp.367–397, Aug. 2011.

[14] M. Mulazzani, S. Schrittwieser, M. Leithner, M. Huber, andE. Weippl, “Dark clouds on the horizon: using cloud storage as at-tack vector and online slack space,” Proc. USENIX Conf. Security(SEC’11), 2011.

[15] M.W. Storer, K. Greenan, D.D. Long, and E.L. Miller, “Secure datadeduplication,” Proc. ACM Int’l Workshop on Storage security andsurvivability (StorageSS’08), pp.1–10, 2008.

[16] J.R. Douceur, A. Adya, W.J. Bolosky, D. Simon, and M. Theimer,“Reclaiming space from duplicate files in a serverless distributed filesystem,” Proc. Int’l Conf. Distributed Computing Systems (ICDCS’02), 2002.

[17] L. Marques and C.J. Costa, “Secure deduplication on mobile de-vices,” Proc. Workshop on Open Source and Design of Communi-cation (OSDOS’11), pp.19–26, 2011.

[18] P. Anderson and L. Zhang, “Fast and secure laptop backups with en-crypted de-duplication,” Proc. Int’l Conf. Large installation systemadministration (LISA’10), pp.1–8, 2010.

[19] S. Goldwasser and S. Micali, “Probabilistic encryption,” Journal ofcomputer and system sciences, vol.28, no.2, pp.270–299, 1984.

[20] D. Boneh, A. Sahai, and B. Waters, “Functional encryption:definitions and challenges,” Proc. Conf. Theory of cryptography(TCC’11), pp.253–273, 2011.

[21] E. Shen, E. Shi, and B. Waters, “Predicate privacy in encryptionsystems,” Proc. Conf. Theory of Cryptography (TCC’09), pp.457–473, 2009.

[22] D. Boneh and M.K. Franklin, “Identity-based encryption from theweil pairing,” Proc. CRYPTO’01, pp.213–229, 2001.

[23] A. Joux, “The weil and tate pairings as building blocks for publickey cryptosystems,” Proc. Int’l Symp. Algorithmic Number Theory(ANTS’05), pp.20–32, 2002.

[24] C. Blundo, V. Iovino, and G. Persiano, “Private-key hidden vector


encryption with key confidentiality,” Proc. Cryptography and Net-work Security 2009 (CANS’09), pp.259–277, 2009.

[25] D. Naor, M. Naor, and J.B. Lotspiech, “Revocation and trac-ing schemes for stateless receivers,” Proc. CRYPTO’01, pp.41–62,2001.

[26] J. Daemen and V. Rijmen, The design of Rijndael: AES-the ad-vanced encryption standard, 2002.

[27] M. Bellare and P. Rogaway, “Random oracles are practical: aparadigm for designing efficient protocols,” Proc. ACM Conf. Com-puter and Comm. Security (CCS’93), pp.62–73, 1993.

[28] V. Goyal, O. Pandey, A. Sahai, and B. Waters, “Attribute-based en-cryption for fine-grained access control of encrypted data,” Proc.ACM Conf. Computer and Comm. Security (CCS’06), pp.89–98,2006.

[29] J. Bethencourt, A. Sahai, and B. Waters, “Ciphertext-policyattribute-based encryption,” Proc. IEEE Symp. Security and Privacy(SP’07), pp.321–334, 2007.

Appendix A: Proof of Theorem 1

Proof . Suppose A is an adversary algorithm that tries tobreak SEPE. A may run in two ways; (1) guessing a binarystring aux to extract the secret key and (2) performing thesecurity game described in Sect. 3.1.2.

Let us denote by p(·) a polynomial function. Since themin-entropy of the distribution of aux is Haux∞ = p(λ), A’sprobability of guessing the correct aux is 2−p(λ). Thus, ex-tracting the correct secrete key is infeasible.

The only way to break SEPE is by the security game.Suppose that A has advantage ε in breaking SEPE by run-ning the security game. We show that an algorithm B thatsolves the BDH problem can be constructed using algo-rithm A. Suppose A makes at most qC ciphertext queries,qT tokens queries and at most qH3 hash function queriesto H3. Then, the advantage of algorithm B is at leastε′ = ε/eqH3 (qT + qC), where e is the base of the natu-ral logarithm. If the BDH assumption holds in G, then ε′is negligible in λ and consequently ε must be negligiblein λ. Let g be a generator of G. Algorithm B is giveng, u1 = g

α, u2 = gβ, u3 = g

γ ∈ G. B simulates the challengerand interacts with algorithmA to output v = e(g, g)αβγ ∈ GT

as follows:Setup: Algorithm B gives A the global parameter: 〈p, g,G,GT 〉.H1-queries: Algorithm A can make queries to the randomoracle H1 at any time. Bmaintains the H1-list which is a listof tuples 〈Mi, hi, ai, ki, ri, ci〉 to respond to the query. Thelist is initially empty. Here i represents the sequence of thequeries (1 ≤ i ≤ qC + qT ). For each query Mi ∈ {0, 1}∗,algorithm B responds as follows:

1. If the query Mi is already in the H1-list in a tuple〈Mi, hi, ai, ki, ri, ci〉 then B responds with H1(Mi)=hi ∈G.

2. Otherwise, B picks a random ci ∈ {0, 1} so that Pr{ci =

0}=1/(qC + qT + 1), and also picks ai, ki, ri ∈ Z∗p at ran-dom.If ci = 0, then B computes hi = u2g

ai .If ci = 1, then B computes hi = g

ai .

3. Algorithm B appends the tuple 〈Mi, hi, ai, ki, ri, ci〉 tothe H1-list and sends H1(Mi)=hi toA.

H3-queries: To respond to these queries, B maintains a listof tuples 〈ti,Vi〉 called the H3-list. The list is initially empty.If the query ti already appears on the H3-list thenB respondswith H3(ti)=Vi. Otherwise, B responds with H3(ti) = V bygenerating V ∈ GT at random and adds the new tuple 〈ti,V〉to the H3-list.Ciphertext queries: WhenA issues a query for the cipher-text of Mi, B responds as follows:

1. B responds to H1-queries by running the above algo-rithm to obtain hi=H1(Mi). If ci=0 then B outputs ⊥and halts. If ci=1 then hi = g

ai ∈ G. Construct hrii , g

rii

and E=e(hi, u1)ri .2. B also gets H3(E) for the query E by running the

above algorithm and gives the correct ciphertext CT=〈hri

i , grii ,H3(E)〉 toA.

Token queries: WhenA issues a query for the token corre-sponding to Mi, B responds as follows:

1. Similar to the ciphertext query, B gets hi=H1(Mi) forMi by running the above algorithm.

2. If ci=0 then B outputs ⊥ and halts. If ci=1 thenhi = g

ai ∈ G. B gives the token T K = 〈hkii uai

1 , gki〉

to algorithm A. Observe that T K is correct becausehki

i uai

1 = hkii hαi = hki+α

i .

Challenge: Algorithm A produces a pair of challengingplaintext M0 and M1. B generates the challenge as follows:

1. Algorithm B runs the above algorithm for respond-ing to H1-queries to obtain a h0, h1 ∈ G such thatH1(M0)=h0 and H1(M1)=h1. If both c0 = 1 for h0 andc1 = 1 for h1 then B outputs ⊥ and halts.

2. Otherwise, at least one of c0, c1 is equal to 0. AlgorithmB picks a b ∈ {0, 1} at random such that cb = 0.

3. Algorithm B responds with the challenge ciphertextCT=〈I, u3, J〉 for a random I ∈ G and J ∈ GT . Observethat the challenge implicitly defines I, J as follows:

I = H1(Mb)γ,

J = H3(e(H1(Mb), uγ1)) = H3(e(u2gab , gαγ))

= H3(e(g, g)αγ(β+ab)).

Hence, the challenge is a valid ciphertext for Mb.More ciphertexts, token queries: A continues to askfor the ciphertext or the token of Mi of his choice exceptM0,M1. Algorithm B responds to these queries as before.Output: Finally, A outputs its guess b ∈ {0, 1}. Then, Bpicks randomly a tuple 〈t,V〉 from the H3-list and outputst/e(u1, u3)ab as its guess for e(g, g)αβγ.

This completes the description of algorithm B. Nowwe analyze the probability that B does not output ⊥ duringthe simulation. We define two events; E1 denotes an eventthat B does not output ⊥ during token or ciphertext queries,and E2 denotes an event that B does not output ⊥ during thechallenge phase. The probability that a ciphertext or a token


query cause B to output ⊥ is Pr{ci = 0} = 1/(qC + qT + 1).SinceAmakes at most qC +qT of all queries the probabilitythat event E1 occurs is at least (1− 1/(qC + qT + 1))(qC+qT ) ≥1/e. That is, Pr{E1} ≥ 1/e. Also, B will output ⊥ in thechallenge phase if A produces M0, M1 such that c0 = c1 =

1. Since Pr{ci = 0} = 1/(qC + qT + 1) for i = 0, 1, and thetwo values are independent, Pr{c0 = c1 = 1} = (1 − 1/(qC +

qT +1))2 ≤ 1−1/(qC +qT ). Therefore, Pr{E2} = 1−Pr{c0 =

c1 = 1} ≥ 1/(qC + qT ). These two events E1 and E2 areindependent becauseA can never issue a ciphertext or tokenquery for the challenge plaintext, M0 and M1. Therefore, theprobability that B does not output ⊥ during the simulationis Pr{E1 ∧ E2} = Pr{E1}Pr{E2} ≥ 1/e(qC + qT ).

Next, assuming that B does not output ⊥, we ar-gue that during the simulation A issues a query forH3(e(H1(Mb), uγ1)) with probability at least ε as shown inthe proof of Theorem 3.1 of [9], where Mb is the chal-lenge plaintext. That is, the value H3(e(H1(Mb), uγ1)) =e(gβ+ab , g)αγ will appear on some tuple in the H3-list withprobability at least ε. Algorithm B will choose the correcttuple with probability at least 1/qH3 and therefore, it willproduce the correct answer with probability at least ε/qH3

assuming that algorithm B does not output ⊥.Consequently, since B does not output ⊥ with proba-

bility at least 1/e(qC + qT ) the probability that algorithm Bsucceeds overall is at least ε′ = ε/eqH3 (qC + qT ). �

Appendix B: Proof of Theorem 2

Proof . The difference between SEPE and SEPEn is that inSEPEn scheme security parameter d, a random permutationπ and hash functions Hi

2 are additionally defined. Withoutloss of generality, we can assume that the security param-eter d is a constant and a random permutation π is fixed.Then the difference between SEPE and SEPEn lies only inthe computation of h. In the SEPEn scheme, h is calcu-lated as h = gs, where s = �u · �vπ = ∑

i∈D iHπ(i)2 (M), andHi

2 : {0, 1}∗ → Z∗p are d hash functions, while h is com-puted as h=H1(M) in the SEPE scheme, where H1 is a hashfunction H1 : {0, 1}∗ → G. In the random oracle model, allhash functions are treated as the random oracle and the dis-tribution of hashed values is uniform. In the SEPEn scheme,evaluations of the random oracles Hi

2 are independent witheach other. Hence, the distribution of s is uniform in N,which implies that the distribution of h = gs is also uniformin G. From the view of an adversary, these two schemes arestatistically indistinguishable because in the SEPE schemethe value of hash function H1 is also distributed uniformlyin G. Therefore, assuming SEPE scheme is semantically se-cure against an adaptively chosen plaintext attack, SEPEn

scheme is also semantically secure against the attack. And,by Theorem 1, SEPEn scheme is secure. �

Youngjoo Shin received the B.S. degreein Computer Science and Engineering from Ko-rea University, Korea in 2006 and the M.S. de-gree in Computer Science from Korea AdvancedInstitute of Science and Technology, Korea in2008. He is currently a Ph.D. candidate at Com-puter Science Department in Korea AdvancedInstitute of Science and Technology, Korea, andis also a researcher at The Attached Instituteof ETRI, Korea. His research interests includecryptography, network security and cloud com-

puting security.

Kwangjo Kim received the B.S. and M.S.degrees of Electronic Engineering in YonseiUniversity, Korea, and Ph.D. of Div. of Elec-trical and Computer Engineering in YokohamaNational University, Japan. Currently he is aprofessor at Computer Science Department inKorea Advanced Institute of Science and Tech-nology, Korea. He served the president of Ko-rean Institute on Information Security and Cryp-tography (KIISC) in 2009 and Board Member ofIACR from 2000 to 2004. His research interests

include the theory of cryptography and information security and their prac-tice.

eﬃcient and secure file deduplication in cloud...

Documents