ieee transactions on knowledge and data engineering 1 privacy-preserving collaborative...

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1

Privacy-Preserving Collaborative ModelLearning: The Case of Word Vector Training

Qian Wang, Member, IEEE, Minxin Du, Xiuying Chen, Yanjiao Chen,Pan Zhou, Member, IEEE, Xiaofeng Chen and Xinyi Huang

Abstract—Nowadays machine learning is becoming a new paradigm for mining hidden knowledge in big data. The collection andmanipulation of big data not only create considerable values, but also raise serious privacy concerns. To protect the huge amount ofpotentially sensitive data, a straightforward approach is to encrypt data with specialized cryptographic tools. However, it is challengingto utilize or operate on encrypted data, especially to perform machine learning algorithms. In this paper, we investigate the problem oftraining high quality word vectors over large-scale encrypted data (from distributed data owners) with the privacy-preservingcollaborative neural network learning algorithms. We leverage and also design a suite of arithmetic primitives (e.g., multiplication,fixed-point representation and sigmoid function computation etc.) on encrypted data, served as components of our construction. Wetheoretically analyze the security and efficiency of our proposed construction, and conduct extensive experiments on representativereal-world datasets to verify its practicality and effectiveness.

Index Terms—Neural network, Collaborative learning, Homomorphic encryption, Privacy Preservation

F

1 INTRODUCTION

O VER the past two decades, machine learning has be-come one of the mainstays of information technology.

With the advancement in neural-network-based learningmethods, major breakthroughs have been made for clas-sic artificial intelligence tasks, such as speech/image/textrecognition, automatic translation, and web page ranking.This is enabled, in part, by the availability of a huge volumeof high-quality data used for training neural networks. Forexample, Google DeepMind developed a program namedAlphaGo, which trains on about 30 million positions fromexpert games to play the board game Go, and has beatenprofessional human players without handicaps [1].

Compared to training with only local dataset, collabora-tive learning can improve the accuracy of resulting modelsby incorporating more representative data. Meanwhile, withthe increased popularity of public computing infrastruc-tures (e.g., cloud platform), it has been more convenient thanever for distributed users (across the Internet) to performcollaborative learning through the shared infrastructure.

While the potential benefits of (collaborative) machinelearning can be tremendous, the large-scale training data

• Q. Wang, M. Du, and X. Chen are with the Key Laboratory of AerospaceInformation Security and Trusted Computing, Ministry of Education,School of Cyber Science and Engineering, Wuhan University, Wuhan430072, China, and also with the State Key Laboratory of IntegratedServices Networks, Xidian University, Xi’an 710071, China. (E-mail:{qianwang,duminxin,xychen}@whu.edu.cn)

• Y. Chen is with the School of Computer Science, Wuhan University,Wuhan 430072, China. (E-mail: [email protected])

• P. Zhou is with the School of Electronic Information and Communications,Huazhong University of Science and Technology, Wuhan 430074, China.(E-mail: [email protected])

• X. Chen is with the State Key Laboratory of Integrated ServiceNetworks (ISN), Xidian University, Xi’an 710071, China. (E-mail:[email protected])

• X. Huang is with the School of Mathematics and Computer Sci-ence, Fujian Normal University, Fuzhou 350007, China. (E-mail: [email protected])

may pose substantial privacy risks. In other words, cen-tralized collection of data from different participants mayraise great concerns in data confidentiality and privacy. Forexample, in certain application scenarios such as healthcare,individuals/patients may not be willing to reveal their sen-sitive data (e.g., protected health information) to anybodyelse, and the disclosure of such proprietary data is forbiddenby the laws or regulations of HIPAA1. To deal with suchprivacy issues, a straightforward approach is to encryptsensitive data before sharing it. However, data encryptionhinders data utilization and computation, making it diffi-cult to efficiently perform (collaborative) machine learningcompared with the case in plaintext domain.

1.1 Related Work

Recently great efforts have been devoted to constructingprivacy-preserving machine learning algorithms with spe-cialized cryptographic tools (e.g., homomorphic encryp-tion [2]). Nikolaenko et al. [3] presented a hybrid approach toprivacy-preserving ridge regression (a fundamental mech-anism adopted in various learning algorithms) that usesboth additive homomorphic encryption and Yao garbledcircuits [4]. In another work [5], they designed a similarhybrid solution to perform matrix factorization (a widelyused collaborative filtering technique) over encrypted data.We note that both [3] and [5] resort to the setting of two non-colluding servers, where one acts as the circuit generatorand the other acts as the circuit evaluator. Subsequently, [6]proposed a privacy-preserving matrix factorization protocolwith fully homomorphic encryption, which outperforms theprotocol proposed in [5] in terms of efficiency. Bost et al. [7]constructed three common machine learning classificationprotocols, namely hyperplane decision, Naıve Bayes, and

1. http://www.hhs.gov/hipaa/index.html


...

Secure interactive protocolRemote server

1u 2u nu

1f 2f nf

Fig. 1: System model

decision trees, leveraging homomorphic encryption to sat-isfy privacy constraints. Dowlin et al. [8] applied a some-what homomorphic encryption scheme in the inferencestage, and presented a method to convert learned neuralnetworks to CryptoNets.

Another line of work has focused on developing securealgorithms with differential privacy [9]. The literature [10]dealt with the problem of differentially private approxima-tions to the principal components analysis (PCA), and thework [11] presented the functional mechanism, a generaldifferentially private framework designed for a large classof optimization-based analyses (e.g., linear regression andlogistic regression). Shokri et al. [12] applied differentialprivacy to neural network parameter updates using thesparse vector technique, thus mitigating the privacy loss incollaborative deep learning. Then Abadi et al. [13] proposeda couple of components: a differentially private stochasticgradient descent algorithm, the moments accountant, andhyperparameter tuning used for private training of neuralnetworks.

1.2 Key Contributions

In this work, we concentrate on machine learning tech-niques applied in the domain of natural language process-ing (NLP). Words are always treated as atomic units forvarious NLP applications or even more complex tasks, andrepresentations of words in a vector space can help learn-ing algorithms to achieve a better performance. Moreover,the word representations computed using neural networksexplicitly encode many linguistic regularities (e.g., syntacticand semantic relationships) and patterns.

In light of these observations, we, for the first time,investigate the problem of privacy-preserving collabora-tive learning for distributed word vectors over large-scaleencrypted data. To this end, our construction addressesthree major challenges induced by data encryption. First,to overcome the difficulty that the plaintext space of thecryptosystem adopted in our construction only constitutedintegral numbers, we introduce fixed-point data type to dealwith real numbers involved in computations. Second, tofacilitate the efficient evaluation of non-linear polynomials(e.g., multiplication and exponentiation operation) in the en-crypted form, we leverage a secure multiplication protocol

combined with the packing technique to be deployed ontwo parties. Third, to enable the private computation ofactivation function (also transcendental function) used inneural networks, we make an approximation by leveragingcustomized series expansion which incurs only a little lossin accuracy. Our main contributions are summarized asfollows.

• To the best of our knowledge, this paper is the firstto provide privacy preservation for neural networklearning algorithms to train distributed word vectors(which are of great significance to numerous NLPapplications) over large-scale encrypted data frommultiple participants.

• We first leverage a couple of arithmetic primitives(e.g., multiplication and fixed-point representation)on encrypted data, and then design a new techniquewhich enables the secure computation of activa-tion function. These arithmetics over ciphertexts areserved as components of our tailored construction.

• We present a thorough analysis regarding to privacyand efficiency of our proposed construction. To fur-ther demonstrate its practicality and effectiveness,we conduct experimental evaluations on representa-tive real-world datasets and make comparison withresults obtained in the plaintext domain.

2 PROBLEM STATEMENT

2.1 System ModelIn this paper, we consider the problem of computing contin-uous vector representations of words over encrypted datafiles collected by a central remote server.

At a high level, as illustrated in Fig. 1, we target at asystem composed of three major parties: multiple partic-ipating users (i.e., data owners), a central remote serverS and a crypto service provider CSP . More specifically,there are n users in total, denoted as ui (i = 1 . . . n),each of which owns a private data file fi and wants toperform collaborative neural network learning with all otherparticipating users. In other words, they will contribute theirprivate data files in the encrypted form to the central remoteserver. After receiving the encrypted data, the remote serverS performs training over the contributed data and produceshigh-quality word vectors along with a model, which canbe used later for various natural language processing (NLP)tasks. In our system design, the crypto service provider CSPis responsible for initializing the system, i.e., generating andissuing encryption/decryption keys for all other parties. Atcertain points, the remote server is also required to runsecure interactive protocols during the training process.

In the above setting, the privacy can hold as long as theremote server S and the crypto service provider CSP donot collude. It is worth noting that this architecture of twonon-colluding entities has been commonly used in existingworks [3], [5], [14].

2.2 Threat ModelOur goal is to ensure that the remote server S and the cryptoservice provider CSP cannot learn anything about the users’data beyond what is revealed by the results of the learning


algorithm. As discussed above, the remote server S and theCSP are assumed to be non-colluding, namely, they belongto two independent organizations such as Amazon EC2and the government agency. Therefore, a collusion betweenthem is highly impossible as it will damage their reputation.We further assume that the remote server S and the CSPare both honest-but-curious entities [7], [15], meaning thatthey will run the protocol exactly as specified without anydeviations, but try to learn extra information from theirviews of the protocol.

3 BACKGROUND AND PRELIMINARIES

3.1 Vector Representations of Words Learned by Neu-ral Networks

Representation of words as continuous vectors has a longhistory [16]. In recent years, it has been shown that the wordvectors can be used to significantly improve and simplifymany natural language processing (NLP) applications [17],[18]. Many different types of models were proposed forestimating word vectors, including the well-known LatentSemantic Analysis (LSA) and Latent Dirichlet Allocation(LDA). In these two types, however, one either has worselinear regularities preservation [19], or incurs prohibitivelyhigh computation overhead on large datasets.

Nowadays, with the progress of machine learning tech-niques, it becomes possible to learn high-quality word vec-tors from massive data by leveraging neural networks. Avery popular work for estimating neural network languagemodel (NNLM) was first introduced in [20], where a feed-forward neural network with a linear projection layer and anon-linear hidden layer was used to jointly learn the wordvector representation and a statistical language model.

In this paper, we focus on two new classic model ar-chitectures proposed by Mikolov et al. [21], [22]. A briefintroduction is given as follows.

3.1.1 Continuous Bag-of-Words Model (CBOW)

The continuous bag-of-words model, denoted as CBOW, issimilar to the feedforward NNLM, where the non-linearhidden layer is removed and the projection layer is sharedfor all words; thus, all words are projected into the sameposition (the corresponding vectors are averaged). Note thatthe order of words in the history or from the future doesnot influence the projection, and the training criterion of theCBOW is to correctly classify the current (middle) word.

3.1.2 Continuous Skip-gram Model

The continuous skip-gram model is similar to CBOW, butinstead of predicting the current word based on the context,it tries to maximize classification of a word based on anotherword in the same sentence. More precisely, it uses eachcurrent word as an input to a log-linear classifier withcontinuous projection layer, and predicts words within acertain range before and after the current word.

The architectures of the two model are shown in Fig. 2.Interested readers can refer to [21], [22] for more details.

INPUT PROJECTION OUTPUT

CBOW

OUTPUTPROJECTIONINPUT

Skip‐gram

2iw

1iw

1iw

2iw

2iw

1iw

1iw

2iw

iw iw

Fig. 2: The CBOW and Skip-gram architectures

3.2 Cryptographic Tools

3.2.1 Paillier Cryptosystem

Homomorphic encryption allows certain computations to becarried out on ciphertexts to generate an encrypted resultwhich matches the result of operations performed on theplaintext after being decrypted. Although homomorphicencryption for arbitrary operations is prohibitively slow, ho-momorphic encryption for solely addition operations is ef-ficient. In particular, we adopt the Paillier cryptosystem [23]which is additively homomorphic in our construction.

In the Paillier cryptosystem, the public (encryption) keyis pkp = (n = pq, g), where g ∈ Z∗n2 , and p and qare two large prime numbers (of equivalent length) cho-sen randomly and independently. The private (decryption)key is skp = (ϕ(n), ϕ(n)−1 mod n). Given a messagea, we write the encryption of a as JaKpk, or simply JaK,where pk is the public key. The encryption of a messagex ∈ Zn is JxK = gx · rn mod n2, for some randomr ∈ Z∗n. The decryption of the ciphertext is x = L(JxKϕ(n)

mod n2) · ϕ−1(n) mod n, where L(u) = u−1n . The ho-

momorphic property of the Paillier cryptosystem is givenby Jx1K · Jx2K = (gx1 · rn1 ) · (gx2 · rn2 ) = gx1+x2(r1r2)n

mod n2 = Jx1 + x2K.

3.2.2 Pseudo-Random Fucntion (PRF)

Let F : {0, 1}λ × {0, 1}∗ → {0, 1}∗ be a pseudo-randomfunction, which is a polynomial-time computable functionthat cannot be distinguished from random functions by anyprobabilistic polynomial-time adversary. Readers can referto [24] for the formal definition and security proof.

3.3 Notations

Here, we outline some notations used throughout the paper.The set of binary strings of length n is denoted as {0, 1}n,and the set of all finite binary strings is denoted as {0, 1}∗.[n] denotes the set of positive integers less than or equalto n, i.e., [n] = {1, 2, . . . , n}. We use x $←− X to representa uniform and random sampling of an element x from aset X . Given a sequence of elements (or a vector) v, werefer to the ith element as v[i] or vi and its total numberof elements as #v. Given two binary strings s1, s2, theconcatenation is written as s1||s2. If S is a set then |S|refers to its cardinality. The whole corpus can be viewedas a collection of n data files C = (f1, . . . , fn), where filefi is a sequence of words (w1, . . . , w#fi) with each wordwj ∈ {0, 1}∗, and #fi denotes the total number of words in


fi. A word universe (which contains all distinct words) ex-tracted from the corpus C is denoted byW = (w1, . . . , wm).The continuous vector representation of a wordw is denotedas a d-dimensional v(w), where each entry v(w)i ∈ R.Besides, we use 0 to denote a vector of length d havingall 0’s.

4 ARITHMETIC PRIMITIVES ON ENCRYPTED DATA

Before delving into details of our construction, we firstintroduce several arithmetic primitives on encrypted data(under the Paillier cryptosystem) in this section.

4.1 Representing Real NumbersIt is worth noting that the message space of the Pailliercryptosystem is over positive integers (i.e., Zn), but theword vectors involved in computations during the trainingprocess are real-value vectors. To overcome this data typemismatch problem, we follow the existing works [3], [6],[25] and leverage fixed-point representation to deal with realnumbers.

Given a real number a, its fixed-point representation isgiven by:

a = ba · 2pc,

where the term 2p is a fixed resolution, and b·c is a rounddown function. In our design, the number of bits p for thefractional part can be chosen as a system parameter. Besides,this fixed-point data type will generate rounding errors thatare inversely proportional to p, namely, a higher accuracy ofthe system can be achieved by selecting a larger p.

To represent negative numbers, we will use the standardtwo’s complement representation. For example, a is a neg-ative number and α bits are used for the fixed-point datatype, then the corresponding two’s complement representa-tion of a is a+ 2α. As a consequence, the private input realnumbers will first be converted to fixed-point representa-tions, and then encrypted under the Paillier cryptosystem.

4.2 Addition/SubtractionUsing the above methods, we can reduce all elementaryoperations (i.e., addition, subtraction and multiplication)over real numbers to their integral version. In particular:

Addition/Subtraction: a± b = a± b

Therefore, additions and subtractions in the encrypted formcan be directly achieved by utilizing the additive homomor-phic property, i.e., Ja+ bK = JaK·JbK and Ja− bK = JaK·JbK−1.

4.3 Multiplication and Inner ProductHowever, to implement multiplication between two num-bers with fixed-point representations in the ciphertext do-main is not straightforward due to the following reasons.On the one hand, the Paillier cryptosystem can only eval-uate linear polynomials (i.e., support any number of addi-tions/subtractions) efficiently. On the other hand, observingthat a · b = bab · 22pc results in a resolution of 22p, andtherefore we need to scale the resolution down to 2p, i.e., toobtain a · b = ba · b/2pc with no overflow occurred.

To achieve this goal, we require that the remote server Sshould work jointly with the crypto service provider CSPto run the secure multiplication protocol. Specifically, oursolution is inspired by [14], which is based on the followingproperty:

a · b = (a+ r1) · (b+ r2)− a · r2 − b · r1 − r1 · r2. (1)

In our setting, the remote server S holds two encryptedvalues JaK and JbK, while the crypto service provider CSPhas the secret key skp for the Paillier cryptosystem. Withoutloss of generality, the fixed-point data type is representedwith α bits and the resolution is set to 2p hereafter.

Now, we describe the detailed procedures of securemultiplication protocol as follows: (i) instead of sending JaKand JbK immediately to the CSP , the S first masks them withtwo k-bit random numbers r1 and r2 (i.e., Ja+r1K = JaK·Jr1K,Jb+r2K = JbK·Jr2K) respectively, where k is a security param-eter (k > α). (ii) the CSP decrypts the masked ciphertexts,computes the intermediate result d = (a+r1)·(b+r2) on theplaintext, re-encrypts the result JdK and sends it back to theS . (iii) the S removes the masks based on the equation (1) bycomputing JdK ·JaK−r2 ·JbK−r1 ·Jr1 ·r2K−1 homomorphically,and obtains Ja · bK with resolution 22p. (iv) the S furtherblinds Ja · bK by using another k′-bit random number r3,where k′ is also a security parameter (k′ > α + p), to getJa·b+r3K and then sends it to the CSP . (v) the CSP decryptsthe received ciphertext, computes d′ = b a·b+r32p c, re-encryptsJd′K and sends it back to the S . (vi) the S removes the maskin a similar way as that in (iii) and obtains the final resultJb a·b2p cK with resolution 2p.

Obviously, to compute the inner product of two en-crypted vectors, which only consists of multiplication andaddition, we can readily resort to the secure multiplicationprotocol and leverage the additive homomorphic property.For ease of presentation, we use ⊗ to denote the multiplica-tion of two encrypted values (i.e., Ja · bK = JaK⊗ JbK) or theinner product of two encrypted vectors.

Packing optimization. It is worth noting that the mes-sage space of Paillier cryptosystem is always greater thanthe space of the masked values exchanged in the secure mul-tiplication protocol. We can therefore make a great improve-ment in both computation time and bandwidth by applyingthe packing technique. The basic idea is to send one cipher-text in the form of Jai+1+ri+1|| . . . ||ai+δ+ri+δK instead of δciphertexts in the form of Jai+riK, where δ = 1024

k and 1024-bit modulus is used in the Paillier cryptosystem. More pre-cisely, given two masked ciphertexts Ja1 +r1K and Ja2 +r2K,the server S can aggregate them into a single ciphertext ofthe form Ja1 + r1||a2 + r2K = Ja1 + r1K2k · Ja2 + r2K. Henceone “packed” ciphertext can be obtained via aggregatingδ ciphertexts in the same manner, and multiple data itemscan be processed by performing only one encryption or onedecryption.

4.4 Sigmoid Function

There is a variety of activation functions that are commonlyused in neural networks. The sigmoid function, also knownas logistic function, is perhaps the most popular activa-tion function chosen in the training stage in order to get


reasonable error terms when running the gradient descentalgorithm. Sigmoid function is defined as follows:

σ(x) =1

1 + e−x.

However the Paillier cryptosystem does not support expo-nentiation and division operations over the ciphertext. Thuswe cannot directly compute the sigmoid function in theciphertext form. Fortunately, it is possible to approximatethe sigmoid activation function, and our subsequent exper-iments show that this approximation has negligible impacton the accuracy of the system and performs comparablywith the results obtained in the plaintext.

In our construction, we utilize the Maclaurin seriesexpansion [26], [27] to realize the approximation of thesigmoid function. Specifically, the equation used for theestimation is given as:

σ(x) =∞∑n=0

(−1)nEn(0)

2n!xn

=∞∑n=0

(−1)n+1(2n+1 − 1)Bn+1

(n+ 1)!xn

=1

2+

1

4x− 1

48x3 +

1

480x5 − 17

80640x7 + . . . ,

where En(x) is an Euler polynomial and Bn is a Bernoullinumber.

It is easy to see that the private computation for σ(x) canbe achieved by performing the above secure multiplicationprotocol and homomorphic addition operations on cipher-texts. Furthermore, due to the property of Maclaurin series,we can flexibly choose the number of terms in the expan-sion based on the accuracy requirement and the efficiencyguarantee.

5 OUR CONSTRUCTION

In this section, we propose a number of privacy-preservingword vector learning schemes that leverage the CBOW andthe skip-gram architectures described in section 2.

Our ultimate goals are summarized as follows.

• Security. Both the remote server S and the CSPcannot learn anything about the private data (i.e.,the original words and their related word vectors)contributed by users beyond what is revealed by thefinal results (i.e., the trained model).

• Effectiveness. The word vectors learned through ourprivacy-preserving methods should reach the sameaccuracy level as that obtained in the plaintext do-main.

• Efficiency. The computation and communication costsat both the remote server side and the crypto serviceprovider side should be practically acceptable.

5.1 InitializationOur construction makes use of the Paillier cryptosystemand a pseudo-random function P , where P is defined as{0, 1}λ × {0, 1}∗ → {0, 1}λ. At the beginning, the CSPtakes responsibility of setting up the environment. In par-ticular, given a security parameter λ, the CSP generates the

following keys uniformly at random from their respectivedomains: a PRF key k1

$←− {0, 1}λ for Pk1(·) and a tuple(skp, pkp) for the Paillier cryptosystem. Then the CSP pub-lishes the public key pkp and sends the private keys k1, skpto all users through a secure channel.

As stated in our system model, each user owns a privatedata file and wants to perform collaborative neural networklearning with all other users. Intuitively, users can protectthe privacy of their sensitive data by encryption beforeoutsourcing to the remote server S . More specifically, forevery data file fi which comprises a sequence of words(w1, . . . , w#fi), user ui computes a token τj = Pk1(wj)for each word wj ,∀j ∈ [#fi]. Besides, for every de-duplicated initial word vector v(wj), the user ui shouldfirst convert each entry v(wj)k ∈ R (k ∈ [d]) to its fixed-point representation v(wj)k, and then encrypt it under thePaillier cryptosystem Jv(wj)kK. For simplicity, we denote theresulting encrypted word vector as Jv(wj)K. Consequently,user ui obtains the corresponding encrypted form ci of thedata file fi, where ci consists of a sequence of tokens τj anda collection of encrypted word vectors Jv(wj)K. Finally, eachuser ui contributes the encrypted data file ci to the remoteserver S .

After receiving all ciphertexts from n users, the remoteserver S initializes a dictionary D (also known as a hashtable or associative array), which is a data structure storingkey-value pairs. Subsequently, the S traverses the wholecontributed data, and stores the tokens associated withtheir corresponding frequency count(Pk1(w)) (i.e., the wordcounts in the plaintext corpus) as key-value pairs in thedictionary D. For instance, if the word w1 appears fivetimes in the corpus C, then the tuple (Pk1(w1), 5) is storedin the dictionary D. At last, the dictionary D contains mitems (i.e., key-value pairs) extracted from the encrypteddatasets. It is worth pointing out that this type of statisticalinformation is indispensable and essential for the remoteserver to conduct the neural network learning process, andtherefore we assume that such statistical information willnot disclose any private information about the original data(i.e., figure out any plaintext word). For ease of reference,Pk1(w) will be simply written as w hereafter.

5.2 Privacy-preserving Learning with CBOW

The continuous bag-of-words (CBOW) architecture, asshown in Fig. 2, predicts the current word from a window ofsurrounding context words, and the order of context wordsdoes not influence the prediction result (bag-of-words as-sumption).

5.2.1 Hierarchical Softmax

In the traditional neural network language model (NNLM),we need to compute the conditional probabilities ofall words in the corpus given a related context:p(w|context(w)), where w ∈ C and context(w) is a set ofwords surrounding w, and then perform a normalization(namely softmax). Obviously, doing such computations forall words for every training sample is incredibly expen-sive, making it impractical to scale up to massive trainingcorpora. One elegant approach to solving this problem is


Algorithm 1 CBOW with Hierarchical Softmax

Input: A training sample (context(w), w), wherecontext(w) contains 2c encrypted word vectorsJv(context(w)i)K (i = 1, . . . , 2c).

Output: Updated word vectors Jv(context(w)i)K and pa-rameter vectors Jθwj K (j = 1, . . . , lw − 1).

1: Initialize a vector e = 0

2: Compute the summation JvsumK =2c∏i=1

Jv(context(w)i)K

3: for j = 1, . . . , lw − 1 do4: Compute sigmoid function JsK = σ(JvsumKT⊗ Jθwj K)

5: Compute JgK = JηK⊗ (J1K · Jαwj K−1 · JsK−1)

6: Update JeK = JeK · (JgK⊗ Jθwj K)7: Update Jθwj K = Jθwj K · (JgK⊗ JvsumK)8: end for9: for i = 1, . . . , 2c do

10: Update Jv(context(w)i)K = Jv(context(w)i)K · JeK11: end for

Algorithm 2 CBOW with Negative Sampling

Input: A training sample (context(w), w), wherecontext(w) contains 2c encrypted word vectorsJv(context(w)i)K (i = 1, . . . , 2c).

Output: Updated word vectors Jv(context(w)i)K and pa-rameter vectors JθuK (u ∈ {w} ∪NEG(w)).

1: Initialize a vector e = 0

2: Compute the summation JvsumK =2c∏i=1

Jv(context(w)i)K

3: for u ∈ {w} ∪NEG(w) do4: Compute sigmoid function JsK = σ(JvsumKT ⊗ JθuK)5: Compute JgK = JηK⊗ (Jαw(u)K · JsK−1)6: Update JeK = JeK · (JgK⊗ JθuK)7: Update JθuK = JθuK · (JgK⊗ JvsumK)8: end for9: for i = 1, . . . , 2c do

10: Update Jv(context(w)i)K = Jv(context(w)i)K · JeK11: end for

hierarchical softmax [28], [29], and another one is throughnegative sampling [22] (to be discussed in the next section).

As the frequency of words works well for obtainingclasses in NNLM [30], the core idea of hierarchical softmaxis to use a Huffman binary tree to represent all words inthe word universe. In our construction, the remote serverS generates the Huffman tree from the exact frequenciesof the words by using the dictionary D obtained in theinitialization phase, that is, frequent words are assignedwith short binary codes. The obtained Huffman tree has upto m leaf nodes (in accordance to words in the dictionaryD) and m − 1 internal nodes (each is associated with a 0initialization vector), and it will serve as the output layer ofthe CBOW architecture.

By leveraging maximum log-likelihood estimation, theobjective function for the CBOW model is defined as

L =∑w∈C

log p(w|context(w)), (2)

Algorithm 3 Skip-gram with Hierarchical Softmax

Input: A training sample (w, context(w)).Output: Updated word vector Jv(w)K and parameter vec-

tors Jθwj K (j = 1, . . . , lw − 1).1: for u ∈ context(w) do2: Initialize a vector e = 03: for j = 1, . . . , lw − 1 do4: Compute the sigmoid JsK = σ(Jv(w)KT ⊗ Jθwj K)5: Compute JgK = JηK⊗ (J1K · Jαwj K−1 · JsK−1)

6: Update JeK = JeK · (JgK⊗ Jθwj K)7: Update Jθwj K = Jθwj K · (JgK⊗ Jv(w)K)8: end for9: Update Jv(w)K = Jv(w)K · JeK

10: end for

and the training goal is to maximize the above function.In a training sample (context(w), w), context(w)

consists of c words before and after theword w, respectively. Therefore at the inputlayer, there are 2c word vectors in total, i.e.,v(context(w)1), v(context(w)2), . . . , v(context(w)2c).Then, the summation of the 2c input vectors is computed at

the projection layer, i.e., vsum =2c∑i=1

v(context(w)i).

The output layer, as described above, is constituted bya Huffman binary tree. For each leaf node, there exists aunique path from the root to it, and this path is used toestimate the probability (i.e., p(w|context(w))) of the wordrepresented by the leaf node. Specifically, the conditionalprobability of a word being the word in the training sampleis defined as:

p(w|context(w)) =lw−1∏j=1

p(αwj |vsum, θwj ), (3)

where lw denotes the number of nodes in the path fromthe root to the leaf node w, θwj denotes the d-dimensionalparameter vector associated with the j-th internal nodeon the above path, and the probability p(αwj |vsum, θwj ) isdefined as:

p(αwj |vsum, θwj ) =

{σ(vT

sum · θwj ), αwj = 0;1− σ(vT

sum · θwj ), αwj = 1,

where the probability of going to the right child (i.e., αwj = 0)at the j-th internal node is σ(vT

sum · θwj ), and apparently theprobability of going to the left child (i.e., αwj = 1) at the j-thinternal node is 1− σ(vT

sum · θwj ).Combined with Eq. (3), the objective function (2) can be

converted to:

L =∑w∈C

loglw−1∏j=1

p(αwj |vsum, θwj )

=∑w∈C

loglw−1∏j=1

{[σ(vTsum · θwj )]1−α

wj · [1− σ(vT

sum · θwj )]αwj }

=∑w∈C

lw−1∑j=1

{(1− αwj ) · log[σ(vTsum · θwj )]

+ αwj · log[1− σ(vTsum · θwj )]}.


Algorithm 4 Skip-gram with Negative Sampling

Input: A training sample (w, context(w)).Output: Updated word vectors Jv(w)K (w ∈ context(w))

and parameter vectors JθuK (u ∈ {w} ∪NEGw(w)).1: for w ∈ context(w) do2: Initialize a vector e = 03: for u ∈ {w} ∪NEGw(w) do4: Compute the sigmoid JsK = σ(Jv(w)KT ⊗ JθuK)5: Compute JgK = JηK⊗ (Jαw(u)K · JsK−1)6: Update JeK = JeK · (JgK⊗ JθuK)7: Update JθuK = JθuK · (JgK⊗ Jv(w)K)8: end for9: Update Jv(w)K = Jv(w)K · JeK

10: end for

For the simplicity of notation, we introduce the followingshortened form without any ambiguity:

L(w, j) = (1− αwj ) log[σ(vTsumθ

wj )] + αwj log[1− σ(vT

sumθwj )].

Recall that the training goal is to maximize the objectivefunction L, and we use stochastic gradient descent as thelearning algorithm to achieve this goal.

We first take the derivative of L(w, j) with regard to theparameter vector θwj of the j-th internal node and obtain:

∂L(w, j)

∂θwj= {(1− αwj )[1− σ(vT

sumθwj )]− αwj σ(vT

sumθwj )}vsum

= [1− αwj − σ(vTsumθ

wj )]vsum,

which results in the following update equation:

θwj = θwj + η[1− αwj − σ(vTsumθ

wj )]vsum, (4)

where η denotes a positive learning rate, and the updateequation should be applied to j = 1, 2, . . . , lw − 1. Next, wetake the derivative of L(w, j) with regard to the summationof the input vectors vsum and obtain:

∂L(w, j)

∂vsum= [1− αwj − σ(vT

sumθwj )]θwj ,

which results in the update equation for each wordcontext(w)i, i ∈ [2c]

v(context(w)i) = v(context(w)i) + ηlw−1∑j=1

∂L(w, j)

∂vsum. (5)

Now we summarize the details of how the remote serverS performs learning in a secure manner in Algorithm 1.Given a training sample (context(w), w), the remote serverS first computes the summation of the input word vectorsJv(context(w)i)K (i ∈ [2c]) by leveraging additive homo-morphic property. In line 4, the remote server and the cryptoservice provider jointly calculate the inner product of thetwo encrypted vectors with the packing optimization. Tocompute the sigmoid function in the ciphertext domain,they can resort to the carefully designed approximationmethod introduced in section 4. Then for each parametervector to be updated on the path from the root to the targetword w, the S can readily invoke the secure multiplicationprotocol according to the equation (4). At last, the S updatesthe word vectors analogously based on the equation (5).

5.2.2 Negative Sampling

As an alternative to the hierarchical softmax, negative sam-pling (NEG) is a simplified version of the noise contrastiveestimation (NEC) proposed by Gutmann et al. [31]. Theidea of negative sampling is more straightforward thanhierarchical softmax, namely, we only need to update asample of output vectors instead of all vectors on the pathfrom the root to the leaf.

Apparently, the target word (i.e., the positive sample)appeared in a training sample should get updated, and wealso need to sample a few words as negative samples (hence“negative sampling”). A probabilistic distribution is thusrequired for the sampling process, and it will be discussedlater in this section.

Now for the training sample (context(w), w), we assumethat a set of words (i.e., negative samples), denoted asNEG(w) (NEG(w) 6= ∅), has been drawn from a certainprobabilistic distribution. The goal then is to maximize thefollowing function

g(w) =∏

u∈{w}∪NEG(w)

p(u|context(w)), (6)

where the probability p(u|context(w)) is defined as follows

p(u|context(w)) =

{σ(vT

sum · θu), αw(u) = 1;1− σ(vT

sum · θu), αw(u) = 0,

where θu denotes a d-dimensional (auxiliary) parametervector associated with u, αw(u) = 1 if u is a positivesample (i.e., the word w), and αw(u) = 0 otherwise (i.e.,u ∈ NEG(w)).

The objective function for the CBOW model under neg-ative sampling is given by:

L =∑w∈C

log g(w). (7)

By plugging Eq. (6) into the above objective function we canobtain:

L =∑w∈C

log∏

u∈{w}∪NEG(w)

p(u|context(w))

=∑w∈C

∑u∈{w}∪NEG(w)

{αw(u) · log[σ(vTsumθ

u)]+

[1− αw(u)] · log[1− σ(vTsumθ

u)]}=∑w∈C{log[σ(vT

sumθu)] +

∑u∈NEG(w)

log[1− σ(vTsumθ

u)]}.

Similarly we introduce a shortened form without introduc-ing any ambiguity:

L(w, u) = αw(u) log[σ(vTsumθ

u)] + [1− αw(u)] log[σ(−vTsumθ

u)].

Again, stochastic gradient descent is adopted in the follow-ing derivation process.

To obtain update equations of word vectors, we first takethe derivative of L(w, u) with regard to the auxiliary vectorθu:

∂L(w, u)

∂θu= αw(u)σ(−vT

sumθu)vsum − [1− αw(u)]σ(vT

sumθu)vsum

= [αw(u)− σ(vTsumθ

u)]vsum,


Model Addition/Subtraction Multiplication Sigmoid ApproximationCBOW HS (4c− 1) · d+ 3(logm− 1) · d 3(logm− 1) · d logm− 1

CBOW NEG (4c− 1) · d+ 3(|NEG(w)|+ 1) · d 3(|NEG(w)|+ 1) · d |NEG(w)|+ 1Skip-gram HS 2cd · (3 logm− 2) 6cd · (logm− 1) 2c · (logm− 1)

Skip-gram NEG 2cd · (3|NEG(w)|+ 4) 6cd · (|NEG(w)|+ 1) 2c · (|NEG(w)|+ 1)

TABLE 1: Arithmetic Primitives Invoked for A Training Sample

which results in the following update equation

θu = θu + η[αw(u)− σ(vTsumθ

u)]vsum, (8)

where η denotes a positive learning rate, and the updateequation needs to be applied to u ∈ {w} ∪ NEG(w). Toupdate input vectors of words v(context(w)i),∀i ∈ [2c], wetake the derivative of L(w, u) with regard to the summationof the input vectors vsum and obtain:

∂L(w, u)

∂vsum= [αw(u)− σ(vT

sumθu)]θu,

which results in the update equation:

v(context(w)i) = v(context(w)i) + η∑

u∈{w}∪NEG(w)

∂L(w, u)

∂vsum.

(9)

Based on update equations (8) and (9), the details ofhow the remote server S performs training under negativesampling in a secure manner are shown in Algorithm 2.

The last obstacle is to determine a probabilistic distri-bution for choosing negative samples. We follow the ideain [22] and generate the relative distribution based on theword frequency stored in the dictionaryD, namely, to obtainthe best quality of training results, the remote server S canselect the negative samples at the following probabilities:

PNEG(wi) =count(wi)

34∑

u∈Dcount(u)

34

, i ∈ [m].

5.3 Privacy-preserving Learning with Skip-gram

In contrast to the CBOW model, the continuous skip-gramarchitecture, as shown in Fig. 2, uses the current word topredict the surrounding window of context words.

5.3.1 Hierarchical SoftmaxDifferent from the CBOW with hierarchical softmax, theobjective function for the skip-gram model is given as:

L =∑w∈C

log p(context(w)|w). (10)

Similar to the derivation approaches presented in Sec-tion 5.2.1, we give the details of performing training overencrypted data at the remote server S in Algorithm 3without additional explanations.

5.3.2 Negative SamplingThe objective function for the skip-gram model under neg-ative sampling is defined as:

L =∑w∈C

∑w∈context(w)

∑u∈{w}∪NEGw(w)

log p(u|w), (11)

whereNEGw(w) denotes the collection of negative sampleswith regard to w handling with w. We give the detailsof performing privacy-preserving training under negativesampling at the remote server S in Algorithm 4 with nofurther explanations.

6 THEORETICAL ANALYSIS

6.1 Security Analysis

As mentioned before, the purpose of our construction is toguarantee that both the remote server and the crypto serviceprovider cannot learn anything about the data (i.e., theoriginal words and their related word vectors) contributedby users. Based on the assumption that S and CSP donot collude, we now present our security analysis from thefollowing two aspects.

At S’s side, the encrypted corpus, which comprises asequence of tokens and a collection of encrypted wordvectors, is obtained after the initialization phase. It hasbeen shown in [24] that the output of a PRF is computa-tionally indistinguishable from a random string, thus thetokens reveal nothing about the corresponding plaintextwords. Besides, the word vectors are encrypted under thePaillier cryptosystem which has been proved in [23] to besemantically secure against chosen-plaintext attack based onthe composite residuosity class problem. In the subsequenttraining process, the intermediate results, which generatedeither via a number of homomorphic operations or duringthe interactions with CSP , are also encrypted under thePaillier cryptosystem, therefore S cannot derive anythingmeaningful.

At CSP ’s side, it has the ability to decrypt the maskedciphertexts (e.g., Ja + r1K in Section 4.3) sent by S . Notethat the masked value is obtained by performing additionover the integers, which is a form of statistical hiding [32].Specifically, for an α-bit integer a and a k-bit random integerr1, the masked value a + r1 gives statistical security ofroughly 2α−k for the potential value a. By choosing thesecurity parameter k properly, we can make the aboveprobability arbitrarily low, and therefore statistically hidethe corresponding entry.

In conclusion, our proposed construction can well pro-tect the privacy of users’ sensitive data.

6.2 Efficiency Analysis

Our efficiency analysis mainly focuses on the training pro-cess with hierarchical softmax and negative sampling, re-spectively. For clarity, we only take one training sample intoaccount for both cases. With Huffman binary tree repre-sentation of the word universe in the case of hierarchicalsoftmax, the number of units that need to be updated isaround logm, i.e., the depth of the Huffman tree or the


Corpussubset

Storagecosts (MB)

Trainingwords

Dictionarysize

Dataset1 2.5 453,121 8,347Dataset2 17.3 3,358,154 28,537Dataset3 80.1 15,859,854 67,246

TABLE 2: The Characteristics of Datasets.

number of iterations (for each sample). During each itera-tion, the number of cryptographic arithmetic primitives thatneed to be invoked depends on the dimension of the vectors(i.e., d). Therefore, both the computation and communicationcomplexities per training sample are O(d · logm). In the caseof negative sampling, the only difference is that the numberof iterations (for each sample) is proportional to the numberof negative samples (i.e., |NEG(w)|). We can then derivethat the computation and communication complexities pertraining sample are O(d · |NEG(w)|).

Furthermore, we analyze the number of each crypto-graphic primitive invoked when conducting the trainingphase. Taking the CBOW model with hierarchical softmaxfor example, (2c − 1) · d addition sub-protocols need tobe invoked to compute the sum of 2c encrypted wordvectors contained in the training sample. Subsequently, eachiteration starts by computing the approximated sigmoidfunction which takes an inner product as input, where daddition and multiplication sub-protocols with one approx-imated sigmoid computation are invoked. Then, additional2d addition and multiplication sub-protocols are used toperform updates on the encrypted initial vector e as wellas the internal parameter vector θ. At last, the updates of 2cencrypted word vectors in the training sample call another2c ·d addition sub-protocols. Analogously, we can obtain thenumber of each primitive invocation for the remaining threecases, and summarize the results in Table 1.

In practice, we can adopt the packing technique in-troduced in Section 4.3 to achieve an order-of-magnitudeimprovement in terms of computation time and bandwidth.Furthermore, the training phase is highly-parallelizable;namely, the whole training corpus can be effectively splitinto multiple parts, each of which can be processed with thesame model by a separate thread (or process) in parallel.

6.3 Accuracy Analysis

In this section, we present the analysis of accuracy lossfor each primitive used in our constructions. Since theoperands involved in addition and subtraction are of thesame fixed-point type, we obtain the results having reso-lution 2p without introducing any rounding error. For themultiplication, the outputs with resolution 22p need to bescaled down to resolution 2p, then the absolute error of δ ·2pwill be introduced due to the truncation of p bits, where0 ≤ δ < 1. When calculating the inner product of two d-dimensional vectors, the cumulated errors can reach d · 2p.Next, we discuss the formal error bound for approximatingthe sigmoid function with Maclaurin series expansion. Then-th Maclaurin polynomial is denoted as Pn(x), we thendefine the absolute error to be

En(x) = |σ(x)− Pn(x)|.

Window Vector Negative Learningsize dimensions samples rate

2c = 4 d = 100 |NEG(w)| = 8 η = 0.025

TABLE 3: Default Parameters for Accuracy Evaluation

20 21 22 23 24 25 26 27 28 29 30Bits for fractional part

0

1e-6

2e-6

3e-6

4e-6

5e-6

Avera

ge e

rror

rate

s

CBOW_HSCBOW_NEGSkip-gram_HSSkip-gram_NEG

(a)

1 2 3 4 5 6Number of terms in Maclaurin series

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Avera

ge e

rror

rate

s

(b)

Fig. 3: (a) Tradeoffs between error rates and the number ofbits used for the fractional part; (b) Tradeoffs between error

rates and the number of terms in Maclaurin series.

That is, the error En(x) is essentially the difference betweenthe n-th Maclaurin polynomial and the original sigmoidfunction, and fortunately it can be bounded using the La-grange form of the remainder (or error) as follows

En(x) ≤ M

(n+ 1)!|x|n+1,

where M is some value satisfying |σ(n+1)(x)| ≤ M . Byusing the above equation, we can find a means to deal withthe maximum allowable approximation error. For example,we want to find the error bound for the 2-nd Maclaurinpolynomial of σ(x) on [−2, 2] centered at 0. We have

E2(x) ≤ M

3!|x|3,

where M bounds σ(3)(x) on the given interval. Also, weknow that the 3-rd derivative of σ(x) has a maximum valueabout 0.0389 on [−2, 2]. Hence, we have M = 0.0389, andcan further compute the maximum possible error bound is0.0389

3! · 23 ≈ 0.0519.

7 EXPERIMENTAL EVALUATION

In this section, we conduct extensive experiments on rep-resentative real-world datasets with different scales to eval-uate the performance of our construction. The experimentsare carried out on separate machines with different config-urations. More specifically, each participating user operateson a machine running Windows 10 operating system witha 2.90GHz 4-core Intel Core CPU and 12GB RAM. Boththe remote server and the crypto service provider operateon an Amazon AWS EC2 instance “c4.8xlarge” runningWindows Server 2012R2 operating system with a 2.90GHz36-core Intel Xeon CPU and 60GB RAM, in Java RuntimeEnvironment (JRE) 1.8. We used HMAC contained in thedefault Java library for the PRF.

The following parameter values are used. In the initial-ization phase, we use Paillier cryptosystem with a 1024-bit modulus. Besides, the default number of iterations andthreads adopted in the training phase is set as 5 and 50,respectively.


Window size Vector dimensions Negative samples2c = 4 d = 100 |NEG(w)| = 8

Learning Rate Fractional bits Terms numberη = 0.025 p = 26 n = 5

TABLE 4: Default Parameters for Efficiency Evaluation

Add. Sub. Multi. SigmoidPlaintext Comp. (ms) 0.0006 0.0005 0.0031 0.0402

Ciphertext Comp. (ms) 0.0264 0.466 13.719 124.768Comm. (KB) - - 1.25 11.25

TABLE 5: Performance of Each Arithmetic Primitive

7.1 DatasetsWe use a real-world benchmark corpus that contains aroundone billion training words, publicly available from theGoogle code project2. Specifically, we generate three sub-datasets (in a similar manner to [21]), with scales rangingfrom hundreds of thousands to tens of millions trainingwords. Table 2 summarizes the main characteristics of thesedatasets in our experiments.

7.2 AccuracyThere are two dominant factors that affect the accuracyof our construction. First, learning algorithms are imple-mented using fixed-point numbers, which introduce round-ing errors. Second, the Maclaurin series expansion used inour construction brings in approximation errors. To quantifythe errors, we define the relative error rate as:

Err =

∣∣∣∣L − L∗L∣∣∣∣ ,

where L denotes the objective functions computed in theclear without any privacy concerns, and L∗ denotes theresults obtained in our solution. The error rate allows usto assess the loss of accuracy due to our approximationmethods.

To evaluate the error rate Err regarding to the numberof bits for the fractional part (resp., the number of termsin the Maclaurin polynomial), the other factor, i.e., thenumber of terms (resp., the number of fractional bits), isset as a constant 5 (resp. 26). Besides, Table 3 gives theconfiguration of some additional default parameters neededin the evaluation. Through a broad sweep of experiments ondifferent training data, Fig. 3(a) illustrates the relationshipbetween the number of bits for the fractional part in fixed-point data type and the resulting error rates. It shows thatthe average error rate decreases with the increasing bitlength allocated for the fractional part. Similar results canbe observed in Fig. 3(b), which shows the tradeoff betweenthe average error rates and the number of terms applied inthe Maclaurin series expansion. Moreover, both plots canprovide important guidance for system configuration whentrying to achieve a desirable accuracy.

7.3 EfficiencyNext, on the basis of the default parameters given in Table 4,we evaluate the overall efficiency in terms of computation

2. https://code.google.com/p/1-billion-word-language-modeling-benchmark/.

Word pair 1 Word pair 2

Semantic Athens Greece Oslo Norwaybrother sister grandson granddaughter

Syntactic apparent apparently rapid rapidlygreat greater tough tougher

TABLE 6: Examples of Semantic and Syntactic Questions

Model Semantic(%)

Syntactic(%)

Total(%)

Mikolov etal.’s [21]

CBOW HS 8.94 19.8 16.38CBOW NEG 10.04 27.55 22.01

Skip-gram HS 29.41 28.54 28.81Skip-gram NEG 16.33 25.02 22.28

Ours

CBOW HS 8.61 19.25 15.9CBOW NEG 9.93 26.83 21.55

Skip-gram HS 28.67 28.32 28.43Skip-gram NEG 15.07 24.59 21.59

Relative ChangingPercentage

CBOW HS -0.33 -0.55 -0.48CBOW NEG -0.11 -0.72 -0.46

Skip-gram HS -0.74 -0.22 -0.38Skip-gram NEG -1.26 -0.43 -0.69

TABLE 7: Comparison of the Overall Accuracy

time and communication costs by utilizing the three datasetswith different scales described above.

Fig. 4(a) shows the overall computation time obtained byexecuting the original training algorithms without privacyguarantees. Since all the operations are performed in theplaintext domain, no communication cost is involved at thiscase. As shown in Fig. 4(b) and 4(c), the total time consump-tion and bandwidth overhead between the remote serverand the crypto service provider grow with the increasingnumber of training words and vector dimensions. It can beseen that for Dataset3, it requires about 30 hours to completethe training process and roughly 86.1GB to transfer theencrypted intermediate values for the CBOW model undernegative sampling, which are both practically acceptable.Besides, we note that the computation time can be furtherbrought down to just around 1 hour by utilizing a modestcluster of 30 nodes.

For clarity, we also run each arithmetic primitive inencrypted and decrypted form, respectively, to demonstratetheir performance (averaged over 1,000 executions), and theresults are given in Table 5. Since both the addition andsubtraction primitives can be achieved by using the additivehomomorphic property directly, they are computationallyefficient without involving any communication cost. Thetime cost of the multiplication as well as the sigmoidapproximation primitives is dominated by the encryptionand decryption operations (of Paillier cryptosystem), whichessentially contain modular exponentiation of big integers.Meanwhile, the communication overhead is required totransfer the encrypted intermediate results.

Furthermore, we assume that each participating userowns a data file containing 100,000 training words. Thecomputation time of each user to encrypt the data file isabout 4.6 minutes, and the storage cost of the resultingciphertext is about 6.34MB.

7.4 EffectivenessTo demonstrate the effectiveness, we follow the parametersetting given in Table 4, run both our privacy-preserving


Window size Vector dimensions Learning rate Negative samples(Default 2c = 4) (Default d = 100) (Default η = 0.025) (Default |NEG(w)| = 8)

2 4 6 8 60 80 100 120 0.015 0.020 0.025 0.030 4 6 8 10Mikolov et

al.’s [21] (%) 21.39 22.01 23.60 23.73 18.42 20.86 22.01 22.93 18.03 20.81 22.01 24.98 22.90 23.27 22.01 23.89

Ours (%) 21.10 21.55 22.75 23.40 18.06 20.09 21.55 22.16 17.42 20.48 21.55 24.32 22.61 22.50 21.55 23.07

Relativechange (%) -0.29 -0.46 -0.85 -0.33 -0.36 -0.77 -0.46 -0.77 -0.61 -0.33 -0.46 -0.66 -0.29 -0.77 -0.46 -0.82

TABLE 8: Total Accuracy with Parameter Turning

Dataset1 Dataset2 Dataset30.1

1

10

Com

puta

tion t

ime (

min

)


(a) Computation time without privacypreservation.


1

10

100

1000

Com

puta

tion t

ime (

h)


(b) Computation time on S and CSP .


1

10

100

1000

Com

munic

ati

on c

ost

(G

B)


(c) Communication costs between S andCSP .

Fig. 4: The overall efficiency evaluation.

1 2 3 4The number of nearest word (k)

0

10

20

30

40

50

Acc

ura

cy (

%)

CBOW-HS [20]CBOW-HSCBOW-NEG [20]CBOW-NEG

(a) CBOW model.

1 2 3 4The number of nearest word (k)

0

10

20

30

40

50

60

Acc

ura

cy (

%)

Skip-gram_HS [20]Skip-gram_HSSkip-gram_NEG [20]Skip-gram_NEG

(b) Skip-gram model.

Fig. 5: Relationship between accuracy and the number ofnearest words (k).

construction and the original learning algorithms proposedby Mikolov et al. [21] on Dataset3, and compare the qualityof the resulting trained word vectors. To this end, we applya comprehensive test set defined in [21] to measure qualityof the word vectors. Specifically, this test set contains fivetypes of semantic questions (8,869 questions in total), andnine types of syntactic questions (10,675 questions in total).For easy presentation, illustrative examples of each type areshown in Table 6.

The comparison of the overall accuracy for all questionson the word vectors obtained by the solutions in [21] andour schemes is summarized in Table 7. A question is as-sumed to be correctly answered only if the closest word(measured by cosine distance in vector space) returned bythe answering mechanism given in [21] is exactly the sameas the correct word in the question. Intuitively, we believethat the quality of word vectors is positively correlatedwith this accuracy metric. Overall, the results show that thequality of the word vectors trained by our schemes withguaranteed privacy is very close to that of Mikolov et al.’s.

Next, we evaluate the accuracy of top-k results, that is,a question is assumed to be correctly answered only if the

returned k nearest words (via the answering mechanism)include the correct word in the question. Fig. 5(a) and 5(b)depict the increasing total accuracy with the increasingnumber of returned nearest words, and the accuracy ofthe original schemes is slightly higher than ours. Again,the results demonstrate that our schemes only bring innegligible accuracy losses and are suitable for numerouspractical applications.

Finally, note that multiple parameters (i.e., window size,vector dimension, learning rate, and the number of negativesamples) are applied for the word vector training and theirdifferent configurations will have effects on the quality ofresulting word vectors. Hence, we provide a comprehen-sive study on the overall accuracy with these parametersturning. For ease of exposition, we just take the CBOWmodel under negative sampling into consideration, and theresults of total accuracy obtained by the same answeringmechanism are illustrated in Table 8. As can be seen, therelative changing percentages of the original algorithm andours range from 0.29% to 0.85%, which show our schemescan achieve both good effectiveness and robustness.

8 CONCLUSION AND FUTURE WORK

In this paper, we proposed four new privacy-preservingneural network learning schemes for training word vectorsover large-scale encrypted data from multiple participants.In our construction, we strategically used an efficient addi-tively homomorphic encryption, and introduced a suite ofarithmetic primitives over encrypted data to serve as com-ponents. Finally, theoretical analyses and extensive experi-mental evaluations on real-world datasets were conductedto show that our schemes are secure, efficient and practicalfor use in real-world NLP applications.

We believe our work to be valuable from the followingtwo perspectives. First, it provides a novel design for learn-


ing word vectors over encrypted data, which guaranteesprivacy of the (potentially) sensitive training data. Ourmethodology may be also applied to other research direc-tions of machine learning, such as convolution neural net-works (CNN). Second, it provides multiple cryptographicprimitives, which are building blocks for more complexarithmetic operations (e.g., arbitrary non-linear functions).Hence, it may provide a viable solution for various moregeneral data analytics or mining tasks.

To tackle the computational bottlenecks involved byusing the homomorphic encryption, our future work isto design a hybrid scheme combining both homomorphicencryption and garbled circuits (GC), where GC allow tomove all expensive operations to a precomputation phase,thus relieving online computational burden. Meanwhile, wewill resort to hardware-assisted trusted execution environ-ments (TEEs), which can create a secure isolation in hostileenvironments for performing partial computation in theplaintext domain. As another promising research focus, wecan leverage TEEs to improve efficiency significantly.

ACKNOWLEDGMENTS

Qian’s research is supported in part by National Natu-ral Science Foundation of China (Grant Nos. U1636219,61373167), the Outstanding Youth Foundation of HubeiProvince (Grant No. 2017CFA047), and the Key Programof Natural Science Foundation of Hubei Province (GrantNo. 2017CFA007). Yanjiao’s research is supported in partby National Natural Science of China (Grant No. 61702380),the Hubei Provincial Natural Science Foundation of China(Grant No. 2017CFB134), and the Technological Innova-tion Projects of Hubei Province (Grant No. 2017AAA125).Xiaofeng’s research is supported by the National NaturalScience Foundation of China (Grant No. 61572382), andthe Key Project of Natural Science Basic Research Plan inShannxi Province of China (Grant No. 2016JZ021). Xinyi’sresearch is supported by the Distinguished Young ScholarsFund of Fujian, China (Grant No. 2016J06013). Qian Wang isthe corresponding author.

REFERENCES

[1] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot et al., “Mastering the game of go with deep neuralnetworks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489,2016.

[2] R. L. Rivest, L. Adleman, and M. L. Dertouzos, “On data banksand privacy homomorphisms,” Foundations of secure computation,vol. 4, no. 11, pp. 169–180, 1978.

[3] V. Nikolaenko, U. Weinsberg, S. Ioannidis, M. Joye, D. Boneh,and N. Taft, “Privacy-preserving ridge regression on hundreds ofmillions of records,” in Proc. of S&P’13. IEEE, 2013, pp. 334–348.

[4] A. C. Yao, “Protocols for secure computations,” in Proc. of FOCS’82.IEEE, 1982, pp. 160–164.

[5] V. Nikolaenko, S. Ioannidis, U. Weinsberg, M. Joye, N. Taft, andD. Boneh, “Privacy-preserving matrix factorization,” in Proc. ofCCS’13. ACM, 2013, pp. 801–812.

[6] S. Kim, J. Kim, D. Koo, Y. Kim, H. Yoon, and J. Shin, “Efficientprivacy-preserving matrix factorization via fully homomorphicencryption,” in Proc. of AsiaCCS’16. ACM, 2016, pp. 617–628.

[7] R. Bost, R. A. Popa, S. Tu, and S. Goldwasser, “Machine learningclassification over encrypted data,” in Proc. of NDSS’15, 2015.

[8] N. Dowlin, R. Gilad-Bachrach, K. Laine, K. Lauter, M. Naehrig,and J. Wernsing, “Cryptonets: Applying neural networks to en-crypted data with high throughput and accuracy,” in Proc. ofICML’16, vol. 48, 2016, pp. 201–210.

[9] C. Dwork, “Differential privacy,” in Proc. of ICALP’06. Springer,2006, pp. 1–12.

[10] K. Chaudhuri, A. D. Sarwate, and K. Sinha, “A near-optimalalgorithm for differentially-private principal components.” Journalof Machine Learning Research, vol. 14, no. 1, pp. 2905–2943, 2013.

[11] J. Zhang, Z. Zhang, X. Xiao, Y. Yang, and M. Winslett, “Functionalmechanism: regression analysis under differential privacy,” Proc.of VLDB’12, vol. 5, no. 11, pp. 1364–1375, 2012.

[12] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,”in Proc. of CCS’15. ACM, 2015, pp. 1310–1321.

[13] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov,K. Talwar, and L. Zhang, “Deep learning with differential privacy,”in Proc. of CCS’16. ACM, 2016, pp. 308–318.

[14] Y. Elmehdwi, B. K. Samanthula, and W. Jiang, “Secure k-nearestneighbor query over encrypted data in outsourced environments,”in Proc. of ICDE’14. IEEE, 2014, pp. 664–675.

[15] O. Goldreich, Foundations of cryptography: volume 2, basic applica-tions. Cambridge university press, 2009.

[16] G. Hinton, J. McClelland, and D. Rumelhart, “Distributed repre-sentations. inparallel distributed processing: Explorations in themicrostructure of cognition,” 1986.

[17] R. Collobert and J. Weston, “A unified architecture for natural lan-guage processing: Deep neural networks with multitask learning,”in Proc. of ICML’08. ACM, 2008, pp. 160–167.

[18] J. Turian, L. Ratinov, and Y. Bengio, “Word representations: asimple and general method for semi-supervised learning,” in Proc.of ACL’10. ACL, 2010, pp. 384–394.

[19] T. Mikolov, W.-t. Yih, and G. Zweig, “Linguistic regularities incontinuous space word representations.” in HLT-NAACL, vol. 13,2013, pp. 746–751.

[20] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural prob-abilistic language model,” Journal of Machine Learning Research,vol. 3, no. Feb, pp. 1137–1155, 2003.

[21] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient esti-mation of word representations in vector space,” arXiv preprintarXiv:1301.3781, 2013.

[22] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Dis-tributed representations of words and phrases and their composi-tionality,” arXiv preprint arXiv:1310.4546, 2013.

[23] P. Paillier, “Public-key cryptosystems based on composite degreeresiduosity classes,” in Proc. of EUROCRYPT’99. Springer, 1999,pp. 223–238.

[24] J. Katz and Y. Lindell, Introduction to modern cryptography. CRCpress, 2014.

[25] M. d. Cock, R. Dowsley, A. C. Nascimento, and S. C. New-man, “Fast, privacy preserving linear regression over distributeddatasets based on pre-distributed data,” in Proc. of AIsec’15. ACM,2015, pp. 3–14.

[26] M. Abramowitz and I. A. Stegun, Handbook of mathematical func-tions: with formulas, graphs, and mathematical tables. Courier Cor-poration, 1964, vol. 55.

[27] F. Temurtas, A. Gulbag, and N. Yumusak, “A study on neuralnetworks using taylor series expansion of sigmoid activationfunction,” in Proc. of ICCSA’04. Springer, 2004, pp. 389–397.

[28] F. Morin and Y. Bengio, “Hierarchical probabilistic neural networklanguage model.” in Aistats, vol. 5. Citeseer, 2005, pp. 246–252.

[29] A. Mnih and G. E. Hinton, “A scalable hierarchical distributed lan-guage model,” in Advances in neural information processing systems,2009, pp. 1081–1088.

[30] T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, and S. Khudan-pur, “Extensions of recurrent neural network language model,” inProc. of ICASSP’11. IEEE, 2011, pp. 5528–5531.

[31] M. U. Gutmann and A. Hyvarinen, “Noise-contrastive estimationof unnormalized statistical models, with applications to naturalimage statistics,” Journal of Machine Learning Research, vol. 13, no.Feb, pp. 307–361, 2012.

[32] D. Evans, Y. Huang, J. Katz, and L. Malka, “Efficient privacy-preserving biometric identification,” in Proc. of NDSS’11, 2011.


Qian Wang is a Professor with the School ofCyber Science and Engineering, and also withthe School of Computer Science, Wuhan Uni-versity, China. He received the B.S. degree fromWuhan University, China, in 2003, the M.S. de-gree from Shanghai Institute of Microsystemand Information Technology (SIMIT), ChineseAcademy of Sciences, China, in 2006, and thePh.D. degree from Illinois Institute of Technol-ogy, USA, in 2012, all in Electrical Engineer-ing. His research interests include AI security,

data storage, search and computation outsourcing security and privacy,wireless systems security, big data security and privacy, and appliedcryptography etc. Qian is an expert under National “1000 Young TalentsProgram” of China. He is a recipient of IEEE Asia-Pacific OutstandingYoung Researcher Award 2016. He is also a co-recipient of severalBest Paper and Best Student Paper Awards from IEEE ICDCS’17, IEEETrustCom’16, WAIM’14, and IEEE ICNP’11 etc. He serves as AssociateEditors for IEEE Transactions on Dependable and Secure Computing(TDSC) and IEEE Transactions on Information Forensics and Security(TIFS). He is a Member of the IEEE and a Member of the ACM.

Minxin Du received the B.S. degree in Com-puter Science and Technology from Wuhan Uni-versity, China, in 2015. He is working towardsthe Master degree in the School of Cyber Sci-ence and Engineering in Wuhan University. Hisresearch interests include cloud security, andapplied cryptography.

Xiuying Chen is a third year undergraduate stu-dent, working towards the Bachelor degree inthe School of Cyber Science and Engineering inWuhan University. Her research interests includecloud security and applied cryptography.

Yanjiao Chen received the Ph.D. degree fromthe Hong Kong University of Science and Tech-nology in 2015, the bachelor degree in ElectronicEngineering from Tsinghua University in 2010.She is currently a professor in the School ofComputer Science in Wuhan University, China.Her research interests include spectrum man-agement for Femtocell networks, network eco-nomics and quality of experience (QoE) of multi-media in wireless networks.

Pan Zhou is currently an associate profes-sor with School of Electronic Information andCommunications, Huazhong University of Sci-ence and Technology, Wuhan, P.R. China. Hereceived his Ph.D. in the School of Electricaland Computer Engineering at the Georgia In-stitute of Technology (Georgia Tech) in 2011,Atlanta, USA. He was a senior technical memberat Oracle Inc, America during 2011 to 2013,Boston, MA, USA, and worked on Hadoop anddistributed storage systems for big data analyt-

ics at Oracle cloud Platform. His current research interest includes:communication and information networks, security and privacy, machinelearning and big data.

Xiaofeng Chen received the BS and MS de-grees in mathematics from Northwest Univer-sity, China, in 1998 and 2000, respectively. Hereceived the PhD degree in cryptography fromXidian University in 2003. He is currently atXidian University as a professor. His researchinterests include applied cryptography and cloudcomputing security. He has published over 100research papers in refereed international con-ferences and journals. His work has been citedmore than 3,000 times at Google Scholar. He is

in the Editorial Board of Security and Communication Networks (SCN),Computing and Informatics (CAI), and International Journal of Embed-ded Systems (IJES) etc. He has served as the program/general chair orprogram committee member in over 30 international conferences.

Xinyi Huang received the Ph.D. degree from theSchool of Computer Science and Software Engi-neering, University of Wollongong, Australia. Heis currently a Professor with the School of Math-ematics and Computer Science, Fujian NormalUniversity, China, and the Co-Director of FujianProvincial Key Laboratory of Network Securityand Cryptology. His research interests includeapplied cryptography and network security. Hehas authored over 100 research papers in refer-eed international conferences and journals. His

work has been cited over 1900 times at Google Scholar (H-Index:25). Heis an Associate Editor of the IEEE TRANSACTIONS ON DEPENDABLEAND SECURE COMPUTING. He serves on the Editorial Board ofInternational Journal of Information Security (Springer). He has servedas the Program/General Chair or a Program Committee Member in over60 international conferences.

ieee transactions on knowledge and data engineering 1 privacy-preserving collaborative...

Documents