privacy-preserving classification on deep neural · pdf filedeals with the privacy-preserving...

18
Privacy-Preserving Classification on Deep Neural Network Herv´ e Chabanne,Amaury de Wargny and Jonathan Milgram and Constance Morel and Emmanuel Prouff Safran Identity & Security Email: {firstname.lastname}@safrangroup.com Abstract—Neural Networks (NN) are today increasingly used in Machine Learning where they have become deeper and deeper to accu- rately model or classify high-level abstractions of data. Their development however also gives rise to important data privacy risks. This obser- vation motives Microsoft researchers to propose a framework, called Cryptonets. The core idea is to combine simplifications of the NN with Fully Homomorphic Encryptions (FHE) techniques to get both confidentiality of the manipulated data and efficiency of the processing. While efficiency and accuracy are demonstrated when the number of non-linear layers is small (eg 2), Cryptonets unfortunately becomes ineffective for deeper NNs which let the problem of privacy preserving matching open in these contexts. This work successfully addresses this problem by combining the original ideas of Cryptonets’ solution with the batch normalization principle introduced at ICML 2015 by Ioffe and Szegedy. We experimentally validate the soundness of our approach with a neural network with 6 non-linear layers. When applied to the MNIST database, it competes the accuracy of the best non-secure versions, thus significantly improving Cryptonets. I. I NTRODUCTION Neural networks aim to solve a so-called classification problem which consists in cor- rectly assigning a label to a new observation, on the basis of a training set of data containing observations (or instances) whose labelling is known [31]. It may also be viewed as the problem of approximating unknown (complex) functions that can depend on a huge number of inputs and are generally unknown. 1 The main blocks of neural network models exist since the early 1980’s but, for a long time, other methods (e.g. Support Vector Machines with Kernel [41]) have been preferred since 1 A classical simple example (e.g. detailed in [34, Chap- ter 1]) is the recognition of handwritten digits. they are not impacted by the overfitting issue (when the dimension of the inputs is high and the isze of the training database is small) and lead to unique solutions. The situation has recently changed with the increasing of both the size of the training database (e.g. available in the clouds) and the computing performances, and also thanks to technical improvements (e.g. the introduction of the dropout regularization technique [27] and the introduction of the REctified Linear Unit acti- vation function [24]). They have brought out the neural networks, putting them forward in the 21st century. They, for instance, play a central role in face recognition [40], image classification [30] or speech recognition [23] and are thus widely deployed. Machine learning algorithms are often ap- plied on sensitive information such as medical data. The privacy protection of these sensitive data has been the subject of several articles during the last five years. They may be clas- sified according to the learning algorithms which are involved: linear regression training or prediction [35], [47], linear classifiers [7], [8], [22], decision trees [8] or neural networks [2], [20], [36], [50], [52]. All the previously mentioned Machine learning algorithms are composed of two stages: the training phase from a labelled database and the classification of new data. The training phase essentially aims at infer- ring the algorithm parameters from a labelled (or classified) database by optimizing some training objective (recognize/classify digits, faces, objects, etc). The classification of a new datum then simply consists in applying the trained algorithm to it (see Appendix B for more details). Like in [48], [20], where Cryptonets’ solution is introduced, our article

Upload: voliem

Post on 06-Mar-2018

217 views

Category:

Documents


3 download

TRANSCRIPT

Privacy-Preserving Classification onDeep Neural Network

Herve Chabanne,Amaury de Wargny and Jonathan Milgramand Constance Morel and Emmanuel Prouff

Safran Identity & SecurityEmail: {firstname.lastname}@safrangroup.com

Abstract—Neural Networks (NN) are todayincreasingly used in Machine Learning wherethey have become deeper and deeper to accu-rately model or classify high-level abstractionsof data. Their development however also givesrise to important data privacy risks. This obser-vation motives Microsoft researchers to proposea framework, called Cryptonets. The core idea isto combine simplifications of the NN with FullyHomomorphic Encryptions (FHE) techniquesto get both confidentiality of the manipulateddata and efficiency of the processing. Whileefficiency and accuracy are demonstrated whenthe number of non-linear layers is small (eg 2),Cryptonets unfortunately becomes ineffective fordeeper NNs which let the problem of privacypreserving matching open in these contexts.This work successfully addresses this problemby combining the original ideas of Cryptonets’solution with the batch normalization principleintroduced at ICML 2015 by Ioffe and Szegedy.We experimentally validate the soundness ofour approach with a neural network with 6non-linear layers. When applied to the MNISTdatabase, it competes the accuracy of the bestnon-secure versions, thus significantly improvingCryptonets.

I. INTRODUCTION

Neural networks aim to solve a so-calledclassification problem which consists in cor-rectly assigning a label to a new observation,on the basis of a training set of data containingobservations (or instances) whose labelling isknown [31]. It may also be viewed as theproblem of approximating unknown (complex)functions that can depend on a huge numberof inputs and are generally unknown.1 Themain blocks of neural network models existsince the early 1980’s but, for a long time,other methods (e.g. Support Vector Machineswith Kernel [41]) have been preferred since

1A classical simple example (e.g. detailed in [34, Chap-ter 1]) is the recognition of handwritten digits.

they are not impacted by the overfitting issue(when the dimension of the inputs is highand the isze of the training database is small)and lead to unique solutions. The situationhas recently changed with the increasing ofboth the size of the training database (e.g.available in the clouds) and the computingperformances, and also thanks to technicalimprovements (e.g. the introduction of thedropout regularization technique [27] and theintroduction of the REctified Linear Unit acti-vation function [24]). They have brought outthe neural networks, putting them forward inthe 21st century. They, for instance, play acentral role in face recognition [40], imageclassification [30] or speech recognition [23]and are thus widely deployed.

Machine learning algorithms are often ap-plied on sensitive information such as medicaldata. The privacy protection of these sensitivedata has been the subject of several articlesduring the last five years. They may be clas-sified according to the learning algorithmswhich are involved: linear regression trainingor prediction [35], [47], linear classifiers [7],[8], [22], decision trees [8] or neural networks[2], [20], [36], [50], [52].

All the previously mentioned Machinelearning algorithms are composed of twostages: the training phase from a labelleddatabase and the classification of new data.The training phase essentially aims at infer-ring the algorithm parameters from a labelled(or classified) database by optimizing sometraining objective (recognize/classify digits,faces, objects, etc). The classification of anew datum then simply consists in applyingthe trained algorithm to it (see Appendix Bfor more details). Like in [48], [20], whereCryptonets’ solution is introduced, our article

deals with the privacy-preserving problem forthe classification (aka matching) processing inthe context of deep Neural Networks. Moreprecisely, we focus on the popular Convolu-tional Neural Network (CNN) which belongsto the family of multilayer perceptron (MLP)networks that themselves extend the basicconcept of perceptron2 to address problemswhere the classification does not reduce to asimple linear separation. Readers not familiarwith the basic concepts of MLP may refer toAppendix A where some basics are recalled.We additionally give in Appendix C somereferences about the study of the privacy ofthe training phase.

A. Problematic

The problem addressed in this work is for-malized hereafter.

Problem 1 (Privacy Preserving Classifica-tion): A client has his data X . A server hasa trained (deep) neural network. The clientobtains the classification (aka label) of his dataX from the server, and this classification mustnot reveal information about the trained neuralnetwork. The server obtains nothing (no infor-mation about the client input or labelling).

To provide practical solutions to Problem 1,two main requirements have been added:• Efficiency: the running time to obtain the

classification needs to be low in order tobe applicable for real world applications

• Accuracy: the classification performanceneeds to be close to the classificationperformance without privacy

B. Related Work

The problem of privacy-preserving neuralnetworks has been widely studied since thebeginning of the century. All existing methodsare based on secure multi-party computation(SMC) or homomorphic encryption (HE) or acombination of those methods. The efficiencyof these solutions strongly depends on thecomplexity of the neural network (complexityof the activation function, number of layers of

2A perceptron is a binary classifier that first applies aninner product between real-valued input vectors and a fixtrained/learned vector and then associates the label 0 or1 to the output depending on a learned threshold.

each type, number of neurons per layer) seenas a classification function.

In 2006, Barni et al [2] proposed a pri-vacy preserving neural network classificationalgorithm based on secure multi-party com-putation and homomorphic encryption. Theirneural networks are composed of a successionof scalar products which are secured on thebasis of homomorphic encryption and activa-tion functions (threshold or sigmoid) securedwith protocols based on secure multi-partycomputation. More precisely, to evaluate thescalar product between ~x and ~y both in RN

and owned respectively by the client and theserver, the client encrypts each componentof the vector ~x with Paillier’s homomorphicencryption [37] and sends them to the server,then the server evaluates homomorphically theencryption of the scalar product and sendsback the encrypted result to the client whofinally decrypts it in the plaintext domain.Thanks to the homomorphic encryption, theserver learns nothing about the client’s vector~x and the client learns nothing about theserver’s vector ~y except what he can infer fromthe product scalar result. The communicationcomplexity of the scalar product evaluation isO(Nn2) where N is the number of compo-nents per vector and n is the RSA-modulusfor the Paillier cryptosystem. Its computationcomplexity is O(N) exponentiations modulon2. To securely evaluate each threshold activa-tion function processing, Yao’s solution to themillionaire’s problem [49] is suggested. Thecommunication complexity of this protocol isO((log2 n)

2) where the two integers to com-pare belongs to Zn. To securely evaluate thesigmoid activation function, firstly the sigmoidis approximated by a low degree polynomialand then the polynomial is evaluated withprivate polynomial evaluation [33]. Finally, thevalues of all intermediate neurons (outputs ofscalar products and activation functions) arerevealed to the client. The authors noted thisissue and keep it for future works. To sumup, this protocol suffers from two major draw-backs: it does not fulfill neither the privacyrequirement nor the efficiency requirements ofour Problem 1.

Orlandi et al [36] enhanced the protocol [2]:intermediate results are no longer revealed tothe client. To evaluate scalar products, they

use the same protocol as Barni et al. with twoimprovements. Firstly they replace Paillier’scryptosystem by its extension by Damgardand Jurik [13]. Secondly, the scalar productresults are masked before delegating the eval-uation of the activation functions to the client.For instance, if the activation function is thethreshold function, the server computes ho-momorphically Enc(a(< ~x, ~y > −δ)) whereδ is the threshold of the activation functionand a is a non zero value randomly chosenby the server. Then the client decrypts it andsends Enc(−1) if a(< ~x, ~y > −δ) < 0and Enc(1) otherwise. The client cannot inferinformation about < ~x, ~y > thanks to therandom value a. If a is negative, the serverhas to multiply the encrypted result by (−1)to obtain the correct encrypted result. Unfor-tunately, this protocol has a similar computa-tion/communication complexity than the Barniet al solution and thus it does not respect theefficiency requirement.

To overcome this drawback, Gilad-Bachrach et al [48], [20] proposed anew privacy preserving neural networkclassification without interaction between theclient and the server. Their approach, whichapplies the principles of Fully HomomorphicEncryption (HFE), is composed of thefollowing steps: the client encrypts his dataas specified by the the chosen HFE schemeand sends it to the server, then the serverprocesses a version of the classificationalgorithm which has been adapted to operateon encrypted data without decrypting them,and the encrypted result is eventually sentback to the client who can decrypt it. Theprivacy is obviously ensured thanks to the dataencryption and the communication complexityis very low. These two assets may howevercome at the cost of an important increaseof the processing complexity. The latterone depends on the algebraic nature of theclassification algorithm, and more preciselyon its multiplicative depth3. The main issue istherefore to adapt the classification algorithmto render it compatible with homomorphicencryption while maintaining good accuracy.To achieve this result, Gilad-Bachrach et

3We recall that the multiplicative depth of a processingis the maximum number of consecutive multiplicationsthat must be computed during to get the output.

al [48], [20] simply propose to replacethe activation function by the squaredfunction which has multiplicative depth 1.To maintain good accuracy, the same neuralnetwork architecture is used for the trainingand the classification. Thanks to the lowdegree polynomial activation function, themultiplicative depth of the neural network isreasonable during the secure classification,and thus the homomorphic classification withYASHE encryption scheme [6] is efficient.Unfortunately as acknowledged by the authorsthemselves, the fact that the squared functionhas an unbounded derivate induces a strangebehaviour during the training phase. Thereforethis protocol is only adapted to small neuralnetworks and its accuracy is very low forneural networks with more than 2 non-linearlayers (which are yet the most common innowadays applications).

C. Contributions

In this paper, we design and evaluate thefirst privacy preserving classification for neu-ral networks whith depth greater than 2. Likethe best state-of-the-art solutions discussed inthe previous section, our goal is to end up witha construction being FHE friendly while keep-ing accuracy at high level. This requires inparticular that the multiplicative depth of theneural network is maintained reasonably low.To address this issue, the literature suggests toreplace the activation function by a low degreenon-linear polynomial both on the learningand classification phases. To stay as close aspossible to the state of the art neural net-works, and to therefore respect our accuracyrequirement, we followed a different approachand we chose to keep a classical deep neuralnetwork (with the ReLU activation function)for the training phase; we afterwards onlymade the modifications on the network whenit is used for classification. As in [20], thefirst modification consists in approximatingthe ReLU by a low degree polynomial. Thismodification, if applied solely, does not workbecause the polynomial approximation of theReLU function is only good on a restricteddistribution whereas the activation functionapplies on inputs with unstable distributions.To circumvent this issue, which leads to clas-sification errors and dramatically degrades the

accuracy, our key technical innovation is tocombine the latter polynomial approximationwith a batch normalization [28]. The batchnormalization concept has been introduced in2015 by Sergey Ioffe and Christian Szegedy inorder to greatly accelerate the training in deepneural network. This idea consists in adding anormalize layer before each activation layer inorder to have a stable and normal distributionat the level of the activation function inputs.When such a normalization is involved, thepolynomial approximation just needs to beaccurate on a small and fix interval.

To validate the soundness of our approachwe launched several experiments. In order toboth increase our chance to achieve a goodaccuracy on the MNIST database [32] and totest the robustness of our methods for NNswith striclty more than 2 non-linear layers, westarted by training a deep neural network with6 activation layers (see Figure 1). We testedthat, without any modification, this trainedneural network achieves 99.59% accuracy. Forthe privacy-preserving classification step, theReLU layers have been replaced by degree-2polynomial approximations. To the best of ourknowledge, the BGV encryption scheme [9]with the Smart-Vercauteren ciphertext pack-ing techniques [43] and the Gentry-Halevi-Smart optimizations [18] is today the mostefficient encryption scheme to evaluate poly-nomial functions. It is why we chose to use itfor our experiments4. Finally, we observed thatour privacy-preserving CNN achieves 99.30%accuracy, which improves that of Cryptonets(98.95%). To conclude, our solution respectsthe three requirements: privacy (our solutionis FHE friendly), efficiency (our solution hasa low multiplicative depth of 6) and accuracy(the recognition performance with and withoutprivacy are close). Last but not least, ourapproach is scalable and can be applied toneural networks with an important number ofnon-linear layers without too much decreasingthe accuracy performances (which was not thecase with Cryptonets).

4In addition, HElib [26], a software library that imple-ments this homomorphic encryption scheme, is availableon GitHub [25]. As in [20], all real values will be mappedto fixed point representations because the BGV encryptionscheme can only be applied on finite field.

D. Paper Organization

We present the required preliminaries inSection II. Our privacy preserving deep neuralnetwork classification protocols can be foundin Section III. This solution requires the ap-proximation of the ReLU function by a lowdegree polynomial. The section IV presentsthis polynomial approximation. Finally, wetested our solution firstly on a light CNN (seeSection V) and then on a deeper CNN (seeSection VI). Section VII concludes this paper.

II. PRELIMINARIES

In this section, the convolutional neural net-work concept and the homomorphic encryp-tion are presented.

A. Convolutional Neural Network (CNN)

Convolutional Neural Network (CNN) areparticularly tailored for image recognition. Forthat purpose, the classical principle of MLPis completed with a new type of layer, calledconvolutional layer based on convolutionalfiltering. Like a perceptron, this convolutionallayer is a scalar product between some inputneurons and some weights which have beenpriorly trained/learned.

A convolutional neural network is a stack oflayers that transforms an input layer, holdingdata to classify, into an output layer, holdingthe label scores. Each neuron of one layer (ex-cept the input layer) is the output of a functionapplied on the neurons of the previous layer.A few distinct kinds of layers are commonlyused such as the fully connected layer, theconvolutional layer, the activation layer andthe pooling layer.

a) Fully connected layer: each neuron ofthis layer is connected to each neuron of theprevious layer. The weight on the connectionbetween the j-th neuron of the (`−1)-th layerand the k-th neuron of the `-th layer noted isdenoted by ω`

kj . The bias of the k-th neuron ofthe `-th layer is denoted by β`

k and its ouptutx`k equals β`

k+∑

j ω`kjx

`−1j , which is a simple

inner product.b) Convolutional layer: The issue with

the fully connected layer is the number ofparameters to learn. In fact, the number ofweights in one fully connected layer is equalto the number of neurons in the previous layermultiplied by the number of neurons in the

Fig. 1: Our deep neural network

(`− 1)-thlayer

`-thlayer

x`−11

x`−12

x`−13

x`1

x`2

x`3

x`4

ω`1,1

ω`4,3

Fig. 2: Fully connected layer

current layer. For instance, if a layer and itsprevious layer contain respectively 100 and10, 000 neurons, then the current layer con-tains 106 weights and 100 biases to learn. Toovercome this issue, convolutional layers areused. These layers are especially tailored forimage inputs because they are based on convo-lutional filtering used in image processing. Inimage processing, a filter studies successivelyeach pixel of an image. For each pixel, aconvolution is performed by multiplying thepixel and its neighbouring pixels values bythe filter corresponding weights and then byadding all the results. The new pixel is set tothis final result value. The same filter is usedon each pixel of the image. Figure 3 shows aconvolutional filtering.

Convolutional layers are based on convolu-tional filtering. The weights within the filterare the weights to learn during the trainingstep. In addition, several filters are performedon the same layer in order to extract differentkinds of characteristics. Hence, each layer

Convolutionalfilter

4043 41

30 28

4140 35

27 25

3038 30

20 19

3531 25

18 17

3032 27

15 10

Filtered image

-3

0

1

0

1

-4

1

0

1

0

Initial image

Fig. 3: Convolutional filtering

is a three-dimension layer: a two-dimensionmatrix is obtained from each filter and then theresult of each filter is put into a stack to obtaina three-dimension result. In addition, the filteris also in three dimensions in order to take intoaccount the three dimensions of the previouslayer. For instance, if n filters are applied ona layer containing w × h × d neurons, theneach filter contains s× s× d weights where sis the filter size and the output layer containsw × h× n neurons.

More specifically, the convolutional layerof the `-th layer contains w` × h` × d`

neurons noted x`i,j,k,∀(i, j, k) ∈[0..w` − 1] × [0..h` − 1] × [0..d` − 1].The parameters to learn are arranged intofilters. Each filter contains s` × s` × d`

weights and one bias and is used to obtain amatrix of w` × h` neurons. Mathematically,the (i0, j0, k0)-th neuron of the `-th layernoted x`i0,j0,k0

is computed with the k0-thfilter and some neurons of the previous layerwith the following formula: x`i0,j0,k0

= β`k0

+∑s`−1i=0

∑s`−1j=0

∑d`−1k=0 ω`,k0

i,j,kx`−1i0+i−bs/2c,j0+j−bs/2c,k.

Figure 4 represents the application of onefilter into a convolutional layer.

c) Activation layer: A neural networkcontaining only fully connected and convolu-

(`− 1)-thlayer

`-thlayer

Fig. 4: Application of one filter into a convo-lutional layer

tional layers can only classify data linearly.To solve more complex classification problem,the activation layer has been introduced. Thesame non-linear function called the activationfunction is applied to each neuron of theprevious layer to obtain one neuron of thecurrent layer: x`i,j,k = f(x`−1i,j,k) where f is theactivation function. Therefore, an activationlayer contains the same number of neuronsthan its previous layer. The two most commonactivation functions are the ReLU functionf(x) = max(0, x) and the sigmoid functionf(x) = (1 + e−x)−1. Activation layers areusually used immediately after convolutionalor fully connected layers. Figure 5 representsan activation layer.

(`− 1)-thlayer

`-thlayer

x`−11

x`−12

x`−13

x`1

x`2

x`3

f

f

f

Fig. 5: Activation layer

d) Pooling layer: As activation layers,they are non-linear layers. This layer reducesthe spatial size in order to reduce the amountof neurons. This layer partitions the neurons ofthe previous layer into a set of non overlappingrectangles and performs a function on eachsub-area in order to obtain the value of oneneuron of the current layer. The most com-mon pooling functions are the max-poolingwhich outputs the maximum values withinthe sub-area and the average-pooling which

outputs the average of the values of the sub-area. In addition, the pooling layer operatesindependently on every depth slice of theprevious layer. Pooling layers are usually usedimmediately after activation layers. Figure 6represents a pooling layer.

5

29

50

0

32

11

0

16

23

45

3

8

20

50

45

1615

20

3

16

(`− 1)-thlayer

`-thlayer

Fig. 6: Max-pooling layer

e) Common architectures: The mainblock of a convolutional neural network isa convolutional layer (CONV ) directly fol-lowed by an activation layer (ACT ), notedCONV → ACT . The convolutional layerextracts locally information from the previouslayer thanks to filters and the activation layerincreases the complexity of the learned clas-sification function thanks to its non-linearity.After some CONV → ACT blocks, a pool-ing layer (POOL) is usually added to re-duce the number of neurons: [CONV →ACT ]p → POOL. This new block is re-peated in the neural network until obtaininga layer of reasonable size. Then some fullyconnected layers (FC) are introduced in orderto obtain a global result which depends of theentire input. A common convolutional networkcan be summarised in the following formula:[[CONV → ACT ]p → POOL]q → [FC]r.

A neural network can be viewed as a func-tion taking as inputs some parameters (weightsand biases) and one data and which outputsthe probability of each label for this data. Forinstance, if the neural network learns how torecognize handwritten digits. Its input layerwill have as many neurons than pixels inimage to classify and its output layer willcontain ten neurons, one per possible digits.During the training step, the weights and bi-ases will be learned in order to recognize aswell as possible handwritten digits. During thetraining step, the ten neurons in the outputlayer will contains the probability for eachdigit. The position of the highest neuron of

the output layer indicates the digit of the inputimage.

B. Homomorphic Encryption

a) Partially homomorphic cryptosystem:Homomorphic encryption is an encryptionwhich allows computations over encrypteddata without knowing the encryption secretkey and learning the raw data. For instance,additive homomorphic encryption allows tocompute additions on encrypted data: givenencryptions Enc(m1), Enc(m2) of the mes-sages m1 and m2, one can efficiently computea ciphertext that encrypts m1 + m2 withoutknowing m1,m2 or the secret key. Homo-morphic encryptions allowing only one typeof operations (additions or multiplications) arecalled partially homomorphic encryptions andhave existed for many years with, for in-stance, Paillier cryptosystem [37] for additivehomomorphic encryption and RSA cryptosys-tem [38] or El Gamal cryptosystem [16] formultiplicative homomorphic encryption. Theseencryptions have been widely used in variousapplications such as electronic voting [14],[15] or simple statistics (e.g. mean, variance)in data mining.

b) Fully Homomorphic Encryption(FHE): Partially homomorphic encryptionare limited to only one type of operationsbut in many applications (including ours),we would like to homomorphically performadditions and multiplications. The firstscheme addressing this issue has beenintroduced by Craig Gentry in 2009 [17].This scheme called fully homomorphicencryption (FHE) is based on ideal latticesand its construction contains two steps.It starts with a somewhat homomorphicencryption (SWHE) scheme which allows afixed number of operations (additions andmultiplications) on the encrypted domain.A bootstrapping operation is added to theSWHE. This operation refreshes ciphertextsby homomorphically evaluating the decryptioncircuit. The combination of a SWHE witha bootstrapping operation results to a fullyhomomorphic encryption: the number ofoperations is now unlimited. Unfortunately,this scheme is inefficient for practicalapplications due to its high computation andmemory cost.

From the Gentry’s breakthrough work, nu-merous FHE schemes have been introduced inorder to make it practical. These schemes canbe organised in three categories:

• Schemes based on ideal lattices such asthe Gentry’s original work [17] and itsoptimizations [44], [19]. These schemeshave worse performances than the lattersystems and so are no longer considered

• Schemes based on learning with er-ror (LWE) or ring learning with er-ror (RLWE) problems such as Brakerskiand Vaikuntanathans schemes [10], [11]and their famous optimization Brakerski-Gentry-Vaikuntanathan (BGV) [9]

• Schemes which operate over integers in-stead of ideal lattices such as the vanDijk, Gentry, Halevi and Vaikuntanathans(DGHV) scheme [46]

Despite the significant progress in the FHEarea, no efficient FHE scheme for all genericfunctions exist. But for some specific applica-tions, very efficient solutions based on FHEcan be designed. In our work, we designed anefficient solution for privacy preserving classi-fication on deep neural network. Our solutionis based on the BGV encryption scheme [9]with the Smart-Vercauteren ciphertext packingtechniques [43] and the Gentry-Halevi-Smartoptimizations [18]. To the best of our knowl-edge, this scheme is today the most efficientFHE scheme for polynomial evaluations. Inaddition, HElib [25] is an efficient implemen-tation of this scheme.

c) Complexity of FHE and multiplicativedepth: Like many FHE schemes, the BGVencryption scheme consists in hiding the plain-text message with noise in order to create theciphertext message. The decryption consistsin removing the noise from the ciphertextmessage. The noise level increases with eachhomomorphic operation. If the noise levelexceeds a certain threshold, it is no longerpossible to correctly decrypt the message. Thenoise growth is much more important withmultiplications as opposed to additions. It iswhy only the multiplicative depth (maximumnumber of multiplications in series) is takeninto account when the computational complex-ity is evaluated.

III. OUR PRIVACY-PRESERVING DEEPNEURAL NETWORK CLASSIFICATION

SOLUTION

This section presents our new proposal fora privacy preserving classification on deepneural network. This proposal respects ourthree requirements: privacy (achieved thanksto FHE), efficiency (reasonably low multi-plicative depth) and accuracy (close to thestate of the art CNN). We recall that the mainissue with the current literature is that CNNarchitectures usually contain ReLU activationfunctions and max pooling functions whichhave a high multiplicative depth and thus areincompatible with our efficiency requirement.

Since our proposal is based on an idea origi-nally introduced in [20] and may be viewed asan extension of this work, we start by recallingthe Cryptonets solution hereafter.

A. Cryptonets solution [20]

Cryptonets solution consists in replacingthe high multiplicative depth layers (ReLUand max pooling) by low multiplicative depthpolynomial layers into the CNN, without toomuch degrading the accuracy of the clas-sification. To achieve this goal, the authorssuggest to replace the ReLU function by thesquare function x 7→ x2 and to replace themax-pooling by the sum-pooling which hasa null multiplicative depth. The proposal istested in a small CNN (9 layers with only2 activation layers) which is applied on theMNIST database. They obtained an accuracyof 98.95% (the state of the art accuracy forthis database is 99.77% according to http://yann.lecun.com/exdb/mnist/). The multiplica-tive depth of the CNN is equal to 2 andthus the classification with FHE (ie privacypreserving) is efficient. However, this designrespects only two requirements among ourthree ones: efficiency and privacy. The non-compliance with the accuracy requirement canbe explained by the low depth of the CNN(only 9 layers).

A simple and natural idea to improve theaccuracy could be to increase the CNN depth.Unfortunately, as actually acknowledged bythe authors themselves, the squaring activationfunction has a derivative which is unbounded,which leads to unstable training, and henceimportant accuracy loss, when the CNN is

deep. This observation about the incompati-bility of the approach with deep CNN seemsto strongly limit the practical interest of Cryp-tonets. At least it shows that more investiga-tions are needed to make it practical for deeplearning.

In view of the unsuccess of the CryptoNets,we chose to consider separately the CNN usedfor the training phase and the CNN used forthe matching:• for the training phase the activation func-

tion has to be a non linear function witha bounded derivation in order to have astable training (especially for deep CNN)and its multiplicative depth is not an issuesince we don’t want to make it privacypreserving (and hence don’t need to applyFHE)

• for the classification, the activation func-tion has to be a low degree polynomialin order to respect the efficiency require-ment (FHE friendly solution)

B. Our solution

To respect the accuracy requirement, wemake minor modifications in the CNN forthe training step compared to state of the artCNNs. Namely, we keep the ReLU activationfunction which seems to be a key function inthe current CNN performances and we replacethe max-pooling by the average-pooling (thishas only a small impact on the accuracy andhas the advantage to replace a step which isnot FHE-friendly by a step with null mul-tiplicative depth). For the CNN used in thematching, we propose to start with the trainedCNN and to replace the ReLU functions bylow degree approximations (hence reapplyingthe same basic idea as in CryptoNets). Todeal with the fact that the ReLU functionis a high degree polynomial and thus cannothave a good low degree polynomial approx-imation on the entire R, our key innovationis to combine the polynomial approximationof the ReLU with a batch normalization layer[28]. Namely, before each ReLU layer, weadd a batch normalization layer in order tohave a restricted stable distribution at theentry of the ReLU, thus limiting our needof an accurate polynomial approximation toa small part of R around the point 0. Toavoid high accuracy degradation between the

training and the classification phases due totoo many modifications into the CNN, thebatch normalization layers will be added tothe CNN for the training and the classificationphases.

As a reminder a common convolutionalnetwork can be summarized in the formula[[CONV → ACT ]p → POOL]q → [FC]r

(see Section II-A). In our solution, convo-lutional networks have the following archi-tecture [[CONV → BN → ACT ]p →POOL]q → [FC]r where BN representsa batch normalization layer and POOL isthe average-pooling. In addition, the activationlayer ACT is the ReLU function for the train-ing phase and a polynomial approximation forthe classification phase.

IV. APPROXIMATION OF RELU BYPOLYNOMIALS

The accuracy of our solution is stronglylinked to the quality of the polynomial approx-imation of the ReLU function on the outputdistribution of the batch normalization layer(see Figure 9). This latter layer is right aftera convolutional layer. Thus, according to thecentral limit theorem, the input distribution ofthe batch normalization layer is normal, andhence the output distribution is the standardnormal one.

Fig. 7: Batch normalization [28, Algorithm 1]

To compute the polynomial approximationof the ReLU on this standard normal distribu-tion, we used the polynomial regression func-tion polyfit from the Python package numpy.This function inputs a set X = (X1, ..., XN ),a set Y = (Y1, ..., YN ) and a polynomial de-gree n and outputs the coefficients of the poly-

nomial P (X) = c0 + c1X + c2X2 + ...cnX

n

such that the square error ε =∑N

i=1(P (Xi)−Yi)

2 is minimized. We applied this function onthe set {(Xi, ReLU(Xi))} where the Xi arerandomly picked up from a standard normaldistribution (see Table I and Figure 8).

Degree Polynomials2 0.1992 + 0.5002X + 0.1997X2

3 0.1995 + 0.5002X + 0.1994X2 − 0.0164X3

4 0.1500 + 0.5012X + 0.2981X2

−0.0004X3 − 0.0388X4

5 0.1488 + 0.4993X + 0.3007X2

+0.0003X3 − 0.0168X4

6 0.1249 + 0.5000X + 0.3729X2

−0.0410X4 + 0.0016X6

TABLE I: Approximation of the ReLU func-tion by polynomials on the standard normaldistribution

Fig. 8: Approximation of the ReLU functionby polynomials on the standard normal distri-bution

We can observe that, for our method, thepolynomials of odd degree 2n+1 approximatesimilarly as the polynomial of even degree2n. The same trend can be mathematicallyobserved from a Taylor series around the point0 of a smooth approximation of the ReLUfunction called softplus ln(1 + ex) (see Fig-ure 9): ln(1+ex) = ln(2)+ x

2− x4

192− 17x8

645120+31x10

14515200 + O(x12). Thus we only used theeven degree polynomial approximations in oursolution.

With the standard normal distribution,99.73% of the values belong to [−3, 3]. Whenthe degree of the polynomial approxima-tion increases, our polynomial approximation

Fig. 9: ReLU and softplus

brightens on [−3, 3] but deteriorates outside[−3, 3].

V. EXPERIMENTATION ON A LIGHT CNN

To begin, we tested our solution on a lightCNN (see Figure 10) with the following char-acteristics:

1) The convolutional layers (Conv1 andConv2) have respectively 20 and 50feature maps and their filter sizes arerespectively (1×5×5) and (20×5×5).

2) Average pooling layers have (2 × 2)window size.

3) The fully connected layers (FC1 andFC2) have respectively 500 and 10 out-puts.

4) A batch normalization layer (BN) ispresent just before the ReLU layer.

This light CNN contains 9 layers and only oneReLU activation layer.

Firstly, we trained this light CNN with theCaffe framework [29] on the MNIST databasewith the following parameters:• base learning rate: b lr = 0.01• learning rate policy: b lr ∗ (1 + 10−4 ∗iter)−0.75

• momentum: 0.9• weight decay: 0.0005• max iter: 104

• solver type: SGDWe obtained an accuracy of 97.95% whichis far from the state of the art for the digitrecognition problem (99.77%). That is due tothe small size of our light CNN. The goal ofthis light CNN is to validate our solution byproving that our solution respects the accuracyrequirement. Thus we would like to prove

that the replacement of the ReLU layer by alow degree polynomial layer leads to a lowaccuracy degradation. To limit the accuracydegradation due to this activation layer mod-ification, the polynomial has to approximatevery well the ReLU function on the outputdistribution of the batch normalization layer.To do that, we firstly analyzed the outputdistribution of the batch normalization layer(see Figure 11).

As expected, this distribution is close toa standard normal distribution. Then we re-placed the ReLU functions by our polynomialapproximations and obtained the accuracy ofthe Table II.

Degree Accuracy2 97.55%4 97.84%6 97.91%

TABLE II: Classification accuracy on the lightCNN

As expected, the performances obtainedwith the private classification (FHE friendlyclassification with polynomial activation layer)are similar to the accuracy of 97.95% ob-tained with the non-private classification (withthe ReLU activation layer). Thus, our solu-tion on this light CNN respects the accuracyrequirements. In addition, the multiplicativedepth with this light CNN is equal to log2 degwhere deg is the degree of the polynomialapproximation. Thus our solution applied onthis light CNN respects the three requirements:privacy (thanks to FHE), efficiency (with amultiplicative depth of 1, 2 or 3 accordingto the polynomial approximation degree) andaccuracy (thanks to the batch normalizationand the proficient polynomial approximationon a standard normal distribution).

VI. EXPERIMENTATION ON A DEEPER CNNIn this section, we tested our solution on a

deeper CNN (see Figure 12) with more hiddenlayers and more activation layers in order tobe closer to state of the art CNN. This CNNhas the following characteristics:

1) In each convolutional layer, the filtersize is (n × 3 × 3) with n the numberof feature maps of the input layer.

2) Conv1 and Conv2 have 32 feature maps,Conv3 and Conv4 have 64 feature maps,

Fig. 10: Our light CNN

Fig. 11: Output distribution of the batch nor-malization layer

Conv5 and Conv6 have 128 featuremaps.

3) Average pooling layers have (2 × 2)window size.

4) The fully connected layers (FC1 andFC2) have respectively 256 and 10 out-puts.

5) A batch normalization layer (BN) ispresent just before each ReLU layer.

6) A dropout is used during the trainingphase only to select 50% of the neurons

This CNN contains 24 layers and six ReLUactivation layers.

Firstly, we trained this CNN with the Caffeframework [29] on the MNIST database withthe following parameters:

• base learning rate: b lr = 0.02• learning rate policy: b lr ∗ (1 + 10−4 ∗

iter)−0.75

• momentum: 0.9• weight decay: 0.0005• max iter: 105

• solver type: SGD

We obtained an accuracy of 99.59% which issimilar to state of the art accuracy (99.77%).Then we replaced the ReLU functions by ourpolynomial approximations and evaluated theaccuracy of these FHE friendly classifications(see Table III).

Degree Accuracy2 59.14%4 97.91%6 36.94%

TABLE III: Classification accuracy on ourCNN

The accuracy degradation is higher thanon our light CNN. This degradation can beexplained by the number of activations layersin our CNN. During the private classification,some errors appear after the first activationlayer due to the replacement of the ReLUfunctions by our polynomial approximations.These errors lead to a low distortion of the out-put distribution of the next batch normalizationlayer. Our polynomial approximations are notperfectly suited for this new distribution andthus more and bigger errors appear after thesecond activation layer. The errors spread andstrengthen from one layer to the next layer.To visualize the distribution distortion, weplotted the output distribution of the 6-th batchnormalization layer with the degree 2 or 4

Fig. 12: Our CNN

polynomial approximations (see Figure 13).

Fig. 13: Output distribution of BN6 with thedegree 2 and 4 polynomial approximations

The distribution with the degree 4 polyno-mial approximation is closer to the standardnormal distribution than the distribution withthe degree 2 polynomial approximation. Thatexplains the accuracy difference between thistwo approximations. These two distributionsare closed to a normal distribution with astandard deviation slightly greater than 1. Toimprove our accuracy, we suggest to buildnew polynomial approximations learned froma distribution closer to the output distributionof the batch normalization (for instance ona normal distribution with a standard devia-tion slightly higher than 1). Table IV sumsup our results with these new polynomialapproximations. We limited our polynomialapproximations to degree 2 and 4 to respectthe efficiency requirement. Our solution hasa multiplicative depth of 6 with degree 2polynomial approximation and a multiplicativedepth of 12 with degree 4 polynomial approx-

imation. These both multiplicative depth arereasonable.

Degree Normal distribution Accuracy2 µ = 0, σ = 1.1 87.12%2 µ = 0, σ = 1.2 90.70%2 µ = 0, σ = 1.3 82.61%4 µ = 0, σ = 1.1 98.18%4 µ = 0, σ = 1.2 97.75%

TABLE IV: Classification accuracy with ournew polynomial approximations

To be closer to the non secure accuracy99.59%, we adapted the CNN weights to ourpolynomial activation function. To do that,we started with the CNN trained with theReLU activation function. We replaced theReLU functions by our degree 2 polynomialapproximations learned on a normal distri-bution with µ = 0 and σ = 1.2. Finallywe continued the CNN learning with a lowlearning rate (10−5). With these new learningweights, we obtained an accuracy of 99.28%.Unfortunately this method does not work withdegree 4 polynomial approximations becausethe last training with degree 2 polynomialactivation layer is unstable.

As the distribution after each batch nor-malization layer depends of the polynomialapproximations used, it is difficult to build apolynomial approximation which perfectly fitsthis distribution. Thus we decided to considerthe polynomial approximation coefficients aslearning weights in the last training step. Inmore details, as previously, we started withthe CNN trained with the ReLU activationfunction. Then we continued the learning af-ter the replacement of the ReLU functionsby our degree 2 polynomial approximation

(learned from a normal distribution with µ = 0and σ = 1.2) where its coefficients belongto learning weights. With this solution, weachieved an accuracy of 99.30% which is veryclose to the accuracy from the non privateclassification (99.59%).

These experiments prove that our solutionrespects the accuracy requirement. In addition,our last private classification employs onlydegree 2 polynomial approximations. Thus itrespects the efficiency requirement (with amultiplicative depth of 6). Hence, our solutionrespects the three requirements.

VII. CONCLUSION

Today, convolutional neural networks (espe-cially deep neural network) provide the bestclassification methods in areas such that com-puter vision and speech recognition. Thesealgorithms are often applied on sensitive in-formation and thus require a private imple-mentation. In this area, two challenges haveto be solved: privacy preserving learning andprivacy preserving classification on CNN. Inthis article, we focus on the privacy preservingclassification challenge. The main difficulty isto simultaneously respect the following threerequirements: privacy, efficiency and accuracy.

To achieve the privacy requirement, oursolution is based on FHE. The main difficultyis to obtain a low multiplicative depth classifi-cation (to achieve the efficiency requirement)without degrading to much the accuracy. Torespect the accuracy requirement, our CNNin the training phase is similar to the stateof the art CNN (with the ReLU activationlayer). This training CNN can have a highmultiplicative depth because we do not applyFHE on it. For the CNN used in the classifica-tion phase, we replace the ReLU functions bylow degree polynomial approximations. Ourprivate classification accuracy relies on the ap-proximation. Unfortunately the ReLU functionhas a high degree, and hence it is not possibleto have a good approximation of the ReLUfunction by a low degree polynomial on theentire R. Our key innovation is to combine thepolynomial approximation of the ReLU with abatch normalization layer. Thanks to this latterlayer, we have a restricted stable distributionat the entry of each activation layer, and thusit is now possible to build a good low degree

polynomial approximation of the ReLU on thisdistribution.

We firstly tested our solution on a slightCNN in order to prove the low accuracydegradation between the non private classifi-cation (97.95%) and the private classification(97.91%). Then we tested our solution on adeeper CNN with 24 layers. To respect theaccuracy requirement, some tricks have beenused: a second training to adapt the CNNweights to the polynomial layers and a polyno-mial approximation learned from a normal dis-tribution slightly higher than 1. We achieveda private classification accuracy of 99.30%which is better than Cryptonets (98.95%) andclose to the non private accuracy on the sameCNN 99.59%. As the best of our knowledge,our solution is the first privacy preservingclassification method compatible with deepCNN.

ACKNOWLEDGMENT

This work has been supported in part bythe CRYPTOCOMP FUI17 project. It wasalso supported by the European Commissionthrough the SECURITY programme underFP7-SEC-2013-1-607049 EKSISTENZ.

REFERENCES

[1] A. Bansal, T. Chen, and S. Zhong. Privacy preserv-ing back-propagation neural network learning overarbitrarily partitioned data. Neural Computing andApplications, 20(1):143–150, 2011.

[2] M. Barni, C. Orlandi, and A. Piva. A privacy-preserving protocol for neural-network-based com-putation. In Proceedings of the 8th workshop onMultimedia & Security, MM&Sec 2006, Geneva,Switzerland, September 26-27, 2006, pages 146–151, 2006.

[3] Y. Bengio. Practical recommendations for gradient-based training of deep architectures. In NeuralNetworks: Tricks of the Trade - Second Edition,pages 437–478. 2012.

[4] J. Bergstra and Y. Bengio. Random search forhyper-parameter optimization. Journal of MachineLearning Research, 13:281–305, 2012.

[5] D. Boneh, E. Goh, and K. Nissim. Evaluating 2-dnfformulas on ciphertexts. In Theory of Cryptography,Second Theory of Cryptography Conference, TCC2005, Cambridge, MA, USA, February 10-12, 2005,Proceedings, pages 325–341, 2005.

[6] J. W. Bos, K. E. Lauter, J. Loftus, and M. Naehrig.Improved security for a ring-based fully homomor-phic encryption scheme. In Cryptography and Cod-ing - 14th IMA International Conference, IMACC2013, Oxford, UK, December 17-19, 2013. Proceed-ings, pages 45–64, 2013.

[7] J. W. Bos, K. E. Lauter, and M. Naehrig. Privatepredictive analysis on encrypted medical data. Jour-nal of Biomedical Informatics, 50:234–243, 2014.

[8] R. Bost, R. A. Popa, S. Tu, and S. Goldwasser.Machine learning classification over encrypted data.IACR Cryptology ePrint Archive, 2014:331, 2014.

[9] Z. Brakerski, C. Gentry, and V. Vaikuntanathan.(leveled) fully homomorphic encryption withoutbootstrapping. In Innovations in Theoretical Com-puter Science 2012, Cambridge, MA, USA, January8-10, 2012, pages 309–325, 2012.

[10] Zvika Brakerski and Vinod Vaikuntanathan. Effi-cient fully homomorphic encryption from (standard)LWE. In IEEE 52nd Annual Symposium on Foun-dations of Computer Science, FOCS 2011, PalmSprings, CA, USA, October 22-25, 2011, 2011.

[11] Zvika Brakerski and Vinod Vaikuntanathan. Fullyhomomorphic encryption from ring-lwe and securityfor key dependent messages. In Advances in Cryp-tology - CRYPTO 2011 - 31st Annual CryptologyConference, Santa Barbara, CA, USA, August 14-18, 2011. Proceedings, 2011.

[12] T. Chen and S. Zhong. Privacy-preserving back-propagation neural network learning. IEEE Trans.Neural Networks, 20(10):1554–1564, 2009.

[13] I. Damgard and M. Jurik. A generalisation, asimplification and some applications of paillier’sprobabilistic public-key system. In Public KeyCryptography, 4th International Workshop on Prac-tice and Theory in Public Key Cryptography, PKC2001, Cheju Island, Korea, February 13-15, 2001,Proceedings, pages 119–136, 2001.

[14] Ivan Damgard, Jens Groth, and Gorm Salomonsen.The theory and implementation of an electronicvoting system. In Secure Electronic Voting. 2003.

[15] Ivan Damgard, Mads Jurik, and Jesper Buus Nielsen.A generalization of paillier’s public-key system withapplications to electronic voting. Int. J. Inf. Sec.,9(6), 2010.

[16] Taher El Gamal. A public key cryptosystem and asignature scheme based on discrete logarithms. InAdvances in Cryptology, Proceedings of CRYPTO’84, Santa Barbara, California, USA, August 19-22,1984, Proceedings, 1984.

[17] C. Gentry. A fully homomorphic encryption scheme.PhD thesis, Stanford University, 2009. crypto.stanford.edu/craig.

[18] C. Gentry, S. Halevi, and N. P. Smart. Homomorphicevaluation of the AES circuit. In Advances in Cryp-tology - CRYPTO 2012 - 32nd Annual CryptologyConference, Santa Barbara, CA, USA, August 19-23,2012. Proceedings, pages 850–867, 2012.

[19] Craig Gentry and Shai Halevi. Implementing gen-try’s fully-homomorphic encryption scheme. InAdvances in Cryptology - EUROCRYPT 2011 - 30thAnnual International Conference on the Theory andApplications of Cryptographic Techniques, Tallinn,Estonia, May 15-19, 2011. Proceedings, 2011.

[20] R. Gilad-Bachrach, N. Dowlin, K. Laine, K. E.Lauter, M. Naehrig, and J. Wernsing. Cryptonets:Applying neural networks to encrypted data withhigh throughput and accuracy. In Proceedings of the33nd International Conference on Machine Learn-ing, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 201–210, 2016.

[21] Pavel Golik, Patrick Doetsch, and Hermann Ney.Cross-entropy vs. squared error training: a theo-retical and experimental comparison. In INTER-SPEECH 2013, 14th Annual Conference of theInternational Speech Communication Association,Lyon, France, August 25-29, 2013, pages 1756–1760, 2013.

[22] T. Graepel, K. E. Lauter, and M. Naehrig. MLconfidential: Machine learning on encrypted data.In Information Security and Cryptology - ICISC2012 - 15th International Conference, Seoul, Korea,November 28-30, 2012, Revised Selected Papers,pages 1–21, 2012.

[23] A. Graves, A. Mohamed, and G. E. Hinton. Speechrecognition with deep recurrent neural networks. InIEEE International Conference on Acoustics, Speechand Signal Processing, ICASSP 2013, Vancouver,BC, Canada, May 26-31, 2013, pages 6645–6649,2013.

[24] R Hahnloser, R. Sarpeshkar, M A Mahowald, R. J.Douglas, and H.S. Seung. Digital selection andanalogue amplification coesist in a cortex-inspiredsilicon circuit. Nature, 405:947–951, 2000.

[25] S. Halevi and V. Shoup. HElib - an implementationof homomorphic encryption. https://github.com/shaih/HElib/.

[26] S. Halevi and V. Shoup. Algorithms in HElib. InAdvances in Cryptology - CRYPTO 2014 - 34thAnnual Cryptology Conference, Santa Barbara, CA,USA, August 17-21, 2014, Proceedings, Part I, pages554–571, 2014.

[27] Geoffrey E. Hinton, Nitish Srivastava, AlexKrizhevsky, Ilya Sutskever, and Ruslan Salakhut-dinov. Improving neural networks by prevent-ing co-adaptation of feature detectors. CoRR,abs/1207.0580, 2012.

[28] S. Ioffe and C. Szegedy. Batch normalization: Accel-erating deep network training by reducing internalcovariate shift. In Proceedings of the 32nd Inter-national Conference on Machine Learning, ICML2015, Lille, France, 6-11 July 2015, pages 448–456,2015.

[29] Yangqing Jia, Evan Shelhamer, Jeff Donahue,Sergey Karayev, Jonathan Long, Ross Girshick, Ser-gio Guadarrama, and Trevor Darrell. Caffe: Con-volutional architecture for fast feature embedding.arXiv preprint arXiv:1408.5093, 2014.

[30] A. Krizhevsky, I. Sutskever, and G. E. Hinton.Imagenet classification with deep convolutional neu-ral networks. In Advances in Neural InformationProcessing Systems 25: 26th Annual Conferenceon Neural Information Processing Systems 2012.Proceedings of a meeting held December 3-6, 2012,Lake Tahoe, Nevada, United States., pages 1106–1114, 2012.

[31] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio.Object recognition with gradient-based learning. InShape, Contour and Grouping in Computer Vision,page 319, 1999.

[32] Yann LeCun and Corinna Cortes. MNIST handwrit-ten digit database. 2010.

[33] M. Naor and B. Pinkas. Oblivious transfer and poly-nomial evaluation. In Proceedings of the Thirty-FirstAnnual ACM Symposium on Theory of Computing,May 1-4, 1999, Atlanta, Georgia, USA, pages 245–254, 1999.

[34] M. A. Nielsen. Neural Networks and Deep Learn-ing. Determination Press, 2015.

[35] V. Nikolaenko, U. Weinsberg, S. Ioannidis, M. Joye,D. Boneh, and N. Taft. Privacy-preserving ridgeregression on hundreds of millions of records. In2013 IEEE Symposium on Security and Privacy, SP2013, Berkeley, CA, USA, May 19-22, 2013, pages334–348, 2013.

[36] C. Orlandi, A. Piva, and M. Barni. Oblivious neural

network computing via homomorphic encryption.EURASIP J. Information Security, 2007, 2007.

[37] P. Paillier. Public-key cryptosystems based oncomposite degree residuosity classes. In Advancesin Cryptology - EUROCRYPT ’99, InternationalConference on the Theory and Application of Cryp-tographic Techniques, Prague, Czech Republic, May2-6, 1999, Proceeding, pages 223–238, 1999.

[38] R L Rivest, L Adleman, and M L Dertouzos. On databanks and privacy homomorphisms. Foundations ofSecure Computation, Academia Press, 1978.

[39] N. Schlitter. A protocol for privacy preservingneural network learning on horizontally partitioneddata. In Privacy in Statistical Databases - UNESCOChair in Data Privacy, International Conference,PSD 2008, Istanbul, Turkey, September 24-26, 208.Proceedings, 2008.

[40] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet:A unified embedding for face recognition and clus-tering. In IEEE Conference on Computer Vision andPattern Recognition, CVPR 2015, Boston, MA, USA,June 7-12, 2015, pages 815–823, 2015.

[41] B. Schlkopf and A. J. Smola. Learning WithKernels: Support Vector Machines, Regularization,Optimization and Beyond. MIT Press, 2002.

[42] R. Shokri and V. Shmatikov. Privacy-preservingdeep learning. In Proceedings of the 22nd ACMSIGSAC Conference on Computer and Communi-cations Security, Denver, CO, USA, October 12-16,2015, pages 1310–1321, 2015.

[43] N. P. Smart and F. Vercauteren. Fully homomorphicSIMD operations. IACR Cryptology ePrint Archive,2011:133, 2011.

[44] Nigel P. Smart and Frederik Vercauteren. Fullyhomomorphic encryption with relatively small keyand ciphertext sizes. In Public Key Cryptography -PKC 2010, 13th International Conference on Prac-tice and Theory in Public Key Cryptography, Paris,France, May 26-28, 2010. Proceedings, 2010.

[45] J. Snoek, H. Larochelle, and R. P. Adams. Practicalbayesian optimization of machine learning algo-rithms. In Advances in Neural Information Process-ing Systems 25: 26th Annual Conference on NeuralInformation Processing Systems 2012. Proceedingsof a meeting held December 3-6, 2012, Lake Tahoe,Nevada, United States., 2012.

[46] Marten van Dijk, Craig Gentry, Shai Halevi, andVinod Vaikuntanathan. Fully homomorphic encryp-tion over the integers. In Advances in Cryptology- EUROCRYPT 2010, 29th Annual InternationalConference on the Theory and Applications of Cryp-tographic Techniques, French Riviera, May 30 - June3, 2010. Proceedings, 2010.

[47] D. Wu and J. Haven. Using homomorphic encryp-tion for large scale statistical analysis. 2012.

[48] P. Xie, M. Bilenko, T. Finley, R. Gilad-Bachrach,K. E. Lauter, and M. Naehrig. Crypto-nets:Neural networks over encrypted data. CoRR,abs/1412.6181, 2014.

[49] A. C. Yao. Protocols for secure computations(extended abstract). In 23rd Annual Symposium onFoundations of Computer Science, Chicago, Illinois,USA, 3-5 November 1982, pages 160–164, 1982.

[50] J. Yuan and S. Yu. Privacy preserving back-propagation neural network learning made practicalwith cloud computing. IEEE Trans. Parallel Distrib.Syst., 25(1):212–221, 2014.

[51] M. D. Zeiler and R. Fergus. Visualizing and un-derstanding convolutional networks. In Computer

Vision - ECCV 2014 - 13th European Conference,Zurich, Switzerland, September 6-12, 2014, Pro-ceedings, Part I, pages 818–833, 2014.

[52] Q. Zhang, L. T. Yang, and Z. Chen. Privacypreserving deep computation model on cloud forbig data feature learning. IEEE Trans. Computers,65(5):1351–1362, 2016.

APPENDIX AFROM NEURAL NETWORKS TO

CONVOLUTIONAL NEURAL NETWORK

The simplest neural network is the per-ceptron (see Figure 14) which takes somebinary inputs ~x and produces one binary out-put (aka label). To each bit coordinate of ~x,say xj , is attached a weight, say ωj ∈ R,expressing the importance of the coordinatefor the output value. During the training phase,the weights are learned, together with a biasβ ∈ R (e.g. thanks to linear regression orgradient descent processing). The classifica-tion of a new input ~x (i.e. the assignmentof a label h(~x)) is afterwards done by ap-plying the following perceptron rule: h(~x) ={0 if

∑j ωjxj + β ≤ 0

1 if∑

j ωjxj + β > 0.

~x

x1

x2

x3

h(~x)

ω1

ω2

ω3

Fig. 14: Perceptron

The use of a single perceptron only servesto classify data that are linearly separable.The multilayer perceptron (MLP) aims to dealwith this limitation thanks to the combiningof several perceptrons and the adding of hid-den/internal layers with non-linear activationfunction such as the sigmoid f(x) = (1 +e−x)−1 or, more recently, the rectified linearunit (ReLU) f(x) = max(0, x). In a MLP,the output of a perceptron is f(

∑j ωjxj + β)

where f is the non-linear activation function.To model high-level abstraction of data, theMLP is composed of several successive layers(one input layer, one output layer and one ormore internal – aka hidden – layers) whereeach node in one layer is linked to every nodes

in the next layer (see Figure 15). The outputlayer h(~x) contains as many neurons as labelsand these neurons represent scores for eachlabel. For instance, for the digit recognition,the output layer contains 10 neurons (one foreach possible digits) and each output neuronrepresents the score for one digit label. Theindex of the largest output neuron indicates theoutput label. In order to interpret the outputlayer as probabilities, a softmax activationlayer is often added at the end of a MLP.This layer transforms a layer containing nneurons x1, ..., xn into a layer containing thesame number of neurons y1, ..., yn with thefollowing equations: ∀i ∈ [1..n], yi = h(xi)

.=

exi∑j∈[1..n] e

xj .

input

layer1st hidden

layer2nd hidden

layeroutputlayer

h(~x)~x

x1

x2

x3

Fig. 15: Multilayer perceptron

APPENDIX BTRAINING AND CLASSIFICATION PHASES

Like other classical neural networks, CNNis composed of two stages: the training phasefrom a labelled database and the classificationof new data. Before the training phase, thedata scientist has to choose the architecture ofthe neural network (number of layers, numberof neurons per layer, activation function, etc)and some parameters for the learning algo-rithm called hyper-parameters (e.g. number ofepochs, batch size, learning rate, etc). Then,the neural parameters (weights and biases)are inferred during the training phase fromthe labelled database and are adapted to thetraining objective (recognize/classify digits,faces, objects, etc). They are initialized withrandom values and are afterwards updatedby applying several times the same process.For each epoch (one iteration into the en-tire training database), the training databaseis randomly partitioned into small sets ofdata called batches. Each batch has the samenumber of data called batch size. For each

batch, its data go through the neural net-work to obtain their output scores with thecurrent neural network. This step is calledfeed-forward. Then an error score is com-puted to measure the mismatch between theoutput scores and the expected ones on thisbatch. We will denote the expected scores ofone training datum by a n-dimensional vectorwhere n is the number of possible labels. Thisexpected vector contains a 1 in the positioncorresponding to its label and zeros on theremaining positions. For instance, in the digitrecognition, the expected vector for the digit0 is y = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0) and the onefor the digit 6 is y = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0).The mean-squared error is widely used whichis equal to MSE = 1

m

∑mi=1

∑nj=1(h(~xi)j −

yi,j)2 where m denotes the batch size, n

the number of possible labels, yi,j is thej-th value of the expected labels vector ofthe i-th training datum and h(~xi)j is thescore contained in the j-th output neuronof the CNN for the i-th input vector ~xi.Today, the cross-entropy cost function, com-bined with the softmax activation function,is sometimes preferred to the MSE (see e.g.[21]). The cross-entropy cost is equal toCE = − 1

m

∑mi=1

∑nj=1 yi,j log h(~xi)j . Fi-

nally the error is reduced by modifying someparameters according to the gradient descent[34, Chapter 2]. The learning rate controlsthe degree of modifications of the parameters.This parameter updating is the well-knownback propagation step. Then the feed-forwardand back propagation steps are applied on thenext batch. Once the pre-defined number ofepochs is achieved, the training is finishedand the last updates of the weights and ofthe biases are stored; the neural network isready to assign labels to never seen data. Theclassification of a new datum ~x now simplyconsists in applying the feed-forward step.

The main difficulty for a data scientist isto choose, during the training phase, all theparameters discussed previously (in particularthe neural network architecture and the hyper-parameters) in order to achieve a satisfyingaccuracy. Many heuristics exist but they arerarely sufficient alone and the data scientistknow-how is often crucial. About the hyper-parameters, some heuristics [3], [4], [45] toimprove them from one learning to another

learning have been suggested. They are how-ever rarely sufficient and the data scientistknow-how is often crucial: as mentioned in[34, Chapter 3], ” [the] difficulty of choosinghyper-parameters is exacerbated by the factthat the lore about how to choose hyper-parameters is widely spread, across manyresearch papers and software programs, andoften is only available inside the heads ofindividual practitioners”. About the neuralnetwork architecture, the authors try in [51]to understand the influence of each layer onthe neural network accuracy in order to buildnew neural networks that outperform currentones. To do that they introduce a visualiza-tion technique that gives insight to understandthe influence of intermediate layers. Theyapply their visualization techniques on theKrizhevsky et al.s architecture [30] and theydeduce from this analysis some small modi-fications in the neural network architecture inorder to improve its accuracy. These visual-izations techniques may permit to improve abit a neural network architecture but does notpermit outstanding improvements and does notdeliver information to build a neural networkarchitecture from scratch.

APPENDIX CPRIVACY-PRESERVING PROCESSING OF THE

TRANING PHASE

The two main security problematics in thearea of deep neural networks are the privacypreserving training and the privacy preservingclassification. The training step (without secu-rity) is not easy for a data scientist due to allthe factors he has to choose: the architecture ofthe neural network (number of layers, numberof neurons for each layer, activation functions,etc), the initialization of the weights and biasesand the hyper-parameters (parameters for thelearning algorithm). Due to all these factors, itis almost impossible to obtain a trained neuralnetwork with sufficient accuracy at the firsttry (with the first trained neural network). Thedata scientist has to analyse, during the train-ing phase, the progress of his neural networkin order to understand why he fails, and thenhe has to modify the factors to obtain betteraccuracy in his next try. This analysis needsto be made on plaintext observations (withoutencryption) which makes privacy preserving

training almost impossible for complex clas-sification problems. Several articles [1], [12],[39], [50], [52] suggest some methods to se-cure the first try of the data scientist but theyonly apply to simple classification problemsand become ineffective when the first try is notsufficient (which is often the case in practice).

Articles on the Privacy-reserving Learningsubject can be split according to the databasefragmentation (decomposition of the databaseinto multiple smaller units called fragments).In a relational database, data are structuredunder the form of a matrix: each row corre-sponds to a subject (e.g. a patient) and eachcolumn to an attribute (e.g. age, race, bloodtype). There exist three types of databasefragmentation: horizontal, vertical or arbitraryfragmentation. A database is horizontally par-titioned, if the database is split into rows. Thatmeans that if a fragment contains one attributeof one subject, he has also all the attributesof this subject (the entire row). On the con-trary, a database is vertically partitioned, ifthe database is split into columns. That meansthat if a fragment contains one attribute ofone subject, he has also the values of thisattribute for all subjects (the entire column).Finally a database is arbitrary partitioned,if it is a mixed of vertical and horizontalfragmentations.

In 2008, Schlitter [39] introduced a privacypreserving neural network learning protocolfor horizontally partitioned database. This pro-tocol, based on additive secret sharing, en-ables several parties to jointly process a neuralnetwork learning without leaking informationabout their respective data. In this protocol,each party performs almost all the compu-tations (excluding weight updating) on theirown data in the plaintext domain and they puttogether their results to update the weightsthanks to the secret sharing properties. Thisprotocol suffers from one major drawback. Ateach iteration, all the updating weights arerevealed to all the participants which mayleak information about the training data. Toovercome this, Shokri and Shmatikov [42]enhanced the scheme by proposing a privacypreserving neural network learning protocolfor horizontally partitioned database with re-duced information leakage. In their protocol,each party obtains some parameters by per-

forming computations on their own databasebut instead of sharing all parameters to otherparties, they only share a small fraction ofthem. Then the weights are updated from theseshared parameters. This method has a trade-offbetween classification accuracy and privacy.Higher is the number of shared parameters,better is the classification accuracy but loweris the privacy.

In vertically partitioned databases, the ele-ments of one row belongs to different parties.It is therefore no longer possible to process thefeed-forward step in the plaintext domain be-cause it inputs a full row. Chen and Zhong [12]proposed a scheme that provides a privacy pre-serving neural network learning protocol forvertically partitioned database. This schemeprotects training data and intermediate results,and supports only two parties. This protocolis also based on additive secret sharing. Alldata and parameters are additively shared be-tween the two parties and the computationsare performed thanks to the secret sharingproperties. Only the sigmoid activation func-tion cannot be efficiently evaluated with thesecret sharing properties due to its high nonlinearity. To overcome this issue, the sigmoidactivation function is replaced by a piece-wise linear approximation. Bansal et al [1]enhanced this scheme by proposing a privacypreserving neural network learning protocolfor arbitrary partitioned database between twoparties. Like the work of Chen and Zhong,this protocol only supports two-party scenario.Unfortunately directly extending them to themulti-party scenario drastically increases thecommunication and computation complexity.

To overcome this limitation, Yuan and Yu[50] proposed to securely outsource the com-putation to a cloud thanks to homomorphicencryption. Each party encrypts his data be-fore sending them to the cloud. Then thecloud performs most of the learning operationsover the ciphertexts and returns the encryptedresults to the parties. To be efficiently evalu-ated over the ciphertexts, the sigmoid functionis approximated by a low degree polynomialobtained through Maclaurin series. In addi-tion, Yuan and Yu suggest to use the BGNhomomorphic encryption [5] which supportsone multiplication and unlimited number ofadditions. Thus the participants have to de-

crypt and re-encrypt intermediate values aftereach multiplication. For privacy preservation,parties decrypt only random shares of theintermediate values thanks to secret sharingmethod. This protocol is not really efficient fordeep neural network because of the multitudeof decryptions and re-encryptions. Zhang et al[52] enhanced this protocol by using a moreefficient homomorphic encryption scheme: theBGV scheme [9] which enables unlimitednumber of multiplications and additions overciphertexts. But the efficiency of the homo-morphic computations depends on the mul-tiplicative depth. To avoid a multiplicativedepth too big, after each iteration the updatedweights are sent to the parties to be decryptedand re-encrypted. Thus the communicationcomplexity of the solution is very high.

In this article, we will not study the privacypreserving learning problem but we will focuson the privacy preserving classification prob-lem.