towards practical differentially private convex optimization
TRANSCRIPT
Towards Practical Differentially Private ConvexOptimization
Roger IyengarCarnegie Mellon University
Om ThakkarBoston University
Joseph P. NearUniversity of California, Berkeley
Abhradeep ThakurtaUniversity of California, Santa Cruz
Dawn SongUniversity of California, Berkeley
Lun WangPeking University
AbstractāBuilding useful predictive models often in-volves learning from sensitive data. Training models withdifferential privacy can guarantee the privacy of suchsensitive data. For convex optimization tasks, several dif-ferentially private algorithms are known, but none has yetbeen deployed in practice.
In this work, we make two major contributions towardspractical differentially private convex optimization. First,we present Approximate Minima Perturbation, a novelalgorithm that can leverage any off-the-shelf optimizer.We show that it can be employed without any hyperpa-rameter tuning, thus making it an attractive techniquefor practical deployment. Second, we perform an extensiveempirical evaluation of the state-of-the-art algorithms fordifferentially private convex optimization, on a range ofpublicly available benchmark datasets, and real-worlddatasets obtained through an industrial collaboration. Werelease open-source implementations of all the differentiallyprivate convex optimization algorithms considered, andbenchmarks on as many as nine public datasets, four ofwhich are high-dimensional.
I. INTRODUCTION
Building useful predictive models often involves learn-ing from sensitive data. To preserve the privacy ofsuch sensitive data, many systems first carry out thelearning task over the data, and then release just the finalālearnedā model. However, as many recent works [1],[2], [3] indicate, a model can leak information aboutthe sensitive data it was trained on, even though thedata might have never been made public. To preventsuch information leakage, differential privacy (DP) [4],[5] has been recently used as a gold standard for per-forming learning tasks over sensitive data. It has alsobeen adopted by large-scale corporations like Google[6], Apple [7], etc. Intuitively, DP prevents an adversaryfrom confidently making any conclusions about whethera sample was used in training a model, even while havingaccess to the model and any external side information.
The authors are ordered alphabetically.
With the recent advancements in machine learning andbig data, private convex optimization has proven to beuseful for large-scale learning over sensitive user datathat has been collected by organizations. The objective ofthis work is to provide insight into practical differentiallyprivate convex optimization, with a specific focus onthe classical technique of objective perturbation [8],[9]. Our main technical contribution is to design anew algorithm for private convex optimization that isamenable to real-world scenarios, and provide privacyand utility guarantees for it. In addition, we conducta broad empirical evaluation of approaches for privateconvex optimization, including our new approach. Ourevaluation is more extensive than prior works [8], [10],[11], [12]; it includes nine public datasets, four ofwhich are high-dimensional. Apart from these, we alsoconsider four real-world use cases, obtained in collabo-ration with Uber Technologies, Inc. We provide adviceand resources for practitioners, including open-sourceimplementations [13] of the algorithms evaluated, andbenchmarks on all the public datasets considered.
A. Objective Perturbation and its Practical Feasibility
We focus our attention on the technique of objectiveperturbation, because prior works [8], [9] as well asour own preliminary empirical results have hinted atits superior performance. The standard technique of ob-jective perturbation [8] consists of a two-stage process:āperturbingā the objective function by adding a randomlinear term, and releasing the minima of the perturbedobjective. It has been shown [8], [9] that releasing sucha minima is sufficient for achieving DP guarantees.
However, objective perturbation provides privacyguarantees only if the output of the mechanism is theexact minima of the perturbed objective. Practical algo-rithms for convex optimization often involve the use offirst-order iterative methods, such as gradient descent or
stochastic gradient descent (SGD) due to their scalability.However, such methods typically offer convergence ratesthat depend on the number of iterations carried out bythe method, so they are not guaranteed to reach theexact minima in finite time. As a result, it is not clear ifobjective perturbation in its current form can be appliedin a practical setting, where one is usually constrainedby resources such as computing power, and reaching theminima might not be feasible.
We address the fundamental question of whether an al-ternative to standard objective perturbation exists whichprovides privacy and utility guarantees when the releasedmodel is not necessarily the minima of the perturbedobjective. In this work, we answer this question in thepositive: it is possible to get privacy and utility guaran-tees even if the system releases a noisy āapproximateāminima of the perturbed objective. A major implicationof this result is that one can use first-order iterativemethods in combination with such an approach. This canbe highly beneficial in terms of the time taken to obtaina private solution, as first-order methods often succeedto find a āgoodā solution very quickly.
B. Our Approach: Approximate Minima Perturbation
We propose Approximate Minima Perturbation(AMP), a strengthened alternative to objectiveperturbation that provides privacy and utility guaranteeswhen the released model is a noisy āapproximateāminima for the perturbed objective. We measure theconvergence of a model in terms of the Euclideannorm of the gradient of the perturbed objective at themodel. The scale of the noise added to the approximateminima contains a parameter representing the maximumtolerable gradient norm. This results in a trade-offbetween the gradient norm bound (consequently, theamount of noise to be added to the approximateminima), and the difficulty, in practice, of being ableto obtain an approximate minima within the normbound. This can be useful in settings where a limitedcomputing power is available, which can in turn act asa guide for setting an appropriate norm bound. We notethat if the norm bound is set to zero, then this approachreduces to the setting of standard objective perturbation.Approximate Minima Perturbation also brings with itcertain distinct advantages, which we will describe next.
Approximate Minima Perturbation works for allconvex objective functions: While previous works [8],[9] provide privacy guarantees for objective perturba-tion only when the objective is a loss function of a
generalized linear model1, the guarantees provided forAMP hold for any objective function.2 In both cases,the objective functions are assumed to possess standardproperties like Lipschitz continuity, and smoothness.
Approximate Minima Perturbation is the first fea-sible approach that can leverage any off-the-shelfoptimizer: AMP can accommodate any off-the-shelfoptimizer as a black-box for carrying out its optimizationstep. This enables a simple implementation that inheritsthe scalability properties of the optimizer used, whichcan be particularly important in situations where high-performance optimizers are available. AMP is the onlyknown feasible algorithm for DP convex optimizationthat allows the use of any off-the-shelf optimizer.
Approximate Minima Perturbation has a competitivehyperparameter-free variant: To ensure privacy for analgorithm, its hyperparameters must be chosen eitherindependently of the data, or by using a differentiallyprivate hyperparameter tuning algorithm. Previous work[8], [14], [15] has shown this to be a challenging task.AMP has only four hyperparameters: one related to theLipschitz continuity of the objective, and the other threerelated to splitting the privacy budget within the algo-rithm. In Section V-A, we present a data-independentmethod for setting all of them. Our empirical evalu-ation demonstrates that the resulting hyperparameter-free variant of AMP yields comparable accuracy to thestandard variant with its hyperparameters tuned in a data-dependent manner (which may be non-private).
C. Empirical Evaluation, & Resources for Practitioners
This work also reports on an extensive and broad em-pirical study of the state-of-the-art differentially privateconvex optimization techniques. In addition to AMP andits hyperparameter-free variant, we evaluate four exist-ing algorithms: private gradient descent with minibatch-ing [16], [15], both the variants (convex, and stronglyconvex) of the private Permutation-based SGD algo-rithm [12], and the private Frank-Wolfe algorithm [17].
Our evaluation is the largest to date, including atotal of 13 datasets. We include datasets with a varietyof different properties, including four high-dimensionaldatasets and four real-world use cases represented bydatasets obtained in collaboration with Uber.
The results of our empirical evaluation demonstratethree key findings. First, we confirm the expectation thatthe cost of privacy decreases as dataset size increases.
1The loss function of a generalized linear model is parameterizedby an inner product of the feature vector of the data, and the model.
2Our analysis, in particular, extends to objective perturbation as well.
For all the real-world use cases in our evaluation, weobtain differentially private models that achieve an ac-curacy within 4% of the non-private baseline even forvery conservative settings of the privacy parameters. Forreasonable values of the privacy parameters, the accuracyof the best private model is within 2% of the baseline fortwo of these datasets, essentially identical to the baselinefor one of them, and even slightly higher than thebaseline for one of the datasets! This provides empiricalevidence to further the claims of previous works [18],[19] that DP can also act as a type of regularization,reducing the generalization error.
Second, our results demonstrate a general ordering ofalgorithms in terms of empirical accuracy. Our resultsshow that AMP generally outperforms all the other algo-rithms across all the considered datasets. Under specificconditions like high-dimensionality of the dataset andsparsity of the optimal predictive model for it, we seethat private Frank-Wolfe provides the best performance.
Third, our results show that a hyperparameter-freevariant of AMP achieves nearly the same accuracy as thestandard variant with its hyperparameters tuned in a data-dependent manner. Approximate Minima Perturbation istherefore simple to deploy in practice as it can leverageany off-the-shelf optimizer, and it has a competitivevariant that does not require any hyperparameter tuning.
We provide an open-source implementation [13] ofthe algorithms evaluated, including our ApproximateMinima Perturbation, and a complete set of benchmarksused in producing our empirical results. In addition toenabling the reproduction of our results, this set ofbenchmarks will provide a standard point of referencefor evaluating private algorithms proposed in the future.Our open-source release represents the first benchmarkfor differentially private convex optimization.
D. Main Contributions
The main contributions of this work are as follows:
ā¢ We propose Approximate Minima Perturbation, astrengthened alternative to objective perturbation thatprovides privacy guarantees even for an approximateminima of the perturbed objective, and thereforeallows the use of any off-the-shelf optimizer. Noprevious approach provides this capability. Comparedto previous approaches, AMP also provides improvedutility in practice, and works with any convex lossfunction under standard assumptions.
ā¢ We conduct the largest empirical study to date ofstate-of-the-art DP convex optimization approaches,including as many as nine public datasets, four of
which are high-dimensional. Our results demonstratethat AMP generally provides the best accuracy.
ā¢ We evaluate DP convex optimization on four real-world use cases, obtained in collaboration with Uber.Our results suggest that for the large-scale datasetsused in practice, privacy-preserving models can obtainessentially the same accuracy as non-private modelsfor reasonable values of the privacy parameters. Inone case, we show that a DP model achieves a higheraccuracy than the non-private baseline.
ā¢ We present a competitive hyperparameter-free variantof AMP, allowing the approach to be deployed with-out the need for tuning on publicly available datasets,or by a DP hyperparameter tuning algorithm.
ā¢ We release open-source implementations [13] of allthe algorithms we evaluate, and the first benchmarksfor differentially private convex optimization algo-rithms on as many as nine public datasets.
II. RELATED WORK
Convex optimization in the non-private setting has along history, with several excellent resources providinga good overview [20], [21]. A lot of recent advanceshave been made in the field of convex Empirical RiskMinimization (ERM) as well. A comprehensive list ofworks on stochastic convex ERM has been provided in[22], whereas [23] provides dimension-dependent lowerbounds for the sample complexity required for stochasticconvex ERM and uniform convergence.
A large body of existing work examines the problemof differentially private convex ERM. The techniquesof output perturbation and objective perturbation werefirst proposed in [8]. Near dimension-independent riskbounds for both the techniques were provided in [11];however, the bounds are achieved for the standard set-tings of the techniques, which provide privacy guaran-tees only for the minima of their respective objectivefunctions. A private SGD algorithm was first given in[10], and optimal risk bounds were provided for a laterversion of private SGD in [16]. A variant of outputperturbation was proposed in [12] that requires the useof permutation-based SGD, and reduces sensitivity usingproperties of that algorithm. Several works [9], [24] dealwith DP convex ERM in the setting of high-dimensionalsparse regression, but the algorithms in these worksalso require obtaining the minima. The Frank-Wolfealgorithm [25] has also seen a resurgence lately [26],[27], [28], [29]. We study the performance of a DPversion of Frank-Wolfe [17] in our empirical analysis.
There are also works in DP convex optimization apartfrom the ERM model. Many recent works [30], [31], [32]examine the setting of online learning, whereas high-dimensional kernel learning is considered in [33]; thesesettings are quite different from ours, and the results areincomparable. There have also been works [34], [35] onDP regression analysis, a subset of DP convex optimiza-tion. However, the privacy guarantees in these hold onlyif the algorithms are able to find some minima. Therehave also been advances in DP non-convex optimization,including deep learning [36], [15]. A broad survey ofworks in DP machine learning has been provided in [37].
Previous empirical evaluations have provided limitedinsight into the practical performance of the various algo-rithms for DP convex optimization. Output perturbationand objective perturbation are evaluated on two datasetsin [8] and [11], and private SGD is evaluated in [10]. Wuet al. [12] perform the broadest comparison, includingtheir own approach, and two variants of private SGD[10], [16] on six datasets, but they do not include objec-tive perturbation. No prior evaluation considers the state-of-the-art algorithms from all three major lines of workin the area (output perturbation, objective perturbation,and private SGD). Moreover, none of the prior evalua-tions considers high-dimensional dataāa maximum of75 dimensions is considered in [12].
Our empirical evaluation is the most complete to date.We evaluate state-of-the-art algorithms from all 3 linesof work on 9 public datasets and 4 real-world use cases.We consider low-dimensional and high-dimensional (asmany as 47,236 dimensions) datasets. In addition, werelease open-source implementations for all algorithms,and benchmarking scripts to reproduce our results [13].
III. PRELIMINARIES
In this section, we formally define the notation, im-portant definitions, and results used in this work.
Given an n-element dataset D = {d1, d2, . . . , dn}, s.t.di ā D for i ā [n], the objective is to get a model Īø fromthe following unconstrained optimization problem:
Īø ā arg minĪøāRp
L(Īø;D),
where L(Īø;D) = 1n
nāi=1
`(Īø; di) is the empirical risk,
p > 0, and `(Īø; di) is defined as a loss function for di thatis convex in the first parameter Īø ā Rp. This formulationfalls under the framework of ERM, which is useful invarious settings, including the widely applicable problemof classification in machine learning via linear regres-sion, logistic regression, or support vector machines. The
notation āxā is used to represent the L2-norm of a vectorx. Next, we define certain basic properties of functionsthat will be helpful in further sections.
Definition III.1. A function f : Rp ā R :
ā¢ is a convex function if for all Īø1, Īø2 ā Rp,f(Īø1)ā f(Īø2) ā„ ćāf(Īø2), Īø1 ā Īø2ć.
ā¢ is a Ī¾-strongly convex function if for all Īø1, Īø2 ā Rp,f(Īø1) ā„ f(Īø2) + ćāf(Īø2), Īø1 ā Īø2ć+ Ī¾
2āĪø1 ā Īø2ā2,or equivalently,ćāf(Īø1)āāf(Īø2), (Īø1 ā Īø2)ć ā„ Ī¾āĪø1 ā Īø2ā2.
ā¢ has Lq-Lipschitz constant L if for all Īø1, Īø2 ā Rp,|f(Īø1)ā f(Īø2)| ā¤ L Ā· āĪø1 ā Īø2āq .
ā¢ is Ī²-smooth if for all Īø1, Īø2 ā Rp,āāf(Īø1)āāf(Īø2)ā ā¤ Ī² Ā· āĪø1 ā Īø2ā.
To establish the notion of DP, we first define neigh-boring datasets. We will refer to a pair of datasetsD,Dā² ā Dn as neighbors, if Dā² can be obtained fromD by modifying one sample di ā D for some i ā [n].
Definition III.2 ((Īµ, Ī“)-Differential Privacy [4], [5]). A(randomized) algorithm M with input domain Dn andoutput range R is (Īµ, Ī“)-differentially private if for allpairs of neighboring inputs D,Dā² ā Dn, and everymeasurable S ā R, we have with probability over thecoin flips of M that:
Pr (M(D) ā S) ā¤ eĪµ Ā· Pr (M(Dā²) ā S) + Ī“.
One of the most common techniques for achieving dif-ferential privacy is the Gaussian mechanism, for whichwe first need the L2-sensitivity of a function.
Definition III.3 (L2-sensitivity). A function f : Dn āRp has L2-sensitivity ā if
maxD,Dā²āDn s.t.
(D,Dā²) neighbors
āf(D)ā f(Dā²)ā = ā.
Lemma III.1 (Gaussian mechanism [38]). If a functionf : Dn ā Rp has L2-sensitivity ā, then the mechanismM , which on input D ā Dn outputs f(D) + b, whereb ā¼ N (0, Ļ2IpĆp) and Ļ = ā
Īµ
(1 +
ā2 log 1
Ī“
), satisfies
(Īµ, Ī“)-differential privacy. Here, N(0, Ļ2IpĆp) denotesthe p-dimensional zero-mean Gaussian distribution witheach dimension having variance Ļ2.
Lastly, we define Generalized Linear Models (GLMs).
Definition III.4 (Generalized Linear Model). For amodel space Īø ā Rp, where p > 0, the sample spaceD in a GLM is defined as the cartesian product of ap-dimensional feature space X ā Rp and a label spaceY , i.e., D = X Ć Y . Thus, each data sample di ā D
can be decomposed into a feature vector xi ā X , and alabel yi ā Y . Moreover, the loss function `(Īø; di) for aGLM is a function of xTi Īø and yi.
IV. APPROXIMATE MINIMA PERTURBATION
In this section, we will describe Approximate MinimaPerturbation, a strengthened alternative to objective per-turbation that provides DP guarantees in the case evenwhen the output of the algorithm is not the actual minimaof the perturbed objective function. The perturbed objec-tive takes the form L(Īø;D)+ Ī
2 āĪøā2+ćb, Īøć, where b is a
random variable drawn from an appropriate distribution,and Ī is an appropriately chosen regularization constant.We make two crucial improvements over the originalobjective perturbation algorithm [8], [9]:ā¢ The privacy guarantee of objective perturbation holds
only at the exact minima of the underlying optimiza-tion problem, which is never guaranteed in practicegiven finite time. We show that AMP provides aprivacy guarantee even for an approximate solution.
ā¢ Earlier privacy analyses for objective perturbation [8],[9] hold only when the loss function `(Īø; d) is a lossfor a GLM (see Definition III.4), as they implicitlymake a rank-one assumption on the Hessian of theloss 52`(Īø; d). Via a careful perturbation analysis ofthe Hessian, we extend the analysis to any convex lossfunction under standard assumptions. It is importantto note that AMP reduces to objective perturbation ifthe āapproximateā minima condition is tightened togetting the actual minima of the perturbed objective.
Algorithmic description: Given a dataset D ={d1, d2, . . . , dn}, where each di ā D, we consider (ob-
jective) functions of the form L(Īø;D) = 1n
nāi=1
`(Īø; di),
where Īø ā Rp is a model, loss `(Īø; di) has L2-Lipschitzconstant L for all di, is convex in Īø, has a continuousHessian, and is Ī²-smooth in both the parameters.
At a high level, Approximate Minima Perturbationprovides a convergence-based solution for objective per-turbation. In other words, once the algorithm finds amodel Īøapprox for which the norm of the gradient ofthe perturbed objective āLpriv(Īøapprox;D) is within apre-determined threshold Ī³, it outputs a noisy version ofĪøapprox, denoted by Īøout. Since the perturbed objectiveis strongly convex, it is sufficient to add Gaussian noise,with standard deviation Ļ2 having a linear dependenceon the norm bound Ī³, to Īøapprox to ensure DP.
Details of AMP are provided in Algorithm 1. Note thatalthough we get a relaxed constraint on the regularizationparameter Ī (in Algorithm 1) if the loss function ` is a
loss for a GLM, the privacy guarantees hold for generalconvex loss functions as well. The parameters (Īµ1, Ī“1)within the algorithm represent the amount of the privacybudget dedicated to perturbing the objective, with therest of the budget (Īµ2, Ī“2) being used for adding noiseto the approximate minima Īøapprox. On the other hand,the parameter Īµ3 intuitively represents the part of theprivacy budget Īµ1 allocated to scaling the noise added tothe objective function. The remaining budget (Īµ1 ā Īµ3)is used to set the amount of regularization used.
Algorithm 1: Approximate Minima PerturbationInput: Dataset: D = {d1, Ā· Ā· Ā· , dn}; loss function:
`(Īø; di) that has L2-Lipschitz constant L, isconvex in Īø, has a continuous Hessian, andis Ī²-smooth for all Īø ā Rp and all di;Hessian rank bound parameter: r which isthe minimum of p and twice the upperbound on the rank of `ās Hessian; privacyparameters: (Īµ, Ī“); gradient norm bound: Ī³.
1 Set Īµ1, Īµ2, Īµ3, Ī“1, Ī“2 > 0 such that Īµ = Īµ1 + Īµ2,Ī“ = Ī“1 + Ī“2, and 0 < Īµ1 ā Īµ3 < 1
2 Set Ī ā„ rĪ²Īµ1āĪµ3
3 b1 ā¼ N (0, Ļ21IpĆp), where Ļ1 =
( 2Ln )(
1+ā
2 log 1Ī“1
)Īµ3
4 Let Lpriv(Īø;D) = 1n
nāi=1
`(Īø;Di) + Ī2nāĪøā
2 + bT1 Īø
5 Īøapprox ā Īø such that āāLpriv(Īø;D)ā ā¤ Ī³
6 b2 ā¼ N (0, Ļ22IpĆp), where Ļ2 =
(nĪ³Ī )(
1+ā
2 log 1Ī“2
)Īµ2
7 Output Īøout = Īøapprox + b2
Privacy and utility guarantees: Here, we provide theprivacy and utility guarantees for Algorithm 1. Whilewe provide a complete privacy analysis (Theorem 1),we only state the utility guarantee (Theorem 2) as it is aslight modification from previous work [9]. We providea proof for it in Appendix A for completeness.
Theorem 1 (Privacy guarantee). Algorithm 1 is (Īµ, Ī“)-differentially private.
Proof Idea. For obtaining an (Īµ, Ī“)-DP guarantee forAlgorithm 1, we first split the output of the algorithminto two parts: one being the exact minima of the per-turbed objective, whereas the other containing the exactminima, the approximate minima obtained in Step 5 ofthe algorithm, as well as the Gaussian noise added toit. For the first part, we bound the ratio of the densityof the exact minima taking any particular value, underany two neighboring datasets, by eĪµ1 with probability at
least 1āĪ“1. We first simplify such a ratio, as done in [8]via the function inverse theorem, by transforming it totwo ratios: one involving only the density of a functionof the minima value and the input dataset, and the otherinvolving the determinant of this functionās Jacobian. Forthe former ratio, we start by bounding the sensitivity ofthe function using the L2-Lipschitz constant L of the lossfunction. Then, we use the guarantees of the Gaussianmechanism to obtain a high-probability bound (shown inLemma IV.1). We bound the latter ratio (in Lemma IV.2)via a novel approach that uses the Ī²-smoothness propertyof the loss. Next, we use the the gradient norm boundĪ³, and the strong convexity of the perturbed objective toobtain an (Īµ2, Ī“2)-DP guarantee for the second part ofthe split output. Lastly, we use the general compositionproperty of DP to get the statement of the theorem.
Proof. Define Īømin = arg minĪøāRp Lpriv(Īø;D). Fix apair of neighboring datasets Dā, Dā² ā Dn, and someĪ± ā Rp. First, we will show that:
pdfDā(Īømin = Ī±)
pdfDā²(Īømin = Ī±)ā¤ eĪµ1 w.p. ā„ 1ā Ī“1. (1)
We define b(Īø;D) = āāL(Īø;D)ā ĪĪøn for D ā Dn and
Īø ā Rp. Changing variables according to the functioninverse theorem (Theorem 17.2 in [39]), we get
pdfDā(Īømin = Ī±)
pdfDā²(Īømin = Ī±)=
pdf(b(Ī±;Dā); Īµ1, Ī“1, L)
pdf(b(Ī±;Dā²); Īµ1, Ī“1, L)
Ā· |det(āb(Ī±;Dā²))||det(āb(Ī±;Dā))|
(2)
We will bound the ratios of the densities and the deter-minants separately. First, we will show that for Īµ3 < Īµ1,pdf(b(Ī±;Dā);Īµ,Ī“,L)pdf(b(Ī±;Dā²);Īµ,Ī“,L) ā¤ e
Īµ3 w.p. at least 1ā Ī“1, and then we
will show that | det(āb(Ī±;Dā²))|| det(āb(Ī±;Dā))| ā¤ e
Īµ1āĪµ3 if Īµ1 ā Īµ3 < 1.
Lemma IV.1. We define b(Īø;D) = āāL(Īø;D)ā ĪĪøn for
D ā Dn, and Īø ā Rp. Then, for any pair of neighboringdatasets Dā, Dā² ā Dn, and Īµ3 < Īµ1, we have
pdf(b(Ī±;Dā); Īµ1, Ī“1, L)
pdf(b(Ī±;Dā²); Īµ1, Ī“1, L)ā¤ eĪµ3 w.p. at least 1ā Ī“1.
Here, Lpriv(Ī±; ) is defined as in Algorithm 1.
Proof. Assume w.l.o.g. that di ā Dā has been replacedby dā²i in Dā². We first bound the L2-sensitivity of b(Ī±; ):
āb(Ī±;Dā)ā b(Ī±;Dā²)ā ā¤ āā`(Ī±; dā²i)āā`(Ī±; di)ān
ā¤ 2L
n,
where the last inequality follows as āā`(Ī±; )ā ā¤ L.
Setting Ļ1 ā„( 2Ln )(
1+ā
2 log 1Ī“1
)Īµ3
for Īµ3 < Īµ1, we getthe statement of the lemma from the guarantees of theGaussian mechanism [38].
Lemma IV.2. Let b(Īø;D) be defined as in Lemma IV.1.Then for any pair of neighboring datasets Dā, Dā² ā Dn,if Īµ1 ā Īµ3 < 1, we have
|det(āb(Ī±;Dā²))||det(āb(Ī±;Dā))|
ā¤ eĪµ1āĪµ3 .
Proof. Assume w.l.o.g. that di ā Dā is replaced by dā²iin Dā². Let A = nā2L(Ī±;Dā), and E = ā2`(Ī±; dā²i) āā2`(Ī±; di). As the (pĆ p) matrix E is the difference ofthe Hessians of the loss of two individual samples, wecan define a bound r on the rank of E as follows:
r = min{p, 2 Ā·
(upper bound on rank of ā2`(Ī±; )
)}Let Ī»1 ā¤ Ī»2 ā¤ Ā· Ā· Ā· ā¤ Ī»p be the eigenvalues of A, andĪ»ā²1 ā¤ Ī»ā²2 ā¤ Ā· Ā· Ā· ā¤ Ī»ā²p be the eigenvalues of A+E. Thus,
| det(āb(Ī±;Dā²))|| det(āb(Ī±;Dā))| =
det(
A+E+ĪIpn
)det(
A+ĪIpn
) =
pāi=1
(Ī»ā²i + Ī
)pā
i=1
(Ī»i + Ī)
=
pāi=1
(1 +
Ī»ā²i ā Ī»i
Ī»i + Ī
)
ā¤pā
i=1
(1 +|Ī»ā²i ā Ī»i|
Ī
)
= 1 +
pāi=1
|Ī»ā²i ā Ī»i|Ī
+ā
i,jā[p],i6=j
ākā{i,j}
|Ī»ā²k ā Ī»k|
Ī2+ Ā· Ā· Ā·
ā¤ 1 +rĪ²
Ī+
(rĪ²)2
Ī2+ Ā· Ā· Ā· ā¤ Ī
Īā rĪ²
The first inequality follows since A is a positive semi-definite matrix (as ` is convex) and thus, Ī»j ā„ 0 for allj ā [p]. The second inequality follows as i) the rank ofE is at most r, ii) both A and A+E are positive semi-definite (so Ī»j , Ī»ā²j ā„ 0 for all j ā [p]), and iii) we havean upper bound Ī² on the eigenvalues of A and A + Edue to `(Īø; dj) being convex in Īø, having a continuousHessian, and being Ī²-smooth. The last inequality followsif Ī > rĪ². Also, we want Ī
ĪārĪ² ā¤ exp(Īµ1 ā Īµ3), whichimplies Ī ā„ rĪ²
1āexp(Īµ3āĪµ1) ā„rĪ²
Īµ1āĪµ3 . Both conditions aresatisfied by setting Ī = rĪ²
Īµ1āĪµ3 as Īµ1 ā Īµ3 < 1.
From Equation 2, and Lemmas IV.1 and IV.2, we getthat pdfDā (Īømin=Ī±)
pdfDā² (Īømin=Ī±) ā¤ eĪµ1 w.p. ā„ 1āĪ“1. In other words,
Īømin is (Īµ1, Ī“1)-differentially private.
Now, since we can write Īøout as Īøout = Īømin +(Īøapprox ā Īømin + b2), we will prove that releasing(Īøapprox ā Īømin + b2) is (Īµ2, Ī“2)-differentially private.
Lemma IV.3. For D ā Dn, let Ī³ ā„ 0 be chosen inde-pendent of D, and let Īømin = arg minĪøāRp Lpriv(Īø;D).If Īøapprox ā Rp s.t. āāLpriv(Īøapprox;D)ā ā¤ Ī³, then re-leasing (ĪøapproxāĪømin+b2), where b2 ā¼ N (0, Ļ2
2IpĆp)
for Ļ2 =(nĪ³Ī )
(1+ā
2 log 1Ī“2
)Īµ2
, is (Īµ2, Ī“2)-DP.
Proof. We start by bounding the L2-norm of (ĪøapproxāĪømin):
āĪøapprox ā Īøminā
ā¤ nāāLpriv(Īøapprox;D)āāLpriv(Īømin;D)āĪ
ā¤ nĪ³
Ī(3)
The first inequality follows as Lpriv is Īn -strongly
convex, and the second inequality follows asāLpriv(Īømin;D) = 0 and āLpriv(Īøapprox;D) ā¤ Ī³.
Now, setting Ļ2 =(nĪ³Ī )
(1+ā
2 log 1Ī“2
)Īµ2
, we get thestatement of the lemma by the properties of theGaussian mechanism [38].
As Īµ1 + Īµ2 = Īµ, and Ī“1 + Ī“2 = Ī“, we get the privacyguarantee of Algorithm 1 by Equation 1, Lemma IV.3,and the general composition property of DP [40].
Next, we provide the utility guarantee (in terms ofexcess empirical risk) for Algorithm 1 in Theorem 2. Forcompleteness, we provide a proof for it in Appendix A.
Theorem 2 (Utility guarantee (adaptedfrom [9])). Let Īø be the true unconstrained
minimizer of L(Īø;D) = 1n
nāi=1
`(Īø; di), and
r = min {p, 2 Ā· (upper bound on rank of `ās Hessian)}.In Algorithm 1, if Īµi = Īµ
2 for i ā {1, 2}, Īµ3 =max
{Īµ12 , Īµ1 ā 0.99
}, Ī“j = Ī“
2 for j ā {1, 2},and we set the regularization parameter
Ī = Ī
(1āĪøā
(Lārp log 1/Ī“
Īµ +
ān2LĪ³āp log 1/Ī“
Īµ
))such that it satisfies the constraint in Step 2, then thefollowing is true:
E(L(Īøout;D)ā L(Īø;D)
)
= O
āĪøāLārp log 1
Ī“
nĪµ+ āĪøā
āāāāLĪ³āp log 1
Ī“
Īµ
.
Remark: For loss functions of Generalized Linear Mod-els, we have r = 2. Here, for small values of Ī³ (forexample, Ī³ = O
(1n2
)), the excess empirical risk of
Approximate Minima Perturbation is asymptotically thesame as that of objective perturbation [9], and has abetter dependence on n than that of Private Permutation-based SGD [12]. Specifically, the dependence is ā 1
n forApproximate Minima Perturbation, and ā 1ā
nfor Private
PSGD.
Towards Hyperparameter-free Approximate MinimaPerturbation: AMP can be considered to have thefollowing hyperparameters: the Lipschitz constant L, andthe privacy parameters Īµ2, Ī“2, and Īµ3 which split theprivacy budget within the algorithm. A data-independentapproach for setting these parameters can eliminatethe need for hyperparameter tuning with this approach,making it convenient to deploy in practice.
For practical applications, given a sensitive datasetand a convex loss function, the L hyperparameter canbe thought of as a trade-off between the sensitivity ofthe loss and the amount of external interference requiredto achieve that sensitivity, for instance, sample clipping(defined in Section V-A) on the data. In the next section,we provide a hyperparameter-free variant of AMP thathas performance comparable to the standard variant inwhich all the hyperparameters are tuned.
V. EXPERIMENTAL RESULTS
Our evaluation seeks to answer two major researchquestions:1) What is the cost (to accuracy) of privacy? How
close can a DP model come to the non-privatebaseline? For real-world use cases, is the cost ofprivacy low enough to make DP learning practical?
2) Which algorithm provides the best accuracy inpractice? Is there a total order on the availablealgorithms? Does this ordering differ for datasetswith different properties?
Additionally, we also attempt to answer the followingquestion which can result in a significant advantage forthe deployment of a DP model in practice:3) Can Approximate Minima Perturbation be de-
ployed without hyperparameter tuning? Can itshyperparameters (L, Īµ2, Ī“2, and Īµ3) be set in a data-independent manner?
Summary of results: Question #1: Our results demon-strate that for datasets of sufficient size, the cost ofprivacy is negligible. Experiments 1 (on low-dimensionaldatasets), 2 (on high-dimensional datasets), and 3 (on
real-world datasets obtained in collaboration with Uber)evaluate the cost of privacy using logistic loss. Ourresults show that for large datasets, a DP model existsthat approaches the accuracy of the non-private baselineat reasonable privacy budgets. Experiment 3 shows thatfor the larger datasets common in practical settings,a privacy-preserving model can produce even betteraccuracy than the non-private one, which suggests thatprivacy-preserving learning is indeed practical.
We also present the performance of private algorithmsusing Huber SVM loss (on all the datasets mentionedabove) in Appendix B. The general trends from theexperiments using Huber SVM loss are identical to thoseobtained using logistic loss.
Question #2: Our experiments demonstrate that AMPgenerally provides the best accuracy among all theevaluated algorithms. Moreover, experiment 2 showsthat under specific conditions, private Frank-Wolfe canprovide the best accuracy. In all the regimes, the resultsgenerally show an improvement over other approaches.
Question #3: Our experiments also demonstrate thata simple data-independent method can be used to setL, Īµ2, Ī“2, and Īµ3 for AMP, and that this method providesgood accuracy across datasets. For most values of Īµ,our data-independent approach provides nearly the sameaccuracy as the version tuned using a grid search (whichmay be non-private).
A. Experiment Setup
Algorithms evaluated: Our evaluation includes onealgorithm drawn from each of the major approachesto private convex optimization: objective perturbation,output perturbation, private gradient descent, and theprivate Frank-Wolfe algorithm. For each approach, weselect the best-known algorithm and configuration.
For objective perturbation, we implement AMP (Al-gorithm 1) as it is the only practically feasible objectiveperturbation approach3. For all the experiments, we tunethe value of the hyperparameters L, Īµ2, Ī“2, and Īµ3 usingthe grid search described later.
We also evaluate a hyperparameter-free variant ofAMP that sets the hyperparameters L, Īµ2, Ī“2 and Īµ3independent of the data. We describe the strategy indetail towards the end of this subsection.
For output perturbation, we implement PrivatePerturbation-based SGD (PSGD) [12], as it is the onlypractically feasible variant of output perturbation3. Weevaluate both the variants, with minibatching, proposed
3For all variants pertaining to the standard regime in [8], obtainingsome exact minima is necessary for achieving a privacy guarantee.
TABLE IDATASETS USED IN OUR EVALUATION
Dataset # Samples # Dim. # ClassesLow-Dimensional Datasets (Public)
Synthetic-L 10,000 20 2Adult 45,220 104 2
KDDCup99 70,000 114 2Covertype 581,012 54 7
MNIST 65,000 784 10High-Dimensional Datasets (Public)
Synthetic-H 5,000 5,000 2Gisette 6,000 5,000 2
Real-sim 72,309 20,958 2RCV1 50,000 47,236 2
Real-World Datasets (Uber)Dataset #1 4m 23 2Dataset #2 18m 294 2Dataset #3 18m 20 2Dataset #4 19m 70 2
in [12]: convex (Algorithm 3), and strongly convex(Algorithm 4). For the convex variant, we evaluate allthree proposed learning rate schemes (constant learningrate, decreasing learning rate, and square-root learningrate). We include results only for constant learning rate,as our experiments show that this scheme produces themost accurate models.
For private gradient descent, we implement a variantof the private SGD algorithm originally proposed in [16].Our variant (Algorithm 2) leverages the Moments Ac-countant [15], incorporates minibatching, and sets thenoise parameter based on the desired number of itera-tions (as compared to a fixed n2 iterations in [16]).
For private Frank-Wolfe, we implement the version(Algorithm 5) originally proposed in [17]. This algorithmperforms constrained optimization by design. Followingthe advice of [26], we use a decreasing learning rate forbetter accuracy guarantees. Unlike the other algorithms,private Frank-Wolfe has nearly dimension-independenterror bounds, so it should be expected to perform com-paratively better on high-dimensional datasets.
Pseudocodes for all the evaluated algorithms are pro-vided in Appendix C for reference. For each algorithm,we evaluate the variant that provides (Īµ, Ī“)-differentialprivacy. Most algorithms can also provide Īµ-differentialprivacy at an additional cost to accuracy.
Datasets: Table I lists the public datasets used in ourexperimental evaluation. Each of these datasets is avail-able for download, and our open-source release con-tains scripts for downloading and pre-processing thesedatasets. It also contains scripts for generating both thesynthetic datasets. As RCV1 has multi-label classifica-tion over 103 labels (with most of the labels being used
for a very small proportion of the dataset), for this datasetwe consider the task of predicting whether a sample iscategorized under the most frequently used label or not.
The selected datasets include both low-dimensionaland high-dimensional datasets. We define low-dimensional datasets to be ones where n ļæ½ p (wheren is the number of samples and p is the number ofdimensions). High-dimensional datasets are definedas those for which n and p are on roughly the samescale, i.e. n ā¤ p (or nearly so). We consider theSynthetic-H, Gisette, Real-sim, and RCV1 datasets tobe high-dimensional.
To obtain training and testing sets, we randomlyshuffle the dataset, take the first 80% as the training set,and the remaining 20% as the testing set.
Sample clipping: Each of the algorithms we evalu-ate has the requirement that the loss have a Lipschitzconstant. We can enforce this requirement for the lossfunctions we consider by bounding the norm for eachsample. We can accomplish this by pre-processing thedataset, but it must be done carefully to preserve DP.
For all the algorithms except private Frank-Wolfe,to make the loss have an L2-Lipschitz constant L, webound the influence of each sample (xi, yi) by clip-ping the feature vector xi to
(xi Ā·min
(1, Lāxiā
)). This
transformation is independent of other samples, and thuspreserves DP; it has also been previously used, e.g. in[15]. As the private Frank-Wolfe algorithm requires theloss to have a relaxed L1-Lipschitz constant L, it suffices(using Theorem 1 from [41]) to bound the Lā-norm ofeach sample (xi, yi) by L. We achieve this by clippingeach dimension xi,j , where j ā [d], to min (xi,j , L).
Hyperparameters: Each of the evaluated algorithmshas at least one hyperparameter. The values for thesehyperparameters should be tuned to provide the bestaccuracy, but the tuning should be done privately in orderto guarantee end-to-end differential privacy. Although anumber of differentially private hyperparameter tuningalgorithms have been proposed [8], [14], [15] to addressthis problem, they add more variance in the performanceof each algorithm, thus making it more difficult tocompare the performance across different algorithms.
In order to provide a fair comparison between algo-rithms, we use a grid search to determine the best valuefor each hyperparameter. Our grid search considers thehyperparameter values listed in Table II. In addition tothe standard algorithm hyperparameters (Ī, Ī·, T, k), wetune the clipping parameter L used in pre-processingthe datasets, and the constraint on the model space
TABLE IIHYPERPARAMETER & PRIVACY PARAMETER VALUES
Hyperparameter Values ConsideredĪ (regularization factor) 10ā5, 10ā4, 10ā3, 10ā2, 0
Ī· (learning rate) 0.001, 0.01, 0.1, 1T (number of iterations) 5, 10, 100, 1000, 5000
k (minibatch size) 50, 100, 300L (clipping threshold) 0.1, 1, 10, 100C (model constraint) 1, 10
f (output budget fraction) 0.001, 0.01, 0.1, 0.5f1 (privacy budget fraction) 0.9, 0.92, 0.95, 0.98, 0.99
Privacy Parameter Values ConsideredĪµ 10ā2, 10ā
32 , 10ā1, 10ā
12 ,
100, 1012 , 101
Ī“ 1n2
used by private Frank-Wolfe, Private SGD when usingregularized loss, and Private strongly convex PSGD. Theparameter C controls the size of the L1/L2-ball fromwhich models are selected by private Frank-Wolfe/theother algorithms respectively. For AMP, we set Īµ2 = f Ā·Īµ,Ī“2 = f Ā·Ī“, and tune for f . Here, f denotes the fraction ofthe budget (Īµ, Ī“) that is allocated to (Īµ2, Ī“2). Also, sincethe valid range of the hyperparameter Īµ3 depends on thevalue of Īµ1, we set Īµ3 = f1 Ā· Īµ1, and tune for f1. We alsoensure that the constraint on Īµ3 in Line 3 of Algorithm 1is satisfied. Note that tuning hyperparameters may benon-private, but it enables a direct comparison of thealgorithms themselves.
We consider a range of values for the privacy pa-rameter Īµ. Following Wu et al. [12], we set the privacyparameter Ī“ = 1
n2 , where n is the size of the trainingdata. The complete set of values considered is listedin Table II. For multiclass classification datasets suchas MNIST and Covertype, we implement the one-vs-allstrategy by training a binary classifier for each class, andsplit Īµ and Ī“ equally among the binary classifiers so thatwe can achieve an overall (Īµ, Ī“)-DP guarantee by usinggeneral composition [40].
Algorithm Implementations: The implementationsused in our evaluation correspond to the pseudocodelistings in Appendix C, are written in Python, andare available in our open source release [13]. For Ap-proximate Minima Perturbation, we define the loss andgradient according to Algorithm 1, and leverage SciPyāsminimize procedure to find the approximate minima.
For all datasets, our implementation is able to achieveĪ³ = 1
n2 , where n is the size of the training data. Forlow-dimensional datasets, our implementation of AMPuses SciPyās BFGS solver, for which we can specifythe desired norm bound Ī³. The BFGS algorithm stores
the full Hessian of the objective, which does not fit inmemory for the sparse high-dimensional datasets in ourstudy. For these, we define an alternative low-memoryimplementation using SciPyās L-BFGS-B solver, whichdoes not store the full Hessian.
Experiment procedure: Our experiment setup is de-signed to find the best possible accuracy achievable fora given setting of the privacy parameters. To ensure afair comparison, we begin every run of each algorithmwith the initial model 0p. Because each of the evaluatedalgorithms introduces randomness due to noise, we train10 independent models for each combination of thehyperparameter setting. We report the mean accuracy andstandard deviation for the combination of the hyperpa-rameter setting with the highest mean accuracy over the10 independent runs.4
Differences with the setting in [12]: Although boththe studies have 3 datasets in common (Covertype, KD-DCup99, and MNIST), our setting is slightly differentfrom [12] for all 3 of them. For Covertype, our studyuses all 7 classes, while [12] uses a binary version. ForKDDCup99, we use a 10% sample of the full dataset (asin [42]), while [12] uses the full dataset. For MNIST,we use all 784 dimensions, while [12] uses randomprojection to reduce the dimensionality to 50.
The results we obtain for both the variants of thePrivate PSGD algorithm [12] are based on faithful im-plementations of those algorithms. We tune the hyperpa-rameters for both, using the grid search described earlier.
Non-private baseline: Note that one of the main objec-tives of this study is to determine the cost of privacyin practice for convex optimization. Hence, to providea point of comparison for our results, we also train anon-private baseline model for each experiment. We useScikit-learnās LogisticRegression class to trainthis model on the same training data as the privatealgorithms, and test its accuracy on the same testing dataas the private algorithms. We do not perform sampleclipping when training this model.
Strategy for Hyperparameter-free Approximate Min-ima Perturbation: Now, we describe a data-independentapproach for setting Approximate Minima Perturbationāsonly hyperparameters, L, Īµ2, Ī“2, and Īµ3, for both the lossfunctions we consider (see Section V-B). For L, we findthat setting L = 1 achieves a good trade-off between
4The results shown are for hyperparameters tuned via the mean testset accuracy. Since all the considered algorithms aim to minimize theempirical loss, we also conducted experiments by tuning via the meantraining set accuracy. Both settings provided visibly identical results.
the amount of noise added for perturbing the objective,and the information loss after sample clipping across alldatasets. Next, we consider only the synthetically gener-ated datasets for setting the hyperparameters specific toAMP. Fixing Ī³ = 1
n2 , we find that setting Īµ2 = 0.01 Ā· Īµand Ī“2 = 0.01 Ā· Ī“ achieves a good trade-off betweenthe budget for perturbing the objective, and the amountof noise that its approximate minima can tolerate. Forsetting Īµ3, we consider two separate cases:ā¢ For Īµ1 = 0.99 Ā· Īµ, and Īµ3 = f1 Ā· Īµ1, we see that
setting f1 = 0.99 for Īµ1 = 0.0099, f1 = 0.95for Īµ1 ā {0.0313, 0.099}, and f1 = 0.9 for Īµ1 ā{0.313, 0.99, 3.13, 9.99} yields a good accuracy forSynthetic-L. Hence, we observe that for very lowvalues of Īµ1, a good accuracy is yielded by Īµ3 closeto Īµ1 (i.e., most of the budget is used to reduce thescale of the noise, and the influence of regularizationis kept large). As Īµ1 increases, we see that it is morebeneficial to reduce the effects of regularization. Wefit a basic polynomial curve of the form y = a+bxāc,where a, b, c > 0, to the above-stated values to geta dependence of f1 (the privacy budget fraction) interms of Īµ1. We combine it with the lower boundimposed on f1 by Theorem 1 (for instance, we requiref1 ā„ 0.9 for Īµ1 = 9.99) to obtain the following data-independent relationship between Īµ1 and Īµ3 for low-dimensional datasets:
Īµ3 = max
{min
{0.887 +
0.019
Īµ0.3731
, 0.99
}, 1ā 0.99
Īµ1
}Ā· Īµ1
ā¢ For Synthetic-H, we see that setting f1 = 0.97 yieldsa good accuracy for all the values of Īµ1 considered.Thus, combining it with the lower bound imposed onf1 by Theorem 1, we obtain the following relationshipfor high-dimensional datasets:
Īµ3 = max
{0.97, 1ā 0.99
Īµ1
}Ā· Īµ1
Note that the results for this strategy are consistent forboth loss functions across all the public and the real-world datasets considered, none of which were usedin defining the strategy except for setting the Lipschitzconstant L of the loss. They can be considered to beeffectively serving as test-cases for the strategy.
B. Loss Functions
Our evaluation considers the loss functions for twocommonly used models: logistic regression and HuberSVM. This section contains results for logistic regres-sion; results for Huber SVM are available in Appendix B.
Logistic regression: The L2-regularized logistic regres-sion loss function on a sample (x, y) with y ā {1,ā1}is `(Īø, (x, y)) = ln(1 + exp(āyćĪø, xć)) + Ī
2 āĪøā2.
Our experiments consider both the regularized and un-regularized (i.e., Ī = 0) settings. The un-regularizedversion has L2-Lipschitz constant L when for each sam-ple x, āxā ā¤ L. It is also L2-smooth. The regularizedversion has L2-Lipschitz constant L+ĪC when for eachsample x, āxā ā¤ L, and for each model Īø, āĪøā ā¤ C. Itis also (L2 + Ī)-smooth, and Ī-strongly convex.
C. Experiment 1: Low-dimensional Datasets
We present the results of our experiments with logisticregression on low-dimensional data in Figure 2. All fouralgorithms perform better in comparison with the non-private baseline for binary classification tasks (Synthetic-L, Adult, and KDDCup99) than for multi-class problems(Covertype and MNIST), because Īµ and Ī“ must be splitamong the binary classifiers built for each class.
Figure 1 contains precise accuracy numbers for eachdataset for reasonably low values of Īµ. These resultsprovide a more precise comparison between the fouralgorithms, and quantify the accuracy loss versus thenon-private baseline for each one. Across all datasets,Approximate Minima Perturbation generally providesthe most accurate models across Īµ values.
D. Experiment 2: High-dimensional Datasets
For this experiment, we repeat the procedure in Ex-periment 1 on high-dimensional data, and present theresults in Figure 2. The results are somewhat differ-ent in the high-dimensional regime. We observe thatalthough Approximate Minima Perturbation generallyoutperforms all the other algorithms, the private Frank-Wolfe algorithm performs the best on Synthetic-H. Fromprior works [11], [17], we know that both objectiveperturbation and the private Frank-Wolfe have neardimension-independent utility guarantees when the lossis of a GLM, and we indeed observe this expectedbehavior from our experiments. As in experiment 1, wepresent precise accuracy numbers for Īµ = 0.1 in Figure 1.
Private Frank-Wolfe works best when the optimalmodel is sparse (i.e., a few important features charac-terize the classification task well), as in the Synthetic-Hdataset, which is well-characterized by just ten importantfeatures. This is because private Frank-Wolfe adds atmost a single feature to the model at each iteration, andnoise increases with the number of iterations. However,noise does not increase with the total number of features,since it scales with the bound on the `ā-norm of thesamples. This behavior is in contrast to ApproximateMinima Perturbation (and the other algorithms consid-ered in our evaluation), for which noise scales with the
Dat
aset
NP
base
line
AM
P
H-F
AM
P
P-SG
D
P-PS
GD
P-SC
PSG
D
P-FW
Low-Dimensional Binary Datasets (Īµ = 0.1)Synthetic-L 94.9 83.1 80.6 81.6 81.7 76.4 81.8
Adult 84.8 79.1 78.7 78.5 77.4 77.2 76.9KDDCup99 99.1 97.5 97.4 98.0 98.1 95.8 96.8
Low-Dimensional Multi-class Datasets (Īµ = 15)Covertype 71.2 64.3 63.5 65.0 62.4 62.2 63.0
MNIST 91.5 71.9 70.5 68.6 68.0 63.2 65.0High-Dimensional Datasets (Īµ = 0.1)
Synthetic-H 95.8 53.2 51.4 52.8 53.5 52.0 57.6Gisette 96.6 62.8 59.7 61.5 62.3 61.3 58.3
Real-sim 93.3 73.1 71.9 66.3 66.1 65.6 69.8RCV1 93.5 64.2 59.9 55.1 58.9 56.2 64.1
Real-World Datasets (Īµ = 0.1)Dataset #1 75.3 75.36 75.3 75.3 75.3 75.3 75.3Dataset #2 72.2 70.4 70.1 69.8 69.5 68.9 68.6Dataset #3 73.6 71.9 71.8 71.8 71.4 71.2 71.6Dataset #4 82.1 81.7 81.7 81.7 81.5 81.3 81.0
Fig. 1. Accuracy results (in %) for logistic regression. For each dataset,the result in bold represents the DP algorithm with the best accuracyfor that dataset. A key for the abbreviations used for the algorithms isprovided in Table III.
TABLE IIILIST OF ABBREVIATIONS USED FOR ALGORITHMS
Abbreviation Full-formNP baseline Non-private baseline
AMP Approximate Minima PerturbationH-F AMP Hyperparameter-free AMP
P-SGD Private SGDP-PSGD Private PSGD
P-SCPSGD Private strongly convex PSGDP-FW Private Frank-Wolfe
bound on the `2-norm of the samples. Private Frank-Wolfe therefore approaches the non-private baselinebetter than the other algorithms for high-dimensionaldatasets with sparse models, even at low values of Īµ.
E. Experiment 3: Real-world Use Cases
For this experiment, we repeat the procedure in Ex-periment 1 on real-world use cases, obtained in collab-oration with Uber. These use cases are represented byfour datasets, each of which has separately been used totrain a production model deployed at Uber. The detailsof these datasets are listed in Table I. The results of this
5We report the accuracy for Īµ = 1 for multi-class datasets, as com-pared to Īµ = 0.1 for datasets with binary classification, because multi-class classification is a more difficult task than binary classification.
6For Dataset #1, AMP slightly outperforms even the NP baseline,as can been seen from Figure 2.
10 2 10 1 100 101
Epsilon
50
60
70
80
90Ac
cura
cy (%
)
10 2 10 1 100 101
Epsilon
72.5
75.0
77.5
80.0
82.5
85.0
Accu
racy
(%)
10 2 10 1 100 101
Epsilon
80
85
90
95
100
Accu
racy
(%)
Synthetic-L (Low-Dim) Adult (Low-Dim) KDDCup99 (Low-Dim)
10 2 10 1 100 101
Epsilon
50
55
60
65
70
Accu
racy
(%)
10 2 10 1 100 101
Epsilon
20
40
60
80
Accu
racy
(%)
10 2 10 1 100 101
Epsilon
0.5
0.6
0.7
0.8
0.9
Accu
racy
Covertype (Low-Dim) MNIST (Low-Dim) Synthetic-H (High-Dim)
10 2 10 1 100 101
Epsilon
50
60
70
80
90
Accu
racy
(%)
10 2 10 1 100 101
Epsilon
50
60
70
80
90
Accu
racy
(%)
10 2 10 1 100 101
Epsilon
50
60
70
80
90
Accu
racy
(%)
Gisette (High-Dim) Real-sim (High-Dim) RCV-1 (High-Dim)
10 2 10 1 100 101
Epsilon
75.30
75.32
75.34
Accu
racy
(%)
10 2 10 1 100 101
Epsilon
60
65
70
Accu
racy
(%)
10 2 10 1 100 101
Epsilon
71
72
73
74
Accu
racy
(%)
Dataset #1 (Real-World) Dataset #2 (Real-World) Dataset #3 (Real-World)
10 2 10 1 100 101
Epsilon
80.0
80.5
81.0
81.5
82.0
Accu
racy
(%)
10 2 10 1 100 101
Epsilon
0.80
0.85
0.90
0.95
1.00
Accu
racy
Non-private baselineApproximate Minima PerturbationHyperparameter-free Approximate Minima PerturbationPrivate SGDPrivate PSGDPrivate Strongly-convex PSGDPrivate Frank-Wolfe
Dataset #4 (Real-World) Color-coded Legend for all the plotsFig. 2. Accuracy results for logistic regression on low-dimensional, high-dimensional and real-world datasets. Horizontal axis depicts varyingvalues of Īµ; vertical axis shows accuracy (in %) on the testing set.
experiment are depicted in Figure 2, with more preciseresults for Īµ = 0.1 in Figure 1.
The real-world datasets are much larger than thedatasets considered in Experiment 1. The difference inscale is reflected in the results: all of the algorithmsconverge to the non-private baseline for very low valuesof Īµ. These results suggest that in many practical settings,the cost of privacy is negligible. In fact, for Dataset#1, some differentially private models exhibit a slightlyhigher accuracy than the non-private baseline for awide range of Īµ. For instance, even Hyperparameter-free AMP, which is end-to-end differentially privateas there is no tuning involved, yields an accuracy of75.34% for Īµ = 0.1 versus the non-private baseline of75.33%. Some prior works [18], [19] have theorized thatdifferential privacy could act as a type of regularizationfor the system, and improve the generalization error; thisempirical result of ours aligns with this claim.
F. Discussion
For large datasets, the cost of privacy is low. Ourresults confirm the expectation that very accurate dif-ferentially private models exist for large datasets. Evenfor relatively small datasets like Adult and KDDCup99(where n < 100, 000), our results show that a differen-tially private model has accuracy within 6% of the non-private baseline even for a conservative privacy settingof Īµ = 0.1.
For all the larger real-world datasets (n > 1m),the accuracy of the best differentially private model iswithin 4% of the non-private baseline even for the mostconservative privacy value considered (Īµ = 0.01). ForĪµ = 0.1, it is within 2% of the baseline for two of thesedatasets, essentially identical to the baseline for one ofthem, and even slightly higher than the baseline for one.
These results suggest that for realistic deploymentson large datasets (n > 1m, and low-dimensional), adifferentially private model can be deployed withoutmuch loss in accuracy.
Approximate Minima Perturbation almost alwaysprovides the best accuracy, and is easily deployable inpractice. Our results in all the experiments demonstratethat among the available algorithms for differentially pri-vate convex optimization, our Approximate Minima Per-turbation approach almost always produces models withthe best accuracy. For four of the five low-dimensionaldatasets, and all the public high-dimensional datasets weconsidered, Approximate Minima Perturbation providedconsistently better accuracy than the other algorithms.Under some conditions like high-dimensionality of the
datasets, and sparsity of the optimal predictive modelfor it, private Frank-Wolfe does give the best perfor-mance. Unlike Approximate Minima Perturbation, how-ever, no hyperparameter-free variant of private Frank-Wolfe existsāand suboptimal hyperparameter valuescan reduce accuracy significantly for this algorithm.
As mentioned earlier, Approximate Minima Perturba-tion also has important properties that enable its practicaldeployment. It can leverage any off-the-shelf optimizeras a black box, allowing implementations to use existingscalable optimizers (our implementation uses Scipyāsminimize). None of the other evaluated algorithmshave these properties.
Hyperparameter-free Approximate Minima Pertur-bation provides good utility. As demonstrated by ourexperimental results, AMP can be deployed without tun-ing hyperparameters, at little cost to accuracy. Our data-independent approach therefore enables deploymentāwithout significant loss of accuracyāin practical settingswhere public data may not be available for tuning.
VI. CONCLUSION
This paper takes two important steps towards practi-cal differentially private convex optimization. We havepresented Approximate Minima Perturbation, a novelalgorithm for differentially private convex optimizationthat does not require the optimization process to reachthe true minima. It can leverage any off-the-shelf solver,and can be employed without hyperparameter tuning.Therefore, it is amenable to be deployed in practice.
We have also performed an extensive empirical eval-uation of state-of-the-art approaches for differentiallyprivate convex optimization. To encourage the furtherdevelopment and deployment, we have released theimplementations used in our evaluation, and the bench-marking scripts used to obtain the datasets and performthe experiments. This benchmark provides a standardpoint of comparison for further advances in differentiallyprivate convex optimization.
VII. ACKNOWLEDGEMENTS
The authors would like to thank Adam Smith forthe discussions regarding the main privacy proof ofAMP, and the anonymous reviewers for their help-ful comments. This material is in part based uponwork supported by NSF CCF-1740850, DARPA contract#N66001-15-C-4066, the Center for Long-Term Cyber-security, and Berkeley Deep Drive. Any opinions, find-ings, conclusions, or recommendations expressed in thismaterial are those of the authors, and do not necessarilyreflect the views of the sponsors.
REFERENCES
[1] M. Fredrikson, S. Jha, and T. Ristenpart, āModel inversion attacksthat exploit confidence information and basic countermeasures,āin Proceedings of the 22Nd ACM SIGSAC Conference on Com-puter and Communications Security, ser. CCS ā15. New York,NY, USA: ACM, 2015, pp. 1322ā1333.
[2] X. Wu, M. Fredrikson, S. Jha, and J. F. Naughton, āA methodol-ogy for formalizing model-inversion attacks,ā in 2016 IEEE 29thComputer Security Foundations Symposium (CSF), June 2016,pp. 355ā370.
[3] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, āMembershipinference attacks against machine learning models,ā in 2017 IEEESymposium on Security and Privacy (SP), May 2017, pp. 3ā18.
[4] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor,āOur data, ourselves: Privacy via distributed noise generation.āin EUROCRYPT, 2006.
[5] C. Dwork, F. McSherry, K. Nissim, and A. Smith, āCalibratingnoise to sensitivity in private data analysis,ā in Theory of Cryp-tography Conference. Springer, 2006, pp. 265ā284.
[6] U. Erlingsson, V. Pihur, and A. Korolova, āRappor: Randomizedaggregatable privacy-preserving ordinal response,ā in Proceed-ings of the 2014 ACM SIGSAC conference on computer andcommunications security. ACM, 2014, pp. 1054ā1067.
[7] āApple tries to peek at user habits without violating privacy,ā TheWall Street Journal, 2016.
[8] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, āDifferentiallyprivate empirical risk minimization,ā JMLR, 2011.
[9] D. Kifer, A. Smith, and A. Thakurta, āPrivate convex empiricalrisk minimization and high-dimensional regression,ā Journal ofMachine Learning Research, vol. 1, p. 41, 2012.
[10] S. Song, K. Chaudhuri, and A. D. Sarwate, āStochastic gradientdescent with differentially private updates,ā in Global Conferenceon Signal and Information Processing (GlobalSIP), 2013 IEEE.IEEE, 2013, pp. 245ā248.
[11] P. Jain and A. Thakurta, ā(near) dimension independent riskbounds for differentially private learning,ā in Proceedings of the31st International Conference on International Conference onMachine Learning - Volume 32, ser. ICMLā14. JMLR.org, 2014,pp. Iā476āIā484.
[12] X. Wu, F. Li, A. Kumar, K. Chaudhuri, S. Jha, and J. Naughton,āBolt-on differential privacy for scalable stochastic gradientdescent-based analytics,ā in Proceedings of the 2017 ACM In-ternational Conference on Management of Data, ser. SIGMODā17. New York, NY, USA: ACM, 2017, pp. 1307ā1322.
[13] āDifferentially Private Convex Optimization Benchmark,ā2017. [Online]. Available: https://github.com/sunblaze-ucb/dpml-benchmark
[14] K. Chaudhuri and S. Vinterbo, āA stability-based validation pro-cedure for differentially private machine learning,ā in Proceed-ings of the 26th International Conference on Neural InformationProcessing Systems - Volume 2, ser. NIPSā13. USA: CurranAssociates Inc., 2013, pp. 2652ā2660.
[15] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov,K. Talwar, and L. Zhang, āDeep learning with differential pri-vacy,ā in Proceedings of the 2016 ACM SIGSAC Conference onComputer and Communications Security, ser. CCS ā16. NewYork, NY, USA: ACM, 2016, pp. 308ā318.
[16] R. Bassily, A. Smith, and A. Thakurta, āPrivate empirical riskminimization: Efficient algorithms and tight error bounds,ā inFoundations of Computer Science (FOCS), 2014 IEEE 55thAnnual Symposium on. IEEE, 2014, pp. 464ā473.
[17] K. Talwar, A. Thakurta, and L. Zhang, āPrivate empirical riskminimization beyond the worst case: The effect of the constraintset geometry,ā CoRR, vol. abs/1411.5417, 2014.
[18] R. Bassily, A. D. Smith, and A. Thakurta, āPrivate empirical riskminimization, revisited,ā CoRR, vol. abs/1405.7085, 2014.
[19] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, andA. Roth, āGeneralization in adaptive data analysis and holdoutreuse,ā in Proceedings of the 28th International Conference onNeural Information Processing Systems - Volume 2, ser. NIPSā15.Cambridge, MA, USA: MIT Press, 2015, pp. 2350ā2358.
[20] S. Boyd and L. Vandenberghe, Convex optimization. Cambridgeuniversity press, 2004.
[21] S. Bubeck et al., āConvex optimization: Algorithms and com-plexity,ā Foundations and Trends RĀ© in Machine Learning, vol. 8,no. 3-4, pp. 231ā357, 2015.
[22] L. Zhang, T. Yang, and R. Jin, āEmpirical risk minimization forstochastic convex optimization: O(1/n)- and O(1/n2)-type ofrisk bounds,ā in Proceedings of the 2017 Conference on LearningTheory, ser. Proceedings of Machine Learning Research, S. Kaleand O. Shamir, Eds., vol. 65. Amsterdam, Netherlands: PMLR,07ā10 Jul 2017, pp. 1954ā1979.
[23] V. Feldman, āGeneralization of erm in stochastic convex opti-mization: The dimension strikes back,ā in Advances in NeuralInformation Processing Systems 29, D. D. Lee, M. Sugiyama,U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Asso-ciates, Inc., 2016, pp. 3576ā3584.
[24] A. Smith and A. Thakurta, āDifferentially private feature selec-tion via stability arguments, and the robustness of the lasso,ā inCOLT, 2013.
[25] M. Frank and P. Wolfe, āAn algorithm for quadratic program-ming,ā Naval Research Logistics (NRL), vol. 3, no. 1-2, pp. 95ā110, 1956.
[26] M. Jaggi, āRevisiting frank-wolfe: projection-free sparse convexoptimization,ā in Proceedings of the 30th International Confer-ence on International Conference on Machine Learning-Volume28. JMLR. org, 2013, pp. Iā427.
[27] S. Lacoste-Julien and M. Jaggi, āAn affine invariant linearconvergence analysis for frank-wolfe algorithms,ā arXiv preprintarXiv:1312.7864, 2013.
[28] āā, āOn the global linear convergence of frank-wolfe opti-mization variants,ā in Advances in Neural Information ProcessingSystems, 2015, pp. 496ā504.
[29] S. Lacoste-Julien, āConvergence Rate of Frank-Wolfe for Non-Convex Objectives,ā Jun. 2016, 6 pages.
[30] P. Jain, P. Kothari, and A. Thakurta, āDifferentially private onlinelearning.ā in COLT, vol. 23, 2012, pp. 24ā1.
[31] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, āLocal privacyand statistical minimax rates,ā in Foundations of Computer Sci-ence (FOCS), 2013 IEEE 54th Annual Symposium on. IEEE,2013, pp. 429ā438.
[32] A. G. Thakurta and A. Smith, ā(nearly) optimal algorithmsfor private online learning in full-information and bandit set-tings,ā in Advances in Neural Information Processing Systems26, C. Burges, L. Bottou, M. Welling, Z. Ghahramani, andK. Weinberger, Eds., 2013, pp. 2733ā2741.
[33] P. Jain and A. Thakurta, āDifferentially private learning withkernels,ā in ICML, 2013.
[34] J. Zhang, Z. Zhang, X. Xiao, Y. Yang, and M. Winslett,āFunctional mechanism: Regression analysis under differentialprivacy,ā Proc. VLDB Endow., vol. 5, no. 11, pp. 1364ā1375,Jul. 2012.
[35] X. Wu, M. Fredrikson, W. Wu, S. Jha, and J. F. Naughton,āRevisiting differentially private regression: Lessons from learn-ing theory and their consequences,ā CoRR, vol. abs/1512.06388,2015.
[36] R. Shokri and V. Shmatikov, āPrivacy-preserving deep learning,āin 2015 53rd Annual Allerton Conference on Communication,Control, and Computing (Allerton), Sept 2015, pp. 909ā910.
[37] Z. Ji, Z. C. Lipton, and C. Elkan, āDifferential privacy and ma-chine learning: a survey and review,ā CoRR, vol. abs/1412.7584,2014.
[38] A. Nikolov, K. Talwar, and L. Zhang, āThe geometry of differen-tial privacy: The sparse and approximate cases,ā in Proceedings ofthe Forty-fifth Annual ACM Symposium on Theory of Computing,ser. STOC ā13. New York, NY, USA: ACM, 2013, pp. 351ā360.
[39] P. Billingsley, Probability and Measure, ser. Wiley Series inProbability and Statistics. Wiley, 1995.
[40] C. Dwork, A. Roth et al., āThe algorithmic foundations of differ-ential privacy,ā Foundations and Trends in Theoretical ComputerScience, vol. 9, no. 3-4, pp. 211ā407, 2014.
[41] R. Paulavicius and J. Zilinskas, āAnalysis of different norms andcorresponding lipschitz constants for global optimization,ā UkioTechnologinis ir Ekonominis Vystymas, vol. 12, no. 4, pp. 301ā306, 2006.
[42] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, āDifferentiallyprivate empirical risk minimization,ā Journal of Machine Learn-ing Research, vol. 12, no. Mar, pp. 1069ā1109, 2011.
[43] S. Dasgupta and L. Schulman, āA probabilistic analysis of emfor mixtures of separated, spherical gaussians,ā JMLR, 2007.
[44] K. Talwar, A. Thakurta, and L. Zhang, āNearly optimal privatelasso,ā in NIPS, 2015.
APPENDIX
A. Omitted ProofsHere, we provide a proof for the utility guarantee
of Algorithm 1, which is provided in Theorem 2. Forbounding the expected risk of the algorithm, we firstneed to bound its empirical risk (Lemma A.1).
Lemma A.1 (Empirical Risk). Let Īø be the minimizer
of the objective function L(Īø;D) = 1n
nāi=1
`(Īø; di),
and Īømin be the minimizer of the objective functionLpriv(Īø;D) = L(Īø;D) + Ī
2nāĪøā2 + bT1 Īø, where b1 is
as defined in Algorithm 1. Also, let Īøout be the outputof Algorithm 1. We have:
L(Īøout;D)ā L(Īø;D) ā¤ L(nĪ³
Ī+ āb2ā
)+
ĪāĪøā2
2n+
2nāb1ā2
Ī.
Proof. We have
L(Īøout;D)ā L(Īø;D) = (L(Īøout;D)ā L(Īømin;D))
+(L(Īømin;D)ā L(Īø;D)
)(4)
First, we will bound (L(Īøout;D)ā L(Īømin;D)). Wehave:
L(Īøout;D)ā L(Īømin;D) ā¤ |L(Īøout;D)ā L(Īømin;D)|ā¤ LāĪøout ā Īøminā= LāĪøapprox ā Īømin + b2āā¤ LāĪøapprox ā Īøminā
+ Lāb2ā
ā¤ L(nĪ³
Ī+ āb2ā
)(5)
The second inequality above follows from the Lipschitzproperty of L(;D). The first equality follows as Īøout =Īømin+(ĪøapproxāĪømin+b2), whereas the last inequalityfollows from inequality 3.
Next, we bound(L(Īømin;D)ā L(Īø;D)
)on the
lines of the proof of Lemma 3 in [9]. Let Īø# =arg minĪøāRp L#(Īø;D), where L#(Īø;D) = L(Īø;D) +Ī2nāĪøā
2. As a result, Lpriv(Īø;D) = L#(Īø;D) + bT1 Īø.So, we have:
L(Īømin;D)ā L(Īø;D) = L#(Īømin;D)ā L#(Īø#;D)
+ L#(Īø#;D)ā L#(Īø;D)
+ĪāĪøā2
2nā ĪāĪøminā2
2nā¤ L#(Īømin;D)ā L#(Īø#;D)
+ĪāĪøā2
2n(6)
The inequality above follows as L#(Īø#;D) ā¤L#(Īø;D).
Let us now bound L#(Īømin;D) ā L#(Īø#;D). Tothis end, we first observe that since Lpriv is Ī
n -stronglyconvex in Īø, we have that
Lpriv(Īø#;D) ā„ Lpriv(Īømin;D)
āāLpriv(Īømin;D)T (Īømin ā Īø#)
+Ī
2nāĪø# ā Īøminā2
= Lpriv(Īømin;D) +Ī
2nāĪø# ā Īøminā2
(7)
The equality above follows as āāLpriv(Īømin;D)ā = 0.Substituting the definition of Lpriv(;D) in equality 7,
we get that
L#(Īømin;D)ā L#(Īø#;D) ā¤ bT1 (Īø# ā Īømin)
ā Ī
2nāĪø# ā Īøminā2
(8)
ā¤ āb1ā Ā· āĪø# ā Īøminā(9)
Inequality 9 above follows by the CauchyāSchwarzinequality.
Now, since L#(Īømin;D)āL#(Īø#;D) ā„ 0, it followsfrom inequalities 8 and 9 that
āb1ā Ā· āĪø# ā Īøminā ā„Ī
2nāĪø# ā Īøminā2
ā āĪø# ā Īøminā ā¤2nāb1ā
Ī(10)
We get the statement of the lemma from equation 4,and inequalities 5, 6, 9, and 10.
Now, we are ready to prove Theorem 2.
Proof of Theorem 2. The proof is on the lines of theproof of Theorem 4 in [9]. First, let us get a highprobability bound on L(Īøout;D) ā L(Īø;D). To thisend, we will first bound āb1ā and āb2ā w.h.p., wherebs ā¼ N
(0, Ļ2
sIpĆp)
for s ā {1, 2}. Using Lemma 2from [43], we get that w.p. ā„ 1ā Ī±
2 ,
ābsā ā¤ Ļs
ā2p log
2
Ī±.
Substituting this into Lemma A.1, we get that w.p.ā„ 1ā Ī±,
L(Īøout;D)ā L(Īø;D) ā¤ L
(nĪ³
Ī+ Ļ2
ā2p log
2
Ī±
)
+ĪāĪøā2
2n+
4nĻ21p log 2
Ī±
Ī.
It is easy to see that by making Īµi = Īµ2 for
i ā {1, 2}, Īµ3 = max{Īµ12 , Īµ1 ā 0.99
}, Ī“j =
Ī“2 for j ā {1, 2}, and setting Ī =
Ī
(Lārp log 1/Ī“
ĪµāĪøā+ nāĪøā
āLĪ³āp log 1/Ī“
Īµ
)such that it
satisfies the constraint in Step 2 in Algorithm 1, we getthe statement of the theorem.
B. Results for Huber SVM
This section reports the results of experiments withthe Huber SVM loss function. The Huber SVM lossfunction is a differentiable and smooth approximationof the standard SVMās hinge loss. We define the lossfunction as in [18]. Defining z = yćx, Īøć, the HuberSVM loss function is:
`(Īø, (x, y)) =
1ā z 1ā z > h
0 1ā z < āh(1āz)2
4h + 1āz2 + h
4 otherwise
As with logistic regression, the Huber SVM lossfunction has L2-Lipschitz constant L when for eachsample x, āxā ā¤ L.
We repeat the experiments of Section V with theHuber SVM loss. To ensure that the experiments run tocompletion for Synthetic-H, we run the experiments on2000 samples, each consisting of 2000 dimensions. Forall the experiments, we obtain the non-private baselineusing SciPyās minimize procedure with the HuberSVM loss function defined above. Following Wu et
Dat
aset
NP
base
line
AM
P
H-F
AM
P
P-SG
D
P-PS
GD
P-SC
PSG
D
P-FW
Low-Dimensional Binary Datasets (Īµ = 0.1)Synthetic-L 94.9 89.3 87.8 85.6 86.2 79.4 86.8
Adult 84.8 79.6 77.5 79.0 76.5 76.0 77.8KDDCup99 99.1 98.7 98.77 98.5 98.5 98.1 98.0
Low-Dimensional Multi-class Datasets (Īµ = 18)Covertype 71.5 66.4 65.3 64.3 62.3 62.7 63.3
MNIST 91.5 74.7 73.7 69.6 72.9 70.6 65.1High-Dimensional Datasets (Īµ = 0.1)
Synthetic-H9 96.5 55.2 54.3 55.0 56.6 55.6 56.0Gisette 96.6 69.9 67.9 65.7 70.6 66.8 66.8
Real-sim 93.6 78.3 76.7 73.6 71.8 69.7 78.3RCV19 93.8 74.5 72.9 71.3 70.1 69.7 75.8
Real-World Datasets (Īµ = 0.1)Dataset #1 75.3 75.3 75.3 75.3 75.310 75.3 75.3Dataset #2 72.2 70.8 70.6 70.8 70.3 70.2 68.6Dataset #3 73.6 71.3 71.2 71.2 71.1 71.1 71.1
Dataset #49 81.9 81.5 81.3 81.7 81.5 81.2 81.2
Fig. 3. Accuracy results (in %) for Huber SVM. For each dataset,the result in bold represents the DP algorithm with the best accuracyfor that dataset. A key for the abbreviations used for the algorithms isprovided in Table III.
al. [12], we set h = 0.1. The results are shown inFigure 4, with more precise results in Figure 3. Theydemonstrate a similar trend to the earlier results forlogistic regression, with our Approximate Minima Per-turbation approach generally providing the highest accu-racy. However, the advantage of Approximate MinimaPerturbation is less pronounced in this setting.
7H-F AMP can outperform AMP when the data-independent strategyprovides a better value for the privacy budget fraction f1 than thespecific set of values we consider for tuning in AMP.
8We report the accuracy for Īµ = 1 for multi-class datasets, ascompared to Īµ = 0.1 for datasets with binary classification, as multi-class classification is a more difficult task than binary classification.
9The numbers cited here do not reflect the trend for this dataset, ascan be seen from Figure 3
10Slightly outperforms even the NP baseline, as can been seen fromFigure 2.
10 2 10 1 100 101
Epsilon
50
60
70
80
90Ac
cura
cy (%
)
10 2 10 1 100 101
Epsilon
72.5
75.0
77.5
80.0
82.5
85.0
Accu
racy
(%)
10 2 10 1 100 101
Epsilon
87.5
90.0
92.5
95.0
97.5
Accu
racy
(%)
Synthetic-L (Low-Dim) Adult (Low-Dim) KDDCup99 (Low-Dim)
10 2 10 1 100 101
Epsilon45
50
55
60
65
70
Accu
racy
(%)
10 2 10 1 100 101
Epsilon
20
40
60
80
Accu
racy
(%)
10 2 10 1 100 101
Epsilon
50
60
70
80
90
Accu
racy
(%)
Covertype (Low-Dim) MNIST (Low-Dim) Synthetic-H (High-Dim)
10 2 10 1 100 101
Epsilon
50
60
70
80
90
Accu
racy
(%)
10 2 10 1 100 101
Epsilon
50
60
70
80
90
Accu
racy
(%)
10 2 10 1 100 101
Epsilon
50
60
70
80
90
Accu
racy
(%)
Gisette (High-Dim) Real-sim (High-Dim) RCV-1 (High-Dim)
10 2 10 1 100 101
Epsilon
75.32
75.33
75.34
75.35
Accu
racy
(%)
10 2 10 1 100 101
Epsilon
60
65
70
Accu
racy
(%)
10 2 10 1 100 101
Epsilon
71
72
73
Accu
racy
(%)
Dataset #1 (Real-World) Dataset #2 (Real-World) Dataset #3 (Real-World)
10 2 10 1 100 101
Epsilon
80.5
81.0
81.5
82.0
Accu
racy
(%)
10 2 10 1 100 101
Epsilon
0.80
0.85
0.90
0.95
1.00
Accu
racy
Non-private baselineApproximate Minima PerturbationHyperparameter-free Approximate Minima PerturbationPrivate SGDPrivate PSGDPrivate Strongly-convex PSGDPrivate Frank-Wolfe
Dataset #4 (Real-World) Color-coded Legend for all the plotsFig. 4. Accuracy results for Huber SVM. Horizontal axis depicts varying values of Īµ; vertical axis shows accuracy (in %) on the testing set.
C. Pseudocodes for Algorithms evaluated in Section V
Algorithm 2: Differentially Private MinibatchStochastic Gradient Descent [16], [15]
Input: Data set: D = {d1, Ā· Ā· Ā· , dn}, loss function:`(Īø;Di) with L2-Lipschitz constant L,privacy parameters: (Īµ, Ī“), number ofiterations: T , minibatch size: k, learningrate function: Ī· : [T ]ā R.
1 Ļ2 ā 16L2T log 1Ī“
n2Īµ2
2 Īø1 = 0p
3 for t = 1 to T-1 do4 s1, Ā· Ā· Ā· , sk ā Sample k samples uniformly with
replacement from D5 bt ā¼ N (0, Ļ2IpĆp)
6 Īøt+1 = Īøt ā Ī·(t)[( 1k
āki=1ā`(Īø; si)) + bt]
7 end8 Output ĪøT
Algorithm 3: Differentially Private Permutation-based Stochastic Gradient Descent [12]
Input: Data set: D = {d1, Ā· Ā· Ā· , dn}, loss function:`(Īø;Di) with L2-Lipschitz constant L,privacy parameters: (Īµ, Ī“), number ofpasses: T , minibatch size: k, constantlearning rate: Ī·.
1 Īø ā 0p
2 Let Ļ be a random permutation of [n]3 for t = 1 to T ā 1 do4 for b = 1 to n
k do5 Let s1 = dĻ(bk), Ā· Ā· Ā· , sk = dĻ(b(k+1)ā1)
6 Īø ā Īø ā Ī·( 1k
āki=1ā`(Īø; si))
7 end8 end9 Ļ2 ā 8T 2L2Ī·2 log( 2
Ī“ )
k2Īµ2
10 b ā¼ N (0, Ļ2IpĆp)11 Output Īøpriv = Īø + b
Algorithm 4: Differentially Private Strongly ConvexPermutation-based Stochastic Gradient Descent [12]Input: Data set: D = {d1, Ā· Ā· Ā· , dn}, loss function:
`(Īø;Di) that is Ī¾-strongly convex andĪ²-smooth with L2-Lipschitz constant L,privacy parameters: (Īµ, Ī“), number ofpasses: T , minibatch size: k.
1 Īø ā 0p
2 Let Ļ be a random permutation of [n]3 for t = 1 to T ā 1 do4 Ī·t ā min 1
Ī² ,1Ī¾t
5 for b = 1 to nk do
6 Let s1 = dĻ(bk), Ā· Ā· Ā· , sk = dĻ(b(k+1)ā1)
7 Īø ā Īø ā Ī·t( 1k
āki=1ā`(Īø; si))
8 end9 end
10 Ļ2 ā 8L2 log( 2Ī“ )
Ī¾2n2Īµ2
11 b ā¼ N (0, Ļ2IpĆp)12 Output Īøpriv = Īø + b
Algorithm 5: Differentially Private Frank-Wolfe [44]Input: Data set: D = {d1, Ā· Ā· Ā· , dn}, loss function:
L(Īø;D) = 1n
nāi=1
`(Īø; di) (with L1-Lipshitz
constant L for `), privacy parameters: (Īµ, Ī“),convex set: C = conv(S) with āCā1denoting maxsāSāsā1 and S being the setof corners.
1 Choose an arbitrary Īø1 from C
2 Ļ2 ā 32L2āCā21Tlog(1/Ī“)n2Īµ2
3 for t = 1 to T-1 do4 ās ā S, Ī±s ā ćs,5L(Īøt;D)ć+ Lap(Ļ), where
Lap(Ī») ā¼ 12Ī»eā|x|/Ī»
5 Īøt ā arg minsāS
Ī±s
6 Īøt+1 ā (1ā Ī·t)Īøt + Ī·Īøt, where Ī·t = 1t+1
7 end8 Output Īøpriv = ĪøT