proximity operators of discrete information divergences · functions, as opposed to separable ones...

HAL Id: hal-01672646https://hal.archives-ouvertes.fr/hal-01672646

Submitted on 26 Dec 2017

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Proximity operators of discrete information divergencesMireille El Gheche, Giovanni Chierchia, Jean-Christophe Pesquet

To cite this version:Mireille El Gheche, Giovanni Chierchia, Jean-Christophe Pesquet. Proximity operators of discreteinformation divergences. IEEE Transactions on Information Theory, Institute of Electrical and Elec-tronics Engineers, 2018, 64 (2), pp.1092-1104. 10.1109/TIT.2017.2782789. hal-01672646

https://hal.archives-ouvertes.fr/hal-01672646

https://hal.archives-ouvertes.fr

1

Proximity Operators of Discrete InformationDivergences

Mireille El Gheche, Giovanni Chierchia, Member, IEEE, and Jean-Christophe Pesquet, Fellow, IEEE

Abstract—While ϕ-divergences have been extensively studiedin convex analysis, their use in optimization problems oftenremains challenging. In this regard, one of the main short-comings of existing methods is that the minimization of ϕ-divergences is usually performed with respect to one of theirarguments, possibly within alternating optimization techniques.In this paper, we overcome this limitation by deriving newclosed-form expressions for the proximity operator of such two-variable functions. This makes it possible to employ standardproximal methods for efficiently solving a wide range of convexoptimization problems involving ϕ-divergences. In addition, weshow that these proximity operators are useful to computethe epigraphical projection of several functions. The proposedproximal tools are numerically validated in the context of optimalquery execution within database management systems, where theproblem of selectivity estimation plays a central role. Experimentsare carried out on small to large scale scenarios.

Index Terms—Convex Optimization, Divergences, ProximityOperator, Proximal Algorithms, Epigraphical Projection.

I. INTRODUCTION

D IVERGENCE measures play a crucial role in evaluatingthe dissimilarity between two information sources. The

idea of quantifying how much information is shared betweentwo probability distributions can be traced back to the work byPearson [3] and Hellinger [4]. Later, Shannon [5] introduceda powerful mathematical framework that links the notion ofinformation with communications and related areas, layingthe foundations for information theory. In this context, a keymeasure of information is the Kullback-Leibler divergence [6],which can be regarded as an instance of the wider class ofϕ-divergences [7]–[9], including also Jeffreys, Hellinger, Chi-square, Renyi, and Iα divergences [10].

The Kullback-Leibler (KL) divergence has been known toplay a prominent role in the computation of channel capacityand rate-distortion functions [11], [12]. These problems canbe addressed either with alternating minimization approaches[13], [14] or geometric programming [15]. The KL divergencewas also used as a metric for maximizing a log-likelihoodin a proximal method generalizing the EM algorithm [16],but here one of its two variables is fixed. The generalizedKL divergence (also called I-divergence) is widely used ininverse problems for recovering a signal of interest from anobservation degraded by Poisson noise. In such a case, the

Mireille El Gheche is with Signal Processing Laboratory (LTS4), EcolePolytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland.

Giovanni Chierchia is with Universite Paris-Est, LIGM UMR 8049, CNRS,ENPC, ESIEE Paris, UPEM, Noisy-le-Grand, France.

Jean-Christophe Pesquet is with Center for Visual Computing, Centrale-Supelec, University Paris-Saclay, Gif sur Yvette, France.

Part of the material in this paper was presented in [1], [2].

generalized KL divergence is employed as a data fidelityterm, and the resulting optimization approach can be solvedthrough an alternating projection scheme [17]. The problemwas formulated in a similar manner by Richardson [18], Lucy[19], and others [20]–[27]. However, in the latter works, oneof the two variables of the I-divergence is fixed.

The classical symmetrization of KL divergence, known asJeffreys (Jef) divergence [28], was recently used in the k-means algorithm as a replacement of the squared difference[29], [30], yielding an analytical expression of the centroidsin terms of the Lambert W function. Moreover, tight boundsfor this divergence were recently derived in terms of the totalvariation distance [31], similarly to KL divergence [32].

The Hellinger (Hel) divergence was originally introducedin [33] and later rediscovered under different names [34]–[37]. In the field of information theory, the Hel divergenceis commonly used for nonparametric density estimation [38],[39], data analytics [40], and machine learning [41].

The Chi-square divergence was originally used to quan-titatively assess whether an observed phenomenon tends toconfirm or deny a given hypothesis [3]. It was also successfullyapplied in different contexts, such as information theory andsignal processing, as a dissimilarity measure between twoprobability distributions [9], [42].

Renyi divergence was introduced as a measure of informa-tion related to the Renyi entropy [43], indicating how much aprobabilistic mixture of two codes can be compressed [44]. Ithas been studied and applied in many areas [36], [45], [46],including image registration and alignment problems [47].

The Iα divergence was originally proposed to statisticallyevaluate the efficiency of an hypothesis test [48]. Subsequently,it was recognized as an instance of the ϕ-divergences [49] andthe Bregman divergences [50], and further extended by manyresearchers [46], [50]–[52]. This divergence was also consid-ered in the context of non-negative matrix factorization [53].

A. ContributionsTo the best of our knowledge, existing approaches for

optimizing convex criteria involving ϕ-divergences are oftenrestricted to specific cases, such as performing the minimiza-tion w.r.t. one of the divergence arguments. In order to takeinto account both arguments, one may resort to alternatingminimization schemes, but only in the case when specificassumptions are met. Otherwise, there exist some approachesthat exploit the presence of additional moment constraints[54], or the equivalence between ϕ-divergences and some lossfunctions [55], but they provide little insight into the numericalprocedure for solving the resulting optimization problems.

2

In the context of proximal methods, there exists no generalapproach for performing the minimization w.r.t. both thearguments of a ϕ-divergence. This limitation can be explainedby the fact that a few number of closed-form expressions areavailable for the proximity operator of non-separable convexfunctions, as opposed to separable ones [56], [57]. Someexamples of such functions are the Euclidean norm [58], thesquared Euclidean norm composed with an arbitrary linearoperator [58], a separable function composed with an or-thonormal or semi-orthogonal linear operator [58], the maxfunction [59], the quadratic-over-linear function [60]–[62], andthe indicator function of some closed convex sets [58], [63].

In this work, we develop a novel proximal approach thatallows us to address more general forms of optimization prob-lems involving ϕ-divergences. Our main contribution is thederivation of new closed-form expressions for the proximityoperator of such functions. This makes it possible to employstandard proximal methods [64]–[73] for efficiently solvinga wide range of convex optimization problems involving ϕ-divergences. In addition to its flexibility, the proposed ap-proach leads to parallel algorithms that can be efficientlyimplemented on both multicore and GPGPU architectures [74].

B. OrganizationThe remaining of the paper is organized as follows. Sec-

tion II presents the general form of the optimization problemthat we aim at solving. Section III studies the proximity oper-ator of ϕ-divergences and some of its properties. Section IVdetails the closed-form expressions of the aforementionedproximity operators. Section V makes the connection withepigraphical projections. Section VI illustrates the applicationto selectivity estimation for query optimization in databasemanagement systems. Finally, Section VII concludes the paper.

C. NotationThroughout the paper, Γ0(H) denotes the class of con-

vex functions f defined on a real Hilbert space H andtaking their values in ] − ∞,+∞ ] which are lower-semicontinuous and proper (i.e. their domain dom f on whichthey take finite values is nonempty). ‖ · ‖ and 〈· | ·〉 de-note the norm and the scalar product of H, respectively.The Moreau subdifferential of f at x ∈ H is ∂f(x) =u ∈ H

∣∣ (∀y ∈ H) 〈y − x | u〉+ f(x) ≤ f(y)

. If f ∈Γ0(H) is Gateaux differentiable at x, ∂f(x) = ∇f(x)where ∇f(x) denotes the gradient of f at x. The conju-gate of f is f∗ ∈ Γ0(H) such that (∀u ∈ H) f∗(u) =supx∈H

(〈x | u〉 − f(x)

). The proximity operator of f is the

mapping proxf : H → H defined as [75]

(∀x ∈ H) proxf (x) = argminy∈H

f(y) +1

2‖x− y‖2. (1)

Let C be a nonempty closed convex subset ofH. The indicatorfunction of C is defined as

(∀x ∈ H) ιC(x) =

0 if x ∈ C+∞ otherwise.

(2)

The elements of a vector x ∈ H = RN are denoted by x =(x(`))1≤`≤N , whereas IN is the N ×N identity matrix.

II. PROBLEM FORMULATION

The objective of this paper is to address convex optimiza-tion problems involving a discrete information divergence. Inparticular, the focus is put on the following formulation.

Problem II.1 Let D be a function in Γ0(RP × RP ). Let Aand B be matrices in RP×N , and let u and v be vectors in RP .For every s ∈ 1, . . . , S, let Rs be a function in Γ0(RKs)and Ts ∈ RKs×N . We want to

minimizex∈RN

D(Ax+ u,Bx+ v) +

S∑s=1

Rs(Tsx). (3)

Note that the functions D and (Rs)1≤s≤S are allowed totake the value +∞, so that Problem II.1 can include convexconstraints by letting some of the functions Rs be equal to theindicator function ιCs of some nonempty closed convex set Cs.In inverse problems, Rs may also model some additional priorinformation, such as the sparsity of coefficients after someappropriate linear transform Ts.

A. Applications in information theory

A special case of interest in information theory arises bydecomposing x into two vectors p ∈ RP ′ and q ∈ RQ′ , that isx = [p> q>]> with N = P ′+Q′. Indeed, set u = v = 0, A =[A′ 0] with A′ ∈ RP×P ′ , B = [0 B′] with B′ ∈ RP×Q′ and,for every s ∈ 1, . . . , S, Ts = [Us Vs] with Us ∈ RKs×P ′

and Vs ∈ RKs×Q′ . Then, Problem II.1 takes the followingform:

Problem II.2 Let A′, B′, (Us)1≤s≤S , and (Vs)1≤s≤S bematrices as defined above. Let D be a function in Γ0(RP×RP )and, for every s ∈ 1, . . . , S, let Rs be a function inΓ0(RKs). We want to

minimize(p,q)∈RP

′×RQ

′D(A′p,B′q) +

S∑s=1

Rs(Usp+ Vsq). (4)

Several tasks can be formulated within this framework, suchas the computation of channel capacity and rate-distortionfunctions [11], [12], the selection of log-optimal portfolios[76], maximum likelihood estimation from incomplete data[77], soft-supervised learning for text classification [78], si-multaneously estimating a regression vector and an additionalmodel parameter [62] or the image gradient distribution and aparametric model distribution [79], as well as image registra-tion [1], deconvolution [2], and recovery [17]. We next detailan important application example in source coding.

Example II.3 Assume that a discrete memoryless source E,taking its values in a finite alphabet e1, . . . , eP1

with prob-ability P(E), is to be encoded by a compressed signal E interms of a second alphabet e1, . . . , eP2. Furthermore, forevery j ∈ 1, . . . , P1 and k ∈ 1, . . . , P2, let δ(k,j) be thedistortion induced when substituting ek for ej . We wish to findan encoding P(E|E) that yields a point on the rate-distortioncurve at a given distortion value δ ∈ ]0,+∞[. It is well-known[80] that this amounts to minimizing the mutual information I

3

between E and E, more precisely the rate-distortion functionR is given by

R(δ) = minP(E|E)

I(E, E), (5)

subject to the constraint

P1∑j=1

P2∑k=1

δ(k,j) P(E = ej)P(E = ek|E = ej) ≤ δ. (6)

The mutual information can be written as [11, Theorem 4(a)]

minP(E)

P1∑j=1

P2∑k=1

P(E = ej , E = ek) ln

(P(E = ek, E = ej)

P(E = ej)P(E = ek)

),

(7)subject to the constraint

P2∑k=1

P(E = ek) = 1. (8)

Moreover, the constraint in (6) can be reexpressed as

P1∑j=1

P2∑k=1

δ(k,j) P(E = ej , E = ek) ≤ δ, (9)

with

(∀j ∈ 1, . . . , P1)P2∑k=1

P(E = ej , E = ek) = P(E = ej).

(10)The unknown variables are thus the vectors

p =(P(E = ej , E = ek)

)1≤j≤P1,1≤k≤P2

∈ RP1P2 (11)

andq =

(P(E = ek)

)1≤k≤P2

∈ RP2 , (12)

whose optimal values are solutions to the problem:

minimizep∈C2∩C3,q∈C1

D(p, r ⊗ q) (13)

where r =(P(E = ej)

)1≤j≤P1

∈ RP1 , ⊗ denotes the Kro-necker product, D is the Kullback-Leibler divergence, and C1,C2, C3 are the closed convex sets corresponding to the linearconstraints (8), (9), (10), respectively. The above formulationis a special case of Problem II.2 in which P = P ′ = P1P2,Q′ = P2, A′ = IP , B′ is such that (∀q ∈ RQ′) B′q = r ⊗ q,S = 3, V1 = IQ′ , U2 = U3 = IP , U1 and V2 = V3 are nullmatrices, and (∀s ∈ 1, 2, 3) Rs is the indicator function ofthe constraint convex set Cs.

B. Considered class of divergences

We will focus on additive information measures of the form

(∀(p, q) ∈ RP × RP

)D(p, q) =

P∑i=1

Φ(p(i), q(i)), (14)

where Φ ∈ Γ0(R × R) is the perspective function [81] on[0,+∞[× ]0,+∞[ of a function ϕ : R → [0,+∞] belonging

to Γ0(R) and twice differentiable on ]0,+∞[. In other words,Φ is defined as follows: for every (υ, ξ) ∈ R2,

Φ(υ, ξ) =

ξ ϕ(υξ

)if υ ∈ [0,+∞[ and ξ ∈ ]0,+∞[

υ limζ→+∞

ϕ(ζ)

ζif υ ∈ ]0,+∞[ and ξ = 0

0 if υ = ξ = 0

+∞ otherwise,(15)

where the above limit is guaranteed to exist [82, Sec. 2.3].Moreover, if ϕ is a strictly convex function such that

ϕ(1) = ϕ′(1) = 0, (16)

the function D in (14) belongs to the class of ϕ-divergences[7], [83]. Then, for every (p, q) ∈ [0,+∞[

P × [0,+∞[P ,

D(p, q) ≥ 0 (17)D(p, q) = 0 ⇔ p = q. (18)

Examples of ϕ-divergences will be provided in Sections IV-A,IV-B, IV-C, IV-D and IV-F. For a thorough investigation ofthe rich properties of ϕ-divergences, the reader is refered to[7], [8], [84]. Other divergences (e.g., Renyi divergence) areexpressed as(

∀(p, q) ∈ RP × RP)

Dg(p, q) = g(D(p, q)

)(19)

where g is an increasing function. Then, provided thatg(ϕ(1)

)= 0, Dg(p, q) ≥ 0 for every [p> q>]> ∈ C with

C =x ∈ [0, 1]2P

∣∣ P∑i=1

x(i) = 1 andP∑i=1

x(P+i) = 1.

(20)From an optimization standpoint, minimizing D or Dg (possi-bly subject to constraints) makes no difference, hence we willonly address problems involving D in the rest of this paper.

C. Proximity operators

Proximity operators will be fundamental tools in this paper.We first recall some of their key properties.

Proposition II.4 [75], [81] Let f ∈ Γ0(H). Then,(i) For every x ∈ H, proxf x ∈ dom f .

(ii) For every (x, x) ∈ H2

x = proxf (x) ⇔ x− x ∈ ∂f(x). (21)

(iii) For every (x, z) ∈ H2,

proxf(·+z)(x) = proxf (x+ z)− z. (22)

(iv) For every (x, z) ∈ H2 and for every α ∈ R,

proxf+〈z|·〉+α(x) = proxf (x− z). (23)

(v) Let f∗ be the conjugate function of f . For every x ∈ Hand for every γ ∈ ]0,+∞[,

proxγf∗(x) = x− γ proxf/γ(x/γ). (24)

4

(vi) Let G be a real Hilbert space and let T : G → H bea bounded linear operator, with the adjoint denoted byT ∗. If TT ∗ = κ Id and κ ∈ ]0,+∞[, then for all x ∈ H

proxfT (x) = x+1

κT ∗(

proxκf (Tx)− Tx). (25)

Numerous additional properties of proximity operators arementioned in [57], [85].

In this paper, we will be mainly concerned with the deter-mination of the proximity operator of the function D definedin (14) with H = RP × RP . The next result emphasizes thatthis task reduces to the calculation of the proximity operatorof a real function of two variables.

Proposition II.5 Let D be defined by (14) where Φ ∈ Γ0(R2)and let γ ∈ ]0,+∞[. Let u ∈ RP and v ∈ RP . Then, for everyp ∈ RP and for every q ∈ RP ,

proxγD(·+u,·+v)(p, q) = (p− u, q − v) (26)

where, for every i ∈ 1, . . . , P,

(p(i), q(i)) = proxγΦ(p(i) + u(i), q(i) + v(i)). (27)

Note that, although an extensive list of proximity operatorsof one-variable real functions can be found in [57], few resultsare available for real functions of two variables [58], [60], [61],[63]. An example of such a result is provided below.

Proposition II.6 Let ϕ ∈ Γ0(R) be an even differentiablefunction on R \ 0. Let Φ: R2 → ]−∞,+∞] be defined as:(∀(ν, ξ) ∈ R2)

Φ(ν, ξ) =

ϕ(ν − ξ) if (ν, ξ) ∈ [0,+∞[

2

+∞ otherwise.(28)

Then, for every (ν, ξ) ∈ R2,

proxΦ(ν, ξ) =

1

2

(ν + ξ + π1, ν + ξ − π1

)if |π1| < ν + ξ

(0, π2) if π2 > 0 and π2 ≥ ν + ξ

(π3, 0) if π3 > 0 and π3 ≥ ν + ξ

(0, 0) otherwise,(29)

with π1 = prox2ϕ(ν−ξ), π2 = proxϕ(ξ) and π3 = proxϕ(ν).

The above proposition provides a simple characterizationof the proximity operators of some distances defined fornonnegative-valued vectors. However, the assumptions madein Proposition II.6 are not satisfied by the class of functionsΦ considered in Section II-B.1

1Indeed, none of the considered ϕ-divergences can be expressed as afunction of the difference between the two arguments.

III. MAIN RESULT

As shown by Proposition II.5, we need to compute the prox-imity operator of a scaled version of a function Φ ∈ Γ0(R2)as defined in (15). In the following, Θ denotes a primitiveon ]0,+∞[ of the function ζ 7→ ζϕ′(ζ−1). The followingfunctions will subsequently play an important role:

ϑ− : ]0,+∞[→ R : ζ 7→ ϕ′(ζ−1) (30)

ϑ+ : ]0,+∞[→ R : ζ 7→ ϕ(ζ−1)− ζ−1ϕ′(ζ−1). (31)

A first technical result is as follows.

Lemma III.1 Let γ ∈ ]0,+∞[, let (υ, ξ) ∈ R2, and define

χ− = infζ ∈ ]0,+∞[

∣∣ ϑ−(ζ) < γ−1υ

(32)

χ+ = supζ ∈ ]0,+∞[

∣∣ ϑ+(ζ) < γ−1ξ

(33)

(with the usual convention inf ∅ = +∞ and sup∅ = −∞).If χ− 6= +∞, the function

ψ : ]0,+∞[→ R :

ζ 7→ ζϕ(ζ−1)−Θ(ζ) +γ−1υ

2ζ2 − γ−1ξζ (34)

is strictly convex on ]χ−,+∞[. In addition, if

(i) χ− 6= +∞ and χ+ 6= −∞

(ii) limζ→χ−ζ>χ−

ψ′(ζ) < 0

(iii) limζ→χ+ψ′(ζ) > 0

then ψ admits a unique minimizer ζ on ]χ−,+∞[, and ζ <χ+.

Proof. The derivative of ψ is, for every ζ ∈ ]0,+∞[,

ψ′(ζ) = ϕ(ζ−1)− (ζ + ζ−1)ϕ′(ζ−1) + γ−1υζ − γ−1ξ

= ζ(γ−1υ − ϑ−(ζ)

)+ ϑ+(ζ)− γ−1ξ. (35)

The function ϑ− is decreasing as the convexity of ϕ yields

(∀ζ ∈ ]0,+∞[) ϑ′−(ζ) = −ζ−2ϕ′′(ζ−1) ≤ 0. (36)

This allows us to deduce that

ifζ ∈ ]0,+∞[

∣∣ ϑ−(ζ) < γ−1υ6= ∅,

then ]χ−,+∞[=ζ ∈ ]0,+∞[

∣∣ ϑ−(ζ) < γ−1υ. (37)

Similarly, the function ϑ+ is increasing as the convexity of ϕyields

(∀ζ ∈ ]0,+∞[) ϑ′+(ζ) = ζ−3ϕ′′(ζ−1) ≥ 0 (38)

which allows us to deduce that

ifζ ∈ ]0,+∞[

∣∣ ϑ+(ζ) < γ−1ξ6= ∅,

then ]0, χ+[=ζ ∈ ]0,+∞[

∣∣ ϑ+(ζ) < γ−1ξ. (39)

If (χ−, χ+) ∈ ]0,+∞[2, then (35) leads to

ψ′(χ−) = ϑ+(χ−)− γ−1ξ (40)

ψ′(χ+) = χ+

(γ−1υ − ϑ−(χ+)

). (41)

So, Conditions (ii) and (iii) are equivalent to

ϑ+(χ−)− γ−1ξ < 0 (42)

χ+

(γ−1υ − ϑ−(χ+)

)> 0. (43)

5

In view of (37) and (39), these inequalities are satisfied if andonly if χ− < χ+. This inequality is also obviously satisfied ifχ− = 0 or χ+ = +∞. In addition, we have: (∀ζ ∈ ]0,+∞[)

ψ′′(ζ) = γ−1υ − ϑ−(ζ) + ζ−1(1 + ζ−2)ϕ′′(ζ−1). (44)

When ζ > χ− 6= +∞, γ−1υ − ϑ−(ζ) > 0, and the convexityof ϕ yields ψ′′(ζ) > 0. This shows that ψ is strictly convexon ]χ−,+∞[.

If Conditions (i)-(iii) are satisfied, due to the continuity ofψ′, there exists ζ ∈]χ−, χ+[ such that ψ′(ζ) = 0. Becauseof the strict convexity of ψ on ]χ−,+∞[, ζ is the uniqueminimizer of ψ on this interval.

The required assumptions in the previous lemma can oftenbe simplified as stated below.

Lemma III.2 Let γ ∈ ]0,+∞[ and (υ, ξ) ∈ R2. If(χ−, χ+) ∈ ]0,+∞[

2, then Conditions (ii) and (iii) inLemma III.1 are equivalent to: χ− < χ+. If χ− ∈ ]0,+∞[and χ+ = +∞ (resp. χ− = 0 and χ+ ∈ ]0,+∞[), Conditions(ii)-(iii) are satisfied if and only if limζ→+∞ ψ′(ζ) > 0 (resp.limζ→0

ζ>0ψ′(ζ) < 0).

Proof. If (χ−, χ+) ∈ ]0,+∞[2, we have already shown that

Conditions (ii) and (iii) are satisfied if and only χ− < χ+.If χ− ∈ ]0,+∞[ and χ+ = +∞ (resp. χ− = 0 and χ+ ∈

]0,+∞[), we still have

ψ′(χ−) = ϑ+(χ−)− γ−1ξ < 0 (45)

(resp. ψ′(χ+) = χ+

(γ−1υ − ϑ−(χ+)

)> 0), (46)

which shows that Condition (ii) (resp. Condition (iii)) isalways satisfied.

By using the same expressions of χ− and χ+ as in theprevious lemmas, we obtain the following characterization ofthe proximity operator of any scaled version of Φ:

Proposition III.3 Let γ ∈ ]0,+∞[ and (υ, ξ) ∈ R2.proxγΦ(υ, ξ) ∈ ]0,+∞[

2 if and only if Conditions (i)-(iii)in Lemma III.1 are satisfied. When these conditions hold,

proxγΦ(υ, ξ) =(υ − γ ϑ−(ζ), ξ − γ ϑ+(ζ)

)(47)

where ζ < χ+ is the unique minimizer of ψ on ]χ−,+∞[.

Proof. For every (υ, ξ) ∈ R2, such that Conditions (i)-(iii) inLemma III.1 hold, let

υ = υ − γ ϑ−(ζ) (48)

ξ = ξ − γ ϑ+(ζ) (49)

where the existence of ζ ∈]χ−, χ+[ is guaranteed byLemma III.1. As consequences of (37) and (39), υ and ξ arepositive. In addition, since

ψ′(ζ) = 0 ⇔ ζ(γ−1υ − ϑ−(ζ)

)= γ−1ξ − ϑ+(ζ) (50)

we derive from (48) and (49) that ζ = ξ/υ > 0. This allowsus to re-express (48) and (49) as

υ − υ + γϕ′(υξ

)= 0 (51)

ξ − ξ + γ

(ϕ(υξ

)− υ

ξϕ′(υξ

))= 0, (52)

that is

υ − υ + γ∂Φ

∂υ(υ, ξ) = 0 (53)

ξ − ξ + γ∂Φ

∂ξ(υ, ξ) = 0. (54)

The latter equations are satisfied if and only if [57]

(υ, ξ) = proxγΦ(υ, ξ). (55)

Conversely, for every (υ, ξ) ∈ R2, let (υ, ξ) =proxγΦ(υ, ξ). If (υ, ξ) ∈ ]0,+∞[

2, (υ, ξ) satisfies (51) and(52). By setting ζ = ξ/υ > 0, after simple calculations, wefind

υ = υ − γ ϑ−(ζ) > 0 (56)

ξ = ξ − γ ϑ+(ζ) > 0 (57)

ψ′(ζ) = 0. (58)

According to (37) and (39), (56) and (57) imply that χ− 6=+∞, χ+ 6= −∞, and ζ ∈]χ−, χ+[. In addition, accordingto Lemma III.1, ψ′ is strictly increasing on ]χ−,+∞[ (sinceψ is strictly convex on this interval). Hence, ψ′ has a limitat χ− (which may be equal to −∞ when χ− = −∞), andCondition (ii) is satisfied. Similarly, ψ′ has a limit at χ+

(possibly equal to +∞ when χ+ = +∞), and Condition (iii)is satisfied.

Remark III.4 In (15), a special case arises when

(∀ζ ∈ ]0,+∞[) ϕ(ζ) = ϕ(ζ) + ζϕ(ζ−1) (59)

where ϕ is a twice differentiable convex function on ]0,+∞[.Then Φ takes a symmetric form, leading to L-divergences. Itcan then be deduced from (31) that, for every ζ ∈ ]0,+∞[,

ϑ−(ζ) = ϑ+(ζ−1) = ϕ(ζ) + ϕ′(ζ−1)− ζϕ′(ζ). (60)

IV. EXAMPLES

A. Kullback-Leibler divergence

Let us now apply the results in the previous section to thefunction

Φ(υ, ξ) =

υ ln

(υ

ξ

)+ ξ − υ if (υ, ξ) ∈ ]0,+∞[

2

ξ if υ = 0 and ξ ∈ [0,+∞[

+∞ otherwise.(61)

This is a function in Γ0(R2) satisfying (15) with

(∀ζ ∈ ]0,+∞[) ϕ(ζ) = ζ ln ζ − ζ + 1. (62)

6

Proposition IV.1 The proximity operator of γΦ with γ ∈]0,+∞[ is, for every (υ, ξ) ∈ R2,

proxγΦ(υ, ξ) =

(υ, ξ) if exp

(γ−1υ

)> 1− γ−1ξ


where

υ = υ + γ ln ζ (64)

ξ = ξ + γ(ζ−1 − 1

)(65)

and ζ is the unique minimizer on ] exp(−γ−1υ),+∞[ of

ψ : ]0,+∞[→ R : (66)

ζ 7→(ζ2

2− 1)

ln ζ +1

2

(γ−1υ − 1

2

)ζ2 + (1− γ−1ξ)ζ.

Proof. For every (υ, ξ) ∈ R2, (υ, ξ) = proxγΦ(υ, ξ) is suchthat (υ, ξ) ∈ dom Φ [86]. Let us first note that

υ ∈ ]0,+∞[ ⇔ (υ, ξ) ∈ ]0,+∞[2. (67)

We are now able to apply Proposition III.3, where ψ is givenby (66) and, for every ζ ∈ ]0,+∞[,

Θ(ζ) =ζ2

2

(1

2− ln ζ

)− 1 (68)

ϑ−(ζ) = − ln ζ (69)

ϑ+(ζ) = 1− ζ−1. (70)

In addition,

χ− = exp(−γ−1υ) (71)

χ+ =

(1− γ−1ξ)−1 if ξ < γ

+∞ otherwise.(72)

According to (67) and Proposition III.3, υ ∈ ]0,+∞[ if andonly if Conditions (i)-(iii) in Lemma III.1 hold. Since χ− ∈]0,+∞[ and limζ→+∞ ψ′(ζ) = +∞, Lemma III.2 shows thatthese conditions are satisfied if and only if

ξ ≥ γ or(ξ < γ and exp(−υ/γ) < (1− γ−1ξ)−1

),

(73)which is equivalent to

exp(υ/γ) > 1− γ−1ξ. (74)

Under this assumption, Proposition III.3 leads to the expres-sions (64) and (65) of the proximity operator, where ζ is theunique minimizer on ] exp(−υ/γ),+∞[ of the function ψ.

We have shown that υ > 0⇔ (74). So, υ = 0 when (74) isnot satisfied. Then, the expression of ξ simply reduces to theasymmetric soft-thresholding rule [87]:

ξ =

ξ − γ if ξ > γ

0 otherwise.(75)

However, exp(γ−1υ) ≤ 1 − γ−1ξ ⇒ ξ < γ, so that ξ isnecessarily equal to 0.

Remark IV.2 More generally, we can derive the proximityoperator of

Φ(υ, ξ) =

υ ln

(υ

ξ

)+κ(ξ − υ) if (υ, ξ) ∈ ]0,+∞[

2

κξ if υ = 0 and ξ ∈ [0,+∞[

+∞ otherwise,(76)

where κ ∈ R. Of particular interest in the literature is the casewhen κ = 0 [11], [12], [21], [24]. From Proposition II.4(iv),we get, for every γ ∈ ]0,+∞[ and for every (υ, ξ) ∈ R2,

proxγΦ(υ, ξ) = proxγΦ(υ + γκ− γ, ξ − γκ+ γ), (77)

where proxγΦ is provided by Proposition IV.1.

Remark IV.3 It can be noticed that

ψ′(ζ) = ζ ln ζ + γ−1υζ − ζ−1 + 1− γ−1ξ = 0 (78)

is equivalent to

ζ−1 exp(ζ−1

(ζ−1 + γ−1ξ − 1

))= exp(γ−1υ). (79)

In the case where ξ = γ, the above equation reduces to

2ζ−2 exp(2ζ−2

)= 2 exp(2γ−1υ)

⇔ ζ =

(2

W (2e2γ−1υ)

)1/2

(80)

where W is the Lambert W function [88]. When ξ 6= γ,although a closed-form expression of (79) is not available,efficient numerical methods to compute ζ can be developed.

Remark IV.4 To minimize ψ in (66), we need to find the zeroon ] exp(−γ−1υ),+∞[ of the function: (∀ζ ∈ ]0,+∞[)

ψ′(ζ) = ζ ln ζ + γ−1υ ζ − ζ−1 + 1− γ−1ξ. (81)

This can be performed by Algorithm 1, the convergence ofwhich is proved in Appendix A.

Algorithm 1 Newton method for minimizing (66).

SET ζ0 = exp(−γ−1υ)

FOR n = 0, 1, . . .⌊ζn+1 = ζn − ψ′(ζn)/ψ′′(ζn).

In the following, we provide expressions of the proximityoperators of other standard divergences, which are derivedfrom the results in Section III (see [89] for more technicaldetails).

B. Jeffreys divergence

Let us now consider the symmetric form of (61) given by

Φ(υ, ξ) =

(υ − ξ)

(ln υ − ln ξ) if (υ, ξ) ∈ ]0,+∞[

2

0 if υ = ξ = 0

+∞ otherwise.(82)

7

This function belongs to Γ0(R2) and satisfies (15) and (59)with

(∀ζ ∈ ]0,+∞[) ϕ(ζ) = − ln ζ. (83)


proxγΦ(υ, ξ) =

(υ, ξ) if W (e1−γ−1υ)W (e1−γ−1ξ) < 1

(0, 0) otherwise(84)

where

υ = υ + γ(

ln ζ + ζ − 1) (85)

ξ = ξ − γ(

ln ζ − ζ−1 + 1) (86)

and ζ is the unique minimizer on ]W (e1−γ−1υ),+∞[ of

ψ : ]0,+∞[→ R :

ζ 7→(ζ2

2+ ζ − 1

)ln ζ +

ζ3

3+

1

2

(γ−1υ − 3

2

)ζ2 − γ−1ξζ.

(87)

Remark IV.6 To minimize ψ in (87), we need to find the zeroon [χ−, χ+] of the function: (∀ζ ∈ ]0,+∞[)

ψ′(ζ) = (ζ+1) ln ζ+ζ

2−ζ−1+ζ2+

(γ−1υ− 3

2

)ζ+1−γ−1ξ.

(88)This can be performed by a projected Newton algorithm.

Remark IV.7 From a numerical standpoint, to avoid the arith-metic overflow in the exponentiations when γ−1υ or γ−1ξtend to −∞, one can use the asymptotic approximation of theLambert W function for large values: for every τ ∈ [1,+∞[,

τ − ln τ +1

2

ln τ

τ≤W

(eτ)≤ τ − ln τ +

e

e− 1

ln τ

τ, (89)

with equality only if τ = 1 [90].

C. Hellinger divergence

Let us now consider the function of Γ0(R2) given by

Φ(υ, ξ) =

(√υ −√ξ)2 if (υ, ξ) ∈ [0,+∞[

2

+∞ otherwise.(90)

This symmetric function satisfies (15) and (59) with

(∀ζ ∈ ]0,+∞[) ϕ(ζ) = ζ −√ζ. (91)


proxγΦ(υ, ξ) =

(υ, ξ) if υ ≥ γ or

(1− υ

γ

)(1− ξ

γ

)< 1


where

υ = υ + γ(ρ− 1) (93)

ξ = ξ + γ(ρ−1 − 1

)(94)

and ρ is the unique solution on]max(1− γ−1υ, 0),+∞

[of

ρ4 +(γ−1υ − 1

)ρ3 +

(1− γ−1ξ

)ρ− 1 = 0. (95)

D. Chi square divergence

Let us now consider the function of Γ0(R2) given by

Φ(υ, ξ) =

(υ − ξ)2

ξif υ ∈ [0,+∞[ and ξ ∈ ]0,+∞[

0 if υ = ξ = 0

+∞ otherwise.(96)

This function satisfies (15) with

(∀ζ ∈ ]0,+∞[) ϕ(ζ) = (ζ − 1)2. (97)


proxγΦ(υ, ξ) =

(υ, ξ) if υ > −2γ and

ξ > −(υ +

υ2

4γ

)(0,maxξ − γ, 0

)otherwise,

(98)where

υ = υ + 2γ(1− ρ) (99)

ξ = ξ + γ(ρ2 − 1) (100)

and ρ is the unique solution on ]0, 1 + γ−1υ/2[ of

ρ3 +(1 + γ−1ξ

)ρ− γ−1υ − 2 = 0. (101)

E. Renyi divergence

Let α ∈]1,+∞[ and consider the below function of Γ0(R2)

Φ(υ, ξ) =

υα

ξα−1if υ ∈ [0,+∞[ and ξ ∈ ]0,+∞[

0 if υ = ξ = 0

+∞ otherwise,

(102)

which corresponds to the case when

(∀ζ ∈ ]0,+∞[) ϕ(ζ) = ζα. (103)

Note that the above function Φ allows us to generate the Renyidivergence up to a log transform and a multiplicative constant.


proxγΦ(υ, ξ) =

(υ, ξ) if υ > 0 and

γ1

α−1 ξ

1− α<

(υ

α

) αα−1

(0,maxξ, 0

)otherwise,

(104)where

υ = υ − γαζ1−α (105)

ξ = ξ + γ(α− 1)ζ−α (106)

and ζ is the unique solution on ](αγυ−1)1

α−1 ,+∞[ of

γ−1υ ζ1+α − γ−1ξ ζα − αζ2 + 1− α = 0. (107)

8

Note that (107) becomes a polynomial equation when α isa rational number. In particular, when α = 2, it reduces to thecubic equation:

ρ3 +(2 + γ−1ξ

)ρ− γ−1υ = 0 (108)

with ζ = ρ−1.

F. Iα divergence

Let α ∈]0, 1[ and consider the function of Γ0(R2) given by

Φ(υ, ξ) =

αυ + (1− α)ξ − υαξ1−α if (υ, ξ) ∈ [0,+∞[

2

+∞ otherwise(109)

which corresponds to the case when

(∀ζ ∈ ]0,+∞[) ϕ(ζ) = 1− α+ αζ − ζα. (110)


proxγΦ(υ, ξ) =

(υ, ξ) if υ ≥ γα or

1−ξ

γ(1− α)<

(1−

υ

γα

) αα−1


where

υ = υ + γα(ζ1−α − 1) (112)

ξ = ξ + γ(1− α)(ζ−α − 1) (113)

and ζ is the unique solution on](

max1− υγα, 0) 1

1−α ,+∞[

of

αζ2 + (γ−1υ−α)ζα+1 + (1−α− γ−1ξ)ζα = 1−α. (114)

As for the Renyi divergence, (114) becomes a polynomialequation when α is a rational number.

Remark IV.12 We can also derive the proximity operator of

Φ(υ, ξ) =

κ(αυ + (1− α)ξ

)− υαξ1−α if (υ, ξ)∈ [0,+∞[2

+∞ otherwise,(115)

where κ ∈ R. From Proposition II.4(iv), we get, for everyγ ∈ ]0,+∞[ and for every (υ, ξ) ∈ R2,

proxγΦ(υ, ξ) = proxγΦ

(υ+γ(1−κ)α, ξ+γ(1−κ)(1−α)

),

(116)where proxγΦ is provided by Proposition IV.11.

V. CONNECTION WITH EPIGRAPHICAL PROJECTIONS

Proximal methods iterate a sequence of steps in whichproximity operators are evaluated. The efficient computationof these operators is thus essential for dealing with high-dimensional convex optimization problems. In the context ofconstrained optimization, at least one of the additive termsof the global cost to be minimized consists of the indicatorfunction of a closed convex set, whose proximity operatorreduces to the projections onto this set. However, if weexcept a few well-known cases, such projection does not

admit a closed-form expression. The resolution of large-scaleoptimization problems involving non trivial constraints is thusquite challenging. This difficulty can be circumvented whenthe constraint can be expressed as the lower-level set of someseparable function, by making use of epigraphical projectiontechniques. Such approaches have attracted interest in the lastyears [27], [63], [91]–[94]. The idea consists of decomposingthe constraint of interest into the intersection of a half-spaceand a number of epigraphs of simple functions. For thisapproach to be successful, it is mandatory that the projectiononto these epigraphs can be efficiently computed.

The next proposition shows that the expressions of theprojection onto the epigraph of a wide range of functions canbe deduced from the expressions of the proximity operatorsof ϕ-divergences. In particular, in Table I, for each of theϕ-divergences presented in Section III, we list the associatedfunctions ϕ∗ for which such projections can thus be derived.

Proposition V.1 Let ϕ : R→ [0,+∞] be a function in Γ0(R)which is twice differentiable on ]0,+∞[. Let Φ be the functiondefined by (15) and ϕ∗ ∈ Γ0(R) the Fenchel-conjugatefunction of the restriction of ϕ on [0,+∞[, defined as

(∀ζ∗ ∈ R) ϕ∗(ζ∗) = supζ∈[0,+∞[

ζζ∗ − ϕ(ζ). (117)

Let the epigraph of ϕ∗ be defined as

epiϕ∗ =

(υ∗, ξ∗) ∈ R2∣∣ ϕ∗(υ∗) ≤ ξ∗. (118)

Then, the projection onto epiϕ∗ is: for every (υ∗, ξ∗) ∈ R2,

Pepiϕ∗(υ∗, ξ∗) = (υ∗,−ξ∗)− proxΦ(υ∗,−ξ∗). (119)

Proof. The conjugate function of Φ is, for every (υ, ξ) ∈ R2,

Φ∗(υ∗, ξ∗) = sup(υ,ξ)∈R2

υυ∗ + ξξ∗ − Φ(υ, ξ). (120)

From the definition of Φ, we deduce that, for all (υ, ξ) ∈ R2,

Φ∗(υ∗, ξ∗)

= sup

sup(υ,ξ)∈[0,+∞[×]0,+∞[

(υυ∗ + ξξ∗ − ξϕ

(υξ

)),

supυ∈]0,+∞[

(υυ∗ − lim

ξ→0ξ>0

ξϕ(υξ

)), 0

(121)

= sup

sup(υ,ξ)∈[0,+∞[×]0,+∞[

(υυ∗ + ξξ∗ − ξϕ

(υξ

)), 0

(122)= supιepiϕ∗(υ

∗,−ξ∗), 0 (123)= ιepiϕ∗(υ

∗,−ξ∗), (124)

where the equality in (123) stems from [81, Example 13.8].Then, (119) follows from the conjugation property of theproximity operator (see Proposition II.4 (v)).

VI. EXPERIMENTAL RESULTS

To illustrate the potential of our results, we consider a queryoptimization problem in database management systems wherethe optimal query execution plan depends on the accurateestimation of the proportion of tuples that satisfy the predicates

9

TABLE ICONJUGATE FUNCTION ϕ∗ OF THE RESTRICTION OF ϕ TO [0,+∞[.

Divergence ϕ(ζ) ϕ∗(ζ∗)ζ > 0 ζ∗ ∈ R

Kullback-Leibler ζ ln ζ − ζ + 1 eζ∗ − 1

Jeffreys (ζ − 1) ln ζ W (e1−ζ∗) +

(W (e1−ζ

∗))−1

+ ζ∗ − 2

Hellinger 1 + ζ − 2√ζ

ζ∗

1− ζ∗if ζ∗ < 1

+∞ otherwise

Chi square (ζ − 1)2

ζ∗(ζ∗ + 4)

4if ζ∗ ≥ −2

−1 otherwise

Renyi, α ∈]1,+∞[ ζα

(α− 1)( ζ∗α

) αα−1 if ζ∗ ≥ 0

0 otherwise

Iα, α ∈]0, 1[ 1− α+ αζ − ζα(1− α)

((1−

ζ∗

α

) αα−1 − 1

)if ζ∗ ≤ α

+∞ otherwise

in the query. More specifically, every request formulated bya user can be viewed as an event in a probability space(Ω, T ,P), where Ω is a finite set of size N . In order tooptimize request fulfillment, it is useful to accurately estimatethe probabilities, also called selectivities, associated with eachelement of Ω. To do so, rough estimations of the probabilitiesof a certain number P of events can be inferred from thehistory of formulated requests and some a priori knowledge.

Let x = (x(n))1≤n≤N ∈ RN be the vector of soughtprobabilities, and let z = (z(i))1≤i≤P ∈ [0, 1]P be the vectorof roughly estimated probabilities. The problem of selectivityestimation is equivalent to the following constrained entropymaximization problem [95]:

minimizex∈RN

N∑n=1

x(n) lnx(n) s. t.

Ax = z,N∑n=1

x(n) = 1,

x ∈ [0, 1]N ,

(125)

where A ∈ RP×N is a binary matrix establishing the theo-retical link between the probabilities of each event and theprobabilities of the elements of Ω belonging to it.

Unfortunately, due to the inaccuracy of the estimated proba-bilities, the intersection between the affine constraints Ax = zand the other ones may be empty, making the above probleminfeasible. In order to overcome this issue, we propose tojointly estimate the selectivities and the feasible probabilities.Our idea consists of reformulating Problem (125) by introduc-ing the divergence between Ax and an additional vector y offeasible probabilities. This allows us to replace the constraint

Ax = z with an `k-ball centered in z, yielding

minimize(x,y)∈RN×RP

D(Ax, y) + λ

N∑n=1

x(n) lnx(n)

s. t.

‖y − z‖k ≤ η,N∑n=1

x(n) = 1,

x ∈ [0, 1]N ,

(126)

where D is defined in (14), λ and η are some positiveconstants, whereas k ∈ [1,+∞[ (the choice k = 2 yieldsthe Euclidean ball).

To demonstrate the validity of this approach, we compareit with the following methods:

(i) a relaxed version of Problem (125), in which the con-straint Ax = z is replaced with a squared Euclideandistance:

minimizex∈RN

‖Ax− z‖2 + λ

N∑n=1

x(n) lnx(n)

s. t.

N∑n=1

x(n) = 1,

x ∈ [0, 1]N ,

(127)

or with ϕ-divergence D:

minimizex∈RN

D(Ax, z) + λ

N∑n=1

x(n) lnx(n)

s. t.

N∑n=1

x(n) = 1,

x ∈ [0, 1]N ,

(128)

where λ is some positive constant;

10

TABLE IICOMPARISON OF Q∞-SCORES

Problem (126) (128) (129)+(125) [94] (127)ϕ

KL 2.23 2.95

2.45 25.84Jef 2.44 3.41Hel 2.42 89.02Chi 2.34 3.20I1/2 2.42 89.02

Fig. 1. Execution time (in seconds) versus size N in Problem (126).

(ii) the two-step procedure in [94], which consists of findinga solution x to

minimizex∈RN

Q1

(Ax, z

)s. t.

N∑n=1

x(n) = 1,

x ∈ [0, 1]N ,

(129)

and then solving (125) by replacing z with z =Ax. Hereabove, for every y ∈ RP , Q1(y, z) =∑Pi=1 φ(y(i)/z(i)) is a sum of quotient functions, i.e.

φ(ξ) =

ξ, if ξ ≥ 1,

ξ−1, if 0 < ξ < 1,

+∞, otherwise.(130)

For the numerical evaluation, we adopt an approach similarto [94], and we first consider the following low-dimensionalsetting:

A =

1 0 1 0 1 0 10 1 1 0 0 1 10 0 0 1 1 1 10 0 1 0 0 0 10 0 1 0 1 0 10 0 0 0 0 1 1

, z =

0.21140.63310.63120.51820.93370.0035

, (131)

for which there exists no x ∈ [0,+∞[N such that Ax = z.

To assess the quality of the solutions x∗ obtained with thedifferent methods, we evaluate the max-quotient between Ax∗

and z, that is [94]

Q∞(Ax∗, z) = max1≤i≤P

φ

([Ax∗](i)

z(i)

). (132)

Table II collects the Q∞-scores (lower is better) obtainedwith the different approaches. For all the considered ϕ-divergences,2 the proposed approach performs favorably withrespect to the state-of-the-art, the KL divergence providingthe best performance among the panel of considered ϕ-divergences. For the sake of fairness, the hyperparametersλ and η were hand-tuned in order to get the best possiblescore for each compared method. The good performance ofour approach is related to the fact that ϕ-divergences are wellsuited for the estimation of probability distributions.

Figure 1 next shows the computational time for solvingProblem (126) for various dimensions N of the selectivityvector to be estimated, with A and z randomly generatedso as to keep the ratio N/P equal to 7/6. To make thiscomparison, the primal-dual proximal method proposed in [69]was implemented in MATLAB R2015, by using the stoppingcriterion ‖xn+1 − xn‖ < 10−7‖xn‖. We then measured theexecution times on an Intel i5 CPU at 3.20 GHz with 12 GB ofRAM. The results show that all the considered ϕ-divergencescan be efficiently optimized, with no significant computationaltime differences between them.

VII. CONCLUSION

In this paper, we have shown how to solve convex optimiza-tion problems involving discrete information divergences byusing proximal methods. We have carried out a thorough studyof the properties of the proximity operators of ϕ-divergences,which has led us to derive new tractable expressions of them.In addition, we have related these expressions to the projectiononto the epigraph of a number of convex functions.

Finally, we have illustrated our results on a selectivityestimation problem. In this context, ϕ-divergences appear to bewell suited for the estimation of the sought probability distri-butions. Moreover, computational time evaluations allowed usto show that the proposed numerical methods provide efficientsolutions for solving such large-scale optimization problems.

APPENDIX ACONVERGENCE PROOF OF ALGORITHM 1

We aim at finding the unique zero on ] exp(−γ−1υ),+∞[of the function ψ′ given by (81) along with its derivatives:

(∀ζ ∈ ]0,+∞[) ψ′′(ζ) = 1 + ln ζ + γ−1υ + ζ−2, (133)

ψ′′′(ζ) = ζ−1 − 2 ζ−3. (134)

To do so, we employ the Newton method given in Algorithm 1,the convergence of which is here established. Assume that

•(υ, ξ)∈ R2 are such that exp(γ−1υ) > 1− γ−1ξ,

• ζ is the zero on]exp(−γ−1υ),+∞

[of ψ′,

• (ζn)n∈N is the sequence generated by Algorithm 1,

• εn = ζn − ζ for every n ∈ N.

We first recall a fundamental property of the Newton method,and then we proceed to the actual convergence proof.

2Note that the Renyi divergence is not suitable for the considered applica-tion, because it tends to favor sparse solutions.

11

Lemma A.1 For every n ∈ N,

εn+1 = ε2nψ′′′(%n)

2ψ′′(ζn)(135)

where %n is between ζn and ζ.

Proof. The definition of εn+1 yields

εn+1 = ζn −ψ′(ζn)

ψ′′(ζn)− ζ =

εnψ′′(ζn)− ψ′(ζn)

ψ′′(ζn). (136)

Moreover, for every ζn ∈ ]0,+∞[, the second-order Taylorexpansion of ψ′ around ζn is

ψ′(ζ) = ψ′(ζn)+ψ′′(ζn)(ζ−ζn)+1

2ψ′′′(%n)(ζ−ζn)2, (137)

where %n is between ζn and ζ. From the above equality, wededuce that ψ′(ζ) = ψ′(ζn)−ψ′′(ζn)εn + 1

2ψ′′′(%n)ε2n = 0.

Proposition A.2 The sequence (ζn)n∈N converges to ζ.

Proof. The assumption exp(γ−1υ) > 1 − γ−1ξ implies thatψ′ is negative at the initial value ζ0 = exp(−γ−1υ), that is

ψ′(ζ0) = − exp(γ−1υ) + 1− γ−1ξ < 0. (138)

Moreover, ψ′ is increasing on[exp(−γ−1υ),+∞

[, since(

∀ζ ∈[exp(−γ−1υ),+∞

[ )ψ′′(ζ) > 0, (139)

and√

2 is a non-critical inflection point for ψ′, since(∀ζ ∈

]√2,+∞

[)ψ′′′(ζ) > 0, (140)(

∀ζ ∈]0,√

2[)

ψ′′′(ζ) < 0. (141)

To prove the convergence, we consider the following cases:

• Case ζ ≤√

2: ψ′ is increasing and concave on [ζ0,√

2].Hence, Newton method initialized at the lower bound ofinterval [ζ0, ζ] monotonically increases to ζ [96].

• Case√

2 ≤ ζ0 < ζ: ψ′ is increasing and convex on[ζ0,+∞[. Hence, Lemma A.1 yields ε1 = ζ1 − ζ > 0.It then follows from standard properties of Newton algo-rithm for minimizing an increasing convex function that(ζn)n≥1 monotonically decreases to ζ [96].

• Case ζ0 <√

2 < ζ: as ψ′ is negative and increasing

on [ζ0, ζ[, the quantity −ψ′/ψ′′ is positive and lowerbounded on [ζ0,

√2], that is

(∀ζ ∈ [ζ0,

√2])− ψ′(ζ)

ψ′′(ζ)≥ −ψ

′(√

2)

ψ′′(ζ0)> 0. (142)

There thus exists k > 0 such that ζ0 < . . . < ζk andζk >

√2. Then, the convergence of (ζn)n≥k follows from

the same arguments as in the previous case.

REFERENCES

[1] M. El Gheche, A. Jezierska, J.-C. Pesquet, and J. Farah, “A proximalapproach for signal recovery based on information measures,” in Proc.Eur. Signal Process. Conf., Marrakech, Maroc, Sept. 2013, pp. 1–5.

[2] M. El Gheche, J.-C. Pesquet, and J. Farah, “A proximal approach foroptimization problems involving kullback divergences,” in Proc. Int.Conf. Acoust. Speech Signal Process., Vancouver, Canada, May 2013,pp. 5984–5988.

[3] K. Pearson, “On the criterion that a given system of deviations fromthe probable in the case of a correlated system of variables is such thatit can reasonable be supposed to have arisen from random sampling,”Phil. Mag., vol. 50, pp. 157–175, 1900.

[4] E. Hellinger, “Neue begrundung der theorie quadratischer formen vonunendlichvielen veranderlichen.,” J. fur die reine und angewandteMathematik, vol. 136, pp. 210–271, 1909.

[5] C. Shannon, “A mathematical theory of communication,” Bell Syst.Tech. J., vol. 27, no. 3, pp. 379–423, 623–656, Jul. 1948.

[6] S. Kullback and R. A. Leibler, “On information and sufficiency,” Ann.Math. Stat., vol. 22, no. 1, pp. 79–86, 1951.

[7] I. Csiszar, “Eine informations theoretische ungleichung und ihre anwen-dung auf den beweis der ergodizitat von markoffschen ketten.,” MagyarTud. Akad. Mat. Kutato Int. Koozl., vol. 8, pp. 85–108, 1963.

[8] A. M. Ali and S. D. Silvey, “A general class of coefficients of divergenceof one distribution from another,” J. R. Stat. Soc. Series B Stat.Methodol., vol. 28, no. 1, pp. 131–142, 1966.

[9] T. M. Cover and J. A. Thomas, Elements of information theory, NewYork, USA: Wiley-Interscience, 1991.

[10] I. Sason and S. Verdu, “f -divergence inequalities,” IEEE Trans. Inf.Theory, vol. 62, no. 11, pp. 5973–6006, 2016.

[11] R.E. Blahut, “Computation of channel capacity and rate-distortionfunctions,” IEEE Trans. Inf. Theory, vol. 18, no. 4, pp. 460–473, Jul.1972.

[12] S. Arimoto, “An algorithm for computing the capacity of arbitrarydiscrete memoryless channels,” IEEE Trans. Inf. Theory, vol. 18, no. 1,pp. 14–20, Jan. 1972.

[13] H. H. Bauschke, P. L. Combettes, and D. Noll, “Joint minimization withalternating Bregman proximity operators,” Pac. J. Optim, vol. 2, no. 3,pp. 401–424, Sep. 2006.

[14] P. L. Combettes and Q. V. Nguyen, “Solving composite monotoneinclusions in reflexive Banach spaces by constructing best Bregmanapproximations from their Kuhn-Tucker set,” J. Convex Anal., vol. 23,no. 2, May 2016.

[15] M. Chiang and S. Boyd, “Geometric programming duals of channelcapacity and rate distortion,” IEEE Trans. Inf. Theory, vol. 50, no. 2,pp. 245 – 258, Feb. 2004.

[16] S. Chretien and A. O. Hero, “On EM algorithms and their proximalgeneralizations,” Control Optim. Calc. Var., vol. 12, pp. 308–326, Jan.2008.

[17] C. L. Byrne, “Iterative image reconstruction algorithms based on cross-entropy minimization,” IEEE Trans. Image Process., vol. 2, no. 1, pp.96 –103, Jan. 1993.

[18] W. Richardson, “Bayesian-based iterative method of image restoration,”J. Opt. Soc. Am. A, vol. 62, no. 1, pp. 55–59, Jan. 1972.

[19] L. B. Lucy, “An iterative technique for the rectification of observeddistributions,” Astron. J., vol. 79, no. 6, pp. 745–754, 1974.

[20] J.A. Fessler, “Hybrid Poisson/polynomial objective functions for tomo-graphic image reconstruction from transmission scans,” IEEE Trans.Image Process., vol. 4, no. 10, pp. 1439 –1450, Oct. 1995.

[21] F.-X. Dupe, M. J. Fadili, and J.-L. Starck, “A proximal iteration fordeconvolving Poisson noisy images using sparse representations,” IEEETrans. Image Process., vol. 18, no. 2, pp. 310–321, Feb. 2009.

[22] R. Zanella, P. Boccacci, L. Zanni, and M. Bertero, “Efficient gradientprojection methods for edge-preserving removal of poisson noise,”Inverse Probl., vol. 25, no. 4, 2009.

[23] P. Piro, S. Anthoine, E. Debreuve, and M. Barlaud, “Combining spatialand temporal patches for scalable video indexing,” Multimed. ToolsAppl., vol. 48, no. 1, pp. 89–104, May 2010.

[24] N. Pustelnik, C. Chaux, and J.-C. Pesquet, “Parallel proximal algorithmfor image restoration using hybrid regularization,” IEEE Trans. ImageProcess., vol. 20, no. 9, pp. 2450 –2462, Nov. 2011.

[25] T. Teuber, G. Steidl, and R. H. Chan, “Minimization and parameterestimation for seminorm regularization models with I-divergence con-straints,” Inverse Probl., vol. 29, pp. 1–28, 2013.

[26] M. Carlavan and L. Blanc-Feraud, “Sparse poisson noisy imagedeblurring,” IEEE Trans. Image Process., vol. 21, no. 4, pp. 1834–1846,Apr. 2012.

12

[27] S. Harizanov, J.-C. Pesquet, and G. Steidl, “Epigraphical projection forsolving least squares anscombe transformed constrained optimizationproblems,” in Scale-Space and Variational Methods in Computer Vision,A. Kuijper et al., Ed., vol. 7893 of Lect. Notes Comput. Sc., pp. 125–136.2013.

[28] H. Jeffreys, “An invariant form for the prior probability in estimationproblems,” Proc. R. Soc. Lond. A Math. Phys. Sci., vol. 186, no. 1007,pp. 453–461, 1946.

[29] F. Nielsen and R. Nock, “Sided and symmetrized Bregman centroids,”IEEE Trans. Inf. Theory, vol. 55, no. 6, pp. 2882–2904, 2009.

[30] F. Nielsen, “Jeffreys centroids: A closed-form expression for positivehistograms and a guaranteed tight approximation for frequency his-tograms,” IEEE Signal Process. Lett., vol. 20, no. 7, pp. 657–660, July2013.

[31] I. Sason, “Tight bounds for symmetric divergence measures and a refinedbound for lossless source coding,” IEEE Trans. Inf. Theory, vol. 61, no.2, pp. 701–707, 2015.

[32] G. L. Gilardoni, “On the minimum f-divergence for given totalvariation,” Comptes Rendus Mathematique, vol. 343, no. 11-12, pp.763–766, 2006.

[33] R. Beran, “Minimum hellinger distance estimates for parametricmodels,” Ann. Stat., vol. 5, no. 3, pp. 445–463, 1977.

[34] P. A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach,Prentice-Hall International, 1982.

[35] A. L. Gibbs and F. E. Su, “On choosing and bounding probabilitymetrics,” Int. Stat. Rev., vol. 70, no. 3, pp. 419–435, 2002.

[36] F. Liese and I. Vajda, “On divergences and informations in statisticsand information theory,” IEEE Trans. Inf. Theory, vol. 52, no. 10, pp.4394–4412, 2006.

[37] T.W. Rauber, T. Braun, and K. Berns, “Probabilistic distance measuresof the dirichlet and beta distributions.,” Pattern Recogn., vol. 41, no. 2,pp. 637–645, 2008.

[38] L. LeCam, “Convergence of estimates under dimensionality restrictions,”Ann. Stat., vol. 1, no. 1, pp. 38–53, 1973.

[39] S. van de Geer, “Hellinger-consistency of certain nonparametric maxi-mum likelihood estimators,” Ann. Stat., vol. 21, no. 1, pp. 14–44, 1993.

[40] L. Chang-Hwan, “A new measure of rule importance using hellingerdivergence,” in Int. Conf. on Data Analytics, Barcelona, Spain, Sep.2012, pp. 103–106.

[41] D. Cieslak, T. Hoens, N. Chawla, and W. Kegelmeyer, “Hellingerdistance decision trees are robust and skew-insensitive,” Data Min.Knowl. Discov., vol. 24, no. 1, pp. 136–158, 2012.

[42] I. Park, S. Seth, M. Rao, and J.C. Principe, “Estimation of symmetricchi-square divergence for point processes,” in Proc. Int. Conf. Acoust.Speech Signal Process., Prague, Czech Republic, May 2011, pp. 2016–2019.

[43] A. Renyi, “On measures of entropy and information,” in Proc. 4thBerkeley Symp. on Math. Statist. and Prob., California, Berkeley, Jun.1961, vol. 1, pp. 547–561.

[44] P. Harremoes, “Interpretations of renyi entropies and divergences,”Physica A: Statistical Mechanics and its Applications, vol. 365, no. 1,pp. 57–62, 2006.

[45] I. Vajda, Theory of Statistical Inference and Information, Kluwer,Dordrecht, 1989.

[46] F. Liese and I. Vajda, Convex Statistical Distances, Treubner, 1987.[47] A. O. Hero, B. Ma, O. Michel, and J. Gorman, “Applications of entropic

spanning graphs,” IEEE Signal Process. Mag., vol. 19, no. 5, pp. 85–95,2002.

[48] H. Chernoff, “A measure of asymptotic efficiency for tests of ahypothesis based on a sum of observations,” Ann. Stat., vol. 23, pp.493–507, 1952.

[49] I. Csiszar, “Information measures: A critical survey.,” IEEE Trans. Inf.Theory, vol. A, pp. 73–86, 1974.

[50] S.-I. Amari, “Alpha-divergence is unique, belonging to both f-divergenceand Bregman divergence classes,” IEEE Trans. Inf. Theory, vol. 55, no.11, pp. 4925–4931, 2009.

[51] J. Zhang, “Divergence function, duality, and convex analysis,” NeuralComput., vol. 16, pp. 159–195, 2004.

[52] T. Minka, “Divergence measures and message passing,” Tech. Rep.,Microsoft Research Technical Report (MSR-TR-2005), 2005.

[53] A. Cichocki, L. Lee, Y. Kim, and S. Choi, “Non-negative matrixfactorization with α-divergence,” Pattern Recogn. Lett., vol. 29, no.9, pp. 1433–1440, 2008.

[54] I. Csiszar and F. Matus , “On minimization of multivariate entropyfunctionals,” in IEEE Information Theory Workshop, Jun. 2009, pp.96–100.

[55] X. L. Nguyen, M. J. Wainwright, and M. I. Jordan, “On surrogate lossfunctions and f -divergences,” Ann. Stat., vol. 37, no. 2, pp. 876–904,2009.

[56] C. Chaux, P. L. Combettes, J.-C. Pesquet, and V. R. Wajs, “A variationalformulation for frame-based inverse problems,” Inverse Probl., vol. 23,no. 4, pp. 1495–1518, Jun. 2007.

[57] P. L. Combettes and J.-C. Pesquet, “Proximal splitting methods in signalprocessing,” in Fixed-Point Algorithms for Inverse Problems in Scienceand Engineering, H. H. Bauschke, R. S. Burachik, P. L. Combettes,V. Elser, D. R. Luke, and H. Wolkowicz, Eds., pp. 185–212. Springer-Verlag, New York, 2011.

[58] P. L. Combettes and J.-C. Pesquet, “A proximal decomposition methodfor solving convex variational inverse problems,” Inverse Probl., vol.24, no. 6, pp. 065014, Dec. 2008.

[59] L. Condat, “Fast projection onto the simplex and the l1 ball,” Math.Program. Series A, 2015, to appear.

[60] J.-D. Benamou and Y. Brenier, “A computational fluid mechanicssolution to the Monge-Kantorovich mass transfer problem,” Numer.Math., vol. 84, no. 3, pp. 375–393, 2000.

[61] J.-D. Benamou, Y. Brenier, and K. Guittet, “The Monge-Kantorovichmass transfer and its computational fluid mechanics formulation,” Int.J. Numer. Meth. Fluids, vol. 40, pp. 21–30, 2002.

[62] P. L Combettes and C. L. Muller, “Perspective functions: Proximalcalculus and applications in high-dimensional statistics,” J. Math. Anal.Appl., 2016.

[63] G. Chierchia, N. Pustelnik, J.-C. Pesquet, and B. Pesquet-Popescu,“Epigraphical projection and proximal tools for solving constrainedconvex optimization problems,” Signal Image Video P., vol. 9, no. 8,pp. 1737–1749, Nov. 2015.

[64] G. Chen and M. Teboulle, “A proximal-based decomposition methodfor convex minimization problems,” Math. Program., vol. 64, no. 1–3,pp. 81–101, Mar. 1994.

[65] M. V. Afonso, J. M. Bioucas-Dias, and M. A. T. Figueiredo, “An aug-mented Lagrangian approach to the constrained optimization formulationof imaging inverse problems,” IEEE Trans. Image Process., vol. 20, no.3, pp. 681–695, Mar. 2011.

[66] S. Setzer, G. Steidl, and T. Teuber, “Deblurring Poissonian imagesby split Bregman techniques,” J. Visual Communication and ImageRepresentation, vol. 21, no. 3, pp. 193–199, Apr. 2010.

[67] A. Chambolle and T. Pock, “A first-order primal-dual algorithm forconvex problems with applications to imaging,” J. Math. Imag. Vis.,vol. 40, no. 1, pp. 120–145, May 2011.

[68] L. M. Briceno-Arias and P. L. Combettes, “A monotone + skew splittingmodel for composite monotone inclusions in duality,” SIAM J. Opt., vol.21, no. 4, pp. 1230–1250, Oct. 2011.

[69] P. L. Combettes and J.-C. Pesquet, “Primal-dual splitting algorithmfor solving inclusions with mixtures of composite, Lipschitzian, andparallel-sum type monotone operators,” Set-Valued Var. Anal., vol. 20,no. 2, pp. 307–330, Jun. 2012.

[70] B. C. Vu, “A splitting algorithm for dual monotone inclusions involvingcocoercive operators,” Adv. Comput. Math., vol. 38, no. 3, pp. 667–681,Apr. 2013.

[71] L. Condat, “A primal-dual splitting method for convex optimizationinvolving Lipschitzian, proximable and linear composite terms,” J.Optimiz. Theory App., vol. 158, no. 2, pp. 460–479, Aug. 2013.

[72] J.-C. Pesquet and N. Pustelnik, “A parallel inertial proximal optimizationmethod,” Pac. J. Optim., vol. 8, no. 2, pp. 273–305, Apr. 2012.

[73] N. Komodakis and J.-C. Pesquet, “Playing with duality: An overviewof recent primal-dual approaches for solving large-scale optimizationproblems,” IEEE Signal Process. Mag., vol. 32, no. 6, pp. 31–54, Nov.2015.

[74] R. Gaetano, G. Chierchia, and B. Pesquet-Popescu, “Parallel implemen-tations of a disparity estimation algorithm based on a proximal splittingmethod,” in Proc. Int. Conf. Visual Commun. Image Process., San Diego,USA, 2012.

[75] J. J. Moreau, “Proximite et dualite dans un espace hilbertien,” Bull.Soc. Math. France, vol. 93, pp. 273–299, 1965.

[76] T. Cover, “An algorithm for maximizing expected log investment return,”IEEE Trans. Inf. Theory, vol. 30, no. 2, pp. 369–373, 1984.

[77] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihoodfrom incomplete data via the EM algorithm,” J. R. Stat. Soc. B, pp.1–38, Dec. 1977.

[78] A. Subramanya and J. Bilmes, “Soft-supervised learning for textclassification,” in Proc. of EMNLP, 2008, pp. 1090–1099.

[79] G. Tartavel, G. Peyre, and Y. Gousseau, “Wasserstein loss for imagesynthesis and restoration,” SIAM J. Imaging Sci., vol. 9, no. 4, pp.1726–1755, 2016.

13

[80] T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley-Interscience, 2nd edition, 2006.

[81] H. H. Bauschke and P. L. Combettes, Convex Analysis and MonotoneOperator Theory in Hilbert Spaces, Springer, New York, Jan. 2011.

[82] J.-B. Hiriart-Urruty and C. Lemarechal, Convex analysis and minimiza-tion algorithms, Part I : Fundamentals, vol. 305 of Grundlehren dermathematischen Wissenschaften, Springer-Verlag, Berlin, Heidelberg,N.Y., 2nd edition, 1996.

[83] I. Csiszar, “Information-type measures of difference of probabilitydistributions and indirect observations,” Studia Sci. Math. Hungar., vol.2, pp. 299–318, 1967.

[84] M. Basseville, “Distance measures for signal processing and patternrecognition,” European J. Signal Process., vol. 18, no. 4, pp. 349–369,1989.

[85] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trendsin Optimization, vol. 1, no. 3, pp. 123–231, 2014.

[86] J. J. Moreau, “Fonctions convexes duales et points proximaux dans unespace hilbertien,” C. R. Acad. Sci., vol. 255, pp. 2897–2899, 1962.

[87] P. L. Combettes and J.-C. Pesquet, “Proximal thresholding algorithmfor minimization over orthonormal bases,” SIAM J. Optim., vol. 18, no.4, pp. 1351–1376, Nov. 2007.

[88] R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Jeffrey, and D. E.Knuth, “On the Lambert W function,” Adv. Comput. Math., vol. 5, no.1, pp. 329–359, 1996.

[89] M. El Gheche, G. Chierchia, and J. C. Pesquet, “Proximity operatorsof discrete information divergences – extended version,” PreprintarXiv:1606.09552, 2017.

[90] A. Hoorfar and M. Hassani, “Inequalities on the Lambert W functionand hyperpower function,” J. Inequal. Pure Appl. Math, vol. 9, no. 2,pp. 5–9, 2008.

[91] M. Tofighi, K. Kose, and A. E. Cetin, “Signal reconstruction frameworkbased on Projections onto Epigraph Set of a Convex cost function(PESC),” Preprint arXiv:1402.2088, 2014.

[92] S. Ono and I. Yamada, “Second-order Total Generalized Variationconstraint,” in Proc. Int. Conf. Acoust. Speech Signal Process., Florence,Italy, May 2014, pp. 4938–4942.

[93] P.-W. Wang, M. Wytock, and J. Z. Kolter, “Epigraph projections for fastgeneral convex programming,” in Proc. of ICML, 2016.

[94] G. Moerkotte, M. Montag, A. Repetti, and G. Steidl, “Proximal operatorof quotient functions with application to a feasibility problem in queryoptimization,” J. Comput. Appl. Math., vol. 285, pp. 243–255, Sept.2015.

[95] V. Markl, P. Haas, M. Kutsch, N. Megiddo, U. Srivastava, and T. Tran,“Consistent selectivity estimation via maximum entropy,” VLDB J., vol.16, no. 1, pp. 55–76, 2007.

[96] D. R. Kincaid and E. W. Cheney, Numerical analysis: mathematics ofscientific computing, vol. 2, American Mathematical Soc., 2002.

proximity operators of discrete information divergences · functions, as opposed to separable ones...

Documents