arxiv:2109.14920v3 [cs.it] 18 oct 2021

26
On the Kullback-Leibler divergence between discrete normal distributions Frank Nielsen Sony Computer Science Laboratories Inc. Tokyo, Japan Abstract Discrete normal distributions are defined as the distributions with prescribed means and covariance matrices which maximize entropy on the integer lattice support. The set of dis- crete normal distributions form an exponential family with cumulant function related to the Riemann theta function. In this paper, we present several formula for common statistical di- vergences between discrete normal distributions including the Kullback-Leibler divergence. In particular, we describe an efficient approximation technique for calculating the Kullback-Leibler divergence between discrete normal distributions via the R´ enyi α-divergences or the projective γ -divergences. Keywords: Exponential family; discrete normal distribution; lattice Gaussian distribution; theta functions; Siegel half space; Sharma-Mittal divergence; R´ enyi α-divergences; γ -divergence; Cauchy- Schwarz divergence. Contents 1 Introduction 2 1.1 The continuous exponential family of normal distributions ............... 2 1.2 The set of discrete normal distributions as a discrete exponential family ....... 3 1.3 Discrete normal distributions on full-rank lattices .................... 6 1.4 Contributions and paper outline ............................. 8 2 Statistical divergences between discrete normal distributions 8 2.1 enyi divergences ...................................... 8 2.2 Kullback-Leibler divergence: Dual natural and moment parameterizations ...... 11 2.3 Sharma-Mittal divergences ................................. 15 2.4 Chernoff information on the statistical manifold of discrete normal distributions . . . 15 3 Numerical approximations and estimations of divergences 16 3.1 Converting numerically natural to moment parameters and vice versa ........ 17 3.2 Some illustrating numerical examples ........................... 18 3.3 Approximating the Kullback-Leibler divergence via projective γ -divergences ..... 19 A Code snippet in Julia 25 1 arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

Upload: others

Post on 30-Apr-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

On the Kullback-Leibler divergence between discrete normal

distributions

Frank NielsenSony Computer Science Laboratories Inc.

Tokyo, Japan

Abstract

Discrete normal distributions are defined as the distributions with prescribed means andcovariance matrices which maximize entropy on the integer lattice support. The set of dis-crete normal distributions form an exponential family with cumulant function related to theRiemann theta function. In this paper, we present several formula for common statistical di-vergences between discrete normal distributions including the Kullback-Leibler divergence. Inparticular, we describe an efficient approximation technique for calculating the Kullback-Leiblerdivergence between discrete normal distributions via the Renyi α-divergences or the projectiveγ-divergences.

Keywords: Exponential family; discrete normal distribution; lattice Gaussian distribution; thetafunctions; Siegel half space; Sharma-Mittal divergence; Renyi α-divergences; γ-divergence; Cauchy-Schwarz divergence.

Contents

1 Introduction 21.1 The continuous exponential family of normal distributions . . . . . . . . . . . . . . . 21.2 The set of discrete normal distributions as a discrete exponential family . . . . . . . 31.3 Discrete normal distributions on full-rank lattices . . . . . . . . . . . . . . . . . . . . 61.4 Contributions and paper outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Statistical divergences between discrete normal distributions 82.1 Renyi divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Kullback-Leibler divergence: Dual natural and moment parameterizations . . . . . . 112.3 Sharma-Mittal divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Chernoff information on the statistical manifold of discrete normal distributions . . . 15

3 Numerical approximations and estimations of divergences 163.1 Converting numerically natural to moment parameters and vice versa . . . . . . . . 173.2 Some illustrating numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Approximating the Kullback-Leibler divergence via projective γ-divergences . . . . . 19

A Code snippet in Julia 25

1

arX

iv:2

109.

1492

0v3

[cs

.IT

] 1

8 O

ct 2

021

Page 2: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

1 Introduction

1.1 The continuous exponential family of normal distributions

The d-variate normal distribution N(µ,Σ) is characterized as the unique continuous distributiondefined on the support X = Rd with prescribed mean µ and covariance matrix Σ which maximizesShannon’s differential entropy [13]. Let Pd denotes the open cone of positive-definite matrices andΛ = (µ,Σ) : µ ∈ Rd,Σ ∈ Pd the parameter space of the normal distributions. The probabilitydensity function (pdf) of a multivariate normal distribution N(µ,Σ) with parameterization λ =(µ,Σ) ∈ Λ is

qλ(x) = pµ,Σ(x) =1

(2π)d2

√|Σ|

exp

(−1

2(x− µ)>Σ−1 (x− µ)

), λ ∈ Λ, x ∈ Rd,

where |Σ| denotes the determinant of the covariance matrix.The set of normal distributions forms an exponential family [38, 6] with pdfs [33] written

canonically as

qρ(x) =1

ZR(ρ)exp

(x>ρ1 + tr

(−1

2xx>ρ2

)),

(1)

where ρ =(ρ1 = Σ−1µ, ρ2 = Σ−1

)are the natural parameters corresponding to the sufficient statis-

tics t(x) =(x,−1

2xx>), and ZR(ρ) is the partition function which normalizes the positive unnor-

malized density:

qρ(x) =

∫Rd

exp

(x>ρ1 −

1

2x>ρ2x

)dx = (2π)

d2 |ρ−1

2 |12 exp

(1

2ρ>1 ρ

−12 ρ1

). (2)

Notice that we used the invariance of the matrix trace under cyclic permutations to get the lastequality of Eq. 3. The cumulant function1 FR(ρ) = logZR(ρ) of the multivariate normal distribu-tions is

FR(ρ) =1

2

(ρ>1 ρ

−12 ρ1 − log |ρ2|+ d log(2π)

).

Thus the pdf of a normal distribution writes canonically as the pdf of an exponential family:

qρ(x) = exp

x>ρ1 −1

2x>ρ2x︸ ︷︷ ︸

〈ρ,t(x)〉

− logZR(ρ)︸ ︷︷ ︸FR(ρ)

, (3)

qλ(x) = exp (〈ρ(λ), t(x)〉)− logZR(ρ(λ)), (4)

where 〈ρ, ρ′〉 is the following compound vector-matrix inner product between ρ = (a,B) and ρ′ =(a′, B′) with a, a′ ∈ Rd and B,B′ ∈ Pd:⟨

ξ, ξ′⟩

= a>a′ + tr(B′B).

1Also called log-normalizer or log-partition function. The naming “cumulant function” stems from the fact thatthe cumulant generating function mX(u) = E[exp(u>t(x))] of the normal is mX(u) = FR(ρ+ u)− FR(ρ) for X ∼ qρ.

2

Page 3: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

1.2 The set of discrete normal distributions as a discrete exponential family

Similarly, the d-variate discrete normal distribution2 [29, 27, 2] NZ(µ,Σ) (or discrete Gaussiandistribution [1, 26]) is defined as the unique discrete distribution (Theorem 2.5 of [2]) defined on theinteger lattice support X = Zd with prescribed mean µ and covariance matrix Σ which maximizesShannon’s entropy. Therefore the set of discrete normal distributions is a discrete exponentialfamily with probability mass function (pmf) which can be written canonically as

pξ(l) =1

ZZ(ξ)exp

(2π

(−1

2l>ξ2l + l>ξ1

)), l ∈ Zd. (5)

The sufficient statistic3 is t(x) =(2πx,−πxx>

)but the natural parameter ξ = (ξ1, ξ2) cannot

be written easily as a function of the λ = (µ,Σ) ∈ Λ parameters, where µ := Epξ [x] and Σ =

Covpξ [x] = Epξ [(x − µ)(x − µ)>]. It can be shown that the normalizer is related to the Riemanntheta function θR (Eq. 21.2.1 of [43]) as follows:

ZZ(ξ) = θR(−iξ1, iξ2),

where the complex-valued theta function is the holomorphic function defined by its Fourier seriesas follows:

θR : Cd ×Hd → C

θR(z,Ω) :=∑l∈Zd

exp

(2πi

(1

2l>Ωl + l>z

)),

where Hd denotes the Siegel upper space4 [45] of symmetric complex matrices with positive-definiteimaginary part:

Hd =R ∈M(d,C) : R = R>, Im(R) ∈ Pd

,

with M(d,C) denoting the set of d× d matrices with complex entries. A matrix R ∈ Hd is called aRiemann matrix. A Riemann matrix can be associated to a plane algebraic curve (loci of the zeroof complex polynomial P (x, y) with x, y ∈ C) via a compact Riemann surface [18, 47].

Remark 1 Notice that the parameterization λ = (µ,Σ) of continuous normal distribution appliedto the discrete normal distribution for the pmf:

pλ(l) ∝ exp

(−1

2(l − µ)>Σ−1(l − µ)

), l ∈ Zd

yields in general Epλ [X] 6= λ and Covpλ [X] 6= Σ.Navarro and Ruiz [30] used the parameterization (a, b) to express the univariate pmf as

pa,b(x) =exp

(− (x−b)2

2a2

)c(a, b)

,

2The term “discrete normal distribution” was first mentioned in [29], page 22 (1972).3The canonical decomposition of exponential families is not unique. We may choose ts(x) = st(x) and ξs = 1

sξ for

any non-zero scalar s: The inner product remains invariant: 〈t(x), ξ〉 = 〈ts(x), ξs〉. Here, we choose s = 2π in orderto reveal the Riemann theta function.

4Siegel upper space generalizes the Poincare hyperbolic upper space [35] H = z = x+ iy ∈ C : y > 0 = H1.

3

Page 4: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

y

x

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

-10-9-8-7-6-5-4-3-2-1 0 1 2 3 4 5 6 7 8 9 10

Figure 1: Plot of unnormalized discrete normal distributions: Top: pξ on the 1D integer lattice Zclipped at [−10, 10] for ξ = (0, 0.3) (left) and ξ = (0.25, 0.15) (right). Notice that when ξ1 ∈ Z,the discrete normal is symmetric (left) but not for ξ1 6∈ Z (right). Bottom: pξ on the 2D integerlattice Z2 clipped at [−7, 7] × [−7, 7]: (Left) ξ1 = (0, 0) and ξ2 = diag

(110 ,

110

), (right) ξ1 = (0, 0)

and ξ2 = diag(

110 ,

12

).

where c(a, b) :=∑

x∈Z exp(− (x−b)2

2a2

). This expression shows that discrete normal distributions are

symmetric around the unique mode b: p(b−x) = p(b+x). Moreover, when b is an integer, we have

Epa,b [x](a, b) = b, and σ2(a, b) = Varpa,b [x](a, b) = a3 c′(a)c(a) where c(a) = c(a, b) for integers b [30] .

In the remainder, Let us denote the partition function of the discrete normal distributions NZ(ξ)by

θ : Rd × Pd → R+

ξ → θ(ξ) := θR(−iξ1, iξ2) =∑l∈Zd

exp

(2π

(1

2l>ξ2l + l>ξ1

)),

with the corresponding cumulant function FZ(ξ) = log θ(ξ). Both the continuous and discretenormal distributions are minimal regular exponential families with open natural parameter spacesand linearly independent sufficient statistic functions ti’s. The orders of the R-pmf discrete normaldistributions and the C-pmf discrete normal distributions are d(d+3)

2 and d(d + 3), respectively.By definition, the standard discrete normal distribution has zero mean and unit variance: Itscorresponding natural parameters ξstd can be approximated numerically as ξstd ' (0, 0.1591549 ×I) [2], where I denotes the identity matrix. Observe that it is fairly different from the naturalparameter ρstd = (0, I) of the continuous normal distribution.

4

Page 5: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

Normal distribution support natural parameter space sufficient stats normalizer

R-pdf continuous qρ ∼ N(ρ) Rd ρ =(Σ−1µ,Σ−1

)∈ Rd × Pd

(x,− 1

2xx>

)ZR(ρ)

R-pmf discrete pξ ∼ NZ(ξ) Zd ξ ∈ Rd × Pd(2πx,−πxx>

)θ(ξ) = θR(−iξ1, iξ2)

R-pmf lattice pξ ∼ NΛ(ξ) Λ = LZd ξ = (a,B) ∈ Rd × Pd(2πx,−πxx>

)θΛ(ξ) = θR(−iL>BL, iL>a)

C-pmf discrete pCi ∼ NZ(ζ) Zd ζ ∈ Cd ×Hright(2πx,−πxx>

)θR(ζ)

Table 1: Summary of the ordinary normal, discrete normal and C-pmf discrete normal distributionsviewed as natural exponential families.

Let pξ(x) =pξ(x)θ(ξ) , with

pξ(x) = exp

(2π

(−1

2x>ξ2x+ x>ξ1

)).

Figure 1 displays the plots of two unnormalized pmfs of two 1D discrete normal distributions andtwo 2D discrete normal distributions.

The discrete normal pmf of Eq. 5 (R-valued pmf) can be extended to complex-valued pmf5

pCζ (l) (C-pmf) when the parameter ζ belongs to the set Cd×Hrightd \Θ0, where Hright

d is Siegel righthalf-space (symmetric complex matrices with positive-definite real parts) and

Θ0 = (a,B) ∈ Cd ×Hrightd : θR(a,B) = 0,

is called the universal theta divisor [15, 2]. The zeros of the Riemann theta function6 θR forms ananalytic variety of complex dimension d−1. Notice both the probabilities and the parameter spaceof the complex discrete normal distribution are complex-valued (C-pmf). For example, considerζ1 = (0, 0) and ζ2 = (1 + i)I (where I denotes the identity matrix), then the C-pmf evaluated atl = (0, 0) is 1

θ(ζ) '1

4+2i which is a complex number.The relationship between univariate discrete normal distributions and the Jacobi function was

first reported in [48]. Studying more generally the C-pmf discrete normal distributions using Siegel

upper space Hrightd and Riemann theta function7 allowed to get more easily results on the real-

valued discrete normal distributions via properties of the theta function. For example, Agostiniand Amendola [2] (Proposition 3.1) proved the quasiperiodicity8 of the complex discrete normaldistributions pCa+iu+Bv,B(x) = pCa,B(x − v) for any (u, v) ∈ Zd × Zd. We also have p(a+λB,B)(x) =p(a,B)x− λ and p(a,B)(x) = p(−a,B)(−x) for (real) discrete normal distributions (parity property).Notice that the C-pmf discrete normal distributions are not identifiable, i.e., ζ 7→ pζ is not one-to-one (Proposition 3.3 of [2]), but the R-pmf discrete normal distributions are identifiable.

Table 1 displays the three types of normal distributions handled in this paper.A key property of Gaussian distributions is that the family is invariant under the action of affine

automorphisms of Rd. Similarly, the family of discrete Gaussian distributions is invariant under

5Complex-valued probabilities have been explored in quantum physics where the wave function can be interpretedas a complex-valued probability amplitude [51].

6When d = 1, the Riemann theta function is called the Jacobi theta function θ(z, ω) =∑l∈Z exp(2πilz + πil2ω).

More precisely, we have θR(a, b) = θ3(πa, b) where θ3 denote the third-type of Jacobian theta function [43].7By extending θ to the Siegel right half-space.8Namely the Riemann theta function enjoys the following quasiperiodicity property:θR(z + u,Ω) = θR(z,Ω)

(periodic in z with integer periods) and θR(z + Ωv,Ω) = exp(−2πi( 12v>Ωv + v>z))θR(z,Ω) for any u, v ∈ Zd.

The theta function can be generalized to the Riemann theta function with characteristic which involves a non-integershift in its argument [47].

5

Page 6: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

(0, 0)

(1, 1)

(2, 0) (0, 0)

(2, 1)

(1, 1)

Figure 2: Top: Two examples of lattices with their basis defining a fundamental parallepiped: theleft one yields a subset of Z2 while the second one coincides with Z2. Bottom: Lattice Gaussian

NΛ(ξ) with Λ = LZ2 for L =

[1 10 1

], and ξ1 = (0, 0) and ξ2 = diag(0.1, 0.5). The lattice points

are displayed in blue and the unnormalized pmf values at the lattice points are shown in red.

the action of affine automorphisms of Zd (Proposition 3.5 [2]):

∀α ∈ GL(d,Z), αXξ = Xα−>ξ1,α−>ξ2α−1 .

The parity property of discrete Gaussians follows (Remark 3.7 [2]):

X−ξ1,ξ2 ∼ −Xξ.

The discrete normal distributions play an important role as the counterpart of the normaldistributions in robust implementations on finite-precision arithmetic computers of algorithms indifferential privacy [50, 8] and lattice-based cryptography [7]. Recently, the discrete normal distri-butions have also been used in machine learning for a particular type of Boltzmann machine termedRiemann-Theta Boltzmann machines [10] (RTBMs). RTBMs have continuous visible states anddiscrete hidden states, and the probability of hidden states follows a discrete multivariate Gaussian.

Let us mention that there exists other definitions of the discrete normal distributions. For ex-ample, the discrete normal distribution may be obtained by quantizing the cumulative distributionfunction of the normal distribution [44]). This approach is also taken when considering mixturesof discrete normal distributions in [31].

1.3 Discrete normal distributions on full-rank lattices

Discrete normal distributions can also be defined on a d-dimensional lattice Λ (also called full-ranklattice Gaussian distributions or lattice Gaussian measures with support not necessarily the integerlattice [22, 28] Zd) by choosing a set of linearly independent basis vectors l1, . . . , ld arranged in

6

Page 7: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

a basis matrix L = [l1, . . . , ld] and defining the lattice Λ = LZd = L× l : l ∈ Zd. The pmf of arandom variable X ∼ NΛ(ξ) is

pξ(x) =1

θΛ(ξ)exp

(2π

(−1

2x>ξ2x+ x>ξ1

)), x ∈ Λ.

The above pmf can further be specialized for a random variable X ∼ NΛ(c, σ) (a lattice Gaussian

with variance σ2 and center c) is pξ(l) = 1(√

2πσ)dexp(−‖l−c‖

22

σ2 ). For a general lattice Λ = LZd, we

may define the lattice Gaussian distribution NΛ(ξ) with ξ = (a,B) and normalizer

θΛ(ξ) :=∑l∈Λ

exp

(2π

(−1

2l>ξ2l + l>ξ1

)).

When L = I (identity matrix), the lattice Gaussian distributions are the discrete normal dis-tributions but other non-identity matrix basis may also generate Z2 (see Figure 2). SinceθΛ(ξ) :=

∑l∈Zd exp

(2π(−1

2(Ll)>ξ2(Ll) + (Ll)>ξ1

)), we have the following proposition:

Proposition 1 The normalizer of a lattice normal distribution NΛ(ξ) for Λ = LZd and ξ = (a,B)amounts to the following Riemann theta function:

θΛ(ξ) = θR(−iL>a, iL>BL).

Last, we can translate the lattice LZd by c ∈ Rd (i.e., Λ = LZd + c) so that we have the fullgeneric pmf of a lattice gaussian which can be written for ξ = (a,B) as:

pΛξ (l) =

1

θΛ(ξ)exp

(2π

(−1

2l>ξ2l + l>ξ1

)), l ∈ LZd + c, (6)

where

θΛ(ξ) =∑

l∈Λ=LZd+c

exp

(2π

(−1

2l>ξ2l + l>ξ1

)), (7)

=∑z∈Zd

exp

(2π

(−1

2(Lz + c)>ξ2(Lz + c) + (Lz + c)>ξ1

)). (8)

The normalizer is related to the Riemann Theta functions with characteristics [43] α, β ∈ Rd:

θR

[αβ

](a,B) :=

∑l∈Zd

e2πi( 12

(l+α)>B(l+α)+(l+α)>(B+β)),

= e2πi( 12αTBα+α>(a+β)) θR(a+Bα+ β,B).

For example, when L = I, α = c and β = 0.

7

Page 8: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

1.4 Contributions and paper outline

We summarize our main contributions as follows: We report a formula for the Renyi α-divergencesbetween two discrete normal distributions in Proposition 3 including related results for the Bhat-tacharyya divergence, the Hellinger divergence and Amari’s α-divergences. We give a formula forthe cross-entropy between two discrete normal distributions in Proposition 6 which yields a formulafor the Kullback-Leibler divergence (Proposition 5 and Proposition 7). More generally, we extendthe formula to Sharma-Mittal divergences in Proposition 8. In Section 3, we show how to imple-ment these formula using numerical approximations of the theta function. We also propose a fasttechnique to approximate the Kullback-Leibler divergence between discrete normal distributionsrelying on γ-divergences [20] (Proposition 9).

2 Statistical divergences between discrete normal distributions

2.1 Renyi divergences

The Renyi α-divergence [49] between pmf r(x) to pmf s(x) on support X is defined for any positivereal α 6= 1 by

Dα[r : s] =1

α− 1log

(∑x∈X

r(x)αs(x)1−α

)=

1

α− 1logEs

[(r(s)

s(x)

)α], α > 0, α 6= 1.

When α = 12 , Renyi α-divergence amounts to twice the symmetric Bhattacharyya diver-

gence [37]: D 12[r : s] = 2DBhattacharyya[r, s] with:

DBhattacharyya[r, s] := − log

(∑x∈X

√r(x)s(x)

).

The Bhattacharyya divergence can be interpreted as the negative logarithm of the Bhat-tacharyya coefficient:

ρBhattacharyya[r, s] =∑x∈X

√r(x)s(x).

A divergence related to the Bhattacharyya divergence is the squared Hellinger divergence:

D2Hellinger[r, s] =

1

2

∑x∈X

(√r(x)−

√s(x))2 = 1− ρBhattacharyya[r, s].

The squared Hellinger divergence is one fourth of the α-divergence for α = 12 [4], where the α-

divergences are defined by

DAmari,α[r : s] =1

α(1− α)(1− ρBhattacharyya,α[r : s]) .

The α-divergences can be calculated from the skewed Bhattacharyya coefficients for α ∈R\0, 1:

ρBhattacharyya,α[r : s] =∑x∈X

r(x)αs(x)1−α.

8

Page 9: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

Proposition 5 of [8] upper bounds the Renyi α-divergence between discrete normal distributionswith same variance σ2 as:

[NZ(µ1, σ

2)

: NZ(µ2, σ

2)]≤ α (µ1 − µ2)2

2σ2.

Renyi α-divergences are non-decreasing with α [49].When both pmfs are from the same discrete exponential families with log-normalizer F (ξ) =

log θ(ξ), the Renyi α-divergence [41] amounts to a α-skewed Jensen divergence [37] between thecorresponding natural parameters:

Dα[pξ : pξ′ ] =1

1− αJF,α(ξ : ξ′),

whereJF,α(ξ : ξ′) := αF (ξ) + (1− α)F (ξ′)− F (αξ + (1− α)ξ′).

Indeed, let

Iα,β[r : s] =∑x∈X

r(x)αs(x)β, α, β ∈ R.

Then we have the following lemma:

Proposition 2 For two pmfs pξ and pξ′ of a discrete exponential family with log-normalizer F (ξ)with αξ + βξ′ ∈ Ξ, we have

Iα,β[pξ : pξ′ ] = exp(F (αξ + βξ′)− (αF (ξ) + βF (ξ′))

).

Proof: We have

Iα,β[pξ : pξ′ ] =∑x∈X

exp(〈t(x), αξ〉 − αF (ξ)) exp(⟨t(x), βξ′

⟩− βF (ξ′)),

= eF (αξ+βξ′)−(αF (ξ)+βF (ξ′))∑x∈X

e〈t(x),αξ+βξ′〉−F (αξ+βξ′)

︸ ︷︷ ︸=1

,

since∑

x∈X pαξ+βξ′(x) = 1 when αξ + βξ′ ∈ Ξ. Thus we get the following proposition:

Proposition 3 The Renyi α-divergence between two discrete normal distributions pξ and pξ′ forα > 0 and α 6= 1 is

Dα[pξ : pξ′ ] =1

1− α

(α log

θ(ξ)

θ(αξ + (1− α)ξ′)+ (1− α) log

θ(ξ′)

θ(αξ + (1− α)ξ′)

). (9)

9

Page 10: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

Z2

Figure 3: Approximating Riemann θR function by summing on the integer lattice points fallinginside an ellipsoid E: θ(ξ) ' θ(ξ;E).

Proof: We have

Dα[pξ : pξ′ ] =1

1− α(α log θ(ξ) + (1− α) log θ(ξ′)− log θ(αξ + (1− α)ξ′)

).

Plugging log θ(αξ + (1 − α)ξ′) = (α + 1 − α) log θ(αξ + (1 − α)ξ′) in the right-hand-side equationyields the result. Notice that we can express also the Renyi divergences as

Dα[pξ : pξ′ ] =1

1− αlog

θ(ξ)αθ(ξ′)1−α

θ(αξ + (1− α)ξ′).

See [16, 19, 3] for the efficient numerical approximations of the Riemann theta function. Basi-cally, the infinite theta series θ(η) is approximated by a finite summation over a region R of integerlattice points:

θ(ξ;R) :=∑x∈R

exp

(2π

(−1

2x>ξ2x+ x>ξ1

)).

When R = Zd, we have θ(ξ;R) = θ(ξ). The method proposed in [16] consists in choosing the integerlattice points Eξ falling inside an ellipsoid used to approximate the theta function as illustrated inFigure 3.

Thus we have the following proposition:

Proposition 4 The squared Hellinger distance between two discrete normal distributions pξ andpξ′ is

D2Hellinger[pξ, pξ′ ] = 1−

θ(ξ+ξ′

2

)√θ(ξ)θ(ξ′)

.

10

Page 11: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

We can also write

Dα[pξ : pξ′ ] =1

α− 1logEpξ′

[(pξpξ′

)α],

=1

α− 1

(α log

θ(ξ′)

θ(ξ)+ Epξ′

[(pξ(x)

pξ′(x)

)α]),

α− 1log

θ(ξ′)

θ(ξ)+

1

θ(ξ′)

∑∈Zd

pξ′

(pξ(x)

pξ′(x)

)α.

This last expression can be numerically estimated.The Bhattacharyya divergence between two discrete normal distributions pξ and pξ′ can be

expressed as an equivalent Jensen divergence between its natural parameters:

DBhattacharyya[pξ′ , pξ′ ] = JF (ξ, ξ′),

where

JF (ξ : ξ′) :=F (ξ) + F (ξ′)

2− F

(ξ + ξ′

2

).

Thus we have DBhattacharyya[pξ, pξ′ ] = log

√θ(ξ)θ(ξ′)

θ(θ(ξ)+θ(ξ′)

2

) . We can also express the Bhattacharyya

divergence using the unnormalized pmfs:

DBhattacharyya[pξ′ , pξ′ ] = log√θ(ξ)θ(ξ′)− log

∑l∈Zd

√pξ(l)pξ′(l)

.

Consider the transformations τ that leaves the θ function invariant: θ(τ(ξ)) = θ(ξ). Then theRenyi α-divergences simplifies to the following formula:

Dα[pξ : pτ(ξ)] =1

1− αlog

θ(ξ)

θ(αξ + (1− α)τ(ξ)). (10)

For example, consider ξ1 = ξ′1 ∈ Zd and ξ2 = diag(b1, . . . , bd) and ξ′2 = diag(σ(b1, . . . , bd)) for apermutation σ ∈ Sd. Then we have θ(ξ′) = θ(ξ), and formula of Eq. 10 applies.

2.2 Kullback-Leibler divergence: Dual natural and moment parameterizations

When α → 1, the Renyi α-divergences tend asymptotically to the Kullback-Leibler divergence(KLD). The KLD between two pmfs r(x) and s(x) defined on the support X is defined by

DKL[r : s] =∑x∈X

r(x) logr(x)

s(x).

In general, the KLD between two pmfs of a discrete exponential family amounts to a reverseBregman divergence between their natural parameters [34]:

DKL[pξ : pξ′ ] = BF∗(ξ : ξ′) = BF (ξ′ : ξ),

11

Page 12: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

where the Bregman divergence with generator F (ξ) is defined by:

BF (ξ′ : ξ) = F (ξ′)− F (ξ)−⟨ξ′ − ξ,∇F (ξ)

⟩,

where 〈ξ, ξ′〉 is the following compound vector-matrix inner product between ξ = (a,B) and ξ′ =(a′, B′) with a, a′ ∈ Rd and B,B′ ∈ Pd:⟨

ξ, ξ′⟩

= a>a′ + tr(B′B).

The gradient ∇F (ξ) = ∇θ(ξ)θ(ξ) defined the dual parameter η of an exponential family: η = ∇F (ξ).

This dual parameter is called the moment parameter (or the expectation parameter) because wehave Epξ [t(x)] = ∇F (ξ) and therefore η = Epξ [t(x)]. A discrete normal distribution can thus beparameterized either by its ordinary parameter λ = (µ,Σ), its natural parameter ξ, or its dualmoment parameter η. We write the distributions accordingly: NZ(λ), NZ(ξ), and NZ(η) withcorresponding pmfs: pλ(x), pξ(x), and pη(x).

There exists a bijection between the space of natural parameters and the space of momentparameters induced by the Legendre-Fenchel transformation of the cumulant function:

F ∗(η) = supξ∈Ξ〈ξ, η〉 − F (θ),

where Ξ = Rd × Pd. Function F ∗ is called the convex conjugate and induces a dual Bregmandivergence so that we have BF (ξ′ : ξ) = BF ∗(η : η′) with η′ = ∇F (ξ′). The dual parameters arelinked as follows: η = ∇F (ξ), ξ = ∇F ∗(η), and therefore we get:

F ∗(η) = 〈ξ, η〉 − F (ξ).

The convex conjugate of the cumulant function F (ξ) is called the negentropy because it can beshown [6, 39] that we have

F ∗(η) = −H[pξ] =∑x∈X

pξ(x) log pξ(x),

where H[pξ] = −∑

x∈X pξ(x) log pξ(x) denotes Shannon’s entropy of the random variable X ∼ pξ.The maximum likelihood estimator (MLE) of a density of an exponential family from n identi-

cally and independently distributed samples x1, . . . , xn is given by [6]:

η =1

n

n∑i=1

t(xi).

It follows from the equivariance property of the MLE that we have ξ = ∇F ∗(η). We get thefollowing MLE for the discrete normal family:

η1 =2π

n

n∑i=1

xi = 2π µ,

η2 = −πn∑i=1

xix>i = −π (Σ + µµ>).

12

Page 13: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

The Fenchel-Young inequality for convex conjugates F (ξ) and F ∗(η) is

F (ξ) + F ∗(η′) ≥⟨ξ, η′

⟩,

with equality holding if and only if η′ = ∇F (ξ). The Fenchel-Young inequality induces a Fenchel-Young divergence:

YF,F ∗(ξ : η′) := F (ξ) + F ∗(η′)−⟨ξ, η′

⟩= YF ∗,F (η′ : ξ) ≥ 0,

such that YF,F ∗(ξ : η′) = BF (ξ : ξ′). Thus the Kullback-Leibler divergence between two pmfs of adiscrete exponential family can be expressed in the following equivalent ways using the natural/-moment parameterizations:

DKL[pξ : pξ′ ] = BF (ξ′ : ξ) = BF ∗(η : η′) = YF ∗,F (η : ξ′) = YF,F ∗(ξ′ : η). (11)

Thus using the fact that the KLD amounts to a reverse Bregman divergence for the cumulantfunction F (ξ) = log θ(ξ), we get the following proposition:

Proposition 5 The Kullback-Leibler divergence between two discrete normal distributions pξ andp′ξ with natural parameters ξ and ξ′ is

DKL[pξ : pξ′ ] = logθ(ξ′)

θ(ξ)− 1

θ(ξ)

⟨ξ′ − ξ,∇θ(ξ)

⟩.

Some software packages for the Riemann theta function can numerically approximate boththe theta function and its derivatives [3]. Using the periodicity property of the theta function forξ′ = (ξ1 +u, ξ2) with u ∈ Zd, we have θ(ξ′) = θ(ξ), and therefore DKL[pξ : pξ′ ] = 1

θ(ξ) 〈ξ − ξ′,∇θ(ξ)〉.

For the discrete normal distributions, we can express the moment parameter for the discretenormal distributions using the ordinary mean-covariance parameters λ = (µ,Σ). Since the sufficientstatistics is 2π(x, xx>), we have η1(ξ) = Epξ [2πx] = 2πµ and η2(ξ) = Epξ [−πxx>] = −π(Σ +µµ>).

Proposition 4.4 of [2] reports the entropy of pξ as

H[pξ] = log θ(ξ)− 2πξ>1 µ+ π tr(ξ2(Σ + µµ>)).

We can rewrite the entropy as minus the convex conjugate function of the cumulant function:

H[pξ] = −F ∗(η) = F (ξ)− 〈ξ, η〉 .

Thus we have the convex conjugate which can be expressed as

F ∗(η) = − log θ(ξ) + 2πµ>ξ1 − π tr(ξ2(Σ + µµ>)). (12)

The entropy of pξ can be calculated using the unnormalized pmf as follows:

H[pξ] = −∑l∈Zd

pξ(l) log pξ(l) = −Epξ [log pξ(l)] = log θ(ξ)− 1

θ(ξ)

∑l∈Zd

pξ(l) log pξ(l) > 0.

13

Page 14: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

The cross-entropy between two pmfs r(x) and s(x) defined over the support X is

H[r : s] = −∑x∈X

r(x) log s(x).

Entropy is self cross-entropy: H[r] = H[r : r]. The formula for the cross-entropy of a density of anexponential family [39] can be written as:

H[pξ : pξ′ ] = F (ξ′)−⟨ξ′,∇F (ξ)

⟩= F (ξ′)−

⟨ξ′, η

⟩.

Thus we get the following proposition:

Proposition 6 The cross-entropy between two discrete normal distributions pξ ∼ NZ(µ,Σ) andpξ′ ∼ NZ(µ′,Σ′) is

H[NZ(µ,Σ) : NZ(µ′,Σ′)] = log θ(ξ′)− 2πµ>ξ′1 + π tr(ξ′2(Σ + µµ>)). (13)

Notice that the cross-entropy can be written using the unnormalized pmf as

H[pξ : pξ′ ] = −Epξ [log pξ′(x)] = log θ(ξ′)− 1

θ(ξ)

∑l∈Zd

pξ(l) log pξ′(l).

The KLD can be expressed as the cross-entropy minus the entropy (and henceforth its othername is relative entropy):

DKL[pξ : pξ′ ] = H[pξ : pξ′ ]−H[pξ].

It follows that we can compute the KLD between two discrete normal distributions as follows:

Proposition 7 The Kullback-Leibler divergence between two discrete normal distributions pξ ∼NZ(µ,Σ) and p′ξ ∼ NZ(µ′,Σ′) is:

DKL[pξ : pξ′ ] = logθ(ξ′)

θ(ξ)− 2πµ>(ξ′1 − ξ1) + π tr((ξ′2 − ξ2)(Σ + µµ>)). (14)

Notice that we use mixed (ξ, λ)-parameterizations in the above formula. In practice, we estimatediscrete normal distributions and then calculate the corresponding natural parameters by solvinga gradient system explained in §3.

Notice that the KLD between normal distributions can be decomposed as a sum of a squaredMahalanobis distance and a matrix Burg divergence (see Eq. 5 of [14]). For discrete normal distri-butions, when ξ = (a,B) with a ∈ Z and ξ′ = (a + m,B) with m ∈ Z, we have θ(ξ) = θ(ξ′) andµ = µ′, so that the KLD simplifies to the following formula:

DKL[pξ : pξ′ ] =1

θ(ξ)〈(−m,B),∇θ(ξ)〉 .

14

Page 15: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

Notice that the MLE ξn of n samples x1, . . . , xn ∼i.i.d. pξ can be interpreted as a KL divergenceminimization problem:

ξn = arg minξ∈Ξ

DKL[pe : pξ],

where pe(x) = 1n

∑ni=1 δ(x−xi) denotes the empirical distribution with δ(x) the Dirac’s distribution:

δ(x) = 1 if and only if x = 0.Notice that when α→ 1, we have JF,α(ξ : ξ′)→ BF (ξ′ : ξ) [37], and Dα[pξ : p′ξ]→ DKL[pξ : p′ξ]].

2.3 Sharma-Mittal divergences

The Sharma-Mittal divergence [40] Dα,β[p : q] between two pmfs p(x) and q(x) defined over thediscrete support X unifies the Renyi α-divergences (β → 1) with the Tsallis α-divergences (β → α):

Dα,β[p : q] :=1

β − 1

(∑x∈X

p(x)αq(x)1−α

) 1−β1−α

− 1

, ∀α > 0, α 6= 1, β 6= 1.

Moreover, we have Dα,β[p : q]→ DKL[p : q] when α, β → 1.For two pmfs pξ and pξ′ belonging to the same exponential family [40], we have:

Dα,β[pξ : pξ′ ] =1

β − 1

(e−

1−β1−αJF,α(ξ:ξ′) − 1

).

Thus we get the following proposition:

Proposition 8 The Sharma-Mittal divergence Dα,β[pξ : pξ′ ] between two discrete normal distribu-tions pξ and p′ξ is:

Dα,β[pξ : pξ′ ] =1

β − 1

( θd(ξ)αθd(ξ

′)1−α

θd(αξ + (1− α)ξ′)

)− 1−β1−α− 1

,

=1

β − 1

( θd(ξ)

θd(αξ + (1− α)ξ′)

)α(β−1)1−α

(θd(ξ

′)

θd(αξ + (1− α)ξ′)

)β−1 . (15)

2.4 Chernoff information on the statistical manifold of discrete normal distri-butions

Chernoff information stems from the characterization of the error exponent in Bayesian hypothesistesting (see §11.9 of [13]). The Chernoff information between two pmfs r(x) and s(x) is defined by

DChernoff [r, s] := − minα∈[0,1]

log

(∑x∈X

rα(x)s1−α(x)

),

15

Page 16: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

where α∗ denotes the best exponent: α∗ = arg minα∈[0,1]

∑x∈X r

α(x)s1−α(x). When r(x) = pξ(x)and s(x) = pξ′(x) are pmfs of a discrete exponential family with cumulant function F (ξ), we have(Theorem 1 of [32]):

DChernoff [pξ, pξ′ ] = BF (ξ : ξ∗) = BF (ξ′ : ξ∗),

where ξ∗ := α∗ξ + (1− α)ξ′. Thus calculating Chernoff information amounts to first find the bestα∗ and second compute DKL[pξ∗ : pξ] or equivalently DKL[pξ∗ : pξ′ ]. By modeling the exponentialfamily as a manifold M = pξ : ξ ∈ Ξ equipped with the Fisher information metric (a Hessianmetric expressed in the ξ-coordinate system by ∇2F (ξ) so that the length element ds appearsin the Taylor expansion of the KL divergence: DKL[pξ+dξ : pξ] = 1

2ds2 = 12dξ>∇2F (ξ)dξ), we

can characterize geometrically the exact α∗ (Theorem 2 of [32]) as the unique intersection of anexponential geodesic γξ,ξ′ with a mixture bisector Bi(ξ, ξ′) where

γξ,ξ′ := pλξ+(1−λ)ξ′ ∝ pλξ p1−λξ′ : λ ∈ (0, 1),

Bi(ξ, ξ′) := pω ∈M : DKL[pω : pξ] = DKL[pω : pξ′ ].

Thus we have pξ∗ = γξ,ξ′ ∩ Bi(ξ, ξ′). This geometric characterization yields a fast numerical ap-proximation bisection technique to obtain α∗ within a prescribed precision error. Since the discretenormal distributions form an exponential family, we can apply the above technique derived from in-formation geometry9 to calculate numerically the Chernoff information. Various statistical inferenceprocedures like estimators in curved exponential families and hypothesis testing can be investigatedusing the information-geometric dually flat structure of M , called a statistical manifold (see [4, 34]for details).

Remark 2 The Fisher information matrix of the univariate normal distributions is I(ξ) =

(log θ(ξ))′′ = θ′′(ξ)θ(ξ) −

(θ′(ξ)θ(ξ)

)2, where θ′ and θ′′ are the derivative and second derivatives of the

Jacobi function θ.

Knowing that the KL divergence between two discrete normal distributions amounts to a Breg-man divergence is helpful for a number of tasks like clustering [21]: The left-sided KL centroidof n discrete normal distributions pξ1 , . . . , pξn amounts to a right-sided Bregman centroid which isalways the center of mass of the natural parameters [5]:

ξ∗ = arg minξ

n∑i=1

1

nDKL[pξ : pξi ] = arg min

ξ

n∑i=1

1

nBF (ξi : ξ)⇒ ξ∗ =

1

n

n∑i=1

ξi.

3 Numerical approximations and estimations of divergences

Although conceptually very similar as maximum entropy distributions to the continuous normaldistributions, discrete normal distributions are mathematically very different to handle. On onehand, the normal distributions are exponential families with all parameter transformations andconvex conjugates FR(ρ) and F ∗R(τ) available in closed-form [36] (where τ = Eqρ [t(x)]). On theother hand, the discrete normal distributions with source parameters λ = (µ,Σ) can be converted

9Information geometry is the field which considers differential-geometric structures of families of probability dis-tributions. Historically, Hotelling [24] first introduced the Fisher-Rao manifold. The term “information geometry”occured in a paper of Chentsov [11] in 1978.

16

Page 17: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

from/back the moment parameters, but the conversions between natural parameters ξ and expecta-tion parameters η = Epξ [t(x)] = ∇F (ξ) are not available in closed-form, nor the cumulant function

F (ξ) = log θ(ξ) and its convex conjugate F (η).

3.1 Converting numerically natural to moment parameters and vice versa

In practice, we can approximate the conversion procedures ξ ↔ η as follows:

• Given natural parameter ξ, we may approximate the dual moment parameter η = ∇F (ξ) =Epxi[t(x)] as η = 1

m

∑ni=1 t(xi) where x1, . . . , xm are independently and identically sampled

from NZd(ξ). Sampling uniformly from discrete normal distributions can be done exactly in1D [8] (requiring average constant time) but requires sampling heuristics in dimension d > 1.Two common sampling heuristics approximating for handling discrete normal distributionsare:

– H1: Draw a variate x ∼ qµ,Σ from the corresponding normal distribution qµ,Σ, andround or choose the closest integer lattice point x of Zd with respect to the `1-norm (i.e.,x = arg minl∈Zd ‖l− x‖1 =

∑di=1 |li − xi|), where (l1, . . . , ld) and (x1, . . . , xd) denote the

coordinates of l and x, respectively.

– H2: Consider the integer lattice points Eξ falling inside the ellipsoid region [16] usedfor approximating θ(ξ) by θ(ξ;Eξ) (Figure 3), draw uniformly an integer lattice pointl from Eξ and accept it with probability pξ(l) (acceptance-rejection sampling describedin [10]).

• Given the moment parameter η, we may approximate θ = ∇F ∗(η) by solving a gradi-ent system. Since the moment generating function (MGF) of an exponential family [6] ismX(u) := EX [exp(u>X)] = exp(F (ξ + u) − F (ξ)), we deduce that the MGF of the discretenormal distributions X ∼ pξ is

mξ(u) =θ(ξ + u)

θ(ξ).

The non-central moments of the sufficient statistics (also called raw moments or geometricmoments) of an exponential family can be retrieved from the partial derivatives of the MGF.For the discrete normal distributions, Agostini and Amendola [2] obtained the following gra-dient system:

η1 = Epξ [t1(x)] =1

1

θ(ξ)∇ξ1θ(ξ),

η2 = Epξ [t2(x)] = − 1

1

θ(ξ)(∇ξ2θ(ξ) + diag(∇ξ2θd(ξ))) .

In practice, this gradient system can be solved up to arbitrary machine precision using softwarepackages (initialization can be done from the closed-form conversion of the moment parameterto the natural parameter for the continuous normal distribution). For example, one way tosolve the gradient system is by using the technique described in [52] that we summarize asfollows:

17

Page 18: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

First, let us choose the following canonical parameterization of the densities of an exponentialfamily:

pψ(x) := exp

(−

D∑i=0

ψiti(x)

).

That is, ψ0 = F (ψ) = and ψi = −ξi for i ∈ 1, . . . , D (i.e., parameter ψ is an augmentednatural parameter which includes the log-normalizer in its first coefficient).

Let Ki(ψ) := Epθ [ti(x)] = ηi denote the set of D + 1 non-linear equations for i ∈ 0, . . . , D.The method of [52] converts iteratively pη to pψ. We initialize ψ(0) and calculate numerically

ψ(0)0 = F (ψ(0)).

At iteration t with current estimate ψ(t), we use the following first-order Taylor approximation:

Ki(ψ) ≈ Ki(ψ(t)) + (ψ − ψ(t))∇Ki(ψ

(t)).

Let H(ψ) denote the (D + 1)× (D + 1) matrix:

H(ψ) :=

[∂Ki(ψ)

∂ψj

]ij

.

We haveHij(ψ) = Hji(ψ) = −Epψ [ti(x)tj(x)]. (16)

We update as follows:

ψ(t+1) = ψ(t) +H−1(ψ(t))

η0 −K0(ψ(t))...

ηD −KD(ψ(t))

. (17)

When implementing this method, we need to approximate Hij of Eq. 16 using the theta

ellipsoid points. For d-variate discrete normal distributions with D = d(d+3)2 , we have t1(x) =

x1, . . . , td(x) = xd, td+1(x) = −12x1x1, td+2(x) = −1

2x1x2, . . . , tD(x) = −12xdxd.

3.2 Some illustrating numerical examples

To compute numerically the theta functions and its derivatives, we may use the following softwarepackages (available in various programming languages): abelfunctions in SAGE [46], algcurvesin Maple® [17], Theta in Python [9], Riemann of jTEM (Java Tools for Experimental Mathematics)in Java [23] (see also [16]), or Theta.jl in Julia [3].

For our experiments, we used Java™(in-house implementation) and Julia (with the packageTheta.jl [3]). We consider the following two discrete normal distributions pξ and pξ′ with thefollowing parameters:

ξ = ((−0.2,−0.2),diag(0.1, 0.2)) ,

ξ′ = ((0.2, 0.2),diag(0.15, 0.25)) .

These bivariate discrete normal distributions are plotted in Figure 4.

18

Page 19: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

Figure 4: Two bivariate discrete normal distributions used to calculate statistical divergences.

We implemented the statistical divergences between discrete normal distributions using an in-house Java™ software and Julia Theta.jl [3] package (see Appendix A for a code snippet).

For the above discrete normal distributions, we calculated:

DBhattacharyya[pξ , pξ′ ] =1

2D 1

2 KL[pξ : pξ′ ] ' 1.626,

and approximated the KL divergence by the Renyi divergence for αKL = 1− 10−5 = 0.99999:

DKL[pξ : pξ′ ] ' DαKL [pξ : pξ′ ] =1

1− αKLJF,αKL

(ξ : ξ′) ' 7.84.

Implementing these formula required to calculate F (ξ), i.e., to evaluate the logarithm of thetafunctions. The following section describes another efficient method based on a projective divergence,i.e., a divergence which does not require pmfs to be normalized.

3.3 Approximating the Kullback-Leibler divergence via projective γ-divergences

The γ-divergences [20, 12] between two pmfs p(x) and q(x) defined over the support X for a realγ > 1 is defined by:

Dγ [p : q] :=1

γ(γ − 1)log

((∑x∈X p

γ(x)) (∑

x∈X qγ(x)

)γ−1(∑x∈X p(x)qγ−1(x))

)γ), (γ > 1).

The γ-divergences are projective divergences, i.e., they satisfy the following identity:

Dγ [p : p′] = Dγ [λp : λ′p′], (∀λ, λ′ > 0).

Thus let us rewrite p(x) = p(x)Zp

and q(x) = q(x)Zq

where p(x) and q(x) are computationallytractable unnormalized pmfs, and Zp and Zq their respective computationally intractable normal-izers. Then we have

Dγ [p : p′] = Dγ [p : p′].

19

Page 20: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

Let us defineIγ [p : q] :=

∑x∈X

p(x)q(x)γ−1.

Then the γ-divergence can be written as:

Dγ [p : q] = Dγ [p : q] =1

γ(γ − 1)log

(Iγ [p : p] Iγ [q : q]γ−1

Iγ [p : q]γ

).

Consider p = pξ and q = pξ′ two pmfs belonging to the lattice Gaussian exponential family, andlet

Iγ(ξ : ξ′

)= Iγ

[pξ : pξ′

].

Provided that ξ + (γ − 1)ξ′ ∈ Ξ, we have following the proof of Proposition 2 that

Iγ(ξ : ξ′

)=

∑l∈Λ

pξ(l)pξ′(l)γ−1,

=∑l∈Λ

exp(⟨ξ + (γ − 1)ξ′, t(x)

⟩),

= exp(FΛ(ξ + (γ − 1)ξ′))∑l∈Λ

pξ+(γ−1)ξ′(l)︸ ︷︷ ︸=1

,

= exp(FΛ(ξ + (γ − 1)ξ′)),

where FΛ(ξ) = log θΛ(ξ) denotes the cumulant function of the Gaussian distributions on lattice Λ.That is, we have

Iγ(ξ : ξ′

)= θΛ(ξ + (γ − 1)ξ′),

and therefore, we can express the γ-divergences as

Dγ [pξ : pξ′ ] =1

γ(γ − 1)log

(θΛ(γξ) θΛ(γξ′)γ−1

θΛ(ξ + (γ − 1)ξ′)γ

). (18)

Notice that the exact values of the infinite summations Iγ (ξ : ξ′) depend on the Riemanniantheta function.

Now, the γ-divergences tend asymptotically to the Kullback-Leibler divergence between nor-

malized densities when γ → 1 [20, 12]: limγ→1Dγ [p : q] = DKL

[pZp

: qZq

]. Let us notice that the

KLD is not a projective divergence, and that for small enough γ > 1, we have ξ + (γ − 1)ξ′ alwaysfalling inside the natural parameter space Ξ. Moreover, we can approximate the infinite summationusing a finite region of integer lattice points Rξ,ξ′ :

Iγ,Rξ,ξ′ (ξ : ξ′) :=∑

x∈Rξ,ξ′

pξ pξ′(x)γ .

For example, we can use the theta ellipsoids [16] Eξ and Eξ′ used to approximate θ(ξ) andθ(ξ′), respectively (Figure 3): We choose Rξ,ξ′ = (Eξ ∪Eξ′)∩Zd. In practice, this approximation ofthe Iγ summations scales well in high dimensions. Overall, we get our approximation of the KLDbetween two lattice Gaussian distributions summarized in the following proposition:

20

Page 21: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

Divergence definition/closed-form formula for lattice Gaussians

Kullback-Leibler divergence DKL[pξ : pξ′ ] =∑

l∈Λ pξ(l) logpξ(l)pξ′ (l)

DKL[pξ : pξ′ ] = log(θΛ(ξ′)θΛ(ξ)

)−2πµ>(ξ′1 − ξ1) + π tr

((ξ′2 − ξ2)(Σ + µµ>)

)squared Hellinger divergence D2

Hellinger[pξ : pξ′ ] = 12

∑l∈Λ(

√pξ(l)−

√pξ′(l))

2

D2Hellinger[pξ : pξ′ ] = 1−

θΛ

(ξ+ξ′

2

)√θΛ(ξ)θΛ(ξ′)

Renyi α-divergence Dα[pξ : pξ′ ] = 1α−1 log

(∑l∈Λ pξ(l)

αpξ′(l)1−α)

(α > 0, α 6= 1) Dα[pξ : pξ′ ] = α1−α log θΛ(ξ)

θΛ(αξ+(1−α)ξ′) + log θΛ(ξ′)θΛ(αξ+(1−α)ξ′)

limα→1Dα[pξ : pξ′ ] = DKL[pξ : pξ′ ]

γ-divergence Dγ [pξ : pξ′ ] = 1γ(γ−1) log

((∑l∈Λ p

γξ (x))

(∑l∈Λ p

γ

ξ′ (l))γ−1(∑

l∈Λ pξ(l)pγ−1

ξ′ (l)))γ

)(γ > 1) Dγ [pξ : pξ′ ] = 1

γ(γ−1) log(θΛ(γξ) θΛ(γξ′)γ−1

θΛ(ξ+(γ−1)ξ′)γ

)limγ→1Dγ [pξ : pξ′ ] = DKL[pξ : pξ′ ]

Holder divergence DHolderα,γ [r : s] :=

∣∣∣∣log

( ∑x∈X r(x)γ/αs(x)γ/β

(∑x∈X r(x)γ)

1/α(∑x∈X s(x)γ)

1/β

)∣∣∣∣(γ > 0, 1

α + 1β = 1) DHolder

α,γ [pξ : pξ′ ] =

∣∣∣∣log θΛ(γξ)1α θΛ(γξ′)

θΛ( γαξ+ γ

βξ′)

∣∣∣∣Cauchy-Schwarz divergence DCS[r : s] := − log

∑x∈X r(x)s(x)√

(∑x∈X r

2(x)) (∑x∈X s

2(x))

(Holder with α = β = γ = 2) DCS[pξ : pξ′ ] = log

√θΛ(2ξ)θΛ(2ξ′)

θΛ(ξ+ξ′)

Table 2: Summary of statistical divergences with corresponding formula for lattice Gaussian dis-tributions with partition function θΛ(ξ). Ordinary parameterization λ(ξ) = (µ = Epξ [X],Σ =Covpξ [X]) for X ∼ NΛ(ξ).

Proposition 9 The Kullback-Leibler divergence between two lattice Gaussian distributions pξ andpξ′ can be efficiently approximated:

DKL[pξ : pξ′ ] ≈ Dγ [pξ : pξ′ ] =1

γ(γ − 1)log

((Iγ,Rξ(ξ : ξ) Iγ,R′ξ(ξ

′ : ξ′)γ−1

Iγ,Rξ,ξ′ (ξ : ξ′)γ

), (19)

for γ > 1 close to 1 (say, γ = 1 + 10−5), where Rξ and Rξ′ denote the integer lattice points fallinginside the theta ellipsoids Eξ and Eξ′ used to approximate the theta functions [16] θΛ(ξ) and θΛ(ξ′),respectively.

Table 2 summarizes the various closed-formula obtained for the statistical divergences betweenlattice Gaussian distributions considered in this paper.

Other statistical divergences like the projective Holder divergences [42] between lattice Gaussiandistributions can be obtained similarly in closed-form:

DHα,γ [r : s] :=

∣∣∣∣∣log

( ∑x∈X r(x)γ/αs(x)γ/β(∑

x∈X r(x)γ)1/α (∑

x∈X s(x)γ)1/β

)∣∣∣∣∣ ,(γ > 0,

1

α+

1

β= 1

).

21

Page 22: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

The Holder divergences include the Cauchy-Schwarz divergence [25] for γ = α = β = 2:

DCS[r : s] := − log

∑x∈X r(x)s(x)√

(∑

x∈X r2(x)) (

∑x∈X s

2(x)).

Since the natural parameter space Ξ is a cone [42], we get:

DHα,γ [pξ : pξ′ ] =

∣∣∣∣∣logθΛ(γξ)

1α θΛ(γξ′)

θΛ( γαξ + γβ ξ′)

∣∣∣∣∣ .Thus we get the following closed-form for the Cauchy-Schwarz divergence between two lattice

Gaussian distributions:

DCS[pξ : pξ′ ] = log

√θΛ(2ξ)θΛ(2ξ′)

θΛ(ξ + ξ′).

References

[1] Divesh Aggarwal, Daniel Dadush, Oded Regev, and Noah Stephens-Davidowitz. Solving theshortest vector problem in 2n time using discrete Gaussian sampling. In Proceedings of theforty-seventh annual ACM symposium on Theory of computing, pages 733–742, 2015.

[2] Daniele Agostini and Carlos Amendola. Discrete Gaussian distributions via theta functions.SIAM Journal on Applied Algebra and Geometry, 3(1):1–30, 2019.

[3] Daniele Agostini and Lynn Chua. Computing theta functions with Julia. Journal of Softwarefor Algebra and Geometry, 11(1):41–51, 2021.

[4] Shun-ichi Amari. Information geometry and its applications, volume 194. Springer, 2016.

[5] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering withBregman divergences. Journal of machine learning research, 6(10), 2005.

[6] Ole Barndorff-Nielsen. Information and exponential families in statistical theory. John Wiley& Sons, 2014.

[7] Alessandro Budroni and Igor Semaev. New Public-Key Crypto-System EHT. arXiv preprintarXiv:2103.01147, 2021.

[8] Clement L Canonne, Gautam Kamath, and Thomas Steinke. The discrete Gaussian for dif-ferential privacy. arXiv preprint arXiv:2004.00010, 2020.

[9] S Carrazza and D Krefl. Theta: A Python library for Riemann-Theta function based machinelearning. https://doi.org/10.5281/zenodo.1120325.

[10] Stefano Carrazza and Daniel Krefl. Sampling the Riemann-Theta Boltzmann machine. Com-puter Physics Communications, 256:107464, 2020.

[11] N. N. Cencov. Algebraic foundation of mathematical statistics. Statistics: A Journal ofTheoretical and Applied Statistics, 9(2):267–276, 1978.

22

Page 23: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

[12] Andrzej Cichocki and Shun-ichi Amari. Families of alpha-beta-and gamma-divergences: Flex-ible and robust measures of similarities. Entropy, 12(6):1532–1568, 2010.

[13] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.

[14] Jason V. Davis and Inderjit Dhillon. Differential entropic clustering of multivariate gaussians.Advances in Neural Information Processing Systems, 19:337, 2007.

[15] Robin De Jong. Theta functions on the theta divisor. The Rocky Mountain Journal of Math-ematics, pages 155–176, 2010.

[16] Bernard Deconinck, Matthias Heil, Alexander Bobenko, Mark Van Hoeij, and Marcus Schmies.Computing Riemann theta functions. Mathematics of Computation, 73(247):1417–1442, 2004.

[17] Bernard Deconinck and Matthew S Patterson. Computing with plane algebraic curves and Rie-mann surfaces: the algorithms of the Maple package “algcurves”. In Computational approachto Riemann surfaces, pages 67–123. Springer, 2011.

[18] Bernard Deconinck and Mark Van Hoeij. Computing Riemann matrices of algebraic curves.Physica D: Nonlinear Phenomena, 152:28–46, 2001.

[19] Jorg Frauendiener, Carine Jaber, and Christian Klein. Efficient computation of multidimen-sional theta functions. Journal of Geometry and Physics, 141:147–158, 2019.

[20] Hironori Fujisawa and Shinto Eguchi. Robust parameter estimation with a small bias againstheavy contamination. Journal of Multivariate Analysis, 99(9):2053–2081, 2008.

[21] Vincent Garcia and Frank Nielsen. Simplification and hierarchical representations of mixturesof exponential families. Signal Processing, 90(12):3197–3212, 2010.

[22] Craig Gentry, Chris Peikert, and Vinod Vaikuntanathan. Trapdoors for hard lattices and newcryptographic constructions. In Proceedings of the fortieth annual ACM symposium on Theoryof computing, pages 197–206, 2008.

[23] Tim Hoffmann and Markus Schmies. jReality, jtem, and oorange—a way to do math withcomputers. In International Congress on Mathematical Software, pages 74–85. Springer, 2006.

[24] Harold Hotelling. Spaces of statistical parameters. Bull. Amer. Math. Soc, 36:191, 1930. Firstmention hyperbolic geometry for Fisher-Rao metric of location-scale family.

[25] Robert Jenssen, Jose C Principe, Deniz Erdogmus, and Torbjørn Eltoft. The Cauchy–Schwarzdivergence and Parzen windowing: Connections to graph theory and Mercer kernels. Journalof the Franklin Institute, 343(6):614–629, 2006.

[26] Angshuman Karmakar, Sujoy Sinha Roy, Oscar Reparaz, Frederik Vercauteren, and IngridVerbauwhede. Constant-time discrete Gaussian sampling. IEEE Transactions on Computers,67(11):1561–1571, 2018.

[27] Adrienne W Kemp. Characterizations of a discrete normal distribution. Journal of StatisticalPlanning and Inference, 63(2):223–229, 1997.

23

Page 24: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

[28] Cong Ling and Jean-Claude Belfiore. Achieving AWGN channel capacity with lattice Gaussiancoding. IEEE Transactions on Information Theory, 60(10):5918–5929, 2014.

[29] JHC Lisman and MCA Van Zuylen. Note on the generation of most probable frequencydistributions. Statistica Neerlandica, 26(1):19–23, 1972.

[30] J Navarro and JM Ruiz. A note on the discrete normal distribution. Advances and Applicationsin Statistics, 5(2):229–245, 2005.

[31] Eric Nichols and Christopher Raphael. Automatic transcription of music audio through contin-uous parameter tracking. In International Society for Music Information Retrieval (ISMIR),pages 387–392, 2007.

[32] Frank Nielsen. An information-geometric characterization of Chernoff information. IEEESignal Processing Letters, 20(3):269–272, 2013.

[33] Frank Nielsen. On the Jensen–Shannon symmetrization of distances relying on abstract means.Entropy, 21(5):485, 2019.

[34] Frank Nielsen. An elementary introduction to information geometry. Entropy, 22(10):1100,2020.

[35] Frank Nielsen. The Siegel–Klein Disk: Hilbert Geometry of the Siegel Disk Domain. Entropy,22(9):1019, 2020.

[36] Frank Nielsen. On a Variational Definition for the Jensen-Shannon Symmetrization of Dis-tances Based on the Information Radius. Entropy, 23(4):464, 2021.

[37] Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans-actions on Information Theory, 57(8):5455–5466, 2011.

[38] Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards.arXiv preprint arXiv:0911.4863, 2009.

[39] Frank Nielsen and Richard Nock. Entropies and cross-entropies of exponential families. In2010 IEEE International Conference on Image Processing, pages 3621–3624. IEEE, 2010.

[40] Frank Nielsen and Richard Nock. A closed-form expression for the Sharma–Mittal entropy ofexponential families. Journal of Physics A: Mathematical and Theoretical, 45(3):032003, 2011.

[41] Frank Nielsen and Richard Nock. On Renyi and Tsallis entropies and divergences for expo-nential families. arXiv preprint arXiv:1105.3259, 2011.

[42] Frank Nielsen, Ke Sun, and Stephane Marchand-Maillet. On Holder projective divergences.Entropy, 19(3):122, 2017.

[43] Frank WJ Olver, Daniel W Lozier, Ronald F Boisvert, and Charles W Clark. NIST handbookof mathematical functions. Cambridge university press, 2010.

[44] Dilip Roy. The discrete normal distribution. Communications in Statistics-theory and Methods,32(10):1871–1883, 2003.

24

Page 25: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

[45] Carl Ludwig Siegel. Symplectic geometry. Elsevier, 2014.

[46] C. Swierczewski. Abelfunctions: A library for computing with abelian functions, riemannsurfaces, and algebraic curves, 2017. http://github.com/abelfunctions/abelfunctions.

[47] Christopher Swierczewski and Bernard Deconinck. Computing Riemann theta functions inSage with applications. Mathematics and computers in Simulation, 127:263–272, 2016.

[48] Pawe l J Szab lowski. Discrete normal distribution and its relationship with Jacobi theta func-tions. Statistics & probability letters, 52(3):289–299, 2001.

[49] Tim Van Erven and Peter Harremos. Renyi divergence and Kullback-Leibler divergence. IEEETransactions on Information Theory, 60(7):3797–3820, 2014.

[50] Lun Wang, Ruoxi Jia, and Dawn Song. D2P-Fed: Differentially private federated learningwith efficient communication. arXiv preprint arXiv:2006.13039, 2020.

[51] Saul Youssef. Quantum mechanics as Bayesian complex probability theory. Modern PhysicsLetters A, 9(28):2571–2586, 1994.

[52] Arnold Zellner and Richard A Highfield. Calculation of maximum entropy distributions andapproximation of marginalposterior distributions. Journal of Econometrics, 37(2):195–209,1988.

A Code snippet in Julia

The Julia language can be freely downloaded from https://julialang.org/.The ellipsoids used to approximate the theta function θR are stored in the RiemannMatrix

structure of the Theta.jl Julia package:

Executing the code below gives the following result:

julia> BhattacharyyaDistance(v1,M1,v2,M2)

1.6259948590224578

julia> KLDivergence(v1,M1,v2,M2)

7.841371347366552

# in J u l i a 1 . 4 . 2using Theta

M1=[0.1 0 ; 0 0 . 2 ] ;v1 = [ −0 .2 ; −0 .2 ] ;

25

Page 26: arXiv:2109.14920v3 [cs.IT] 18 Oct 2021

M2=[0.15 0 ; 0 0 . 2 5 ] ;v2 = [ 0 . 2 ; 0 . 2 ] ;

# cumulant func t i on o f the d i s c r e t e normal fami lyfunction F(v ,M)R= RiemannMatrix ( im∗M) ;log ( r e a l ( theta (−im∗v ,R) ) )end

# Renyi d ive rgence between two d i s c r e t e normal d i s t r i b u t i o n sfunction RenyiDivergence ( alpha , v1 ,M1, v2 ,M2)M12=alpha ∗M1+(1−alpha ) ∗M2;v12=alpha ∗v1+(1−alpha ) ∗v2 ;(1/(1− alpha ) ) ∗ ( alpha ∗F( v1 ,M1) +(1−alpha ) ∗F( v2 ,M2) − F( v12 ,M12) )end

function BhattacharyyaDistance ( v1 ,M1, v2 ,M2)(1/2) ∗ RenyiDivergence (1/2 , v1 ,M1, v2 ,M2)end

function KLDivergence ( v1 ,M1, v2 ,M2)alpha =0.9999999999;RenyiDivergence ( alpha , v1 ,M1, v2 ,M2)end

BhattacharyyaDistance ( v1 ,M1, v2 ,M2)KLDivergence ( v1 ,M1, v2 ,M2)

26