hierarchical representations with poincaré variational auto … · empirically shown that data...

19
Hierarchical Representations with Poincaré Variational Auto-Encoders Emile Mathieu [email protected] Charline Le Lan [email protected] Chris J. Maddison †* [email protected] Ryota Tomioka [email protected] Yee Whye Teh †* [email protected] Department of Statistics, University of Oxford, United Kingdom * DeepMind, London, United Kingdom Microsoft Research, Cambridge, United Kingdom Abstract The variational auto-encoder (VAE) model has become widely popular as a way to learn at once a generative model and embeddings for observations living in a high-dimensional space X⊂ R n . In real world, many such observations may be assumed to be hierarchically structured, such as living organisms data which are related through the evolutionary tree. Also, it has been theoretically and empirically shown that data with hierarchical structure can efficiently be embedded in hyperbolic spaces. We therefore endow VAEs with a hyperbolic geometry and empirically show that it can better generalise to unseen data than its Euclidean counterpart, and can qualitatively recover the hierarchical structure. 1 Introduction Learning from unlabelled raw sensory observations, which are often high-dimensional, is a problem of significant importance in machine learning. Variational auto-encoders (VAEs) (Kingma and Welling, 2013; Rezende et al., 2014) are probabilistic generative models composed of an encoder stochastically embedding observations in a low dimensional latent space Z , and a decoder generating observations x ∈X from encodings z ∈Z . After training, the encodings constitute a low- dimensional representation of the original raw observations, which can be used as features for a downstream task (e.g. Huang and LeCun, 2006; Coates et al., 2011) or be interpretable for their own sake. Such a representation learning (Bengio et al., 2013) task is of interest in VAEs since a stochastic embedding is learned to faithfully reconstruct observations. Hence, one may as well want such an embedding to be a good representation, i.e. an interpretable representation yielding better generalisation or useful as a pre-processing task. There are empirical evidence showing that neural networks yield progressively more abstract representations of objects as the information is pushed deeper in the layers (Bengio et al., 2013). For example, feeding a dog’s image to an encoder neural network would yield an abstract vectorial representation of that dog. Since human beings organise objects hierarchically, it motivates the use of hierarchical latent spaces to embed data. Many domains of natural and human life exhibit patterns of hierarchical organisation and structure. For example, the theory of evolution (Darwin, 1859) implies that features of living organisms are related in a hierarchical manner given by the evolutionary tree. Also, many grammatical theories from arXiv:1901.06033v1 [stat.ML] 17 Jan 2019

Upload: others

Post on 23-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Hierarchical Representations withPoincaré Variational Auto-Encoders

    Emile Mathieu†[email protected]

    Charline Le Lan†[email protected]

    Chris J. Maddison†∗[email protected]

    Ryota Tomioka‡[email protected]

    Yee Whye Teh†∗[email protected]

    † Department of Statistics, University of Oxford, United Kingdom∗ DeepMind, London, United Kingdom

    ‡Microsoft Research, Cambridge, United Kingdom

    Abstract

    The variational auto-encoder (VAE) model has become widely popular as a wayto learn at once a generative model and embeddings for observations living in ahigh-dimensional space X ⊂ Rn. In real world, many such observations maybe assumed to be hierarchically structured, such as living organisms data whichare related through the evolutionary tree. Also, it has been theoretically andempirically shown that data with hierarchical structure can efficiently be embeddedin hyperbolic spaces. We therefore endow VAEs with a hyperbolic geometry andempirically show that it can better generalise to unseen data than its Euclideancounterpart, and can qualitatively recover the hierarchical structure.

    1 Introduction

    Learning from unlabelled raw sensory observations, which are often high-dimensional, is a problemof significant importance in machine learning. Variational auto-encoders (VAEs) (Kingma andWelling, 2013; Rezende et al., 2014) are probabilistic generative models composed of an encoderstochastically embedding observations in a low dimensional latent space Z , and a decoder generatingobservations x ∈ X from encodings z ∈ Z . After training, the encodings constitute a low-dimensional representation of the original raw observations, which can be used as features for adownstream task (e.g. Huang and LeCun, 2006; Coates et al., 2011) or be interpretable for theirown sake. Such a representation learning (Bengio et al., 2013) task is of interest in VAEs since astochastic embedding is learned to faithfully reconstruct observations. Hence, one may as well wantsuch an embedding to be a good representation, i.e. an interpretable representation yielding bettergeneralisation or useful as a pre-processing task. There are empirical evidence showing that neuralnetworks yield progressively more abstract representations of objects as the information is pusheddeeper in the layers (Bengio et al., 2013). For example, feeding a dog’s image to an encoder neuralnetwork would yield an abstract vectorial representation of that dog. Since human beings organiseobjects hierarchically, it motivates the use of hierarchical latent spaces to embed data.

    Many domains of natural and human life exhibit patterns of hierarchical organisation and structure.For example, the theory of evolution (Darwin, 1859) implies that features of living organisms arerelated in a hierarchical manner given by the evolutionary tree. Also, many grammatical theories from

    arX

    iv:1

    901.

    0603

    3v1

    [st

    at.M

    L]

    17

    Jan

    2019

  • linguistics, such as phrase-structure grammar, involve hierarchy. What is more, almost every systemof organization applied to the world is arranged hierarchically (Kulish, 2002), e.g. governments,families, human needs (Maslow, 1943), science, etc. Hence, explicitly incorporating hierarchicalstructure in probabilistic models has been a long going research topic (e.g. Heller and Ghahramani,2005). It has been shown that certain classes of symbolic data can be organized according to a latenthierarchy using hyperbolic spaces (Nickel and Kiela, 2017). Trees are intrinsically related withhyperbolic geometry since they can be embedded with arbitrarily low distortion into the Poincarédisc (Sarkar, 2012). The exponential growth of the Poincaré surface area with respect to its radius isalso related with the exponential growth of leaves in a tree with respect to its depth.

    Taken together this suggests a VAE, whose latent space is endowed with a hyperbolic geometry, willbe more capable of representing and discovering hierarchical features. This is what we considerin this work. Two recent concurrent works (Ovinnikov, 2018; Grattarola et al., 2018) considered asomewhat similar problem and endowed an auto-encoder (AE)-based model with an hyperbolic latentspace. We discuss more in details those works in Section 5. Our goal is twofold; (a) learn a latentrepresentation that is interpretable in terms of a hierarchical relationship between the observations,(b) yield a more "compact" (than its Euclidean counterpart) representation for low dimensionallatent spaces – therefore yielding better reconstruction – for observations having such an underlyinghierarchical structure.

    The remainder of this paper is organized as follows: In Section 2, we briefly review Riemanniangeometry and the Poincaré ball model of hyperbolic geometry. In Section 3, we introduce the PoincaréVAE model, and discuss its design and how it can be trained. In Section 4, we empirically assessthe performance of our model on a synthetic dataset generated from a branching diffusion process,showing that it outperforms its Euclidean counterpart. Eventually, in Sections 5 and 6, we discussrelated work and further work.

    2 The Poincaré Ball model of hyperbolic geometry

    2.1 Brief Riemannian geometry review

    A real, smooth manifoldM is a collection of real vectors x, which is locally similar to a linear space.At each point x of the manifoldM is defined a real vector space of the same dimensionality asM,called the tangent space in x: TxM. Intuitively, it contains all the possible directions in which onecan tangentially pass through x. For each point x of the manifold, the metric tensor g(x) definesan inner product on the associated tangent space: < ·, · >x: TxM×TxM→M. A Riemannianmanifold is then defined as a tuple (M, g) (Petersen, 2006).The metric tensor gives local notions of angle, length of curves, surface area and volume, fromwhich global quantities can be derived by integrating local contributions. A norm is induced by theinner product on TxM: ‖·‖x =

    √< ·, · >x. The matrix representation of the Riemannian metric

    G(x), is defined such that ∀u,v ∈ TxM× TxM, g(x)(u,v) =< u,v >x= uTG(x)v. Aninfinitesimal volume element is induced on each tangent space TxM, and thus a measure on themanifold dM(x) =

    √|G(x)|dx.

    The length of a curve γ : t 7→ γ(t) ∈M is given by

    L(γ) =

    ∫ 10

    ‖γ′(t)‖1/2γ(t)dt =∫ 10

    (〈γ′(t), γ′(t)〉γ(t)

    )1/2dt.

    The concept of straight lines can then be generalised to geodesics, which are constant speed curvesgiving the shortest path between pairs of points x,y of the manifold: γ = arg minL(γ) with γ(0) =x, γ(1) = y and ‖γ′(t)‖γ(t) = 1. A global distance is thus induced onM given by dM(x,y) =inf L(γ). Endowing M with that distance consequently defines a metric space (M, dM). Theconcept of moving along a "straight" curve with constant velocity is given by the exponential map.In particular, there is a unique unit speed geodesic γ satisfying γ(0) = x with initial tangent vectorγ′(0) = v. The corresponding exponential map is then defined by expx(v) = γ(1), as illustrated onFigure 1 (Left). The logarithm map is the inverse logx = exp

    −1x :M→ TxM. For geodesically

    complete manifolds, such as the Poincaré ball, expx is well-defined on the full tangent space TxM.For an offset b ∈ M and a normal vector a ∈ TbM, the orthogonal projection Pa,b(x) is the

    2

  • exponential map with minimal length, from x to the set of geodesics with origin b ∈ M anddirections {v ∈ TxM |〈a,v〉b = 0}.

    2.2 The hyperbolic space

    0

    x2

    v1v2 exp0(v1)expx2(v2)

    Figure 1: (Left) Geodesics and exponential maps. (Right) An embedded tree.

    From a historical perspective, hyperbolic geometry was created in the first half of the nineteenthcentury in the midst of attempts to understand Euclid’s axiomatic basis for geometry (Coxeter, 1942).It is one type of non-Euclidean geometry, the one that discards Euclid’s fifth postulate (or parallelpostulate) to replace it with the following one: Given a line and a point not on it, there is more thanone line going through the given point that is parallel to the given line.

    Moreover, in every dimension d ≥ 2 there exists a unique complete, simply connected Riemannianmanifold having constant sectional curvature κ = |c| ∈ R up to isometries. The three manifoldsgiven by κ > 0, κ < 0 and κ = 0 are respectively the hypersphere Sdc , Euclidean space R

    d, andhyperbolic space Hdc . In contrast with S

    dc and R

    d, the hyperbolic space Hdc can be constructed usingvarious isomorphic models (none of which is prevalent), among them the hyperboloid model, theBeltrami-Klein model, the Poincaré half-plane model and the Poincaré ball which we will be relyingon.

    Intuitively, hyperbolic spaces can be thought of as continuous versions of trees or vice versa, treescan be thought of as "discrete hyperbolic spaces", as illustrated in Figure 1 (Right). Indeed, treescan be embedded with arbitrarily low distortion 1 into the Poincaré disc (Sarkar, 2012). In contrast,Bourgain’s theorem (Linial et al., 1994) shows that the Euclidean space is unable to obtain comparablylow distortion for trees - even using an unbounded number of dimensions. This type of structure isnaturally suited with hyperbolic spaces since the volume and surface area grow exponentially withtheir radius.

    2.3 The Poincaré ball

    The Poincaré ball is a model of Hdc with curvature κ = |c|. Its smooth manifold is the the open ballof radius 1/

    √c in Rd: Bdc = {x ∈ Rd | c ‖x‖

    2E < 1}, where ‖·‖E denotes the Euclidean norm. Its

    metric tensor is a conformal 2 one, given by

    gcp(x) =

    (2

    1− c ‖x‖2

    )2ge = (λ

    cx)

    2ge, (1)

    1Graph embeddings global metric, measuring relative difference between pairwise distances in the embeddingspace and in the original space, the best distortion being 0.

    2i.e. proportional to the Euclidean metric tensor ge.

    3

  • where λcx =2

    1−c‖x‖2 , x ∈ Bdc and ge denotes the Euclidean metric tensor, i.e. the usual dot product,

    ∀u,v ∈ Rd ge(u,v) = 〈u,v〉. The Poincaré ball model of hyperbolic space corresponds then to theRiemannian manifold Bdc = (Bdc , gcp). The induced distance function between x,y ∈ Bdc is given as

    dcp(x,y) =1√c

    cosh−1

    (1 + 2

    c||x− y||2

    (1− c ‖x‖2)(1− c ‖y‖2)

    ). (2)

    The framework of gyrovector spaces provides a non-associative algebraic formalism for hyperbolicgeometry, analogous to the vector space structure of Euclidean geometry (Ungar, 2008). The Möbiusaddition of x and y in Bdc is defined as

    x⊕c y =(1 + 2c 〈x,y〉+ c‖y‖2)x+ (1− c‖x‖2)y

    1 + 2c 〈x,y〉+ c2‖x‖2‖y‖2. (3)

    Ones recover the Euclidean addition of two vectors in Rd as c → 0. Building on that framework,Ganea et al. (2018) derived closed-form formulations for the exponential map

    expcx(v) = x⊕c(

    tanh

    (√cλcx‖v‖

    2

    )v√c‖v‖

    ), expc0(v) = tanh

    (√c‖v‖

    ) v√c‖v‖

    (4)

    and its inverse, the logarithm map

    logcx(y) =2√cλcx

    tanh−1(√c‖ − x⊕c y‖

    ) −x⊕c y‖ − x⊕c y‖

    , logc0(y) = tanh−1 (√c‖y‖) y√

    c‖y‖.

    (5)

    3 The Poincaré VAE: Pc-VAE

    We consider the problem of mapping an empirical observations distribution to a lower dimensionalPoincaré ball Bdc , as well as learning a map from this latent space Z = Bdc to the observation space X .Building on the VAE framework, this Pc-VAE model only differs by the choice of prior distributionp(·) and posterior distribution qφ(·|x) which are defined on Bdc , and by the encoder gφ and decoderfθ maps which take into account the geometry of the latent space. Their parameters {θ,φ} arelearned by maximising the evidence lower bound (ELBO). Hyperbolic geometry is in a sense anextrapolation of Euclidean geometry. Hence our model can also be seen as a generalisation of a usualEuclidean VAE that we denote by N -VAE, i.e. Pc-VAE −−−→

    c→0N -VAE.

    3.1 Prior and variational posterior distributions

    In order to parametrise distributions on the Poincaré ball, we consider a maximum entropy generali-sation of normal distributions (Said et al., 2014; Pennec, 2006), i.e. a hyperbolic normal distribution,which density ν is given by

    dν(x|µ, σ2)dM(x)

    = NBdc (x|µ, σ2) =

    1

    Z(σ)exp

    (−dcp(µ,x)

    2

    2σ2

    ), (6)

    with σ > 0 a dispersion parameter (not being the standard deviation), µ ∈M the Fréchet expectation3 , and Z(σ) the normalising constant derived in Appendix B.6, given by

    Zr(σ) =

    √π

    1

    (2√c)d−1

    d−1∑k=0

    (−1)k(d− 1k

    )e

    (d−1−2k)22 cσ

    2

    [1 + erf

    ((d− 1− 2k)

    √cσ√

    2

    )].

    (7)

    We discuss such distributions in Appendices B.1 and B.2. Hence, the prior distribution is chosen tobe a hyperbolic normal distribution with mean zero, p(z) = NBdc (·|0, σ

    20), and the variational family

    to be Q = {NBdc (·|µ, σ2) | µ ∈ Bdc , σ > 0}. Figure 2 shows samples form such hyperbolic normal

    distributions.3Generalisation of the expectation to manifolds, defined as minimisers of σ2(µ) =∫

    M dM(µ,x)2p(x)dM(x).

    4

  • σ = 1.0

    µ = (0, 0)

    +

    µ = (0.4, 0)

    +

    µ = (0.8, 0)

    +

    σ = 1.7 + + +

    Figure 2: Samples from hyperbolic normal distributions with different expectations (black crosses)and distortions.

    3.2 Encoder and decoder architecture

    We make use of two neural networks, a decoder fθ and an encoder gφ, to respectively parametrisethe likelihood p(·|fθ(z)) : Z → P(X ) and the variational posterior q(·|gφ(x)) : X → P(Z) ∈ Q.Such functions fθ and gφ are often designed as compositions of decision units φ ◦ fa,b, with fa,b anaffine transformation and φ a non-linearity.

    b

    Ha, ba

    x

    dcp(x, Ha, b)

    b

    a Ha, b

    x

    dcp(x, Ha, b)

    Figure 3: Illustration of an orthogonal projection on a hyperplane in a Poincaré disc B2c (Left) and anEuclidean plane (Right). Those hyperplanes are decision boundaries.

    Decoder Affine transformations fa,b : Rd → Rp – given by fa,b : x 7→ 〈a,x− b〉E – can berewritten as being proportional to the distance from x to its orthogonal projection Pa,b(x) on thehyperplane Ha,b which acts as a decision boundary

    fa,b : x→ sign(〈a,x− b〉E) ‖a‖E dE(x, Ha,b)with Ha,b = {x ∈ Rp | 〈a,x− b〉E = 0} = b+ {a}

    ⊥.

    5

  • As highlighted in Ganea et al. (2018) one can analogously define an operator fa,b : Bdc → Rp

    fa,b : x→ sign(〈a, logcx(b)〉b) ‖a‖b dcp(x, Ha,b) (8)

    with Ha,b = {x ∈ Bdc | 〈a, logcx(b)〉E = 0} = exp

    cb({a}⊥). (9)

    A closed-formed expression for dcp(x, Ha,b) has also been derived by Ganea et al. (2018)

    dcp(x, Ha,b) =1√c

    sinh−1(

    2√c| 〈−p⊕c x, a〉 |

    (1− c‖ − p⊕c x‖2)‖a‖

    ). (10)

    Such hyperplanes – also called gyroplanes – can be interpreted as decision boundaries, being semi-hyperspheres orthogonal to the Poincaré ball’s boundary as shown on Figure 3, contrary to flathyperplanes for the Euclidean space.

    We define the decoder’s first layer as the concatenation of such operators (fa1,b1(z), . . . , fak,bk(z)),that we call gyroplane units. This output is then fed to a generic neural network.

    Encoder The encoder must have outputs gφ(x) lying within the support of the posterior q parame-ters. For the choice of qφ(·|x) = NBdc (·|µ, σ

    2), Im(gφ) ⊆ Bdc × R∗+ is required. Constraining σ > 0can classically be done via a log-variance parametrisation. The easiest way to ensure that µ ∈ Bdc is toparametrise it as the image of the exponential map defined at the origin with velocity v ∈ T0Bdc = Rd.Hence,

    v, σ = gφ(x)

    µ = expc0(v). (11)

    In the flat curvature limit c→ 0, the usual parametrisation µ = 0 + v = v ∈ Rd is recovered. ThePc-VAE’s decoder is therefore a generic decoder whose mean output is mapped to the manifold viathe exponential map exp0.

    3.3 Training

    Objective The evidence lower bound (ELBO) can readily be extended for Riemannian latent spacesby applying Jensen’s inequality w.r.t. dM (cf Appendix A) yielding

    LM(x; θ, φ) ,∫M

    [ln pθ(x|z)− ln p(z)− ln qφ(z|x)] qφ(z|x) dM(z) ≤ ln p(x) (12)

    ≈ 1K

    K∑k=1

    [ln pθ(x|zk) + ln p(zk)− ln qφ(zk|x)

    ], zk ∼ qφ(·|x) (13)

    Hence LM(x; θ, φ) can easily be estimated by monte carlo (MC) samples.

    Reparameterisation In order to obtain samples x ∼ NBdc (·|µ, σ2) and to compute gradients with

    respect to parameters {µ, σ}, we rely on the reparameterisation given by hyperbolic polar coordinate

    x = expcµ

    (r

    λcµα

    )(14)

    with the direction α being uniformly distributed on the d− 1 hypersphere,α ∼ U(Sd−1)

    and the hyperbolic radius r = dcp(µ,x) having density

    ρ(r) =1R+(r)

    Zr(σ)e−

    r2

    2σ2

    (sinh(

    √cr)√

    c

    )d−1(15)

    as shown in Appendices B.3 and B.4.

    Gradients Then gradients ∇µx can easily be computed thanks to the reparameterisation givenby Eq 14. Similarly for gradients ∇σx by additionally relying on an implicit reparameterisation(Figurnov et al., 2018) via its cdf F (r|σ) as shown in Appendix B.7.We consequently rely on that reparameterisation to obtain samples NBdc (·|µ, σ

    2) through anacceptance-rejection sampling scheme described in Algorithm 1 and Appendix B.7.

    6

  • Algorithm 1 Sampling z ∼ NBdc (·|µ, σ2)

    Require: µ, σ2, dimension d, curvature c1: g : r 7→ 1r>0Zg(σ)e

    − 12σ2

    (r−(d−1)√cσ2)

    2

    . Truncated Normal((d− 1)√cσ2, σ2) pdf

    2: ρ : r 7→ 1R+ (r)Zr(σ) e− r2

    2σ2

    (sinh(

    √cr)√c

    )d−1. Hyperbolic radius pdf

    3: M ← Zg(σ)Zr(σ)1

    (2√c)d−1

    e(d−1)2cσ2

    2 . Acceptance-Rejection constant4: while sample r not accepted do5: Propose r ∼ g(·)6: Sample u ∼ U([0, 1]) . Uniform distribution on the segment [0, 1]7: if u < M ρ(r)g(r) then . Accept-Reject step8: Accept sample r9: else

    10: Reject sample r11: end if12: end while13: Sample direction α ∼ U(Sd−1) . Uniform distribution on the hypersphere14: z ← expcµ

    (rλcµα)

    . Run geodesic from µ with direction α and velocity r15: Return z

    4 Experiments

    Branching diffusion process To test the Pc-VAE on data with underlying hierarchical structure,we generate a synthetic dataset from a branching diffusion process. Nodes xi ∈ Rn are representedin a vector space, and sampled according to the following sampling process

    xi ∼ N(· |xπ(i), σ20

    )∀i ∈ 1, . . . , N (16)

    with π(i) being the index of the ith node’s ancestor and d(i) its depth. Hence, the generatedobservations have a known hierarchical relationship. Yet, the model will not have access to thehierarchy but only to the vector representations (x1, . . . ,xN ).

    Then to assess the generalisation capacity of the models, noisy observations are sampled for eachnode xi

    yi,j = xi + �i,j , �i,j ∼ N(· |0, σ2j

    )∀i ∈ 1, . . . , N,∀j ∈ 1, . . . , J. (17)

    The root x0 is set to null vector for simplicity. The observation dimension is set to n = 50. Thedataset (yi,j)i,j is centered and normalised to have unit variance. Thus, the choice of variance σ

    20

    does not matter and it is set to σ0 = 1. The number of noisy observation is set to J = 5, and itsvariance to σ2j = σ

    20/5 = 1/5. The depth is set to 6 and the branching factor to 2. Details concerning

    the optimisation and architectures of the models can be found in Appendix C. In order to avoidembeddings to lie too close to the center of the ball, thus not really making use of the hyperbolicgeometry, we set the prior distribution’s distortion to σ = 1.7. Indeed, more prior mass is put close tothe boundary as the distortion σ increases as illustrated in Figure 2. Consequently, the embeddings are"pushed" towards the boundary of the ball because of the regularisation term in the ELBO (Equation13). We restrict ourselves to a latent dimensionality of d = 2, i.e. the Poincaré disc B2c .

    Both models – Pc-VAE and N -VAE – are trained on the synthetic dataset previously described.The aim is to assess whether the underlying hierarchy of the observations is recovered in the latentrepresentations. Figure 4 shows that indeed a hierarchical structure is somewhat learned by bothmodels, even though the Pc-VAE’s hierarchy could visually clearer.So as to quantitatively assess models performances, we compute the marginal likelihood on the testdataset using IWAE (Burda et al., 2015) estimates (with 1000 samples). We train several Pc-VAEswith decreasing curvatures, along with a Euclidean N -VAE for a baseline.

    7

  • Figure 4: Latent representations of a – P1-VAE (Left) and aN -VAE (Right), trained on the syntheticdataset. Posterior means are represented by black crosses, and colour dots are posterior samples.Blue lines represent the true underlying hierarchy.

    Figure 5: Latent representation of P1-VAE (Left) and N -VAE (Right) with heatmap of the logdistance to the hyperplane (in pink).

    Table 1: Negative test marginal likelihood estimates on the synthetic dataset, averaged over 20 runs.The smaller the better. Twenty different datasets are thus generated and models are trained on those.

    Models

    σ N -VAE P0.1-VAE P0.3-VAE P0.8-VAE P1.0-VAE P1.2-VAELIWAE 1 57.14± .42 57.10± .39 57.16± .39 56.88± .43 56.71± .41 56.58± .48LIWAE 1.3 57.03± .43 56.91± .39 56.91± .38 56.39± .41 56.21± .42 56.07± .47LIWAE 1.7 57.00± .39 56.77± .38 56.62± .35 55.90± .48 55.70± .41 55.60± .38

    Table 1 shows that on the synthetic dataset, the Pc-VAE outperforms its Euclidean counterpart interms of test marginal likelihood. Also, as the curvature decreases, the performance of the N -VAE isrecovered. Figure 6 shows latent representations of Pc-VAEs with different curvatures. With "small"

    8

  • Figure 6: Latent representations of Pc-VAE with decreasing curvatures c = 1.2, 0.3, 0.1 (Left toRight).

    curvatures, we observe that embeddings lie close the center of the ball, where the geometry is closeto be Euclidean.

    5 Related work

    In the Bayesian Nonparametric’s literature, explicitly modelling the hierarchical structure of datahas been a long-going trend (Teh et al., 2008; Heller and Ghahramani, 2005; Griffiths et al., 2004;Ghahramani et al., 2010; Larsen et al., 2001). Embedding graphs in hyperbolic spaces has beenempirically shown (Nickel and Kiela, 2017, 2018; Chamberlain et al., 2017) to yield a more compactrepresentation compared to Euclidean space, especially for low dimensions. De Sa et al. (2018)studied the trade-offs of tree embeddings in the Poincaré disc.

    VAEs with non Euclidean latent space have first been explored by Davidson et al. (2018), endowingthe latent space with a hyperspherical geometry, and parametrising variational posteriors with Von-Mises Fisher distributions. Independent works recently considered endowing AEs latent space with ahyperbolic geometry in order to yield a hierarchical representation. Grattarola et al. (2018) consideredthe setting of constant curvature manifolds (CCMs) (i.e. hyperspherical, Euclidean and hyperbolicgeometries), thus making use of the hyperboloid model for the hyperbolic geometry. By building onthe adversarial auto-encoder (AAE) framework, observations are generated by mapping prior samplesthrough the decoder, and observations are embedded via a deterministic encoder. Hence, only samplesfrom the prior distribution are needed, which they choose to be a wrapped normal distribution. Inorder to use generic neural networks, the encoder is regularised so that its outputs lie close to themanifold. On the counterpart, the encoder and decoder architectures are therefore not tailored to takeinto account the geometry of the latent space. The posterior distribution is regularised to the prior viaa discriminator, which requires the design and training of an extra neural network. In a concurrentwork, Ovinnikov (2018) recently proposed to similarly endow VAEs latent space with a Poincaré ballmodel. They train their model by minimising a Wasserstein Auto-Encoder loss (Tolstikhin et al.,2017) with a MMD regularisation because they cannot derive a close-form solution of the ELBO’sentropy term. We instead rely on a MC approximation of the ELBO.

    6 Future directions

    One drawback of our approach is that it is computationally more demanding that usual VAEs. Thegyroplane units are not as fast as their linear counterpart which befit from an optimised implemen-tation. Also, our sampling scheme for hyperbolic normal distributions does not scale well withthe dimensionality. Sampling the hyperbolic radius is the only non-trivial part, even though it is aunivariate random variable, its dependency in the dimensionality hurts our rejections bound. Onemay use the log-concavity of the hyperbolic radius pdf, by applying an adaptive rejection sampling(ARS) (R. Gilks and Wild, 1992; Görür and Teh, 2011) scheme that should scale better.

    Another interesting extension would be the use of normalising flows to avoid the previously mentioneddifficulties due to hyperbolic normal distributions, while preserving a tractable likelihood. One may

    9

  • consider Hamiltonian flows (Caterini et al., 2018) defined on the Poincaré ball, or generalisation oftrivial flows (e.g. planar flows).

    Further empirical results on real datasets are also needed to strengthen the claim that such modelsindeed yield interpretable hierarchical representations which are more compact.

    Acknowledgments

    We are extremely grateful to Adam Foster, Phillipe Gagnon and Emmanuel Chevallier for theirhelp. EM, YWT’s research leading to these results received funding from the European ResearchCouncil under the European Union’s Seventh Framework Programme (FP7/2007-2013) ERC grantagreement no. 617071 and they acknowledge Microsoft Research and EPSRC for partially fundingEM’s studentship.

    ReferencesAlanis-Lobato, G. and Andrade-Navarro, M. A. (2016). Distance Distribution between Complex

    Network Nodes in Hyperbolic Space. Complex Systems, pages 223–236.

    Asta, D. and Shalizi, C. R. (2014). Geometric Network Comparison. arXiv.org.

    Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and newperspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828.

    Box, G. E. P. and Muller, M. E. (1958). A note on the generation of random normal deviates. Ann.Math. Statist., 29(2):610–611.

    Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance Weighted Autoencoders. arXiv.org.

    Caterini, A. L., Doucet, A., and Sejdinovic, D. (2018). Hamiltonian Variational Auto-Encoder.arXiv.org.

    Chamberlain, B. P., Clough, J., and Deisenroth, M. P. (2017). Neural Embeddings of Graphs inHyperbolic Space. arXiv.org.

    Chevallier, E., Barbaresco, F., and Angulo, J. (2015). Probability density estimation on the hyperbolicspace applied to radar processing. In Nielsen, F. and Barbaresco, F., editors, Geometric Science ofInformation, pages 753–761, Cham. Springer International Publishing.

    Coates, A., Lee, H., and Ng, A. (2011). An analysis of single-layer networks in unsupervised featurelearning. In Gordon, G., Dunson, D., and Dudík, M., editors, Proceedings of the FourteenthInternational Conference on Artificial Intelligence and Statistics, volume 15 of JMLR Workshopand Conference Proceedings, pages 215–223. JMLR W&CP.

    Coxeter, H. S. M. (1942). Non-Euclidean Geometry. University of Toronto Press.

    Darwin, C. (1859). On the Origin of Species by Means of Natural Selection. Murray, London. or thePreservation of Favored Races in the Struggle for Life.

    Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, J. M. (2018). HypersphericalVariational Auto-Encoders. arXiv.org.

    De Sa, C., Gu, A., Re, C., and Sala, F. (2018). Representation Tradeoffs for Hyperbolic Embeddings.arXiv.org.

    Figurnov, M., Mohamed, S., and Mnih, A. (2018). Implicit Reparameterization Gradients. arXiv.org.

    Ganea, O.-E., Bécigneul, G., and Hofmann, T. (2018). Hyperbolic Neural Networks. arXiv.org.

    Ghahramani, Z., Jordan, M. I., and Adams, R. P. (2010). Tree-structured stick breaking for hierarchicaldata. In Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A., editors,Advances in Neural Information Processing Systems 23, pages 19–27. Curran Associates, Inc.

    Grattarola, D., Livi, L., and Alippi, C. (2018). Adversarial Autoencoders with Constant-CurvatureLatent Manifolds. arXiv.org.

    10

  • Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., and Blei, D. M. (2004). Hierarchical topic modelsand the nested chinese restaurant process. In Thrun, S., Saul, L. K., and Schölkopf, B., editors,Advances in Neural Information Processing Systems 16, pages 17–24. MIT Press.

    Görür, D. and Teh, Y. W. (2011). Concave-convex adaptive rejection sampling. Journal of Computa-tional and Graphical Statistics, 20(3):670–691.

    Heller, K. A. and Ghahramani, Z. (2005). Bayesian hierarchical clustering. In Proceedings of the22Nd International Conference on Machine Learning, ICML ’05, pages 297–304, New York, NY,USA. ACM.

    Huang, F. J. and LeCun, Y. (2006). Large-scale learning with SVM and convolutional for genericobject categorization. In CVPR (1), pages 284–291. IEEE Computer Society.

    Kingma, D. P. and Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv.org.

    Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. CoRR, abs/1312.6114.

    Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A., and Boguna, M. (2010). HyperbolicGeometry of Complex Networks. arXiv.org, (3):253.

    Kulish, V. (2002). Hierarchical Methods. Springer Netherlands.

    Larsen, J., Szymkowiak, A., and Hansen, L. K. (2001). Probabilistic hierarchical clustering withlabeled and unlabeled data.

    Linial, N., London, E., and Rabinovich, Y. (1994). The geometry of graphs and some of its algorithmicapplications. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science,SFCS ’94, pages 577–591, Washington, DC, USA. IEEE Computer Society.

    Maslow, A. H. (1943). A theory of human motivation. Psychological Review, 50(4):430–437.

    Nickel, M. and Kiela, D. (2017). Poincaré Embeddings for Learning Hierarchical Representations.arXiv.org.

    Nickel, M. and Kiela, D. (2018). Learning Continuous Hierarchies in the Lorentz Model of HyperbolicGeometry. arXiv.org.

    Ovinnikov, I. (2018). Poincaré Wasserstein Autoencoder. NeurIPS Workshop on Bayesian DeepLearning, pages 1–8.

    Pennec, X. (2006). Intrinsic statistics on riemannian manifolds: Basic tools for geometric measure-ments. Journal of Mathematical Imaging and Vision, 25(1):127.

    Petersen, P. (2006). Riemannian Geometry. Springer-Verlag New York.

    R. Gilks, W. and Wild, P. (1992). Adaptive rejection sampling for gibbs sampling. 41:337–348.

    Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic Backpropagation and ApproximateInference in Deep Generative Models. arXiv.org.

    Said, S., Bombrun, L., and Berthoumieu, Y. (2014). New riemannian priors on the univariate normalmodel. Entropy, 16(7):4015–4031.

    Sarkar, R. (2012). Low distortion delaunay embedding of trees in hyperbolic plane. In van Kreveld,M. and Speckmann, B., editors, Graph Drawing, pages 355–366, Berlin, Heidelberg. SpringerBerlin Heidelberg.

    Teh, Y. W., Daume III, H., and Roy, D. M. (2008). Bayesian agglomerative clustering with coalescents.In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural InformationProcessing Systems 20, pages 1473–1480. Curran Associates, Inc.

    Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. (2017). Wasserstein Auto-Encoders.arXiv.org.

    Ungar, A. A. (2008). A gyrovector space approach to hyperbolic geometry. Synthesis Lectures onMathematics and Statistics, 1(1):1–194.

    11

  • A Evidence Lower Bound

    The ELBO can readily be extended for Riemannian latent spaces by applying Jensen’s inequality w.r.t.dM which yield

    ln p(x) = ln

    ∫Z=M

    pθ(x, z)dM(z) = ln∫Mpθ(x|z)p(z)dM(z)

    = ln

    ∫M

    pθ(x|z)p(z)qφ(z|x)

    qφ(z|x)dM(z)

    ≥∫M

    lnpθ(x|z)p(z)qφ(z|x)

    qφ(z|x)dM(z)

    =

    ∫M

    [ln pθ(x|z)− ln p(z)− ln qφ(z|x)] qφ(z|x) dM(z)

    = Ez∼qφ(·|x) [ln pθ(x|z) + ln p(z)− ln qφ(z|x)], LM(x; θ, φ)

    B Hyperbolic normal distributions

    B.1 Probability measures on Riemannian manifolds

    Probability measures and random vectors can be defined on Riemannian manifolds so as to modeluncertainty on non-Euclidean spaces (Pennec, 2006). The Riemannian metric G(x) induces aninfinitesimal volume element on each tangent space TxM, and thus a measure on the manifold,

    dM(x) =√|G(x)|dx = λxdx, (18)

    with dx being the Lebesgue measure. Random vectors x ∈M would natural be characterised by theRadon-Nikodym derivative of a measure ν w.r.t. the Riemannian measure dM(·)

    dν(x)

    dM(x)= f(x).

    Thus one can generalise isotropic normal distributions as parametric measures ν with the followingdensity

    dν(x|µ, σ2)dM(x)

    = NM(x|µ, σ2) =1

    Zexp

    (−dM(µ,x)

    2

    2σ2

    ), (19)

    with dM being the Riemannian distance on the manifold induced by the tensor metric. σ > 0 is calledthe dispersion (which is not equal to the standard deviation) and µ ∈M the Fréchet expectation.Sucha class of distributions is an isotropic special case of the maximum entropy characterisation derivedin Pennec (2006). It is also the formulation used by Said et al. (2014) in the Poincaré half-plane.

    Another generalisation are wrapped (also push-forward or exp-map) normal distributions whichconsist in mapping a normal distribution by the exponential map of the manifold

    x ∼ NM(x|µ, σ2) iff x = expµ(�), � ∼ N (·|0, σ2). (20)Such wrapped distributions usually have tractable formulas, and can easily generalise all famousdistributions. This is the reason why Grattarola et al. (2018) rely on such a prior. Yet, for the Gaussiancase, they violate the maximum entropy characterisation. The usual (Euclidean) normal distributionis recovered in the limit of zero curvature for both generalisations.

    B.2 Maximum entropy hyperbolic normal distributions

    In the Poincaré ball Bdc , the maximum entropy generalisation of the normal distribution from Eq 19yield

    dν(x|µ, σ2)dM(x)

    = NBdc (x|µ, σ2) =

    1

    Z(σ)exp

    (−dcp(µ,x)

    2

    2σ2

    ). (21)

    12

  • Such a pdf can easily be computed pointwise once Z(σ) is known, which we derive in Appendix B.6.A harder task it to get samples from such a distribution. We propose a rejection sampling algorithmbased on the hyperbolic polar coordinates as derived in Appendix B.7.

    B.3 Hyperbolic polar coordinates

    Euclidean polar coordinates (or hyperspherical coordinate in d > 2), express points x ∈ Rd througha radius r ≥ 0 and a direction α ∈ Sd−1 such that x = rα. Yet, one could choose another pole(or reference point) µ 6= 0 such that x = µ + rα. Consequently, r = dE(µ,x). An analogouschange of variables can also be constructed in Riemannian manifolds relying on the exponentialmap as a generalisation of the addition operator. Given a pole µ ∈ Bdc , the point of hyperbolicpolar coordinates x = (r,α) is defined as the point at distance r ≥ 0 of µ and lying on thegeodesic with initial direction α ∈ Sd−1. Hence x = expcµ( rλcµα) with

    rλcµα ∈ TµBdc , since

    dcp(µ,x) = ‖ lncµ(x)‖µ = ‖ rλcµα‖µ = r.

    Below we derive the expression of the Poincaré ball metric in such hyperbolic polar coordinate, forthe specific setting where µ = 0: x = expc0(

    r2α). Switching to Euclidean polar coordinate yield

    ds2Bdc = (λcx)

    2(dx21 + · · ·+ dx2d) =4

    (1− c‖x‖2)2dx2 =

    4

    (1− cρ2)2(dρ2 + ρ2ds2Sd−1). (22)

    Let’s define r = dcp(0,x) = L(γ), with γ being the geodesic joining 0 and x. Since such a geodesicis the segment [0,x], we have

    r =

    ∫ ρ0

    λctdt =

    ∫ ρ0

    2

    1− ct2dt =

    ∫ √cρ0

    2

    1− t2dt√c

    =2√c

    tanh−1(√cρ).

    Plugging ρ = 1√c

    tanh(√c r2 ) (and dρ = (1− cρ

    2)/2dr) into Eq 22 yield

    ds2Bdc =4

    (1− cρ2)21

    4(1− cρ2)2dr2 +

    (2

    ρ

    1− cρ2

    )2ds2Sd−1

    = dr2 +

    2 1√c tanh(√c r2 )1− c

    (1√c

    tanh(√c r2

    )2

    2

    ds2Sd−1

    = dr2 +

    (1√c

    sinh(√cr)

    )2ds2Sd−1 . (23)

    Below we remind the result for the general setting where µ 6= 0. In an orthonormal basis of TµBdc– given by hyperspherical coordinate v = rα with r ≥ 0 and α ∈ Sd−1 – the hyperbolic polarcoordinate leads to the following expression of the matrix of the metric (Chevallier et al., 2015)

    G(v) = G(r) =

    (1 0

    0(

    sinh(√cr)√

    cr

    )2Id−1

    )(24)

    Hence

    √|G(v)| =

    √|G(r)| =

    (sinh(

    √cr)√

    cr

    )d−1. (25)

    The fact that the length element ds2Bdc and equivalently the metric G only depends on the radius onhyperbolic polar coordinate, is a consequence of the hyperbolic space’s isotropy.

    13

  • B.4 Probability density function

    By integrating the density in the tangent space TµBdc we get∫Bdc

    f(x)dM(x) =∫

    Bdc

    f(x)√|G(x)| dx =

    ∫TµBdc

    f(v)√|G(v)| dv

    =

    ∫R+

    ∫Sd−1

    f(r)√|G(r)|dr rd−1 dsSd−1

    =

    ∫R+

    ∫Sd−1

    f(r)

    (sinh(

    √cr)√

    cr

    )d−1dr rd−1 dsSd−1

    =

    ∫R+

    ∫Sd−1

    f(r)

    (sinh(

    √cr)√

    c

    )d−1dr dsSd−1 .

    plugging the hyperbolic normal density f , with the hyperbolic distance r = dcp(µ,x), yields

    =

    ∫R+

    ∫Sd−1

    1

    Z(σ)e−

    r2

    2σ2

    (sinh(

    √cr)√

    c

    )d−1dr dsSd−1

    =1

    Z(σ)

    (∫R+

    e−r2

    2σ2

    (sinh(

    √cr)√

    c

    )d−1dr

    )(∫Sd−1

    dsSd−1

    )=

    1

    Z(σ)Zr(σ)ASd−1

    ⇔ Z(σ) = Zr(σ)ASd−1 .

    Hence f(x)√|G(x)| factorises as f(x)

    √|G(x)| = ρ(r) × 1Sd−1(α) with α being uniformly

    distributed on the hypersphere Sd−1, and the hyperbolic radius r = dcp(µ,x) being distributedaccording to the density (derivative w.r.t the Lebesgue measure)

    ρ(r) =1R+(r)

    Zr(σ)e−

    r2

    2σ2

    (sinh(

    √cr)√

    c

    )d−1(26)

    The (Euclidean) Normal density is recovered as c→ 0 since(

    sinh(√cr)√c

    )d−1→ rd−1.

    By expanding the sinh term using the binomial formula, we get

    ρ(r) =1R+(r)

    Zr(σ)e−

    r2

    2σ2

    (sinh(

    √cr)√

    c

    )d−1=

    1R+(r)

    Zr(σ)e−

    r2

    2σ2

    (e√cr − e−

    √cr

    2√c

    )d−1

    =1R+(r)

    Zr(σ)e−

    r2

    2σ21

    (2√c)d−1

    d−1∑k=0

    (d− 1k

    )(e√cr)d−1−k (

    −e−√cr)k

    =1R+(r)

    Zr(σ)

    1

    (2√c)d−1

    e−r2

    2σ2

    d−1∑k=0

    (−1)k(d− 1k

    )e(d−1−2k)

    √cr

    =1R+(r)

    Zr(σ)

    1

    (2√c)d−1

    d−1∑k=0

    (−1)k(d− 1k

    )e−

    r2

    2σ2+(d−1−2k)

    √cr

    =1R+(r)

    Zr(σ)

    1

    (2√c)d−1

    d−1∑k=0

    (−1)k(d− 1k

    )e

    (d−1−2k)22 cσ

    2

    e−1

    2σ2[r−(d−1−2k)

    √cσ2]

    2

    . (27)

    14

  • B.5 Cumulative density function

    Integrating the expended density of Eq (27) yields

    Fr(r) =

    ∫ r−∞

    ρ(r)dr

    =1

    Zr(σ)

    1

    (2√c)d−1

    d−1∑k=0

    (−1)k(d− 1k

    )e

    (d−1−2k)22 cσ

    2

    ∫ r0

    e−1

    2σ2[r−(d−1−2k)

    √cσ2]

    2

    dr (28)

    =1

    Zr(σ)

    1

    (2√c)d−1

    d−1∑k=0

    (−1)k(d− 1k

    )e

    (d−1−2k)22 cσ

    2√2πσ

    ×[Φ

    (r − (d− 1− 2k)

    √cσ2

    σ

    )+ Φ

    ((d− 1− 2k)

    √cσ)− 1]

    =1

    Zr(σ)

    √π

    1

    (2√c)d−1

    d−1∑k=0

    (−1)k(d− 1k

    )e

    (d−1−2k)22 cσ

    2

    ×[

    erf(r − (d− 1− 2k)

    √cσ2√

    )+ erf

    ((d− 1− 2k)

    √cσ√

    2

    )](29)

    with Φ : x 7→ 12(

    1 + erf(x√2

    )), the cumulative distribution function of a standard normal distribu-

    tion.

    B.6 Normalisation constant

    As shown in Appendix B.3 – by using hyperbolic polar coordinate – the density of the hyperbolicnormal distribution factorises between a hyperbolic radius r and a direction α, thus

    Z(σ) = Zr(σ)ASd−1 . (30)

    We know that the surface area of the d− 1-dimensional hypersphere with radius 1 is given by

    Zα(σ) = ASd−1 =2πd/2

    Γ(d/2). (31)

    Also, taking the limit Fr(r) −−−→r→∞

    1 in Eq (29) yield

    Zr(σ) =

    √π

    1

    (2√c)d−1

    d−1∑k=0

    (−1)k(d− 1k

    )e

    (d−1−2k)22 cσ

    2

    [1 + erf

    ((d− 1− 2k)

    √cσ√

    2

    )].

    (32)

    For the special case of c = 1 and d = 2 we recover the formula given in Said et al. (2014)

    Zr(σ) =

    √π

    2σe

    σ2

    2 erf(σ√2

    ). (33)

    B.7 Sampling

    In the section we introduce a rejection sampling scheme to obtain samples from a hyperbolic normaldistributionNBdc (·|µ, σ

    2), and a reparameterisation allowing to compute gradients with respect to theparameters µ and σ.

    Reparametrisation Analogously to the Box-Muller transform (Box and Muller, 1958) for samplingrandom variables normally distributed, one can make use of hyperbolic polar coordinate to get afactorized density derived in Appendix B.4. We indeed rely on the reparameterisation of x ∼NBdc (·|µ, σ

    2) given by

    x = expcµ

    (r

    λcµα

    ), (34)

    15

  • with the direction α being uniformly distributed on the d− 1 hypersphere,

    α ∼ U(Sd−1)

    and the hyperbolic radius r having a derivative w.r.t. Lebesgue being

    ρ(r) =1R+(r)

    Zr(σ)e−

    r2

    2σ2

    (sinh(

    √cr)√

    c

    )d−1. (35)

    Such a density is not a well-known distribution and its cumulative density function does not seemanalytically invertible. We therefore rely on a rejection sampling scheme to obtain samples of thehyperbolic radius. We propose below two pdf g such that ∃M > 1 such that ∀r > 0 ρ(r) ≤Mg(r),and derive M . Thanks to such a reparameterisation, one only needs to sample a scalar randomvariable r, even if the dimension of the space can be large. Yet, the performance of sampling r can beaffected by the dimensionality depending on the method for obtaining such samples.

    Sampling challenges due to the hyperbolic geometry Several properties of the Euclidean spacedo not generalise to the hyperbolic setting, unfortunately hardening the task of obtaining samplesfrom hyperbolic normal distributions. First, one can factorise a normal density through the space’sdimensions – thanks to to the Pythagorean theorem – hence allowing to divide the task on severalsubspaces and then concatenate the samples. Such a property does not extend to the hyperbolicgeometry, thus seemingly preventing us from focusing on 2-dimensional samples. Second, in

    Euclidean geometry, the polar radius r is distributed according to ρ(r) =1R+ (r)

    Zr(σ)e−

    r2

    2σ2 rd−1, makingit easy by a linear change of variable to take into account different scaling values. The non-linearityof sinh prevent us from using such a simple change of variable.

    Computing gradients with respect to parameters So as to compute gradients of samples x withrespect to the parameters µ and σ of samples of a hyperbolic distributions, we respectively rely onthe reparameterisation given by Eq 34 for ∇µx, and on an implicit reparameterisation (Figurnovet al., 2018) of r for ∇σx.

    We have x = expcµ(rλcµα)

    with α ∼ U(Sd−1) and r ∼ ρ(·). Hence,

    ∇µx = ∇µ expcµ(v), (36)

    with v = rλcµα (actually) independent of µ, and

    ∇σx = ∇σ expcµ(v) = ∇v expcµ(v)α

    λcµ∇σr, (37)

    with∇σ(r) computed via the implicit reparameterisation given by

    ∇σ(r) = − (∇rF (r, σ))−1∇σF (r, σ). (38)

    Rejection Sampling with truncated Normal proposal The developed expression of ρ(r) fromEq (27) highlights the fact that the density can immediately be upper bounded by a truncated normaldensity:

    ρ(r) =1R+(r)

    Zr(σ)

    1

    (2√c)d−1

    d−1∑k=0

    (−1)k(d− 1k

    )e

    (d−1−2k)22 cσ

    2

    e−1

    2σ2[r−(d−1−2k)

    √cσ2]

    2

    ≤1R+(r)

    Zr(σ)

    1

    (2√c)d−1

    d−1∑2k=0

    (d− 1

    2k

    )e

    (d−1−4k)22 cσ

    2

    e−1

    2σ2[r−(d−1−4k)

    √cσ2]

    2

    .

    16

  • Then we choose our proposal g to be the truncated normal distribution associated with k = 0, i.e.with mean (d− 1)

    √cσ2 and variance σ2

    g(r) =1r>0

    σ(

    1− Φ(− (d−1)

    √cσ2

    σ

    )) 1√2πe−

    12σ2

    (r−(d−1)√cσ2)

    2

    =1√2π

    1r>0

    σ(

    1− 12 −12erf

    (−(d− 1)

    √c σ√

    2

    ))e− 12σ2 (r−(d−1)√cσ2)2=

    √2

    π

    1r>0

    σ(

    1 + erf(

    (d−1)√cσ√

    2

    ))e− 12σ2 (r−(d−1)√cσ2)2=

    1r>0Zg(σ)

    e−1

    2σ2(r−(d−1)

    √cσ2)

    2

    (39)

    with

    Zg =

    √π

    (1 + erf

    ((d− 1)

    √cσ√

    2

    )). (40)

    Computing the ratio of the densities yield

    ρ(r)

    g(r)=Zg(σ)

    Zr(σ)

    1

    (2√c)d−1

    d−1∑k=0

    (−1)k(d− 1k

    )e

    (d−1−2k)22 cσ

    2

    e−1

    2σ2[r−(d−1−2k)

    √cσ2]

    2

    e+1

    2σ2[r−(d−1)

    √cσ2]

    2

    =Zg(σ)

    Zr(σ)

    1

    (2√c)d−1

    d−1∑k=0

    (−1)k(d− 1k

    )e

    (d−1−2k)22 cσ

    2

    e2k√c((d−1−k)

    √cσ2−r).

    Hence

    ρ(r)/g(r) ≤M , Zg(σ)Zr(σ)

    1

    (2√c)d−1

    e(d−1)2cσ2

    2 . (41)

    Rejection Sampling with Gamma proposal Now let’s consider the following Gamma(2, σ) den-sity:

    g(r) =1r>0Zg(σ)

    re−rσ

    with

    Zg(σ) = Γ(2)σ2.

    Then log ratio of the densities can be upper bounded as following:

    ln

    (ρ(r)

    g(r)

    )= ln

    Zg(σ)

    Zr(σ)− r

    2

    2σ2+ (d− 1) ln(e

    √cr − e−

    √cr)− (d− 1) ln 2− ln r + r

    σ

    = lnZg(σ)

    Zr(σ)− (d− 1) ln 2− r

    2

    2σ2+

    ((d− 1)

    √c+

    1

    σ

    )r︸ ︷︷ ︸

    ≤ ((d−1)√cσ+1)2

    2

    + (d− 1) ln

    (1− e−2

    √cr

    r

    )︸ ︷︷ ︸

    ≤(d−1) ln(2√c)

    ≤ ln Zg(σ)Zr(σ)

    +((d− 1)

    √cσ + 1)2

    2+ (d− 1) ln

    √c.

    Hence

    ρ(r)/g(r) ≤M , Zg(σ)Zr(σ)

    cd−12 e

    ((d−1)√cσ+1)2

    2 . (42)

    17

  • B.8 Uninformative prior

    Instead of a standard hyperbolic Normal distribution NBdc (·|0, Id), one may prefer an uninformativeprior, such as the quasi-uniform distribution of the Poincaré disc (Asta and Shalizi, 2014; Krioukovet al., 2010; Alanis-Lobato and Andrade-Navarro, 2016). In hyperspherical coordinate x = rα, withα ∈ Sd−1, its distribution factorises as p(x) = ρ(r; δ)1Sd−1(α) with

    ρ(r; δ) =δ sinh(δr)

    cosh(δ − 1)≈ δ exp(−δr).

    δ ∈ (0, 1] controls the density of points close to the origin of the hyperbolic circle.

    C Experimental details

    C.1 Synthetic Branching Diffusion Process

    In this section we give more details on the architecture’s design and optimisation scheme used for theexperimental results given in Section 4.

    Architectures Both N -VAE and Pc-VAE decoders parametrise the mean of the unit varianceGaussian likelihood N (·|fθ(z), 1). Their encoders parametrise the mean and the log-varianceof respectively an isotropic normal distribution N (·|gφ(z)) and an isotropic hyperbolic normaldistribution NBdc (·|gφ(z)). The N -VAE’s encoder and decoder are composed of 2 Fully-Connectedlayers with a ReLU activation in between, as summed up in Tables 2 and 3. The Pc-VAE’s design issimilar, the differences being that the decoder’s output is mapped to manifold via the exponential mapexpc0, and the decoder’s first layer is made of gyroplane units presented in Section 3.2, as summarisedin Tables 4 and 5. Observations live in X = R50 and the latent space dimensionality d is set to d = 2.

    Layer Output dim ActivationInput 50 Identity

    FC 200 ReLU

    FC 2, 1 IdentityTable 2: Encoder network for N -VAE

    Layer Output dim ActivationInput 2 Identity

    FC 200 ReLU

    FC 50 IdentityTable 3: Decoder network for N -VAE

    Layer Output dim ActivationInput 50 Identity

    FC 200 ReLU

    FC 2, 1 expc0, IdentityTable 4: Encoder network for Pc-VAE

    Layer Output dim ActivationInput 2 Identity

    Gyroplane 200 ReLU

    FC 50 IdentityTable 5: Decoder network for Pc-VAE

    The synthetic datasets are generated as described in Section 4, then centered and normalised to unitvariance. There are then randomly split into training and testing datasets with a proportion 0.7.

    Optimisation Gyroplanes offset b ∈ Bdc are only implicitly parametrised to live in the manifold, byprojecting a real vector b = expc0(b

    ′). Hence, all parameters {θ,φ} of the model explicitly live inEuclidean spaces which means that usual optimisation schemes can be applied. We therefore relyon Adam optimiser (Kingma and Ba, 2014) with parameters β1 = 0.9, β2 = 0.999 and a constantlearning rate set to 1e − 3. Models are trained with mini-batches of size 64 for 1000 epochs. TheELBO is approximated with a MC estimate given in Equation 13 with K = 1.

    D More experimental qualitative results

    Similarly as Figure 6, Figure 7 illustrates the learned latent representations ofPc-VAE with decreasingcurvatures c, by highlighting the leaned gyroplanes of the decoder.

    18

  • Figure 7: Latent representations of Pc-VAE with decreasing curvatures c = 1.2, 0.3, 0.1 (Left toRight).

    19

    1 Introduction2 The Poincaré Ball model of hyperbolic geometry2.1 Brief Riemannian geometry review2.2 The hyperbolic space2.3 The Poincaré ball

    3 The Poincaré VAE: Pc-VAE3.1 Prior and variational posterior distributions3.2 Encoder and decoder architecture3.3 Training

    4 Experiments5 Related work6 Future directionsA Evidence Lower BoundB Hyperbolic normal distributionsB.1 Probability measures on Riemannian manifoldsB.2 Maximum entropy hyperbolic normal distributionsB.3 Hyperbolic polar coordinatesB.4 Probability density functionB.5 Cumulative density functionB.6 Normalisation constantB.7 SamplingB.8 Uninformative prior

    C Experimental detailsC.1 Synthetic Branching Diffusion Process

    D More experimental qualitative results