politecnico di milano - modeling and computational aspects ......un approccio molto di erente è...

Politecnico di Milano

Mathematics Department

Doctoral Programme In

Mathematical Models and Methods in Engineering

Modeling and computational aspects of

dependent completely random measures in

Bayesian nonparametric statistics

Doctoral Dissertation of:

Ilaria Bianchini

Supervisor:

Prof. Alessandra Guglielmi

Co - supervisor:

Prof. Raaele Argiento

The Chair of the Doctoral Program:

Prof. Irene Sabadini

Year 2017 - XXX Cycle

Abstract

Bayesian nonparametrics is a lively topic in the statistical literature. Thanksto its versatility, the approach applies to a wide range of modern applications,from machine learning to medicine. In particular, an intense research activityhas been recently devoted to the development of dependent stochastic pro-cesses to be used as priors in nonparametric models. Exchangeability is indeedno longer the proper assumption in all contexts. Many datasets contain co-variate information, that we wish to leverage to improve model performance.In a nutshell, dependent nonparametric processes extend existing nonpara-metric priors over measures, partitions, sequences, etc. to obtain priors overcollections of such mathematical objects; members of these collections are as-sociated with values in some metric covariate space, such as time or externalmeasurements.

This thesis illustrates dierent modeling strategies motivated by practicalproblems involving covariates, the main goals being density estimation andclustering. At the beginning, completely random measures are presented,since they are the leitmotif of the thesis and the Bayesian nonparametricmodels introduced afterwards are mainly built on top of them. In the followingchapters (Chapter 2-5) the original contributions are illustrated.

Chapter 2 presents a truncation method to a-priori approximate the mix-ing measure in an innite mixture model; in particular, we focus on mixtureswhere the mixing measure is given by normalized completely random mea-sures. Among the illustrative examples, we show how to easily include thecovariates in the support of the measure. In Chapter 3, a dierent approach,where covariates enter directly in the prior for the random partition, is pre-sented. This model is motivated by an health-care problem: proling thedierent behaviors over time of blood donors. In Chapter 4 we address theissue of overestimating the number of components in mixture model, whichtypically occurs when using Dirichlet process mixtures.

To this end, a model that induces a-priori separation among the groupspecic parameters is proposed. A class of determinantal point process mix-ture models dened via the spectral representation is explored. These modelsincorporate also dependence on covariates via mixtures of experts. In the nalchapter, we introduce a class of time dependent processes taking values in thespace of exponential completely random measures. These processes have anAR(1)-type structure and may be used as building block in latent trait modelsto develop, for instance, time dependent feature allocation models.

Sommario

La statistica bayesiana nonparametrica è un'area di ricerca molto attivae vivace, grazie alla sua essibilità e alle più svariate applicazioni che mo-tivano il suo sviluppo, dal machine learning alla medicina. In particolare,una parte della letteratura piú recente è dedita all'introduzione di processistocastici dipendenti (dal tempo, da covariate, . . . ) da utilizzare come prior

nei modelli nonparametrici. Infatti, la tipica assunzione di scambiabilitá nonsempre è appropriata: in molti contesti, è possibile sfruttare delle informazioniaggiuntive, cioè le covariate, per migliorare la performance del modello statis-tico. In sintesi, processi nonparametrici dipendenti estendono le distribuzionia-priori note in letteratura denite per misure, partizioni, . . . su collezionidi tali oggetti matematici. Ogni elemento di questa collezione è associato avalori nello spazio delle covariate, come il tempo o altri tipi di misurazioni.

Questa tesi illustra diversi approcci modellistici, motivati da applicazionireali, il cui obiettivo è fare stima di densità e raggruppamento dei dati. Dopoaver introdotto i concetti di base della statistica bayesiana nonparametricache vengono usati nella tesi, vengono presentati quattro capitoli contenentiil contributo originale di questo lavoro. Nella prima parte viene illustratoun'approssimazione di misure di probabilitá aleatorie basate su un tronca-mento a-priori: queste vengono usate in modelli mistura la cui misturante èuna misura completamente aleatoria normalizzata. Uno degli esempi, in parti-colare, considera il caso in cui il supporto della misura dipende dalle covariate.Un approccio molto dierente è quello adottato nel capitolo successivo, dovele covariate inuenzano in modo diretto la prior sulla partizione aleatoria. Ilmodello è motivato da un interessante problema applicativo in ambito sani-tario: analizzare il comportamento dei donatori di sangue. A seguire, aron-tiamo il problema della sovrastima del numero di gruppi nei modelli mistura:qui viene proposto ed illustrato un modello che a-priori induce separazionetra i parametri delle componenti della mistura, attaverso l'impiego di processidi punto che dipendono dal determinante di una certa matrice di varianza ecovarianza. Le covariate sono incluse nel modello attraverso un approccio tipomistura di esperti. Inne, introduciamo una classe di processi dipendenti daltempo che vivono nello spazio di particolari misure completamente aleatorie. Iprocessi hanno una struttura di tipo autoregressivo e possono essere utilizzatifacilmente in modelli più complessi.

Acknowledgements

The work contained in this thesis is the result of the collaboration with anumber of people, to whom goes my deepest gratitude. First of all I want tothank Alessandra Guglielmi and Raaele Argiento, who believed in me sincewe rst met. You provided me with continuous support and encouragement.I am also grateful to the brilliant collaborators that hosted me during the vis-iting periods, professors Fernando Quintana and Jim Grin. Your hospitalityand stimulating discussions made my stays abroad memorable experiences.Another thank goes to my PhD colleagues who shared lunch time with me.

Un enorme ringraziamento anche ai miei genitori e agli amici di sempre:i punti di riferimento su cui posso sempre contare. Inne, grazie a Matteo,a cui dedico questa tesi, che mi ha sopportato e sostenuto in questi anni conpazienza e amore.

Contents

Introduction 1

1 Introduction to

completely random measures 5

1.1 Completely random measures . . . . . . . . . . . . . . . . . . 61.2 Bayesian nonparametric models for density estimation and clus-

tering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Generalized latent trait models . . . . . . . . . . . . . . . . . 19

2 Posterior sampling from ε-approximation of normalized com-

pletely random measure mixtures 27

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.2 Preliminaries on normalized completely random measures . . . 292.3 ε-approximation of normalized completely random measures . 302.4 ε-NormCRM process mixtures . . . . . . . . . . . . . . . . . . 362.5 Normalized Bessel random measure mixtures: density estimation 382.6 Linear dependent NGG mixtures: an application to sports data 452.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Appendix 2.A: Details on full-conditionals for the Gibbs sampler . . 48Appendix 2.B: Proofs of the theorems . . . . . . . . . . . . . . . . . 49

3 Covariate driven clustering:

an application to blood donors data 57

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.2 A covariate driven model for clustering . . . . . . . . . . . . . 603.3 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . 643.4 The AVIS data on blood donations . . . . . . . . . . . . . . . 683.5 Discussion and future work . . . . . . . . . . . . . . . . . . . . 81Appendix 3.A: Gibbs sampler . . . . . . . . . . . . . . . . . . . . . 82Appendix 3.B: Gibbs sampler for the blood donations application . 85

4 Determinantal point process mixtures via spectral density ap-

proach 89

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.2 Using DPPs to induce repulsion . . . . . . . . . . . . . . . . . 924.3 Generalization to covariate-dependent models . . . . . . . . . 1004.4 Simulated data and reference datasets . . . . . . . . . . . . . . 104

ii

4.5 Biopic movies dataset . . . . . . . . . . . . . . . . . . . . . . . 1164.6 Air quality index dataset . . . . . . . . . . . . . . . . . . . . . 1194.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5 Constructing stationary time series of completely randommea-

sures via Bayesian conjugacy 125

5.1 Stationary autoregressive type AR(1) models for univariate data1265.2 Exponential completely random measures . . . . . . . . . . . . 1275.3 Building a stationary time dependent model for a sequence of

discrete random measures . . . . . . . . . . . . . . . . . . . . 1295.4 Application: latent feature model on a synthetic dataset of images1405.5 Application: Poisson Factor Analysis for time dependent topic

modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1495.6 Discussion and future developments . . . . . . . . . . . . . . . 158

Bibliography 163

Introduction

Since many scientic problems become more and more complex, models and computa-

tional methods for data analysis require more and more sophisticated statistical tools. In

this sense, Bayesian nonparametric (BNP) statistics oers a framework for the development

of exible models with a broad-spectrum application. This thesis presents advances in BNP

models for dealing with dependence on covariates or time from a modelling perspective; new

models involving completely random measures are introduced, together with corresponding

MCMC algorithms to perform posterior inference. Along the thesis, we tackle various ap-

plications where the issue of dependence arises, such as healthcare and image analysis. The

building block that recurs in this work is given by completely random measures, tractable

mathematical objects that might be employed for building probability random measures

and also for modeling latent structures in the observations. Starting from the denition

and the main properties of completely random measures, we present dierent modelling

strategies for performing clustering and density estimation through mixture models, as well

as latent feature estimation.

The work is developed in 5 self-contained chapters, whose structure is summarized

hereinafter.

Chapter 1: in order to create a coherent framework for the development of the models

presented in the dissertation, we include an initial chapter presenting literature review

and basic concepts related to completely random measures; its reading is useful to

understand the main motivations of this work.

Chapter 2 is based on a published paper: see Argiento et al. (2016b). In a nutshell, we

deal with nonparametric mixture models whose mixing distribution belongs to the

class of normalized homogeneous completely random measures. We tackle the issue

related to the innite dimensionality of the parameter by proposing a truncation,

discarding the weights of the unnormalized measure smaller than a threshold. We

provide some theoretical properties about the approximation, as convergence and

posterior characterization. A general conditional blocked Gibbs sampler is devised,

in order to sample from the posterior of the model. Illustrative examples, including

also covariate information in the location points of the random measure, show the

eectiveness of the method.

Chapter 3 illustrates the problem of predicting the next donation time for a blood donor.

We consider data on blood donations provided by Milan department of AVIS (Italian

Volunteer Blood-donors Association). With the goal of characterizing behaviors of

donors, we analyze gap times between consecutive blood donations. In particular,

we take into account population heterogeneity via model based clustering. The main

contribution is given by the introduction, in an accelerated failure time model with a

skew normal likelihood, of a prior on the random partition that explicitly accounts for

covariate information. In particular, we consider a prior for the random partition of

the form product partition and a term that takes into account the distance between

covariates in a cluster.

2

Chapter 4 deals, dierently from the others, with nite mixture models, with a random

number of components. Typically, when using mixture models, nite or innite, over-

estimating the number of groups is quite common; hence, there is a need for models

inducing a-priori separation among the clusters. We explore a class of determinantal

point process (DPP) mixture models dened via spectral representation, focusing on

a power exponential spectral density. In the second part of the chapter we generalize

our model to account for the presence of covariates, both in the likelihood as linear

regression and in the weights of the mixture by means of a mixture of experts ap-

proach. This yields a trade-o between repulsiveness of locations in the mixtures and

attraction among subjects with similar covariates. This project has been developed

during my stay (1.5 months) at Ponticia Universidad Catolica de Chile, under the

supervision of Prof. Fernando A. Quintana; a preliminary version of the paper can

be found in Bianchini et al. (2017).

Chapter 5 aims at developing a new way of exibly modeling series of completely random

measures that exhibit some temporal dependence. These processes might be fruitful

in real life applications, such as latent feature model for the identication of features in

images or Poisson factor analysis for topic modelling. In order to achieve a convenient

mathematical tractability, namely to be able to dene a exible transition kernel for

the process, we consider the large class of exponential family of completely random

measures. This leads to a simple description of the process which has an AR(1)-type

structure and oers a framework for generalizations to more complicated forms of

time-dependence. This project was started during my stay (3 months) at University

of Kent, under the supervision of Prof. Jim Grin.

Each chapter includes details on the implementation of the MCMC methods employed

in posterior inference. Most of the statistical analyses have been carried out using R and

C++. The algorithms of the last chapter have been coded in Rcpp, a R package providing R

functions as well as C++ classes which oers a seamless integration of R and C++.

Contents 3

Absolutely

continuousdistributions

Nam

eNotation

Pdf

Mean

Variance

Gaussian

N( µ,σ

2)

1√

2πσ

2ex

p

( −1

2σ2

(x−µ

)2

)µ

σ2

Gam

ma

gamma(α,β

)βα

Γ(α

)xα−

1ex

p−βxI (

0,+∞

)(x

)α/β

α/β

2

InverseGam

ma

IG

(α,β

)βα

Γ(α

)x−α−

1ex

p−β/xI (

0,+∞

)(x

)β

(α−

1)

β2

(α−

1)2

(α−

2),ifα>

2

Beta

Beta

(a,b

)1

B(a,b

)xa−

1(1−x

)b−

1I (

0,1

)(x

)a

a+b

ab

(a+b)

2(a

+b

+1)

Dirichlet

Dirichlet(α

1,...,α

k)

Γ(∑ k i=

1αi)

Γ(α

1)...Γ

(αk)x

α1−

11

xα

2−

12

...x

αk−

1k

I Sk−

1(x

)E(Xj)

=αj

α0

Var(Xj)

=αj(α

0−αi)

α2 0(α

0+

1)

α0

=∑ k i=

1αi

Sk−

1= x∈Rk

:∑ k i=

1xi

=1,

0<xi<

1,∀i

Table1:

Absolutely

continuousdistributions:

notationandparam

eterizationusedthroughoutthethesis.

4

Disc

rete

distr

ibutio

ns

Nam

eNotation

Prob

ability

mass

Mean

Varian

ce

Bern

oulli

Bern

oulli(π

)πx(1−π

)1−xI

(x∈0,1

)π

π(1−π

)

Binom

ialBin

(n,π

)

(nx )

πx(1−p)n−

xI(x∈0,1,...,n

)nπ

nπ

(1−π

)

Negative

binom

ialNB

(r,π)

(x

+r−

1

x

)·(1−p)rpxI

(x∈0,1,2,...

)πr

1−π

πr

(1−π

)2

Multin

omial

Multin

(n;p

1 ,...,pk )

n!

x1 !···x

k ! px

11···p

xkkIS

EXi

=npi

VarX

i =npi (1−pi )

∑ki=

1pi

=1

S= (

x1 ,...,x

k : ∑

ki=1xi

=n,x

i ∈0,...,n

∀i )

Poisson

Poisson

(λ)

λx

x! e −

λI(x∈0,1,2

...)

λλ

Table2:

Discrete

distrib

ution

s:notation

andparam

eterizationused

throu

ghoutthethesis.

Chapter 1

Introduction to

completely random measures

This introductory chapter describes the leitmotiv of the thesis, that is completely ran-

dom measures. These are a exible probabilistic tool that can be exploited in a wide variety

of situations when dealing with Bayesian nonparametrics: from mixture models for density

estimation and clustering, after a suitable normalization of the random measure, to latent

factor models, where completely random measures are considered for modeling presence or

absence of features.

The main results, available from the literature, and useful for the comprehension of the

rest of the thesis, are reviewed in this chapter.

6

1.1 Completely random measures

Completely random measures are elegant and mathematically tractable proba-

bilistic tools that oer a useful framework for the understanding of peculiar charac-

teristics of popular Bayesian nonparametric priors and for the construction of new

models. Extensive descriptions of this subject can be found, among the others, in

Kingman (1993) and Kallenberg (1983).

We start with the denition of completely random measures, rst introduced

in Kingman (1967). Denote, rst, by M the space of boundedly nite (positive)

measures over a complete and separable metric space X endowed with the Borel

σ−algebra X , i.e. for any µ inM and any bounded set A in X one has µ(A) < +∞.

Moreover, we letM stand for the corresponding Borel σ-algebra on M.

Definition 1.1 (Completely random measure)

Let µ be a measurable mapping from (Ω,F ,P) into (M,M), such that for any n ≥ 1

and any collection A1, . . . , An in X , with Ai ∩ Aj = ∅ for any i 6= j, the random

variables µ(A1), . . . , µ(An) are mutually independent. Then µ is termed completely

random measure (CRM).

In order to have a better intuition of this mathematical object, it is useful to know

that realizations of CRMs are discrete measures with probability 1, at least in this

work. In general, a CRM can be decomposed into three independent components: a

deterministic measure µdet, a countable collection ofM non-negative random masses

at non-random locations and a countable collection of non-negative random masses

at random locations, µc =∑

i≥1 Jiδθi (see Chapter 8 of Kingman (1993) for a more

detailed explanation). Accordingly,

µ = µc +

M∑i=1

Viδψi + µdet

where the xed location points ψ1, . . . , ψM , with M ∈ 1, 2, . . . , ∪ ∞ are in

X, the random jumps V1, . . . , VM are mutually independent and they are indepen-

dent from µc. In what follows, we will not consider the deterministic component

µdet, that can be viewed as a centering measure and not of importance when deal-

ing with statistical modelling. Typically, when considering priors in the Bayesian

nonparametric context, we assume a completely random measure given only by the

component µc, ignoring the component with random jumps at xed locations, easier

to characterize. Therefore, it is assumed that a CRM is an a.s. discrete measure

with random jumps and random support points.

Every random measure can be given through its Laplace exponent, i.e. the

expectation of all linear functionals (see Kallenberg, 1983). In particular, a CRM

µc is characterized by the Lévy-Khintchine representation which states that

E(e−∫X f(t)µc(dt)

)= exp

(−∫R+×X

(1− e−sf(t)

)ν(ds, dt)

)(1.1)

1.1. Completely random measures 7

where f : X → R is a measurable function such that∫|f |dµc < ∞ (almost surely)

and ν is a measure on R+ × X such that for any D in X∫D

∫R+

min(s, 1)ν(ds, dt) <∞. (1.2)

The measure ν in (1.1) characterizing µc is referred to as the Lévy intensity of

µc: it contains the information about the distribution of the jumps and locations of

the random measure. Such a measure will play an important role throughout this

work. Moreover, it is often useful to factorize the jump and the location part of ν

by writing it as

ν(ds, dx) = ρx(ds)P0(dx) (1.3)

where P0 is a probability measure on X and ρ is a transition kernel on X×B (R+),

namely x 7→ ρx(A) is X -measurable for any A in B (R+) and ρx is a measure on

(R+,B (R+)) for any x in X, called kernel. If ρx = ρ does not depend on x, both the

intensity ν and the CRM µc are called homogeneous; in this case, the sequence of

jumps Jii≥1 is controlled by the kernel ρ and the locations points are independent

of the jumps and are independent and identically distributed according to P0.

Example 1 (Gamma process)

A homogeneous CRM λ whose Lèvy intensity is given by

ν(ds, dx) = κe−s

sdsP0(dx)

is a gamma process with parameters (κ, P0), κ > 0.

The name gamma process originates since the Laplace functional (1.1) of f =

γID, with γ > 0 and ID the indicator function of a subset D, sums up to

E(e−γλ(D)

)= (1 + γ)−κP0(D).

Then, it is clear that the random variable λ(D), i.e. the random mass assigned to

a subset D, is gamma distributed with parameters (1, κP0(D)).

Example 2 (Beta process)

A homogeneous CRM η whose Lèvy intensity is

ν(ds, dx) = κs−1(1− s)c−1dsP0(dx),

with support on (0, 1] is a beta process with parameters (κ, c, P0), where κ, c > 0.

This is a degenerate beta density, where κ > 0 is the mass parameter and c > −1 is

called concentration parameter. This CRM was rst introduced in Hjort (1990) for

survival analysis. Analogously as in the previous example, one can show that, for

any D ∈ X,η(D) ∼ Beta(cκP0(D), c(1− κP0)).

8

Note that η(D) has mean equal to κP0(D) and varianceκP0(D)(1− κP0(D))

c+ 1; thus,

the interpretation of c is that of a concentration parameter.

CRMs are closely connected to Poisson processes; before explaining this rela-

tionship, it is useful to recall the denition of Poisson processes.

Definition 1.2 (Poisson process)

A Poisson process Π with Levy's intensity ξ(·) on Y is a random countable subset

of a separable space Y such that:

1. for any disjoint numerable subsets A1, A2, . . . , An of Y, the random variables

N(A1), N(A2), . . . , N(An) are independent;

2. N(A) has the Poisson distribution Poisson(ζ), where ζ = ζ(A) is such that

ζ > 0.

We denote by N the cardinality of the set Π ∩A, N(A) = #Π ∩A, where A is a

subset of the space Y where the process takes place.

Every CRM µc can be represented as a linear functional of a Poisson process

Π dened on the product space Y = R+ × X with Levy's intensity as in (1.3), see

Kingman (1967):

µc(A) =

∫A

∫R+

sΠ(ds, dx), ∀A ∈ X.

From a constructive viewpoint, the (homogeneous) measure µc =∑

i≥1 Jiδθi with

Lévy's intensity as in (1.3) can be generated by sampling the locations θi pointsaccording to P0 and the jumps Ji from a Poisson process with intensity ρ(ds). An

illustration is reported in Figure 1.1.

1.2 Bayesian nonparametric models for density estima-

tion and clustering

In this section we review some of the most popular models for density estimation

and clustering. In the framework of Bayesian nonparametric statistics, indeed, these

two goals can be pursued simultaneously by means of semi-parametric or nonpara-

metric mixture models, where CRMs or, more precisely, normalized CRMs, are used

as the random mixing measure.

1.2.1 Mixture models

A general tool for dening a prior on densities has been rst suggested in Lo

(1984) and Ferguson (1983). The basic idea consists of introducing a sequence of

exchangeable latent variables θnn≥1 generated according to some discrete random

probability measure. At rst, let us start recalling the notion of exchangeability.

1.2. Bayesian nonparametric models for density estimation

and clustering 9

θ~ P0

J ~

PP(ρ

)

Figure 1.1: Representation of the Lévy's intensity for a Beta process ν(s, x) =κs−1(1 − s)c−1P0(x), with P0 being a standard Gaussian distribution, κ = 1 andc = 0.5 (left). A realization of a CRMs with this intensity (right).

Let (θn)n≥1 be a sequence of observations, dened on a probability space (Ω,F ,P),

where each θi takes values in X, a complete and separable metric space endowed by

a σ-algebra X (for instance, X = Rk for some positive integer k and X = B(Rk)).The typical assumption in the Bayesian approach is exchangeability of a sequence

of observations. Formally, this means that for every n ≥ 1 and every permutation

π of the indices 1, 2,.., n,

L(θ1, θ2, . . . , θn) = L(θπ(1), θπ(2), .., θπ(n)).

The strength of (innite) exchangeability lies in the following theorem:

Theorem 1.1 (de Finetti's representation)

If θ1, θ2, . . . is an innitely exchangeable sequence of variables with probability mea-

sure P , then there exists a distribution function Q on F , the set of all distributionfunctions on X, such that the joint distribution of (θ1, θ2, . . . , θn) has the form

p(θ1, θ2, . . . , θn) =

∫PX

n∏i=1

K(θi)dQ(P ),

where the integral is calculated over the space PX of the probability measures on X.Equivalently, one could write

θi|Piid∼ P i = 1, 2, .., n,

P ∼ Q

for any n ≥ 1.

Therefore P is a random element dened on (Ω,F ,P) with values in PX and

10

its distribution Q is the so-called de Finetti measure and can be interpreted as the

prior distribution on an innite dimensional object, that is a random probability

distribution.

Figure 1.2: Representation of a mixture of 4 Gaussian components.

Mixture models provide a statistical framework for modeling a collection of (ex-

changeable) continuous observations (X1, . . . , Xn), where each measurement is sup-

posed to arise from one of k groups, with k eventually unknown, and each group is

modeled by a kernel distribution from a suitable parametric family.

This model is usually represented hierarchically in terms of a collection of inde-

pendent and identically distributed latent random variables (θ1, . . . , θn):Xi|θi

ind∼ K(·|θi) i = 1, . . . , n

θi|Piid∼ P i = 1, . . . , n

P ∼ Q

(1.4)

where P is a discrete random probability measure, Q is its distribution (i.e. the

prior) and K(·|θ) is a probability density function parametrized by the latent ran-

dom variable θ; for instance, K can be a Gaussian distribution and, in that case,

θ = (µ,Σ). Note that nite mixture models can be recovered when P has only a

nite (xed or random) number L of random jumps at random locations points, i.e.

P =∑L

j=1 plδτl where usually a Dirichlet distribution prior is given to the vector

(p1, . . . , pL) and the atoms are independent and identically distributed (i.i.d) from a

probability distribution P0. On the other hand, in the innite mixture model case,

P has a countable number of items, P =∑

j≥1 pjδτj .

Model (1.4) is equivalent to assuming that the data X1, . . . , Xn are i.i.d. accord-

ing to a random probability density that is a convolution of kernel distributions:

X1, . . . , Xn|Piid∼ f(x) =

∫ΘK(x|θ)P (dθ). (1.5)


and clustering 11

The randomness of P is inherited by the unknown density f : therefore, this approach

allows us to put a prior on f , which is often the object of interest when tackling

density estimation problems. Note that since P is discrete and the mixture model

can be written as a weighted sum of a countably innite number of parametric

densities

f(x) =+∞∑j=1

pjK (x|τj)

where the weights (pj)j>1 represent the relative frequency of the groups in the

population indexed by θj .

This approach provides a exible model for clustering items in a hierarchical

setting without the necessity to specify in advance the exact number of clusters.

Indeed, in representation (1.4), given the discreteness of the mixing measure, there

can be ties among the latent variables since P(θi = θj) > 0 for any i 6= j.

Possible coincidences among the θ′is induce a partition structure within the ob-

servations. Suppose, for instance, that there are k ≤ n distinct values θ∗1, . . . , θ∗k

among θ1, . . . , θn and let Aj :=i : θi = θ∗j

for j = 1, . . . , k. According to such a

denition, any two dierent indices i and l belong to the same group Aj if and only

if θi = θl = θ∗j . Hence, the A′js describe a clustering scheme for the (continuous)

observations Xi's: any two observations Xi and Xl belong to the same cluster if and

only if i, l ∈ Aj for some j. In particular, the number of distinct values θ∗i among

the latent θi's identies the number of clusters into which the n observations can be

partitioned. Within the framework of nonparametric hierarchical mixture models,

one might be interested in determining an estimate of the density f and in evalu-

ating the posterior distribution of the number of clusters present in the observed

sample. The most popular model of this family is the Dirichlet Process Mixture

(DPM) model, where the random probability measure P is the Dirichlet process.

1.2.2 Dirichlet process mixture

The Dirichlet process is a cornerstone in Bayesian nonparametrics since its rst

introduction in Ferguson (1973). Its success can be explained by its mathematical

tractability and the ease of use when devising Markov chain Monte Carlo (MCMC)

techniques. Before describing the mixture model based on the DP, we briey review

the denition and the main properties of the process.

Definition 1.3 (Dirichlet process)

Let P0 be a distribution over Θ and κ be a positive real number. Then, for any

nite measurable partition S1, . . . , Sr of Θ, the vector (P (S1), . . . , P (Sr)) is random

since P is random. Then P is a Dirichlet process with base distribution P0 and

concentration parameter κ, written P ∼ DP (κ, P0), if

(P (S1), . . . , P (Sr)) ∼ Dirichlet(κP0(S1), . . . , κP0(Sr))

12

for every nite measurable partition S1, . . . , Sr of Θ. See Table 1 in the Introduction

for the notation.

Parameters P0 and κ play intuitive roles in the denition of the DP. The base

distribution is the mean of the DP: for any measurable set S, we have E(G(S)) =

P0(S). On the other hand, the concentration parameter κ can be understood as the

reciprocal of the variance: V ar(G(S)) = P0(S)(1−P0(S))/(κ+ 1). The larger κ is,

the smaller the variance, and the DP will concentrate more of its mass around the

mean.

The rst property of DP to review is conjugacy. Let P ∼ DP (κ, P0) and let

θ1, . . . , θn be a sequence of i.i.d. draws from P . We are interested in the posterior

distribution of P given observed values of (θ1, . . . , θn). Let nk = #i : θi ∈ Sk bethe number of observed values in Sk . It is straightforward to show that

(P (S1), . . . , P (Sr))|θ1, . . . , θn ∼ Dirichlet (κP0(S1) + n1, . . . , κP0(Sr) + nr)

and this relationship holds for any r and any partition of the space. By denition of

DP, we have that the posterior of P is a DP as well, with concentration parameter

κ+ n and base distributionκP0 +

∑ni=1 δθi

κ+ n, where δx is the Dirac delta located in

x, i.e.

P |X1, . . . , Xn ∼ DP(κ+ n,

κP0 +∑n

i=1 δθiκ+ n

).

Hence, the DP provides a conjugate family of priors over distributions that is closed

under posterior updates given observations.

Another property that plays a fundamental role when developing MCMC al-

gorithms or generalizing the DP to include covariate dependence is the Blackwell-

MacQueen urn scheme, that characterizes the sequence of predictive distributions

of a sample from a DP. Let θ1, θ2, . . . |Piid∼ P and P ∼ DP (κ, P0). Consider the

posterior predictive distribution for a new sample θn+1, conditioning on θ1, . . . , θn,

i.e. when is P integrated out.

It is straightforward to prove that, for any n = 1, 2, . . . ,

θn+1|θ1, . . . , θn ∼1

κ+ n

(κP0 +

n∑i=1

δθi

). (1.6)

See Blackwell and MacQueen (1973). Thus, the posterior base distribution given

θ1, . . . , θn is also the predictive distribution of a new observation θn+1.

We highlight that the predictive distribution (1.6) has point masses located at

the previous draws θ1, . . . , θn. Thus, with positive probability, a sample from the

DP will have ties, regardless of the continuity of the base measure.

As a last remark about the properties of the DP, we provide a constructive


and clustering 13

denition of the process, called stick-breaking construction. Consider

βkiid∼ Beta(1, κ) τk

iid∼ P0

pk = βk

k−1∏i=1

(1− βi), k = 2, 3, . . . , p1 = β1.

Dene P =∑

j≥1 pjδτj ; Sethuraman (1994) proved that P ∼ DP (κ, P0). The

construction of pk's can be understood metaphorically by considering βk as the

length of a piece of stick. Starting with a stick of length 1, we break it at β1,

assigning p1 to be the length of stick we just broke o. Now recursively break the

other portion to obtain p2, p3 and so forth. This simple construction helped in

building posterior inference, as well as in dening extensions of the DP.

A Dirichlet process mixture prior (DPM) (Antoniak, 1974) is (1.5) when P ∼DP (κ, P0), namely

Xi|Piid∼ f(Xi) =

∫K(Xi|θ)P (dθ), P ∼ DP (κ, P0). (1.7)

For example, in a univariate setting, a DP location mixture of normals is

Xi|P ∼∫RN (Xi|µ, σ2)P (dµ), P ∼ DP (κ, P0),

where P0 is a base measure dened on R. Working with an innite number of

components is particularly appealing because it ensures that, for appropriate choices

of the kernel K(yi|θ), the DPM model has support on a large classes of distributions.

For example, Lo (1984) showed that a DP location-scale mixture of normals has full

support on the space of absolutely continuous distributions.

We can further rewrite the model (1.7) in terms of clusters and unique values

of the latent parameters by dening a vector (s1, s2, . . . , sn) of labels indicating to

which cluster item i belongs, i.e. si = j ⇔ θsi = θ∗j :

Xi|θ∗j , si = jind∼ K(xi|θ∗j ), i = 1, . . . , n

θ∗j ∼ P0, j = 1, . . . ,Kn

p(s1, . . . , sn) = Γ(κ)/Γ(κ+ n)κKnΓ(n1)× · · · × Γ(nKn) (1.8)

where Kn is the number of distinct values among s1, . . . , sn and nj =∑

i I(si = j)

is the number of indicator variables si's that are equal to j.

Previous work on nonparametric Bayesian clustering has paid some attention to

the implicit a priori the rich-get-richer property imposed by the Dirichlet process

(see, e.g. Wallach et al., 2010). This leads to partitions consisting of a small number

of large clusters: new observations are more likely to join already-large clusters.

This is clear from representation (1.8), where the probability of joining an already

14

existing cluster is proportional to the cardinality of the cluster. Although the rich-

get-richer cluster usage may be appropriate for some clustering applications, there

are others for which it is undesirable. Thus, there exists a need for alternative

priors in clustering models. In what follows, we introduce a more general class of

mixture models that entail the DPM and partially alleviate the problem of the

rich-get-richer behavior; with this aim, we consider mixtures with mixing measure

given by normalized completely random measures (NormCRMs). As we will see

in the next section, the NormCRMs are very exible but still mathematically and

computationally tractable, making them a good choice as P in the mixture models.

Another issue when dealing with DPM models is how to manage the presence

of covariates: indeed, traditional nonparametric priors such as the Dirichlet process

assume that observations are exchangeable. Exchangeability is not a reasonable as-

sumption in every context. For example, in time series or spatial data, we often see

correlations between observations that occur close in time or space. Many datasets

contain these types of covariate information; we wish to include these covariates as

deterministic variables to condition to, in order to exploit this additional information

to improve model performance. Typically, it is assumed that the members of these

collections of priors are associated with values in some covariate space - usually a

metric space representing time or geographic location - and that locations that are

close in covariate space tend to generate similar structures. Dependent nonpara-

metric processes have been used in a wide variety of applications. Examples include

image processing (see e.g. the R package dpmixsim da Silva and da Silva, 2012),

text analysis (e.g. Blei et al., 2010) and nance, to construct stochastic volatility

models (Delatola and Grin, 2011, among the others). If we wish to model data

that depend on some covariate, it makes sense to build a collection of correlated pro-

cesses (MacEachern, 2000). The goal is thus to induce dependency between random

measures, both in terms of the locations and the jumps.

1.2.3 Normalized CRM

The Dirichlet process on a complete and separable metric space X can also be

obtained by normalizing the jumps of a gamma CRM µ with parameter (κ, P0) as

described in Example 1: the random probability measure Q = µ/µ(X) has the same

distribution as the Dirichlet process on X with parameter (κ, P0). Therefore, one

might wonder whether a full Bayesian analysis can be performed if, in the above

normalization, the gamma process is replaced by any CRM with a generic Lévy

intensity as in (1.3). From a Bayesian perspective, the idea of normalization rst

appeared in Regazzini et al. (2003). A denition stated in terms of CRMs is as

follows.

Definition 1.4 (Normalized CRM)

Let µ be a CRM with intensity measure ν such that 0 < µ(X) < ∞ almost surely.


and clustering 15

Then, the random probability measure

Q =µ

µ(X)

is called normalized completely random measure on (X,X ).

The requirement of positive and nite total mass µ(X) is satised if the corre-

sponding intensity ν = ρxP0 (in the non homogeneous case) is such that∫R+

ρx(ds) = +∞, for all x. (1.9)

This means that the jumps of the process form a dense set in (0,+∞) and there are

innitely many masses near the origin: indeed, according to (1.9) the total mass of

ρx must sum up to +∞. At the same time, the second condition in (1.2) forces ρxto have this mass near the origin. In this case µ is also called an innite activity

process.

It is important to remark that, apart from the Dirichlet process, NormCRMs

are not structurally conjugate. Nonetheless, one can still provide a posterior char-

acterization of NormCRMs in the form of a mixture representation. In the sequel,

we will focus on NormCRMs, whose underlying Lévy intensity has a non-atomic

centering measure P0. It is then useful to introduce an auxiliary variable whose

density, conditionally on the sample, can be expressed as

pX(u) ∝ un−1e−ψ(u)k∏j=1

∫R+

snje−usρ(ds)

where ψ is the Laplace exponent, namely

ψ(u) =

∫R+

(1− e−us

)ρ(ds)

(in the homogeneous case). Then, if data are from model (1.4), where Q is the

probability distribution of a NormCRM, we have that the posterior of P is still a

NormCRM with some xed location points corresponding to the locations of the

observations, conditioning on the auxiliary variable u that we introduced. Details

and theorems are given in James et al. (2009). This can be considered as a sort

of conditional conjugacy property, that makes the computation simpler: in fact,

when building a Gibbs sampler for sampling from the posterior of (1.4), the full-

conditionals are relatively easy to derive thanks to the presence of u.

In the rest of the thesis, by P ∼ NormCRM(ρ, P0) we denote an homogeneous

normalized completely random measure with intensity ν(ds, dx) = ρ(ds)P0(dx)

where P0 is a probability measure and ρ(·) is the Lévy intensity controlling the

16

jumps, such that the two following conditions hold true:∫R+

min (s, 1) ρ(ds) < +∞ and

∫R+

ρ(ds) = +∞.

1.2.4 Exchangeable partition probability functions and product

partition models

Innite mixture models with NormCRMs as mixing measure are a exible tool

for both density estimation and clustering; however, if the statistical focus is on

clustering data, we might consider the same model under a dierent parametrization,

starting from the so-called exchangeable partition probability function driving the

prior induced on the the random partition.

We have already mentioned how the discreteness of P in model (1.4) implies

that there might be ties within latent variables. Correspondingly, dene ρn to be

a random partition of the integers 1, . . . , n such that any two integers i and j

belong to the same set in ρn if and only if θi = θj . Observe that ρn is random

since (θ1, . . . , θn) is. Let k ∈ 1, . . . , n and suppose A1, . . . , Ak is a partition of

1, . . . , n, that is a possible realization of ρn. We already saw in model (1.8) that

we can express the DPM through the parameter ρn. In this case, the prior induced

on ρn depends only on the cardinality of the groups, n1, . . . , nk. More in general,

under the assumption of exchangeability, a common specication for the probability

distribution of ρn consists in assuming that it depends only on the frequencies of

each set in the partition, namely is a function of (n1, . . . , nk) ∈ Πn,k as follows:

p(ρn = A1, . . . , Ak) = π(n)k (n1, . . . , nk) (1.10)

where

Πn,k =

(n1, . . . , nk) : ni ≥ 1,

k∑j=1

nj = n

.

Then,π

(n)k : 1 ≤ k ≤ n, n ≥ 1

with π

(n)k dened in (1.10) is termed exchangeable

partition probability function (EPPF).

The EPPF determines the distribution of a random partition of N. It follows

that, for any 1 ≤ k ≤ n and any (n1, . . . , nk) ∈ Πn,k, π(n)k is a symmetric function

of its arguments and it satises the marginal invariance rule

π(n)k (n1, . . . , nk) = π

(n+1)k+1 (n1, . . . , nk, 1) +

k∑j=1

π(n+1)k (n1, . . . , nj + 1, . . . , nk) (1.11)

On the other hand, as shown in Pitman (1996), every non-negative symmetric func-

tion satisfying the rule (1.11) is the EPPF of some exchangeable sequence. See

Pitman (1996) for a thorough analysis of EPPFs.

A dierent perspective on priors for random partitions of exchangeable sequences


and clustering 17

is given by the product partition model (PPM), proposed by Hartigan (1990)

and Barry and Hartigan (1993) and popularized in the BNP literature by Quintana

and Iglesias (2003). The PPM explicitly denes a probability distribution p(ρn)

over partitions by using a non-negative function of Aj , the collection of indices in

1, . . . , n of data that are assigned to cluster j; the function is denoted by c(Aj) andit is known as the cohesion function. A product partition probability is dened

as

p(ρn = A1, . . . , Ak) = C

k∏j=1

c(Aj) (1.12)

where C is a suitable normalizing constant, depending on the number of clusters k.

Conditional on a given partition, typically the PPM assumes independent sampling

across clusters for data X1, . . . , Xn, i.e.

p(X1, . . . , Xn|ρn, θ∗j , j = 1, . . . ) =∏j

p(X∗j |θ∗j ) (1.13)

where θ∗j are cluster specic parameters and X∗j is the collection of observations

belonging to the j-th group. Applications of the PPM often use exchangeability ofXi

across i ∈ Aj by assuming that Xi, i ∈ Aj , are i.i.d. given θ∗j . One of the appealingcharacteristics of the PPM is its conjugate nature. The posterior p(ρn|X1, . . . , Xn)

is again a product partition model, with updated cohesion functions c(Aj)p(X∗j ),

where p(X∗j ) is the marginal law of Xi, i ∈ Aj under partition ρn.

Exchangeable innite mixture models where the mixing measure is a NormCRM,

namely

Xi|θiind∼ f(Xi|θi), θi|P

iid∼ P, P ∼ NormCRM(ρ, P0), (1.14)

can be written as product partition models with likelihood (1.13) and prior on the

partition (1.12) through a change of parametrization. In fact, as in the DPM case,

if the random probability measure P in model (1.14) is not of interest, it may be

marginalized out, yielding:

Xi|θ∗jkj=1

, siind∼ f(Xi|θ∗si), θ∗j |P0

iid∼ P0, p(ρn = (s1, . . . , sn)) ∼ p(ρn) (1.15)

where p(ρn) is the induced prior distribution of the partition ρn (see James et al.,

2009). Note that we assume that data are i.i.d. within clusters and independent

across clusters. In case of Gibbs-type models (see e.g. De Blasi et al., 2015), as for

instance when P in (1.4) is the normalized generalized gamma process (Regazzini

et al., 2003), the prior on the partition in (1.15) can be seen as an (exchangeable)

product partition model, namely

p(ρn = A1, . . . , Akn) = C

kn∏j=1

c(Aj),

18

where C is the normalizing constant. The relationship between exchangeable innite

mixture models and product partition models emphasizes the marginal invariance of

the implied sequence of partition distributions with increasing sample size. Loosely,

the probability distribution for partitions over 1, 2, . . . , n is the same as the distri-

bution obtained by marginalizing out item (n+ 1) from the probability distribution

for partitions of (n+1) items, 1, 2, . . . , n+1 (see, for instance, Section 2.4 of Dahl

et al., 2017).

Now, we specify the two parameterizations for two simple cases of mixture models

(1.4); rst, examining the Blackwell-MacQueen urn scheme implied by the DP, it is

clear that the cohesion function in this case is c(Aj) = κ(nj − 1)!, where nj = |Aj |denotes the cardinality of the j-th cluster.

The second special case is given by the normalized generalized gamma process,

denoted by NGG(κ, σ, P0). The Lévy's intensity of the jumps is

ρ(ds) = κs−1−σe−sds, s > 0

where κ is a mass parameter and σ a discount parameter, σ ∈ [0, 1). The cohesion

function becomes c(Aj) = (1 − σ)nj−1 where (α)n is the Pochammer symbol, or

rising factorial, dened as (α)n = Γ(α + n)/Γ(α). It is clear that the parameter σ

has a deep inuence on the clustering behavior. For a thorough discussion on this

topic, see Lijoi et al. (2007).

As a further step, it may often be the case where we want to include some

covariate information in the model: for instance, dependence on time, space or

subject specic information. From a modeling viewpoint, it is natural to assume

that observations with similar covariates (i.e. close in time or space) are a-priori

more probably clustered together: however, including this behavior in the prior is

not straightforward and for this reason recent literature is rich of models generalizing

to covariate dependent clustering. Among the others, Müller et al. (2011) proposed

the PPMx. We start from this model to develop our contributions in Chapter 5.

1.2.5 Marginal representation for NormCRM

When building a MCMC algorithm for posterior sampling (see, for instance,

Favaro and Teh (2013)) or when generalizing the model to include some covari-

ate information (Dahl et al. (2017) and Müller et al. (2011)) it may be useful to

consider representation (1.15). Consider a sample θ1, . . . , θn from a NormCRM G,

i.e., θ1, . . . , θn|Giid∼ G and G ∼NormCRM(ρ, P0). The most common example of

a marginal process comes from the Dirichlet process (DP). In this case, we have

already mentioned in Section 1.2.2 that, if we marginalize the innite dimensional

parameter G out, then the law of the sample θ1, . . . , θn is uniquely characterized in

term of the random partition and the distinct values. In particular, we have that

1.3. Generalized latent trait models 19

the joint law of partition and unique values is

L(A1, . . . , Ak, dθ∗1, . . . , dθ

∗k) = π

(n)k (n1, . . . , nk)

k∏j=1

P ∗0 (dθ∗j ), (1.16)

where π(n)k is the EPPF. The latter decomposition (1.16) sheds light on the law of

a sample from a NormCRM. It can be factorized into two parts: the law of the

partition ρn, and the law of the cluster-specic parameters θ∗1, . . . , θ∗k. The rst

factor, namely the EPPF, depends only on the Lévy's intensity ρ, while the second

(conditionally to the number of unique values kn) is the product of the centering

measure P0.

If we want to draw a sample from model (1.16) we need rst to sample a random

partition ρn of the data, then for each of the resulting k clusters we need a cluster

specic parameter θ∗j , j = 1, . . . , k, sampled i.i.d. from P0. In order to sample

a random partition ρn with distribution given by the EPPF π(n)k , one can use a

generalization of the Chinese restaurant process metaphor (Pitman, 2006). The

metaphor consists in considering a Chinese restaurant with an (ideally) innite

number of tables, each with an innite capacity. Customer 1 sits at the rst table.

After that n = 1, 2, . . . customers have entered and sit in the restaurant, there are

k occupied tables: the next customer, n+ 1, will sit at a new table with probability

equal toπn+1k+1 (n1, . . . , nk, 1)

πnk (n1, . . . , nk, 1),

while she/he will sit at one of the k occupied table with probability

πn+1k (n1, . . . , nj + 1, . . . , nk, 1)

πnk (n1, . . . , nk, 1), j = 1, . . . , k.

The resulting sequence is exchangeable, meaning that the order in which the

customers sit does not aect the probability of the nal distribution.

1.3 Generalized latent trait models

In multivariate analysis, a fruitful approach to study high dimensional data is

given by the introduction of latent variables; dependency among the observations is

thus provided by the relationship between the observations and the latent variables

themselves. Such latent models can be used, for instance, as tools for modelling the

covariance structure. Under the Gaussian assumption for the observations, it boils

down to principal component analysis and factor analysis, that are well known sta-

tistical tools used to identify low-dimensional structures in the data (see Lawrence,

2005, Tipping and Bishop, 1999). Recently, such models have been extended to al-

low for a unied estimation method for mixed (continuous, binary and categorical)

variables. These extensions rely on the exponential distribution family. They date

20

back to the work of Moustaki and Knott (2000) and are referred to as generalized

latent trait models. See also Dunson (2000, 2003) for further developments.

For the sake of clarity, in the following we describe the two main ingredients

needed to build a generalized latent trait model as introduced in Moustaki and

Knott (2000). This model will be generalized in Chapter 5 to account for time

dependence. Let Xi = (Xi1, . . . , Xiq) be the i−th q−dimensional response variable.

Assume the following conditions:

1. Each component Xil, l = 1, . . . , q, is independent of the others and is assumed

to have distribution from the following class of distributions related to the

well-known exponential family (see McCulloch and Neuhaus (2001)), i.e. its

density has the following analytical form:

K(xil; θil, τl) = κl (xil; τl) exp (τl (< θil, xil > −A(θil)))

= exp (τl (< xil, θil > −A(θil)) + cl(xil, τl)) ,(1.17)

where

µil = E (Xil|θil, τl) =da(θil)

dθil,

νil = Var (Xil|θil, τl) =1

τl

d2a(θil)

dθ2il

,

and τl is a scalar dispersion (or nuisance) parameter. For any choice of the dis-

persion parameter, density (1.17) forms an exponential family with parameter

θil.

2. The canonical parameter θil is related to covariates and latent variables through

the generalized linear model

gl(µil(θil)) = ηil =s∑

h=1

βhuih +

p∑j=1

λ∗jζ∗ij (1.18)

where gl is a monotonic dierentiable link function, ηil is the linear predic-

tor, (ui1, . . . , uis) is the vector of available covariates for individual i. The

name generalized latent trait model is justied by the following interpreta-

tion: ζ∗i1, . . . , ζ∗ip is a vector of loadings and λ

∗1, . . . , λ

∗p is a collection of factors.

The latent variables, ζ∗ij , represent latent traits of individuals accounting for

item specic response tendencies.

For instance, suppose to observe binary values: then, it is natural to model

the variables Xil with a Bernoulli distribution, with expected value πil. The link

function may be the logit transformation, that is,

g(πil) = logit(pil) = log

(πil

1− πil

).


On the other hand, when data are counts, then theXil are assumed to be Poisson

distributed with mean µil. In this case the link function is the logarithmic function

g(µil) = log(µil).

Finally, for continuous observations one assumes that the observations are Gaus-

sian distributed: in this case, the link function is the identity, obtaining the standard

linear factor model.

Before moving to the nonparametric generalization, in this work we assume s = 0

in (1.18), i.e. we do not consider the contribution of observable covariates, but only

the term given by latent variables, i.e. gl(µil) = ηil =∑p

j=1 λ∗jζ∗ij .

1.3.1 A nonparametric prior for the factors

In the latent variables context and in particular in generalized latent trait models,

choosing the number of factors p is not an easy task and there is not a clear strategy

to x this number: see for example Lopes and West (2004) for a thorough discussion.

In the Bayesian nonparametric literature, however, the restriction given by a xed

number of latent factors p is overcome by letting p (ideally) be innite. In this

context, completely random measures play an important role in dening exible

models.

Following approach and notation of Broderick et al. (2017), we consider variables

coming from Bayesian nonparametric models as being composed of two parts: (i)

a collection of traits and the corresponding frequencies or rates and (ii) for each

data point, an allocation to dierent traits. Both parts can be expressed as random

measures. Each trait is represented by a point λ in some (Polish) space Ψ of traits.

Furthermore, let Jk be the frequency (or rate in case of a normalized measure), of

the trait represented by λk, where k ≥ 1 indexes the countably many traits. In

particular, Jk ∈ R+. Then, (Jk, λk) is a couple consisting of the k − th trait and

its frequency. We can represent the full collection of couples trait/frequency by a

discrete measure on Ψ that places weight Jk at location λk:

G =∑k≥1

Jkδλk .

Next, for each individual i we consider a random measure Θi, whose distribution

does depend on G. Measure Θi consists of a sum over traits to which the i-th

individual is allocated and a degree to which the individual is allocated to this

particular trait. That is, Θi is a discrete measure whose support concides with the

support of G, i.e. λ1, λ2, . . . , and

Θi =∑k≥1

ζikδλk , (1.19)

where now ζik ∈ R+ represents the degree to which the data point belongs to trait

λk.

22

Summing up, to characterize the latent structure of n individuals, the following

model is assumed

Θ1, . . . ,Θn|Giid∼ L(dΘ|G) (1.20)

G ∼ CRM(ν) (1.21)

where ν := ν(ds, dψ) = ρ(ds)P0(dψ) is the Lévy's intensity of the CRM as dened

in Section 1.1. However, we still need to fully describe L(dΘ|G), the law of the

traits measure given G in (1.21); measure Θi is a CRM with only a xed-location

component, conditionally to G. In particular, the locations of Θi are the same as

those of G, as in (1.19): ζik is drawn according to some distribution H that takes

Jk, the weight of G at location λk, as a parameter, namely

ζik ∼ H(·|Jk) independently across i = 1, . . . , n and k ≥ 1.

Note that while every atom of Θi is located at an atom of G, it is not necessarily

the case that every atom of G has a corresponding atom in Θi. In particular, if ζiktakes value zero, there is no atom in Θi at λk.

As far as the likelihood is concerned, each data point can be allocated only to

a nite number of traits. Thus, we assume that the number of weights dierent

from zero in every Θi is nite. This is achieved by considering H(dζ|J) a discrete

distribution with support N0 = 0, 1, 2, . . . , for any J . We denote by h(ζ|J) the

probability mass function of ζ given J . This guarantees that each data point can

be associated with a subset of the possible latent variables, namely the traits, which

we refer to as the latent features of that data point.

Moreover, note that, by construction, the pairs (Jk, λk)k≥1 form a marked

Poisson point process with rate measure µmark(ds×dx) := ρ(ds)h(x|s), so we assume

∞∑x=1

νx(R+) < +∞ for νx := ρ(ds)h(x|s)

in order to have a nite number of latent variables generating a data point.

1.3.2 Marginal representation

In the Bayesian nonparametric literature we often bump into processes that

are actually versions of model (1.20)-(1.21) where G has been integrated out: they

are, in fact, more interpretable and usually lead to easier MCMC algorithms when

sampling from the posterior.

In order to dene a generalized latent trait model we need a joint prior for the

vector of scores ζ∗i := (ζ∗i1, . . . , ζ∗ip), for i = 1, . . . , n, the vector of factor loadings

λ∗ := (λ∗1, . . . , λ∗p) and parameter p, as introduced in Section 1.3, which are nite

dimensional objects. On the other hand, in equations (1.20)-(1.21) we introduced a

model for traits and frequencies involving innite dimensional mathematical objects.

First, we recall a marginal characterization of a sample Θ1, . . . ,Θn from model


(1.20)-(1.21) provided in Theorem 6.1 of Broderick et al. (2017). The marginal

distribution of Θ1, . . . ,Θn is described by the following construction. For each i =

1, 2, . . . , n,

1. let λ∗kqi−1

k=1 be the union of the atom location in Θ1, . . . ,Θi−1. Let ζ∗j,k :=

Θj(λ∗k). Let ζ∗i,k denote the weight of Θi|Θ1, . . . ,Θi−1 at λk. Then ζ∗i,k has

distribution described by the following probability mass function:

h(ζ∗i,k = ζ|ζ∗1,k, . . . , ζ∗i−1,k) =

∫ρ(dθ)h(ζ|θ)

∏i−1m=1 h(ζ∗m,k|θ)∫

ρ(dθ)∏i−1m=1 h(ζ∗m,k|θ)

(1.22)

2. for each ζ = 1, 2, . . . , Θi has ρi,ζ new atoms whose weight is ζ. Where

ρi,ζind∼ Poisson

(ρ

∣∣∣∣∫ ρ(dθ)h(0|θ)i−1h(x|θ)), independently across i, ζ

(1.23)

Moreover, these atoms are located at λi,ζ,jρi,ζj=1 where

λi,ζ,jiid∼ P0(dλ) independently across i, ζ, j. (1.24)

Henceforth, let us consider a sample Θ1, . . . ,Θn from (1.20)- (1.21): our aim is

to show that the sample can be summarized by a n-dimensional array of scores

ζ∗1 , . . . , ζ∗n and a vector of traits λ∗; then, we see how points 1. and 2. above also

characterize the marginal law of these objects.

First, we observe that from Theorem 5.1 of Broderick et al. (2017) the marginal

law of Θ1 can be expressed as follows: for each ζ ∈ N, there are ρ1,ζ atoms of Θ1

with weight ζ, where

ρ1,ζind.∼ Poisson

(∫θρ(dθ)h(ζ|θ)

)across ζ.

These atoms have locations λ1,ζ,jρ1,ζ

j=1, where λ1,ζ,jiid∼ P0 across ζ, j.

Thanks to the independence among variables in the construction above we

can rstly let p1 :=∑∞

ζ=1 ρ1,ζ , and let λ∗1 = λ∗kp1

k=1 be the disjoint (by as-

sumption) union of the λ1,ζ,jρ1,ζ

j=1. Note that p1 is nite by Assumption A2.

Finally let ζ∗1 = ζ∗1,k := Θ1(λ∗k)p1

k=1. It is clear that Θ1 may be repre-

sented with the pairs (ζ∗1 ,λ∗1). Moreover, we can x the sampling order and as-

sume that λ∗1 =: (λ∗1, . . . , λ∗p1

) and ζ∗1 := (ζ∗1 , . . . , ζ∗p1

). We proceed by induc-

tion: for n = 2, 3, . . . consider a sample (ζ∗1 ,λ∗1), . . . , (ζ∗n−1,λ

∗n−1), generated as

described below. Thanks to points 1. and 2. above, we can characterize the

law of (ζ∗n,λ∗n)|(ζ∗1 ,λ∗1), . . . , (ζ∗n−1,λ

∗n−1). The two vectors ζ∗n and λ∗n have length

pn = pn−1 + p∗n, where pn−1 is the length of ζ∗n−1 and p∗n =∑∞

ζ=1 ρi,ζ (see (1.23)).

The rst pn−1 entries of ζ∗n are distributed as described in equation (1.22), while the

rst pn−1 entries of λ∗n are equal to λ

∗n−1 (thinning). In addition, the last p

∗n entries

of the vectors ζ∗n and λ∗n are lled according to equations (1.23) and (1.24) following

24

the sampling order (we call this second part innovation). For ease of notation, we

will assume that all the vectors ζ∗'s have the same length p = pn, with the proviso

that ζj,i = 0 if pi < p; moreover, we will let λ∗ = λ∗n. A metaphor that generalizes

the well known Indian Buet process can be formulated to describe the marginal

law of the ζ's. Thus, one can employ the notation ζ∗1 , . . . , ζ∗n ∼ GIB(ν), where GIB

stands for generalized Indian Buet (explained later). Consider an Indian buet,

namely a buet where there are innitely many dishes; dierently from the usual

construction, assume now that customers can choose as many portions of the dishes

they want. Then, the rst customer enters the restaurant and takes 1 portion of ρ1,1

dishes, 2 portions of ρ1,2 dishes, . . . , ζ portions of ρ1,ζ dishes, and so on. At the end,

the rst customer will have chosen p1 dishes, and the vector ζ∗1 = (ζ∗1,1, . . . , ζ∗1,p1

)

reports how many portions of each dish labelled as λ∗ = (λ∗1, . . . , λ∗p1

) she/he has

chosen. Recursively, the n-th customer chooses dishes and number of portions ac-

cording to two steps. First, for each dish k = 1, . . . , pn−1 already chosen by the

previous customers, she/he will take ζ∗n,k portions (also 0), then she/he will choose

1 portion of ρn,1 new dishes, 2 portions of ρn,2 of new dishes, . . . , ζ portions of ρ1,ζ

new dishes, and so on.

This marginal representation shed light on a dierent characteristic of the non-

parametric prior on the latent factors that we are considering: it induces a feature

model, in the same way the NormCRM induces a model for clustering i.e. a prior

for the random partition ρn. More formally, a feature allocation model fn of

[n] := 1, . . . n is a multiset of non-empty subsets of [n] called features, such that

no index i belongs to innitely many features. We write fn = A1, . . . , Ap, wherep is the number of features. A partition is a special case of a feature allocation

for which the features are restricted to be mutually exclusive and exhaustive. The

features of a partition are often referred to as clusters. We note that a partition

is always a feature allocation, but the converse statement does not hold in general.

Consider now the random feature allocation of the data indices [n] = 1, . . . , n,fn =: A1, . . . , Ap, given by the following i ∈ Aj if and only if ζ∗i,j > 0, so that the

law of ζ,1 . . . , ζ∗n characterize the law of fn.


1.3.3 Nonparametric generalized latent trait model

We are now ready to write down the general form of a generalized latent trait

model as follows:

X1, . . . , Xn|Θ1, . . . ,Θn, τ2 ∼

n∏i=1

q∏l=1

K(Xil; θil, τ2l ), (1.25)

where gl(µil(θil)) = ηil =

∫Θi(dθ) =

∑j≥1

λjζij

Θ1, . . . ,Θn|Giid∼ L (Θ|G) (1.26)

G ∼ CRM(ν) (1.27)

τ21 , . . . , τ

2qiid∼ p(τ2)

where K(·; θ, τ2) is a kernel density belonging to some parametric family. We point

out that this model includes as special cases the popular Innite latent feature model

of Ghahramani and Griths (2006) or the latent Poisson factor analysis of Zhou

et al. (2012). However, to the best of our knowledge, our formulation is a general

representation that has never appeared in the literature.

We can write down a marginal version of the model above by integrating out

the innite dimensional parameter G and introducing the representation ζ∗1 , . . . , ζ∗n

and λ∗ as follows:

X1, . . . , Xn|ζ∗1 , . . . , ζ∗n, λ∗, τ 2 ∼n∏i=1

q∏l=1

K(Xil; θil, τ2l ), (1.28)

gl(µil(θil)) = ηil =

p∑j=1

λ∗jζ∗ij

ζ∗1 , . . . , ζ∗n ∼ GIB(ν) (1.29)

λ∗1, λ∗2 . . . iid∼ P0 (1.30)

τ21 , . . . , τ

2qiid∼ p(·).

A standard application for which model (1.28) is employed is the problem of

learning recurrent features in a collection of images: this task is of interest, for

instance, when analyzing a video (sequence of images) and looking for objects that

appear frequently. The most well-known model is the linear-Gaussian latent feature

model, in which the features are binary, as in Griths and Ghahramani (2011)

and Ghahramani and Griths (2006). Common factors are, in this case, images

containing specic features that recur over the observations and are usually modeled

as i.i.d. Gaussian distribution: λj ∼ Nq(0, σ2ZI) where q is the total number of pixels

in the observations and I is a q× q identity matrix. The conditional distribution of

26

an image Xi, namely the kernel K, is

Xi|A, ζi, τ2 ∼ Nq(Aζi, τ

2I)

where A is a matrix whose rows are the traits λj 's and τ2 is the variance of each

component.

The GIB in this case is given by the Indian Buet Process itself, described

as follows. The rst customer (the rst observation) starts at the left and samples

Poisson(α) dishes. The i-th costumer moves from left to right sampling dishes with

probabilitymk

iwhere mk is the number of customers to have previously sampled

dish k. Having reached the end of the previously sampled dishes, she/he tries

a number of new dishes that is Poisson(α/i) distributed. If we apply the same

ordering scheme to the binary matrix whose entries ζik tell us whether or not (0/1)

the hidden feature k contributes to the i-th item generated by this process, we

recover an exchangeable distribution. It is clear that the number of active features

K+ is given by∑n

i=1 Poisson(α/i). Figure 1.3 shows a realization of this matrix,

where the rows represent the customers (namely the n observations) and the columns

are the dishes. Note that the number of dishes that have been tasted up to a certain

arrival of a customer grows with the number of customers.

Figure 1.3: Matrix representation of a realization from an Indian Buet process(image taken from Griths and Ghahramani, 2011).

Chapter 2

Posterior sampling from

ε-approximation of normalized

completely random measure mixtures

This chapter is based on Argiento et al. (2016b).

In this chapter we deal with the mixture models introduced in Section 1.2; in particu-

lar, we consider the case where the mixing distribution belongs to the class of normalized

homogeneous completely random measures. However, the issue related to the innite di-

mensionality of the parameter has been only mentioned so far. Here, we address this

computational issue by proposing a truncation method for the mixing distribution. The

idea is to discard the weights of the unnormalized measure smaller than a threshold. We

provide some theoretical properties about the approximation, as convergence and posterior

characterization. A relatively simple blocked Gibbs sampler is devised, in order to sample

from the posterior of the model. In particular, we are able to sample from the posterior of

the truncated mixing measure.

The performances of the proposed approximation are illustrated by two dierent ap-

plications. In the rst, a new random measure, called normalized Bessel random measure,

is introduced; goodness of t indexes show its good performances as mixing measure for

density estimation. The second example describes how to incorporate covariates in the

support of the normalized measure, leading to a linear dependent model for regression and

clustering.

In order to keep the chapter self-contained, in Section 2.2 the notation used for homo-

geneous normalized completely random measure and its approximation is recalled.

28

2.1 Introduction

One of the livelier topic in Bayesian nonparametrics concerns mixtures of para-

metric densities where the mixing measure is an almost surely discrete random

probability measure. The basic model is the Dirichlet process mixture model, ap-

peared rst in Lo (1984), where the mixing measure is indeed the Dirichlet process.

Dating back to Ishwaran and James (2001a) and Lijoi et al. (2005), many alterna-

tive mixing measures have been proposed; the former paper replaced the Dirichlet

process with stick-breaking random probability measures, while the latter focused

on normalized completely random measures. These hierarchical mixtures play a

pivotal role in modern Bayesian nonparametrics, and their popularity is mainly due

to the high exibility in density estimation problems as well as in clustering, which

is naturally embedded in the model.

In some statistical applications, the clustering induced by the Dirichlet process

as mixing measure may be restrictive. In fact, it is well-know that the latter allo-

cates observations to clusters with probabilities depending only on the cluster sizes,

leading to the the rich gets richer behavior. Within some classes of more general

processes, as, for instance, stick-breaking and normalized processes, the probability

of allocating an observation to a specic cluster depends also on extra parameters,

as well as on the number of groups and on the cluster size. We refer to Argiento

et al. (2015) for a recent review of the state of the art on Bayesian nonparametric

mixture models and clustering.

Since posterior inference for Bayesian nonparametric mixtures involves an innite-

dimensional parameter, this may lead to computational issues. However, there is

a recent prolic literature focusing mainly on two dierent classes of MCMC algo-

rithms, namely marginal and conditional Gibbs samplers. The former integrates

out the innite dimensional parameter (i.e. the random probability), resorting to

generalized Polya urn schemes; see Favaro and Teh (2013) or Lomelí et al. (2017).

The latter includes the nonparametric mixing measure in the state space of the

Gibbs sampler, updating it as a component of the algorithm; this class includes the

slice sampler (see Grin and Walker, 2011). Among conditional algorithms there

are truncation methods, where the innite parameter (i.e. the mixing measure) is

approximated by truncating the innite sums dening the process, either a poste-

riori (Argiento et al., 2010; Barrios et al., 2013) or a priori (Argiento et al., 2016a;

Grin, 2013).

In this work we introduce an almost surely nite dimensional class of random

probability measures that approximates the wide family of homogeneous normal-

ized completely random measures introduced in Section 1.1; we use this class as

the building block in mixture models and provide a simple but general truncation

algorithm to perform posterior inference. Our approximation is based on the con-

structive denition of the weights of the completely random measure as the points of

a Poisson process on R+. In particular, we consider only points larger than a thresh-

old ε, controlling the degree of approximation. Conditionally on ε, our process is

nite dimensional both a priori and a posteriori.

2.2. Preliminaries on normalized completely random

measures 29

Here we illustrate two applications. In the rst one, a new choice for the Lévy

intensity ρ, characterizing the normalized completely random measure, is proposed:

the Bessel intensity function that, up to our knowledge, has never been applied

in a statistical framework, but known in nance (see Barndor-Nielsen, 2000, for

instance). We call this new process normalized Bessel random measure. In the

second application, we set ρ to be the well-known generalized gamma intensity and

consider a centering measure P0x depending on on a set of covariates x, yielding a

linear dependent normalized completely random measure.

In this chapter, since the main objective is the approximation of the nonpara-

metric process arising from the normalization of completely random measures, we

x ε to a small value. However, it is worth mentioning that it is possible to choose

a prior for ε, but the computational cost might greatly increase for some intensity

ρ.

The new achievements of this chapter can be summarized as follows: (i) a gener-

alization of the ε-approximation given in Argiento et al. (2016a) for the NGG process

to the whole family of normalized homogeneous completely random measures, (ii) a

dierent technique providing the posterior distribution (and the exchangeable par-

tition probability function) of this new random probability measure, making use of

Palm's formula, and (iii) the introduction of the normalized Bessel random measure

as mixing measure in Bayesian nonparametric mixtures.

In particular, after the introduction of the nite dimensional ε-approximation

of a normalized completely random measure, we derive its posterior and show that

the ε-approximation converges to its innite dimensional counterpart (Section 2.3).

Then we provide a Gibbs sampler for the ε-approximation hierarchical mixture

model (Section 2.4). Section 2.4.1 illustrates some criteria to choose the approx-

imation parameter ε. Section 2.5.1 is devoted to the introduction of the normalized

Bessel random measure, and some of its properties; on the other hand, Section 2.5.2

discusses an application of the ε-Bessel mixture models to both simulated and real

data. Section 2.6 denes the linear dependent ε−NGG's, and considers linear de-

pendent ε −NGG mixtures to t the AIS data set. To complete the set-up of the

chapter, Section 2.2 is devoted to a summary of basic notions about homogeneous

normalized completely random measures, and Section 2.7 contains a conclusive dis-

cussion.

2.2 Preliminaries on normalized completely randommea-

sures

In this section we recall the notation and the main denitions that are useful

in the rest of the chapter. See also Section 1.1 of the introductory chapter. Let

Θ ⊂ Rm for some positive integer m. Let µ be a homogeneous completely random

measure on Θ with Levy's intensity ν(ds, dτ) = κρ(ds)P0(dτ), where ρ(s) is the

density of a non-negative measure on R+, and κP0 is a nite measure on Θ with

30

total mass κ > 0. Assume that ρ satises regularity conditions∫ +∞

0min1, sρ(s)ds < +∞, (2.1)

and ∫ +∞

0ρ(s)ds = +∞. (2.2)

This implies that the homogeneous completely random measure can be represented

as µ(·) =∑

j≥1 Jjδτj (·). Since µ is homogeneous, the support points τj and the

jumps Jj of µ are independent, and the τj 's are independent identically distributed(iid) random variables from P0, while Jj are the points of a Poisson process on

R+ with mean intensity ρ. Moreover, if T := µ(Θ) =∑

j≥1 Ji, by (2.1) and (2.2)

P(0 < T < +∞) = 1.

Therefore, the corresponding normalized completely random measure P can be

dened through normalization of µ:

P :=µ

µ(Θ)=

+∞∑j=1

JjTδτj =

+∞∑j=1

Pjδτj . (2.3)

We refer to P in (2.3) as a (homogeneous) normalized completely random measure

with parameter (ρ, κP0). As an alternative notation, following James et al. (2009),

P is referred to as a homogeneous normalized measure with independent increments.

An alternative construction of normalized completely random measures can be given

in terms of Poisson-Kingman models as in Pitman (2003).

2.3 ε-approximation of normalized completely random

measures

The goal of this section is the denition of a nite dimensional random proba-

bility measure that is an approximation of a general normalized completely random

measure with Lévy intensity given by ν(ds, dτ) = ρ(ds)κP0(dτ), introduced above.

First of all, by the Restriction Theorem for Poisson processes, for any ε > 0, all

the jumps Jj of µ larger than a threshold ε are still a Poisson process, with mean

intensity γε(s) := κρ(s)1(ε,+∞)(s). Moreover, the total number of these points

is Poisson distributed, i.e. Nε ∼ Poisson(Λε) where Λε := κ∫ +∞ε ρ(s)ds. Since

Λε < +∞ for any ε > 0 by (2.1), Nε is almost surely nite. In addition, conditionally

to Nε, the points J1, . . . , JNε are iid from the density

ρε(s) =γε(s)

Λε=κρ(s)

Λε1(ε,+∞)(s), (2.4)

thanks to the relationship between Poisson and Bernoulli processes; see, for instance,

Kingman (1993), Section 2.4.

2.3. ε-approximation of normalized completely random

measures 31

We denote by µε the CRM with Lévy intensity

νε(ds, dτ) := ρ(ds)1(ε,+∞)(s)dsκP0(dτ). (2.5)

This implies that µε =∑Nε

j=1 Jjδτj . However, it is not worth trying to normalize µε,

since µε(B) = 0 for any B if Nε = 0. We consider, instead, the CRM µε so dened:

µε(·)d= J0δτ0(·) + µε(·) (2.6)

where (J0, τ0) is independent from (Jj , τj), j ≥ 1, J0 and τ0 are independent with

density ρε and P0, respectively. Thus

µε(·) = J0δτ0(·) +

Nε∑j=1

Jjδτj (·) =

Nε∑j=0

Jjδτj (·).

Summing up, we dene:

Pε(·) =

Nε∑j=0

Pjδτj (·) =

Nε∑j=0

JjTεδτj (·), (2.7)

where Tε =∑Nε

j=0 Jj , τjiid∼ P0, τj and Jj independent. We denote Pε in

(2.7) by ε − NormCRM and write Pε ∼ ε − NormCRM(ρ, κP0). When ρε(s) =

1/(ωσΓ(−σ, ωε))s−σ−1e−ωs, s > ε, Pε is the ε−NGG process introduced in Argiento

et al. (2016a), with parameter (σ, κ, P0), 0 ≤ σ ≤ 1, κ ≥ 0.

Increasing Lévy processes are completely random measures for Θ = R (or R+).

Therefore, it is worth mentioning some literature on ε-approximation of such pro-

cesses in the nancial context. In particular, the book by Asmussen and Glynn

(Asmussen and Glynn, 2007, Chapter XII) provides a justication for the approx-

imation of innite activity Lévy processes by compound Poisson processes: any

Lévy jump process J on R can be represented as the sum of two independent Lévy

processes

J(s) = J1(s) + J2(s), s ∈ R,

where the Lévy measures of J1 and J2 are restrictions of the whole Lévy measure

on (−ε, ε) and (−∞,−ε]∪[ε,+∞), respectively. When considering the homogeneous

completely random measure µ under (2.1) and (2.2) as here, this theory yields that µ

is the sum of two independent homogeneous completely random measures µ(0,ε] and

µε, corresponding to mean intensities ρ(s)1(0,ε](s) and ρε as in (2.4), respectively.

Note that µε is the CRM in the right hand-side of (2.6). The basic idea of the

ε-approximation is that, if ε is small enough, µ(0,ε] can be neglected and µ can be

approximated by µε; see (Asmussen and Glynn, 2007, Chapter XII) and Trippa and

Favaro (2012).

The approach to ε-approximation taken here is similar, though not identical,

since we rst add the random mass J0 in the random point τ0 to µε to dene the

32

CRM µε as in (2.6). The random probability measure Pε in (2.7) is then dened by

normalization of µε. We will show in Proposition 2.3 that Pε converges in distribu-

tion to P as ε goes to 0, but the basic idea of the approximation is that the point

mass we add to µε is negligible; see Section 2.4.1.

Several other methods have been proposed in order to approximate a normalized

measure; rst of all, we mention the inverse Lévy measure method, referred to as

Ferguson-Klass representation (Ferguson and Klass, 1972) in this context, represent-

ing the Poisson process of the jumps of a subordinator as a series of trasformed (via

the survival function of the Lévy intensity) points of a unit rate Poisson process. Of

course, to get implementable simulation algorithms, the series expansion has to be

truncated at a xed and large integer N , or whenever the new jump to be added to

the series is smaller that a threshold ε. In the latter case, the truncation rule would

yield only jumps of size greater than ε, obtaining an algorithm that is similar to

that proposed here (Asmussen and Glynn, 2007, Chapter XII). On the other hand,

Arbel and Prünster (2017) proposes a truncation rule of the series representation at

a xed integer N quantifying the error through a moment-matching criterion, i.e.

evaluating a measure of discrepancy between actual moments of the whole series

and moments of the truncated sum based on the simulation output. More series

representations of the jump process can be considered, with corresponding trunca-

tion rules; see Bondesson (1982) and Rosi«ski (2001). Alternatively, Trippa and

Favaro (2012) proposed a novel class of r.p.m.'s, that is dense in the class of homo-

geneous normalized completely random measures. These authors rst approximate

any CRM µ with µε which, as we have already mentioned, has nite Lévy measure.

Then, resorting to the denseness" of the novel class, they approximate µε with

an element of this class, with Lévy intensity given by the weighted sum of a nite

number of intensities of nite activity processes, plus the intensity of the gamma

process.

Let θ = (θ1, . . . , θn) be a sample from Pε, a ε−NormCRM(ρ, κP0) as dened

in (2.7), and let θ∗ = (θ∗1, . . . , θ∗k) be the (observed) distinct values in θ. We denote

by allocated jumps of the process the values Pl∗1 , Pl∗2 , . . . , Pl∗k in (2.7) such that there

exists a corresponding location for which τl∗i = θ∗i , i = 1, . . . , k. The remaining

values are non-allocated jumps. We use the superscript (na) for random variables

related to non-allocated jumps. The rst result is a characterization of the posterior

law of the random measure µε, not yet normalized; however, we need introducing

two more ingredients rst. We consider an auxiliary random variable U such that

U |µε ∼ Gamma(n, Tε), so that the marginal density of U is

fU (u;n) =un−1

Γ(n)E(Tnε e−Tεu) =

un−1

Γ(n)(−1)n

d

dunE(e−uTε)

=un−1

Γ(n)(−1)n

d

dunΛε,ueΛε,u

ΛεeΛε,

(2.8)

and the last equality follows easily from the denition of Tε and the Lévy-Khintchine

representation, using notation dened in (2.11). We also formulate the following


measures 33

lemma, whose proof is straightforward.

Lemma 2.3.1

Let µε be a nite CRM with Lévy intensity νε as in (2.5), and let µε be dened as

in (2.6). Consider a CRM µ? such that

µ?(·) d= Xµε(·) + (1−X)µε(·), (2.9)

where X ∼ Bernoulli(p), p = a/(a+ b), a, b > 0, and X is independent on µε and

(J0, τ0). The Laplace functional of µ? is:

Ψ[f ] =aA[f ] + b

a+ bexp

−∫R+×Θ

(1− e−f(τ)s

)νε(ds, dτ)

, (2.10)

for any positive f , where

A[f ] := E(

e−f(τ0)J0

)=

∫R+×Θ

e−f(τ)sρε(s)dsP0(dτ)

=1

Λε

∫R+×Θ

e−sf(τ)νε(ds, dτ)

is the Laplace functional of the random measure J0δτ0 .

The posterior distribution of µε has the following characterization.

Theorem 2.3.1

If Pε is an ε−NormCRM(ρ, κP0), then the conditional distribution of Pε, given θ∗

and U = u, is obtained by normalization of the following random measure

µ∗ε(·)d= µ(na)

ε,u (·) + µ(a)ε,u(·) = µ(na)

ε,u (·) +k∑j=1

J(a)j δθ∗k(·)

where

1. the law of the process of non-allocated jumps µ(na)ε,u (·) is distributed as the

CRM µ? dened in (2.9), corresponding to Lévy intensity in (2.10) given by

e−usνε(ds, dτ) and probability p of success p = Λε,u/(Λε,u + k), where

Λε,u := κ

∫ +∞

εe−usρ(s)ds, u ≥ 0; (2.11)

2. the process of allocated jumps µ(a)ε,u(·) has xed points of discontinuity θ∗ =

(θ∗1, . . . , θ∗k) with weights J

(a)j

ind∼ snie−usρ(s)1(ε,+∞)(s)ds, j = 1, . . . , k;

3. µ(na)ε,u (·) and µ

(a)ε,u(·) are independent, conditionally to l∗ = (l∗1, . . . , l

∗k), the

vector of locations of the allocated jumps;

34

4. the posterior law of U given θ∗ has density on the positive real numbers given

by

fU |θ∗(u|θ∗) ∝ un−1eΛε,u−Λε Λε,u + k

Λε

k∏i=1

∫ +∞

εκsnie−usρ(s)ds, u > 0.

The proof of the above proposition, as well as of all the others in this section, is in

Appendix 2.B. An immediate consequence of Theorem 2.3.1 is the next proposition.

Corollary 2.3.1

The conditional distribution of Pε, given θ∗ and U = u, veries the distributional

equation

P ∗ε (·) d= wP (na)

ε,u (·) + (1− w)k∑j=1

P(a)j δθ∗k(·)

where P(na)ε,u (·) is the null measure if µ

(na)ε,u (Θ) = 0, w = µ

(na)ε,u (Θ)/(µ

(na)ε,u (Θ) +∑k

j=1 J(a)j ), and the jumps P (a)

1 , . . . , P(a)k associated to the xed points of discon-

tinuity θ∗1, . . . , θ∗k are dened as P

(a)j = J

(a)j /

∑kj=1 J

(a)j , j = 1 . . . , k.

Theorem 2.3.1 and Corollary 2.3.1 conceive the nite dimensional counterpart

of Proposition 1 in James et al. (2009).

Both the innite and nite dimensional processes dened in (2.3) and (2.7),

respectively, belong to the wide class of species sampling models, deeply investigated

in Pitman (1996), and we use some of the results there to derive ours. Let (θ1, . . . , θn)

be a sample from (2.3) or (2.7) (or, more generally, from a species sampling model);

since it is a sample from a discrete probability, it induces a random partition pn :=

C1, . . . , Ck on the set Nn := 1, . . . , n where Cj = i : θi = θ∗j for j = 1, . . . , k. If

#Ci = ni for 1 ≤ i ≤ k, the marginal law of (θ1, . . . , θn) has unique characterization:

L(pn, θ∗1, . . . , θ

∗k) = p(n1, . . . , nk)

k∏j=1

L(θ∗j ),

where p is the EPPF associated to the random probability. The EPPF p is a

probability law on the set of the partitions of Nn. The following proposition providesan expression for the EPPF of a general ε−NormCRM .

Proposition 2.1

Let (n1, . . . , nk) be a vector of positive integers such that∑k

i=1 ni = n. Then, the

EPPF associated with a Pε ∼ ε-NormCRM(ρ, κP0) is

pε(n1, . . . , nk) =

∫ +∞

0

[un−1

Γ(n)

(k + Λε,u)

Λεe(Λε,u−Λε)

k∏i=1

∫ +∞

εκsnie−usρ(s)ds

]du(2.12)

where Λε,u has been dened in (2.11).


measures 35

A result concerning the EPPF of a generic normalized (homogeneous) completely

random measure can be obtained from Pitman (2003), formulas (36)-(37):

p(n1, . . . , nk) =

∫ +∞

0

un−1

Γ(n)eκ∫+∞0 (e−us−1)ρ(s)ds

(k∏i=1

∫ +∞

0κsnie−usρ(s)ds

)du.

(2.13)

It follows that the EPPF of (2.7) converges pointwise to that of the corresponding

(homogeneous) normalized completely random measure (2.3) when ε tends to 0.

Proposition 2.2

Let pε(·) be the EPPF of a ε−NormCRM(ρ, κP0). Then for any sequence n1, . . . , nkof positive integers with k > 0 and

∑ki=1 ni = n,

limε→0

pε(n1, . . . , nk) = p0(n1, . . . , nk), (2.14)

where p0(·) is the EPPF of the NormCRM(ρ, κP0) as in (2.13).

Convergence of the sequence of EPPFs yields convergence of the sequences of

ε−NormCRMs, generalizing a result obtained for ε−NGG processes.

Proposition 2.3

Let Pε be a ε−NormCRM(ρ, κP0), for any ε > 0. Then

Pεd→ P as ε→ 0,

where P is a NormCRM(ρ, κP0). Moreover, as ε tends to +∞, Pεd→ δτ0 , where

τ0 ∼ P0.

The proof of the above proposition is along the same lines as the proof of Propo-

sition 1 in Argiento et al. (2016a), and therefore it is omitted here.

Furthermore, the m-th moment of Pε, m = 1, 2, . . . , is equal to:

E [(Pε(B))m] = E[(P0(B))Km

](2.15)

where B ∈ B(Θ) and Km is the number of distinct values in a sample of size m from

Pε. In particular, when m = 2, Km assumes values in 1, 2, and the probability

that K2 = 1 is the probability that, in a sample of size 2 from Pε, the sample values

coincide, i.e. pε(2). Therefore E(Pε(B)2) = P0(B)pε(2) + (P0(B))2(1− pε(2)), and

consequently

Var(Pε(B)) = pε(2)P0(B) (1− P0(B)) . (2.16)

Analogously, the covariance structure of Pε is as follows:

Cov(Pε(B1), Pε(B2)) = pε(2) (P0(B1 ∩B2)− P0(B1)P0(B2)) (2.17)

36

for any B1, B2 ∈ B(Θ). Proofs of (2.15) and (2.17) are given in Appendix 2.B.

2.4 ε-NormCRM process mixtures

We consider mixtures of parametric kernels as the distribution of data, where

the mixing measure is the ε − NormCRM(ρ, κP0). The model we assume is the

following:

Yi|θiind∼ f(·; θi), i = 1, . . . , n

θi|Pεiid∼ Pε, i = 1, . . . , n

Pε ∼ ε−NormCRM(ρ, κP0),

ε ∼ π(ε),

(2.18)

where f(·; θi) is a parametric family of densities on Y ⊂ Rp, for all θ ∈ Θ ⊂ Rm. Itis a special case of model (1.4) in Chapter 1, where the de Finetti measure is given

by the family of ε-NormCRM processes.

Remember that P0 is a non-atomic probability measure on Θ, such that E(Pε(A)) =

P0(A) for all A ∈ B(Θ) and ε ≥ 0. Model (2.18) will be addressed here as

ε−NormCRM hierarchical mixture model.

The design of a Gibbs scheme to sample from the posterior distribution of model

(2.18) is straightforward, once we have augmented the state space with the variable

u, by using the posterior characterization in Theorem 2.3.1. The Gibbs sampler

generalizes that one provided in Argiento et al. (2016a) for ε−NGG mixtures, but

it is designed for any Lévy intensity ρ under (2.1) and (2.2). Description of the

full-conditionals is below, and further details can be found in Appendix 2.A.

1. Sampling from L(u|Y ,θ, Pε, ε): it is clear that, conditionally to Pε, u is

independent from the other variables and distributed according to gamma

with parameters (n, Tε).

2. Sampling from L(θ|u,Y , Pε, ε): each θi, for i = 1, . . . , n, has discrete law

with support τ0, τ1, . . . , τNε, and probabilities P(θi = τj) ∝ Jjf(Yi; τj).

3. Sampling from L(Pε, ε|u,θ,Y ): this step is not straightforward and can be

split into two consecutive substeps:

3.a Sampling from L(ε|u,θ,Y ): see Appendix 2.A.

3.b Sampling from L(Pε|ε, u,θ,Y ): via characterization of the posterior

in Theorem 2.3.1, since this distribution is equal to L(Pε|ε, u,θ). To put

into practice, we have to sample (i) the number Nna of non-allocated

jumps, (ii) the vector of the unnormalized non-allocated jumps J (na),

(iii) the vector of the unnormalized allocated jumps J (a), the support of

the allocated (iv) and non-allocated (v) jumps. See Appendix 2.A for a

wider description.

2.4. ε-NormCRM process mixtures 37

We highlight that, when sampling from non-standard distributions, Accept-Reject

or Metropolis-Hastings algorithms have been exploited.

2.4.1 Some ideas on the choice of ε

We believe that a brief discussion on the choice of the approximation parameter

ε is worth doing. We could also consider it random, as we did in Argiento et al.

(2016a), where the ε-NGG mixture model was proposed. In our general view, this

parameter can be considered either as a true parameter, and then it should be

xed on the ground of the prior information we have, or as a tuning parameter to

approximate the exact model (normalized completely random measure mixtures).

If we prefer the latter alternative as we did here, ε has to be small. However, since the

result on ε-approximation (Theorem 2.3) concerns the prior distribution in (2.18),

the only suggestions we can give refer to a priori criteria. Here we suggest to set ε

such that the sum of the masses µ((0, ε]) and J0 we perturb µ with, obtaining µε,

is small. In particular, since the interest is in normalized random measures, small

is xed with respect to the expectation E(T ) of the total mass of µ, i.e. we choose

ε such that

r(ε) :=E(µ(0, ε]) + E(J0)

E(T )≤ ν, (2.19)

where ν is typically a small value. Rather, alternative criteria are available; for

instance, as in Argiento et al. (2016a), we could choose ε to achieve a prexed value

for E(Nε) or Var(Nε). As far as (2.19) is concerned, observe that

E(µ(0, ε]) = κ

∫ ε

0sρ(s)ds, Var(µ(0, ε]) = κ

∫ ε

0s2ρ(s)ds;

from (2.1), it follows that

E(µ(0, ε])→ 0 Var(µ(0, ε])→ 0 as ε→ 0,

i.e. the r.v. µ(0, ε] converges to 0 in L∈ and this implies convergence in probability.

Besides, we have that

ε ≤ E(J0) =κ∫ +∞ε sρ(s)ds

Λε≤ E(T )

E(Nε).

Consequently, when ε→ 0, E(Nε)→ +∞ and thus E(J0) converges to 0.

As an interesting example, we evaluate the ratio r(ε) when ρ(s) = 1/Γ(1 −σ)s−1−σe−ωε for 0 ≤ σ < 1, κ > 0 and ω = 1, that means when µ is the gen-

eralized gamma process, i.e. the unnormalized CRM dening NGG processes by

38

Figure 2.1: Values of r(ε) when ρ isthe Lévy intensity of the generalizedgamma CRM, with κ = 1 and dierentvalues of σ, as a function of log10(ε).

−10 −8 −6 −4 −2

0.00

0.05

0.10

0.15

0.20

log10(ε)

Rat

io

σ

0.001

0.1

0.3

0.6

normalization. By 8.354.2 in Gradshteyn and Ryzhik (2007), we have that

E (µ(0, ε]) =κ

Γ(1− σ)(Γ(1− σ)− Γ(1− σ; ε))

=κ

Γ(1− σ)

(+∞∑n=0

(−1)nε1−σ+n

n!(1− σ + n)

)ε→0∼ κε1−σ

Γ(2− σ),

and E(J0) = Γ(1−σ, ε)/Γ(−σ, ε). We also mention that Var(µ(0, ε]) ∼ (κε2−σ)/Γ(2−σ) as ε tends to 0. Figure 2.1 shows r(ε) when µ is the generalized gamma process

with κ = 1 and dierent values of σ, as a function of ε. Note that a smaller threshold

ε is needed in order to obtain the same value of ν when the parameter σ decreases

to 0.

Similar calculations can be derived when µ is the Bessel random measure intro-

duced in the next section.

2.5 Normalized Bessel random measure mixtures: den-

sity estimation

In this section we introduce a new normalized process, called normalized Bessel

random measure. Section 2.5.1 describes theoretical results: in particular, we show

that this family encompasses the well-known Dirichlet process. Then we t the

mixture model to synthetic and real datasets in Section 2.5.2. Results are illustrated

through a density estimation problem.

2.5.1 Denition

Let us consider a normalized completely random measure corresponding to mean

intensity

ρ(s;ω) =1

se−ωsI0(s), s > 0,

2.5. Normalized Bessel random measure mixtures: density

estimation 39

where ω ≥ 1 and

Iν(s) =

+∞∑m=0

(s/2)2m+ν

m!Γ(ν +m+ 1)

is the modied Bessel function of order ν > 0 (see Erdélyi et al., 1953, Sect 7.2.2).

It is straightforward to see that, for s > 0,

ρ(s;ω) =1

se−ωs +

+∞∑m=1

1

22m(m!)2s2m−1e−ωs, (2.20)

so that ρ is the sum of the Lévy intensity of the gamma process with rate parameter

ω and of the Lévy intensities

ρm(s;ω) =1

22m(m!)2s2m−1e−ωs, s > 0, m = 1, 2, . . . (2.21)

corresponding to nite activity Poisson processes. It is simple to check that (2.1)

and (2.2) hold. Hence, following (2.3) in Section 2.2, we introduce the normalized

Bessel random measure P , with parameters (ω, κ), where ω ≥ 1 and κ > 0. Thanks

to (2.20) and the Superposition Property of Poisson processes the total mass T in

(2.3) can be written as

Td= TG +

+∞∑m=1

Tm, (2.22)

where TG, T1, T2, . . . are independent random variables, TG being the total mass

of the gamma process and Tm the total mass of a completely random measure

corresponding to the intensity νm(ds, dτ) = ρm(s)dsκP0(dτ). In particular, TG ∼gamma(κ, ω), while Tm =

∑Nmj=1 J

(m)j , whereNm ∼ Poisson(κΓ(2m)/((2ω)2m(m!)2)),

and J (m)j are the points of a Poisson process on R+ with intensity κρm. By this

notation we mean that Tm is equal to 0 when Nm = 0, while, conditionally to

Nm > 0, J(m)j

iid∼ gamma(2m,ω). We can write down the density function of T , via

the Lévy-Khintchine representation:

ψ(λ) := − log(E(e−λT )

)= κ

∫ +∞

0(1− e−λs)ρ(s;ω)ds

= κ

(log

(ω + λ

ω

)+

+∞∑m=1

Γ(2m)

22m(m!)2ωm−

+∞∑m=1

Γ(2m)

22m(m!)2(ω + λ)m

)

= κ log

(ω + λ+

√(ω + λ)2 − 1

ω +√ω2 − 1

).

The same expression is obtained when T ∼ fT (t) = κ(ω +√ω2 − 1)κ

e−ωt

tIκ(t),

t > 0 (see Gradshteyn and Ryzhik, 2007, formula (17.13.112)). Observe that, when

ω = 1, fT is called Bessel function density (Feller, 1971). By (2.13), the EPPF of

40

the normalized Bessel random measure is:

pB(n1, . . . , nk;ω, κ) = κk∫ +∞

0

un−1

Γ(n)

(ω +√ω2 − 1

ω + u+√

(ω + u)2 − 1

)κ1

(u+ ω)n

×k∏j=1

Γ(nj) 2F1

(nj2,nj + 1

2; 1;

1

(u+ ω)2

)du, (2.23)

where

2F1(α1, α2; γ; z) :=∞∑m=0

(α1)m (α2)m(γ)m

1

m!(z)m , with (α)m :=

Γ(α+m)

Γ(α)

is the hypergeometric series (see Gradshteyn and Ryzhik, 2007, formula (9.100)).

The following proposition shows that the EPPF of the normalized Bessel random

measure converges to the EPPF of the Dirichlet process as the parameter ω increases.

The proof is given in Appendix 2.B.

Proposition 2.4

Let (n1, . . . , nk) be a vector of positive integers such that∑k

i=1 ni = n, where

k = 1, . . . , n. Then, the EPPF (2.23), associated with the normalized Bessel random

measure P with parameter (ω, κ), ω ≥ 1, κ > 0, and mean measure P0, is such that

limω→+∞

pB(n1, . . . , nk;ω, κ) = pD(n1, . . . , nk;κ),

where pD(n1, . . . , nk;κ) is the EPPF of the Dirichlet process with measure parameter

(κ, P0).

The prior distribution of Kn, the number of distinct values in a sample of size

n from the normalized Bessel random measure, could be derived from its EPPF in

(2.23). However, this is not an easy task from a computational point of view, so that

we prefer to use a Monte Carlo strategy to simulate from the prior of the Kn. The

simulation strategy is also useful to understand the meaning of the parameters of

the normalized Bessel random measure: κ has the usual interpretation of the mass

parameter, since, when xing ω, E(Kn) increases with κ. On the other hand, the

eect of ω is quite peculiar: decreasing ω (thus drifting apart from the Dirichlet

process), with κ xed, the prior distribution of Kn shifts towards smaller values.

However, when E(Kn) is kept xed, the distribution has heavier tails if ω is small

(see Figures 2.2 and 2.4 (a)).

The Lévy intensity (2.20) of the normalized Bessel completely random measure

has a similar expression as the intensity corresponding to an element of the class Cin Trippa and Favaro (2012). Both intensities are linear combinations of intensities

of the gamma process and of the type si−1e−ωs1(0,+∞)(s), corresponding to nite

activity Poisson processes. Here, the intensity of the Bessel random probability

measure corresponds to an innite mixture with xed weights, where the indexes


estimation 41

0.0

0.1

0.2

0.3

0.4

0.5

0 2 4 6 8 10 13 16 19 22 25 28

k

0.1

0.5

1

3

5

Figure 2.2: Prior distribution ofKn under a sample from ε-NB pro-cess with ε = 10−6, ω = 1.05 andseveral values for κ, as reported inthe legend.

i are even integers (see (2.21)), while Trippa and Favaro (2012) assume a linear

combination of a nite number of components, through a vector of parameters.

2.5.2 Application

In this section let us consider the hierarchical mixture model (2.18), where the

mixing measure is Pε, the ε-approximation of the normalized Bessel random mea-

sure, as introduced above (here ε-NB(ω, κP0) mixture model). Of course, when

ε is small, this model approximates the corresponding mixture when the mixing

measure is P ; to the best of our knowledge, this normalized Bessel completely

random measure has never been considered in the Bayesian nonparametric liter-

ature. By decomposition (2.22), we argue that this model is suitable when the

unknown density shows many dierent components, where a few of them are very

spiky (they should correspond to Lévy intensities (2.21)), while there is a folk of

atter components which are explained by the intensity (1/s)e−ωs of the Gamma

process. For this reason, we consider a simulated dataset which is a sample from a

mixture of 5 Gaussian distributions with means and standard deviations equal to

(15, 1.1), (50, 1), (20, 4), (30, 5), (40, 5), and weights proportional to 10, 9, 4, 5, 5.The histogram of the simulated data, for n = 1000, is reported in Figure 2.3.

We report posterior estimates for dierent sets of hyperparameters of the ε-NB

mixture model when f(·; θ) is the Gaussian density on R and θ = (µ, σ2) stands for

its mean and variance. Moreover, P0(dµ, dσ2) = N (dµ; yn, σ2/κ0) × IG(dσ2; a, b).

We set κ0 = 0.01, a = 2 and b = 1 as proposed rst in Escobar and West (1995). We

shed light on three sets of hyperparameters in order to understand sensitivity of the

estimates under dierent conditions of variability; indeed, each set has a dierent

value of pε(2), which tunes the a-priori variance of Pε, as reported in (2.16). We

tested three dierent values for pε(2): pε(2) = 0.9 in set (A), pε(2) = 0.5 in set

(B) and pε(2) = 0.1 in set (C). Moreover, in each scenario we let the parameter

1/ω range in 0.01, 0.25, 0.5, 0.75, 0.95; note that the extreme case of ω = 100

(or equivalently 1/ω = 0.01) corresponds to an approximation of the DPM model.

42

Figure 2.3: Density esti-mate for case A5: posteriormean (line), 90% point-wise credibility intervals(shadowed area), true den-sity (dashed) and the his-togram of simulated data.

The mass parameter κ is then xed to achieve the desired level of pε(2). As far

as the choice of ε concerns, we set it equal to 10−6: this provides pretty good

approximation a priori (see Section 2.4.1); moreover, posterior inference proved to

be fairly robust with respect to ε. At the end, we got 15 tests, listed in Table 2.1.

As mentioned before, it is possible to choose a prior for ε, even if, for the ρ in (2.20),

the computational cost would greatly increase due to the evaluation of functions

2F1 in (2.23).

We have implemented our Gibbs sampler in C++. All the tests in Sections 2.5

and 2.6 were made on a laptop with Intel Core i7 2670QM processor, with 6GB of

RAM. Every run produced a nal sample size of 5000 iterations, after a thinning of

10 and an initial burn-in of 5000 iterations. Every time the convergence was checked

by standard R package CODA tools.

Here, we focus on density estimation: all the tests provide similar estimates,

quite faithful to the true density. Figure 2.3 shows density estimate and pointwise

90% credibility intervals for case A5; the true density is superimposed as dashed line.

Figure 2.4 displays prior and posterior distributions, respectively, of the number Kn

of groups, i.e. the number of unique values among (θ1, . . . , θn) in (2.18) under two

sets of hyperparameters, A1, representing an approximation of the DPM model, and

A5, where the parameter ω is nearly 1. From Figure 2.4 it is clear that A5 is more

exible than A1: for case A5, a priori the variance of Kn is larger, and, on the other

hand, the posterior probability mass in 5 (the true value) is larger.

In order to compare dierent priors, we take into account ve dierent predictive

goodness-of-t indexes: (i) the sum of squared errors (SSE) , i.e. the sum of the

squared dierences between the yi and the predictive mean E(Yi|data) (yes, we are

using data twice!); (ii) the sum of standardized absolute errors (SSAE), given by

the sum of the standardized error |yi−E(Yi|data)|/√

Var(Yi|data); (iii) log-pseudo

marginal likelihood (LPML), quite standard in the Bayesian literature, dened as


estimation 43

0.0

0.2

0.4

0.6

0.8

Kn

Den

sity

1 2 3 4 51 2 3 4 5 6 7

0.0

0.2

0.4

0.6

Kn

Den

sity

4 5 6 7 8

Figure 2.4: Prior (left) and posterior (right) distributions of the number Kn ofgroups for test A1 (gray) and A5 (blue).

the sum of log(CPOi), where CPOi is the conditional predictive ordinate of yi,

the value of the predictive distribution evaluated at yi, conditioning on the training

sample given by all data except yi. The last two indexes, (iv) WAIC1 and (v)

WAIC2, as denoted here, were proposed in Watanabe (2010) and deeply analyzed

in Gelman et al. (2014): they are generalizations of the AIC, adding two types of

penalization, both accounting for the eective number of parameters. The bias

correction in WAIC1 is similar to the bias correction in the denition of the DIC,

while WAIC2 is the sum of the posterior variances of the conditional density of

the data. See Gelman et al. (2014) for their precise denition. Table 2.1 shows the

values of the ve indexes for each test: the optimal (according to each index) tests

are highlighted in bold for the experiments (A), (B) and (C). It is apparent that

the dierent tests provide similar values of the indexes, but SSE, indicating that,

from a predictive viewpoint, there are no signicant dierences among the priors.

However, especially when the value of κ is small, i.e. in all tests A and B, a model

with a smaller ω tends to outperform the Dirichlet process case (approximately,

when ω = 100). On the other hand, the SSE index shows quite dierent values

among the tests: it is well-known that this is an index favoring complex models and

leading to better results when data are over-tted. Therefore, tests with a higher

value of κ are always preferable according to this criterion.

We tted our model also to a real dataset, the Hidalgo stamps data of Wil-

son (1983) consisting of n = 485 measurements of stamp thickness in millimeters

(here multiplied by 103). The stamps have been printed between 1872 and 1874

on dierent paper types, see data histogram in Figure 2.5. This dataset has been

analyzed by dierent authors in the context of mixture models: see, for instance,

Nieto-Barajas (2013).

We report posterior inference for the set of hyperparameters which is most in

agreement with our prior belief: the mean distribution is given by P0(dµ, dσ2) =

44

Table 2.1: Predictive goodness-of-t indexes for the simulated dataset.

Test ω κ SSE SSAE WAIC1 WAIC2 LPML

A1 100 0.06 6346.59 811.16 -3312.44 -3312.55 -3312.55A2 4 0.09 5812.86 810.43 -3312.33 -3312.42 -3312.43A3 2 0.1 6089.19 810.99 -3312.38 -3312.47 -3312.48A4 1.33 0.11 6498.23 811.29 -3312.54 -3312.62 -3312.63A5 1.05 0.11 5725.18 810.39 -3312.27 -3312.36 -3312.36

B1 100 0.43 5184.25 809.61 -3311.95 -3312 -3312.01B2 4 0.67 5125.41 809.7 -3312.19 -3312.25 -3312.26B3 2 0.81 4610.39 809.42 -3311.92 -3311.98 -3312B4 1.33 0.93 4246.43 809.07 -3311.75 -3311.83 -3311.84

B5 1.05 1 4571.09 809.08 -3311.96 -3312.05 -3312.06

C1 100 1.56 3707.5 809.36 -3311.73 -3311.86 -3311.88

C2 4 2.67 2194.1 808.8 -3312.02 -3312.23 -3312.26C3 2 3.64 1223.86 809.28 -3312.62 -3312.96 -3312.99C4 1.33 5.29 748.85 808.7 -3313.05 -3313.51 -3313.54C5 1.05 8.95 685 807.96 -3312.9 -3313.36 -3313.38

N (dµ; yn, σ2/κ0) ×IG(dσ2; a, b) as before, and κ0 = 0.005, a = 2 and b = 0.1. The

approximation parameter ε of the ε-NB(ω, κP0) random measure is xed to 10−6;

on the other hand, in order to set parameters ω and κ, we argue as follows: ω ranges

in 1.05, 5, 10, 1000 and we choose the mass parameter κ such that the prior mean

of the number of clusters, i.e. E(Kn), is the desired one. As noted in Section 2.5.1,

a closed form of the prior distribution of Kn is not available, so we resort to Monte

Carlo simulation to estimate it. Table 2.2 shows the four couples of (ω, κ) yielding

E(Kn) = 7: indeed, according to Ishwaran and James (2002) and McAulie et al.

(2006) and references therein, there are at least 7 dierent groups (but the true

number is unknown), corresponding to the number of types of paper used. For an

in-depth discussion about the appropriate number of groups in Hidalgo stamps data,

we refer the reader to Basford et al. (1997). Table 2.2 also reports prior standard

deviations of Kn: even if the a-priori dierences are small, the posteriors appear to

be quite dierent among the 4 tests. All the posterior distributions on Kn support

the conjecture of at least seven distinct modes in the data; in particular, Figure 2.5

(b) displays the posterior distribution of Kn for Test 4. A modest amount of mass

is given to less than 7 groups, and the mode is in 11. Even Test 1, corresponding to

the Dirichlet process case, does not give mass to less than 7 groups, where 9 is the

mode. Density estimates seem pretty good; an example is given in Figure 2.5 (a),

with 90% credibility band for Test 4.

As in the simulated data example, some predictive goodness-of-t indexes are

reported in Table 2.2: the optimal value for each index is indicated in bold. The

SSE is signicantly lower when ω is small, thus suggesting a greater exibility of

the model with small values of ω. The other indexes assume the optimal value in

2.6. Linear dependent NGG mixtures: an application to

sports data 45

0.00

0.05

0.10

0.15

0.20

Kn

Dens

ity

5 6 7 8 9 10 12 14 16 18

Figure 2.5: Posterior inference for the Hidalgo stamp data for Test 4: histogramof the data, density estimate and 90% pointwise credibility intervals (a); posteriordistribution of Kn (b).

Table 2.2: Predictive goodness-of-t indexes for the Hidalgo stamps data.

ω κ E(K) sd(K) SSE SSAE WAIC1 WAIC2 LPML

1 1000 0.98 7 2.04 15.17 384.1 -713.12 -713.96 -714.122 10 0.91 7 2.13 12.85 383.51 -713.22 -714.04 -714.253 5 0.92 7 2.18 13.52 383.68 -713.52 -714.3 -714.44 1.05 1.02 7 2.32 11.12 383.38 -712.84 -713.66 -714.05

Test 4 as well, even if those values are similar along the tests.

Our ε-approximation method turned out to be accurate and fast when compared

with competitors (the slice sampler and an a-posteriori truncation method) when the

mixing random probability measure is the NGG process and the kernel is Gaussian;

see Argiento et al. (2016a), Section 5.

2.6 Linear dependent NGG mixtures: an application to

sports data

Let us consider a regression problem, where the response Y is univariate and con-

tinuous, for ease of notation. We model the relationship (in distributional terms) be-

tween the vector of covariates x = (x1, . . . , xp) and the response Y through a mixture

density, where the mixing measure is a collection Px,x ∈ X of ε−NormCRMs,

being X the space of all possible covariates. We follow the same approach as in

MacEachern (2000) and De Iorio et al. (2009) for the dependent Dirichlet process.

We dene the dependent ε −NormCRM process Px,x ∈ X, conditionally to x,

46

as:

Pxd=

Nε∑j=0

Pjδγj(x). (2.24)

The weights Pj are the normalized jumps as in (2.7), while the locations γj(x),

j = 1, 2, . . ., are independent stochastic processes with index set X and P0x marginal

distributions. Model (2.24) is such that, marginally, Px follows a ε − NormCRMprocess, with parameter (ρ, κP0x), where ρ is the intensity of a Poisson process on

R+, κ > 0, and P0x is a probability on R. Observe that, since Nε and Pj do not

depend on x, (2.24) is a generalization of the single weights dependent Dirichlet

process (see Barrientos et al., 2012, for this terminology). We also assume the

functions x 7→ γj(x) to be continuous.

The dependent ε−NormCRM process in (2.24) takes into account the vector

of covariates x only through γj(x). In particular, when the kernel of the mixture

(2.18) belongs to the exponential family, for each j, γj(x) = γ(x; τj) can be assumed

as the link function of a generalized linear model, so that (2.18) specializes to

Yi|θi,xiind∼ f(y;γ(xi,θi)) i = 1, . . . , n

θi|Pεiid∼ Pε i = 1, . . . , n where Pε ∼ ε−NormCRM(ρ, κP0).

(2.25)

This last formulation is convenient because it facilitates parameters interpretation

as well as numerical posterior computation.

We analyze the Australian Institute of Sport (AIS) data set (Cook and Weisberg,

1994), which consists of 11 physical measurements on 202 athletes (100 females and

102 males). Here the response is the lean body mass (lbm), while three covariates

are considered, the red cell count (rcc), the height in cm (Ht) and the weight in

Kg (Wt). The data set is contained in the R package DPpackage (Jara et al.,

2011). The actual model (2.25) we consider here is when f(·;µ, η2) is the Gaussian

distribution with µ mean and η2 variance; moreover, µ = γ(x,θ) = xtθ, and the

mixing measure Pε is the ε-NGG(κ, σ, P0), as introduced in Argiento et al. (2016a).

We have considered two cases, when mixing the variance η2 with respect to the

NGG process or when the variance η2 is given a parametric density; in both cases,

by linearity of the mean xtθ, the model (here called linear dependent NGGmixture)

can be interpreted as a NGG process mixture model, and inference can be achieved

via an algorithm similar to that in Section 2.4. We set ε = 10−6, which provides

a moderate value for the ratio r(ε) in (2.19), and σ ∈ 0.001, 0.125, 0.25, κ such

that E(Kn) ' 5 or 10. When the variance η2 is included in the location points

of the ε − NGG process, then P0 is N4(b0,Σ0) × IG(ν0/2, ν0η20/2); on the other

hand, when η2 is given a parametric density, then η2 ∼ IG(ν0/2, ν0η20/2). We xed

hyperparameters in agreement with the least squares estimate: b0 = (−50, 5, 0, 0),

Σ0 = diag(100, 10, 10, 10), ν0 = 4, η20 = 1. For all the experiments, we computed

the posterior of the number of groups, the predictive densities at dierent values of

the covariate vectors and the cluster estimate via posterior maximization of Binder's

loss function (see Lau and Green, 2007a).

2.7. Conclusion 47

Moreover, we compared the dierent prior settings computing predictive goodness-

of-t tools, specically log pseudo-marginal likelihood (LPML) and the sum of

squared errors (SSE), as introduced in Section 2.5.2. The minimum value of SSE,

among our experiments, was achieved when η2 is included in the location of the

ε − NGG process, σ = 0.001 and κ = 0.8 so that E(Kn) ' 5. On the other hand,

the optimal LPML was achieved when σ = 0.125, κ = 0.4, and E(Kn) ' 5. Posterior

0.0

0.2

0.4

0.6

0.8

Kn

Den

sity

2 3 4 40 60 80 100 120

40

50

60

70

80

90

10

0

Wt

lbm

Figure 2.6: Posterior distributions of the number Kn of groups (left) and clusterestimate (right) under the linear dependent ε−NGG mixture.

of Kn and cluster estimate under this last hyperparameter setting are in Figure 2.6

((left) and (right), respectively); in particular the cluster estimate is displayed in

the scatterplot of the Wt vs lbm. In spite of the vague prior, the posterior of Kn is

almost degenerate on 2, giving evidence to the existence of two linear relationships

between lbm and Wt.

Finally, Figure 2.7 displays predictive densities and 95% credibility bands for

3 athletes, a female (Wt=60, rcc=3.9, Ht=176 and lbm=53.71), and two males

(Wt=67.1,113.7, rcc=5.34,5.17, Ht=178.6, 209.4 and lbm=62,97) respectively, under

the same hyperparameter setting of Figure 2.6; the dashed lines are observed values

of the response. Depending on the value of the covariate, the distribution shows one

or two peaks: this reects the dependence of the grouping of the data on the value of

x. This gure highlights the versatility of nonparametric priors in a linear regression

setting with respect to the customary parametric priors: indeed, the model is able

to capture in detail the behavior of the data, even when several clusters are present.

2.7 Conclusion

We have proposed a new model for density and cluster estimation in the Bayesian

nonparametric framework. In particular, a nite dimensional process, the ε −

48

Figure 2.7: Predictive distributions of lbm for three dierent athletes: Wt=60,rcc=3.9, Ht=176 (left), Wt=67.1, rcc=5.34, Ht=178.6 (center), Wt=113.7,rcc=5.17, Ht=209.4 (right). The shaded area is the predictive 95% pointwise credi-ble interval, while the dashed vertical line denotes the observed value of the response.

NormCRM , has been dened, which converges in distribution to the correspond-

ing normalized completely random measure, when ε tends to 0. Here, the ε −NormCRM is the mixing measure in a mixture model. In this chapter we have

xed ε very small, but we could choose a prior for ε and include this parameter into

the Gibbs sampler scheme. Among the achievements of the work, we have gener-

alized all the theoretical results obtained in the special case of NGG in Argiento

et al. (2016a), including the expression of the EPPF for an ε−NormCRM process,

its convergence to the corresponding EPPF of the nonparametric underlying pro-

cess and the posterior characterization of Pε. Moreover, we have provided a general

Gibbs Sampler scheme to sample from the posterior of the mixture model. To show

the performance of our algorithm and the exibility of the model, we have illustrated

two examples via normalized completely random measure mixtures: in the rst ap-

plication, we have introduced a new normalized completely random measure, named

normalized Bessel random measure; we have studied its theoretical properties and

used it as the mixing measure in a model to t simulated and real datasets. The

second example we have dealt with is a linear dependent ε−NGG mixture, where

the dependence lies on the support points of the mixing random probability, to t

a well known dataset.

Appendix 2.A: Details on full-conditionals for the Gibbs

sampler

Here, we provide some details about Step 3 of the Gibbs Sampler in Section 2.4.

As far as Step 3a is concerned, the full-conditional L(ε|u,θ,Y ) is obtained integrat-

2.7. Conclusion 49

ing out Nε (or equivalently Nna) from the law L(Nε, u,θ,Y ), as follows:

L(ε|u,θ,Y ) ∝+∞∑

Nna=0

L(Nna, ε, u,θ,Y )

=+∞∑

Nna=0

π(ε)e−ΛεΛNnaε,u

Λε

(Nna + k)

Nna!

k∏i=1

∫ +∞


=

(k∏i=1

∫ +∞


)eΛε,u−Λε Λε,u + k

Λεπ(ε) = fε(u;n1, . . . , nk)π(ε),

where we used the identity∑+∞

Nna=0 ΛNnaε,u (Nna + k)/(Nna!) = eΛε,u(Λε,u + k). More-

over, fε(u;n1, . . . , nk) is dened in (2.32). This step depends explicitly on the ex-

pression of ρ(s). Step 3.b consists in sampling from L(Pε|ε, u,θ) as reported in

Corollary 2.3.1. In order to sample a draw from the posterior distribution of the

(unnormalized) measure, we follow Theorem 2.3.1. The component µ(a)ε,u is obtained

generating independently from L(Jl∗i ) ∝ Jnil∗ie−uJl∗

i ρ(Jl∗i )1(ε,∞)(Jl∗i ), i = 1, . . . , k.

On the other hand, µ(na)ε,u satises the distributional identity described at point 1 of

the proposition, and therefore we simulate it as follows:

1. Draw x from the Bernoulli distribution with parameter p = Λε,u/(Λε,u + k).

2. Draw N (na) from Px(Λε), where Px(Λε) denotes the shifted Poisson distribu-

tion, with support on x, x+ 1, x+ 2, ... and mean λ+ x.

3. If N (na) = 0, let µ(na)ε,u be the null measure. Otherwise, draw an iid sample

(Jj , τj), j = 1, . . . , N (na), from ρε(s)dsP0(dτ), and set µ(na)ε,u =

∑N(na)

j=1 Jjδτj .

Appendix 2.B: Proofs of the theorems

Proof of Theorem 2.3.1

Conditionally to the unnormalized measure µε (see (2.6)), the law of θ is given

by

P(θ1 ∈ dθ1, . . . , θn ∈ dθn|µε) =1

Tnε

k∏j=1

µε(dθ∗j )nj .

By considering the variable U in (2.8), we express the joint conditional distribution

of θ and U as

P(θ1 ∈ dθ1, . . . , θn ∈ dθn, U ∈ du|µε) =un−1

Γ(n)e−Tεudu

k∏j=1

µε(dθ∗j )nj . (2.26)

The posterior distribution of µε can be characterized by its Laplace functional;

50

we have

E(

e−∫Θ f(τ)µε(dτ)|θ1 ∈ dθ∗1, . . . , θn ∈ dθn, U ∈ du

)=

E

e−∫Θ f(τ)µε(dτ)P(θ1 ∈ dθ∗1, . . . , θn ∈ dθn, U ∈ du|µε)

E P(θ1 ∈ dθ∗1, . . . , θn ∈ dθn, U ∈ du|µε)

.

(2.27)

Let us focus on the numerator in (2.27); by (2.26) we obtain:

E(

e−∫Θ f(τ)µε(dτ)P(θ1 ∈ dθ∗1, . . . , θn ∈ dθn, U ∈ du|µε)

)=un−1du

Γ(n)E

e−J0(f(τ0)+u)e−∫Θ(f(τ)+u)µε(dτ)

k∏j=1

(µε(dθ∗j ) + J0δτ0(dθ∗))nj

.(2.28)

Moreover, if P0 is an absolutely continuous probability, then, for each j = 1, . . . , k,

(µε(dθ∗j ) + J0δτ0(dθ∗))nj = µε(dθ

∗j )nj + J

nj0 δτ0(dθ∗j ),

so that

k∏j=1

(µε(dθ∗j )nj + J

nj0 δτ0(dθ∗)) =

k∏j=1

µε(dθ∗)nj +

k∑l=1

δτ0(dθ∗l )Jnl0

∏j 6=l

µε(dθ∗)nj .

Therefore, the expected value on the right hand side of (2.28) is:

E(

e−J0(f(τ0)+u))E

e−∫Θ f(τ)+uµε(dτ)

k∏j=1

µε(dθ∗j )nj

+

k∑l=1

E(e−J0(f(τ0)+u)Jnl0 δτ0(dθ∗l ))E

e−∫Θ f(τ)+uµε(dτ)

∏j 6=l

µε(dθ∗j )nj

.

Representation of a CRM via trasformation of a Poisson process can be extended

to µε(dθ∗j )nj =

∫R+×Θ s

njδτ (dθ∗j )N(ds, dτ) where N is a Poisson process with mean

intensity νε(ds, dτ). If we apply Palm's formula (see Daley and Vere-Jones, 2007,

2.7. Conclusion 51

Proposition 13.1.IV) to µε(dθ∗k)nk , we have that

E

e−∫Θ(f(τ)+u)µε(dτ)

k∏j=1

µε(dθ∗j )nj

= E

e−∫Θ(f(τ)+u)µε(dτ)

k−1∏j=1

µε(dθ∗j )nj

∫R+×Θ

snkk δτk(dθ∗k)N(dsk, dτk)

= E

e−∫Θ(f(τ)+u)(µε)(dτ)

k−1∏j=1

µε(dθ∗j )nj

P0(dθ∗k)

∫ ∞ε

e−(f(θ∗k)+u)sksnkk κρ(sk)dsk

(iterating Palm's formula further k − 1 times)

= E

e−∫Θ(f(τ)+u)(µε)(dτ)

k∏j=1

(P0(dθ∗j )

∫ ∞ε

e−(f(θ∗j )+u)sjsnjj κρ(sj)dsj

)

= exp

−∫R+×Θ

(1− e−s(f(τ)+u)

)νε(ds, dτ)

k∏j=1

P0(dθ∗j )

∫ ∞ε

e−(f(θ∗j )+u)sjsnjj κρ(sj)dsj .

In other words, the numerator of (2.27) is equal to

un−1

Γ(n)

∫R+×Θ e−s(f(τ)+u)νε(ds, dτ) + k

Λεe−

∫R+×Θ(1−e−s(f(τ)+u))νε(ds,dτ)

×k∏j=1

P0(dθ∗j )

∫ ∞ε

e−(f(θ∗j )+u)ssnjκρ(s)ds.

(2.29)

Observe that, if we plug the function f ≡ 0 in (2.29), we obtain the denominator of

the ratio (2.27), that is

P(dθ1, . . . , dθn, du) =un−1

Γ(n)

Λε,u + k


k∏j=1

P0(dθ∗j )kε(u, nj), (2.30)

where for n > 0, kε(u, n) =∫∞ε e−ussnκρ(s)ds = (−1)n d

dunψε(u), and ψε(u) :=

− log(E(e−uTε)

)= Λε − Λε,u.

We are ready to compute the posterior Laplace functional of µε: by substituting

(2.29) and (2.30) in the numerator and denominator of (2.27), we have

E(

e−∫Θ f(τ)µε(dτ)|θ1 ∈ dθ1, . . . , θn ∈ dθn, U ∈ du

)=

∫R+Θ e−sf(τ)e−suνε(ds, dτ) + k

Λε,u + ke(−

∫R×Θ(1−e−sf(τ))e−suνε(du,dτ))

×

k∏j=1

∫ ∞0

e−sf(θ∗j ) e−susnjρ(s)I(ε,∞)(s)

kε(u, nj)ds

.

(2.31)

52

This expression gives that the posterior Laplace functional of µε, conditionally to

U ∈ du, factorizes in two terms. This proves the independence property in point 3.

We denote the unnormalized process of non-allocated jumps by µ(na)u,ε . Its conditional

Laplace transform is given by the rst factor (between ) in the right hand side

of (2.31). In order to obtain point 1. of the theorem, characterization (2.10) gives

that the law of µ(na)u,ε coincides with the law of a process µ? as given in (2.9), with

(exponential tilted) Lévy intensity e−suνε(ds, dτ) and probability of success of the

Bernoulli mixing random variable p =Λε,u

k+Λε,u. As far as point 2. is concerned, the

Laplace functional (2.31) gives that the process of the allocated jumps has xed

atoms at the observed unique values θ∗1, . . . , θ∗k, i.e. it can be represented as

µ(a)ε (·) =

k∑j=1

J(a)j δθ∗j (·).

In this case, the weigths of the allocated masses J(a)j are independent and distributed

according to

P (J(a)j ∈ ds|θ1 ∈ dθ1, . . . , θn ∈ dθn, U ∈ du) =

e−susnjρ(s)I(ε,∞)(s)

kε(u, nj)ds,

for any j = 1, . . . , k. Finally, point 4. follows easily from (2.30).

Proof of Proposition 2.1

This proposition follows from (2.30). In fact, we rst observe that P(θ1 ∈dθ1, . . . , θn ∈ dθn, U ∈ du) = P(pn, θ

∗1 ∈ dθ∗1, . . . , θ

∗k ∈ dθ∗k, U ∈ du), and then

integrate out θ∗1, . . . , θ∗k and U from (2.30) to obtain (2.12).


By Proposition 2.1, pε(n1, . . . , nk) =∫ +∞

0 fε(u;n1, . . . , nk)du, where

fε(u;n1, . . . , nk) =un−1

Γ(n)

(k + Λε,u)


k∏i=1

∫ +∞

εκsnie−usρ(s)ds, (2.32)

with u > 0. On the other hand, the EPPF of a NormCRM(ρ, κP0) can be written

as p0(n1, . . . , nk) =∫ +∞

0 f0(u;n1, . . . , nk)du, where

f0(u;n1, . . . , nk) =un−1

Γ(n)exp

κ

∫ +∞

0(e−us − 1)ρ(s)ds

k∏i=1

∫ +∞

0κsnie−usρ(s),

with u > 0. We rst show that

limε→0

fε(u;n1, . . . , nk) = f(u;n1, . . . , nk) for any u > 0. (2.33)

2.7. Conclusion 53

In particular, we have that

limε→0

∫ +∞

εsnie−usρ(s)ds =

∫ +∞

0snie−usρ(s)ds

and

limε→0

eΛε,u−Λε = exp

κ

∫ +∞

0(e−us − 1)ρ(s)ds

,

being this limit nite for any u > 0. Using standard integrability criteria, it is

straightforward to check that, for any u > 0, limε→0 Λε,u = limε→0 Λε = +∞ and

they are equivalent innite, i.e.

limε→0

k + Λε,uΛε

= limε→0

Λε,uΛε

= 1.

We can therefore conclude that (2.33) holds true. The rest of the proof follows

as in the second part of the proof of Lemma 2 in Argiento et al. (2016a), where

we prove that (i) limε→0∑C∈Πn

pε(n1, . . . , nk) = 1; (ii) lim infε→0 pε(n1, . . . , nk) =

p0(n1, . . . , nk) for all C = (C1, . . . , Ck) ∈ Πn, the set of all partitions of 1, 2, . . . , n;(iii)

∑C∈Πn

p0(n1, . . . , nk) = 1. By Lemma 1 in Argiento et al. (2016a), equation

(2.14) follows.

Proof of formula 2.15

First of all, observe that

(x1 + · · ·+ xN∗ε

)m=

∑m1+···+mN∗ε =m

m1,...,mN∗ε≥0

(m

m1, . . . ,mN∗ε

) N∗ε∏j=1

xmjj (2.34)

=m∑k=1

11,...,N∗ε (k)1

k!

∑n1+···+nk=mnj=1,2,...

(m

n1, . . . , nk

) ∑j1,...,jk

k∏i=1

xniji

where N∗ε = Nε + 1, x0

j = 1 for all xj ≥ 0, and the last summation is over all

positive integers, being (2.34) the multinomial theorem. The second equality follows

straightforward from dierent identications of the set of all partitions of m (see

Pitman, 2006, Section 1.2). Therefore, for any B ∈ B(Θ), m = 1, 2, . . ., we have

54

(here, instead of P0 and τ0 as in (2.7), there are PN∗ε and τN∗ε ):

E(Pε(B)m) = E

E

N∗ε∑j=1

Pjδτj (B)

m

|Nε

= E

E

∑m1+···+mN∗ε =m

m1,...,mN∗ε≥0

(m

m1, . . . ,mN∗ε

) N∗ε∏j=1

(Pjδτj (B))mj |Nε

= E

m∑k=1

11,...,N∗ε (k)1

k!

∑n1+···nk=mnj=1,2,...

(m

n1, . . . , nk

)∑j1,...,jk

E(k∏i=1

Pniji |Nε)k∏i=1

E(δτj (B)|Nε)

= E

m∑k=1

11,...,N∗ε (k)1

k!

∑n1+···+nk=mnj=1,2,...

(m

n1, . . . , nk

)pε(n1, . . . , nk)(P0(B))k

.

We identify this last expression as E(∑m

k=1 P0(B)kP(Km = k|Nε)), where Km is

the number of distinct values in a sample of size m from Pε. Hence, we have proved

that

E(Pε(B)m) = E(E(P0(B)Km |Nε)

)= E

(P0(B)Km

).

Proof of formula 2.17

Suppose that B1, B2 ∈ B(Θ) are disjoint. Therefore

E(Pε(B1)Pε(B2)) = E

E

N∗ε∑j=1

Pjδτj (B1)

N∗ε∑l=1

Plδτl(B2)|Nε

= E

∑l 6=j

j,l=1,...,N∗ε

E(PjPl|Nε)E(δτj (B1))E(δτl(B2)))

= E

P0(B1)P0(B2)∑l 6=j

j,l=1,...,N∗ε

E(PjPl|Nε)

= P0(B1)P0(B2)pε(1, 1).

2.7. Conclusion 55

The general case when B1 and B2 are not disjoint follows easily:

E(Pε(B1)Pε(B2)) = E((Pε(B1 ∩B2))2

)+ E (Pε(B1 \B2)Pε(B1 ∩B2))

+ E (Pε(B2 \B1)Pε(B1 ∩B2)) + E(Pε(B1 \B2)Pε(B2 \B1)),

where now the sets are disjoint. Applying the result above we rst nd that

E(Pε(B1)Pε(B2)) = pε(2)P0(B1 ∩B2) + (1− pε(2))P0(B1)P0(B2),

and consequently formula 2.17 holds true.


The EPPF of the Dirichlet process appeared rst in Antoniak (1974) (see Pit-

man, 1996); anyhow, it is straightforward to derive it from (2.13):

pD(n1, . . . , nk;κ) =

∫ +∞

0

un−1

Γ(n)e−κ log u+ω

ω

k∏j=1

κΓ(nj)

(u+ ω)njdu

= κk∫ +∞

0

un−1

Γ(n)

(ω

ω + u

)κ 1

(u+ ω)n

k∏j=1

Γ(nj)du =Γ(κ)

Γ(κ+ n)κk

k∏j=1

Γ(nj)

where the last equality follows from formula (3.194.3) in Gradshteyn and Ryzhik

(2007). By denition of the hypergeometric function, we have

1 ≤ 2F1

(nj2,nj + 1

2; 1;

1

(u+ ω)2

)≤ 2F1

(nj2,nj + 1

2; 1;

1

ω2

).

Moreover

ω +√ω2 − 1

(u+ ω) +√

((u+ ω)2 − 1)=

ω

u+ ω

1 +√

1− 1/ω2

1 +√

1− 1/(u+ ω)2

and1 +

√1− 1/ω2

2≤

1 +√

1− 1/ω2

1 +√

1− 1/(u+ ω)2≤ 1,

so that(1 +

√1− 1/ω2

2

)κpD(n1, . . . , nk;κ) ≤ pB(n1, . . . , nk;ω, κ)

≤k∏j=1

2F1

(nj2,nj + 1

2; 1;

1

ω2

)pD(n1, . . . , nk;κ).

56

The left hand-side of these inequalities obviously converges to pD(n1, . . . , nk;κ) as

ω goes to +∞. On the other hand,

2F1

(nj2,nj + 1

2; 1;

1

ω2

)→ 1 as ω → +∞,

thanks to the uniform convergence of the hypergeometric series 2F1(nj2 ,

nj+12 ; 1; z)

on a disk of radius smaller that 1. We conclude that, for any n1, . . . , nk such that

n1 + · · ·+ nk = n, k = 1, . . . , n, and any κ > 0,

limω→+∞

pB(n1, . . . , nk;ω, κ) = pD(n1, . . . , nk;κ).

Chapter 3

Covariate driven clustering:

an application to blood donors data

Blood is an important resource in global healthcare and therefore an ecient blood

supply chain is required. Predicting arrivals of blood donors is fundamental since it allows

for better planning of donations sessions; with the goal of characterizing behaviors of donors,

we analyze gap times between consecutive blood donations. In particular, we take into

account population heterogeneity via model based clustering.

Dening the model boils down to assign the prior for the random partition itself and to

exibly assign the cluster-specic distribution, since, conditionally on the partition, data

are assumed independent and identically distributed within each cluster and independent

between dierent clusters.

In particular, we aim at taking into account possible patterns within available covariates,

which can be either continuous or categorical; the additional covariate information should

drive the prior knowledge on the random partition by increasing the probability that two

donors with similar covariates belong to the same cluster. This is done through a covariate-

dependent nonparametric prior, thus departing from the standard exchangeable assumption.

We introduce a covariate dependent product partition model by modifying the prior on the

partition prescribed by the class of normalized completely random measures. We include

in such a prior a term that takes into account the distance between covariates. After a

brief discussion about the model and a simple illustrative example on simulated data, we t

our model to a large dataset provided by the Milan department of AVIS (Italian Volunteer

Blood-donors Association), the largest provider of blood donations in Italy.

58

3.1 Introduction

Section 1.2.4 of Chapter 1 presented the wide family of product partition mod-

els: in particular, they can be seen, under the assumption of exchangeability, as

an alternative parametrization of nonparametric mixture models with NormCRMs

as mixing measures. This is especially useful when the focus of the analysis is on

clustering, since in this case the prior on the random partition is made explicit.

However, exchangeability should not be assumed in presence of item-specic infor-

mation, which should be included in the prior for the partition. In presence of

covariates such as time, space or external measurements, the exchangeability as-

sumption is, indeed, unreasonable. We aim at assuming a model where two subjects

are more likely to co-cluster a-priori if their corresponding covariate values are sim-

ilar, i.e. they are close in time or space, or they have similar characteristics. Thus,

the goal of this chapter is to develop a model for the random partition depending on

covariates: in the Bayesian nonparametric literature, there are various approaches

that can be adopted, whose focus is the random measure or the random partition

(see the review paper by Foti and Williamson (2015)).

The rst viewpoint adopted to include covariate dependence in random measures

can be found in MacEachern (1999), where for the rst time the dependent Dirichlet

process appeared. The idea is to include dependence on covariates in the support or

in the jumps of the mixing measureG of a mixture model, as in (1.14). Recent papers

investigated deeper this approach: among the others, we mention Chung and Dunson

(2009), Rodriguez and Dunson (2011) (probit stick-breaking with covariates), Ren

et al. (2011) (logistic stick-breaking) and Di Lucca et al. (2013) (time dependent

DP). These works originate from the stick-breaking representation and modify the

way the weights of the measure are built. However, it is not clear how the covariates

aect the prior on the random partition, which is our main interest here.

Furthermore, other works are based on the augmentation of the space where

the Lévy measure is dened (see, for instance, Grin and Leisen (2014), Foti and

Williamson (2015) and Ranganath and Blei (2017)).

However, the main application of this chapter focuses on clustering: for this

reason, we based our model on the work of Müller et al. (2011), who proposed

the PPMx model, a product partition model with covariates information. In that

work, as well as in its generalizations, the cohesion function is restricted to be the

one induced by the Dirichlet process, namely c(Aj) = κ(nj − 1)!. The desired

dependence on covariates is induced by an additional factor, the similarity function,

that depends on covariates of items in each cluster: this coecient multiplies the

cohesion function as follows,

p(ρn = A1, . . . , AK) ∝K∏j=1

c(Aj)g(x∗j ).

where g(x∗j ) is a non-negative function that formalizes the similarity among the

covariates in the j-th cluster (recall that x∗j denotes the collection of covariates

3.1. Introduction 59

corresponding to items belonging to cluster j). As a default choice, they propose

to dene the similarity g(·) as the marginal probability in an auxiliary probability

model, even if xi are not considered random. The use of a probability density for

the construction of g(·) is convenient since it allows for easy computation: indeed,

posterior inference is identical to the posterior inference that we would obtain if the

covariate vector xi were part of the random response vector Yi, i.e. their model can

be rewritten using a DPM formulation on the response and the covariates jointly,

so that ecient Gibbs sampler schemes are available. For more details, see Müller

et al. (2011) and Müller and Quintana (2010). A similar approach can be found

in Park and Dunson (2010), where the dependence on the predictors is included

directly into the cohesion function, as c(Aj , x∗j ) = α(nj−1)!

∫ ∏i∈Aj f(xi|γ)dG0γ(γ).

Generalizations to achieve variable selection and to include spatial dependence can

be found in Quintana et al. (2015), Barcella et al. (2016) and Page and Quintana

(2015), respectively.

However, there is no need to consider similarities that are marginal laws of an

underlying model for covariates: in general, g can be any non-negative function of

some similarity measure such that the prior probability that two items belong to the

cluster increases if their similarity increases. As we will see later, the way we dene

g does not worsen the complexity of the algorithm for posterior inference. Indeed,

we will be able to devise a general MCMC sampler to perform posterior analysis

that does not depend on the specic choice of similarity. The full-conditionals of the

Gibbs sampler are relatively easy to implement thanks to the conjugate structure

of the PPMx.

Other non-exchangeable random partition models have been recently appeared

in the literature. In a recent paper, Dahl et al. (2017) adopt an approach similar to

ours; they dene a distribution that depends on pairwise similarities, dened in term

of distances among subjects. In particular, their prior is dened sequentially, as the

product of conditional probabilities, i.e. the probability of assigning a group to the

next subject conditioning to the previously allocated items. These probabilities are

modied according to the attraction of the current item to the previously allocated

items. This model shows valuable properties, since the prior on the number of

clusters and the cardinality of these clusters are not aected by covariates which,

on the other hand, places more probability on partitions that group similar items.

A related approach can be found in Dahl (2008) and Blei and Frazier (2011). In

particular, in the latter paper the authors propose a variation of the Blackwell-

McQueen urn scheme where the subject assignments are draws from probabilities

that depend on distance measurements.

The outline of the chapter is as follows: in Section 3.2 we propose a variation of

the product partition model with covariate dependence and discuss the inuence of

various choices of similarity function. In Section 3.3 we apply our model to a simple

simulated dataset. Then, we present the main application that motivated this work:

clustering the behavior of blood donors and predict the time of the next donation.

We consider a large dataset provided by the Milan department of AVIS (Italian

60

Volunteer Blood-donors Association), the largest provider of blood donations in Italy

(Section 3.4). We conclude with a brief discussion of the achievements and possible

future developments. Details about MCMC algorithms for posterior inference are

described in the Appendices.

3.2 A covariate driven model for clustering

Our aim is to estimate the clusters in the data as well as their density. As

described earlier, the main novelty here is the elicitation of a prior for the random

partition that, on the one hand exhibits the positive aspects of the NormCRM

processes, and on the other hand is driven by covariate information by means of a

similarity function that depends on the distance among subject-specic covariates.

We start from parametric densities f(·; θ) and specify a hierarchical model that

achieves the goals previously described. In particular, we assume that the data

are independent across groups, conditionally on covariates and the cluster specic

parameters; these are i.i.d from a base distribution P0. The prior on the partition

depends on covariates and it can be represented as a mixture of product partition

models. Concretely, we propose:

Y1, . . . Yn|x1, . . . ,xn, θ∗1, . . . , θ

∗K , ρn ∼

K∏j=1

f(y∗j |x∗j , θ∗j ) (3.1)

θ∗1, . . . , θ∗K |ρn

iid∼ P0

p(ρn|x1, . . . ,xn) ∝∫ +∞

0D(u, n)

K∏j=1

c(u, nj)g(x∗j )du (3.2)

where the notation q∗j stands for the collection of values of a quantity q for all items

belonging to cluster Aj and nj for the cardinality of the j − th group. Note that

f(y∗j |x∗j , θ∗j ) is a generic regression model such as f(y|x, θ) = N(y; xTβ, σ2

)for

the linear regression model or logit (P(y = 1)) = xTβ for logistic regression. In the

latter case, θ = (β, σ), while in the former case θ = β.

It is worth noticing that the likelihood specication in (3.1) may be any model:

in Section 3.4 we will deal with recurrent events, so that a more complex regression

models for gap times will be needed. Moreover, the prior p(ρn|x1, . . . ,xn) in (3.2)

can be equivalently written as

p(ρn|x1, . . . ,xn, u) ∝K∏j=1

c(u, nj)g(x∗j )

where we implicitly assume that g(∅) = 1 and the prior on the auxiliary u is as

3.2. A covariate driven model for clustering 61

follows

p(u) =un−1

Γ(n)(−1)n

d

dune−Ψ(u), u > 0

where Ψ(u) is the Laplace functional of the Lévy intensity ρ(s) of the NormCRM.

In this case, the marginal prior for ρn, given covariates, is

p(ρn|x1, . . . ,xn) = H(g(x))

∫ +∞

0D(u, n)

K∏j=1

c(u, nj)g(x∗j )p(u)du

where H(g(x)) is the intractable normalizing constant of the law of the random

partition ρn, dened as

H(g(x)) =∑ρn∈Pn

∫ +∞

0D(u, n)

K∏j=1

c(u, nj)g(x∗j )du. (3.3)

The joint law for (θ∗jKj=1

, ρn, yini=1) is equal to

K∏j=1

f(y∗j |x∗j , θ∗j )P0(θ∗j )

H(g(x))

∫ +∞

0D(u, n)

K∏j=1

c(u, nj)g(x∗j )du.

The intractability of the normalizing constant H(g(x)) prevents us from considering

any parameter of the similarity function g(·) or of the NGG process as random

variables (for instance, to perform variable selection in the similarity function g).

The issue has been also risen in Dahl et al. (2017) when comparing their model to

the product partition model with covariates of Müller et al. (2011). Finally, the

joint law of data and all parameters, included the auxiliary variable u, is

L(yini=1 , ρn, θ∗1, . . . , θ

∗K , u| xi

ni=1)

(3.4)

=K∏j=1

f(y∗j |x∗j , θ∗j )P0(θ∗j )

H(g(x))D(u, n)

K∏j=1

c(u, nj)g(x∗j )

and

L(ρn, u| xini=1) = H(g(x))D(u, n)

K∏j=1

c(u, nj)g(x∗j )

is the joint law of the random partition and the auxiliary variable u.

Even if the proposed approach is in fact quite general with respect to the choice

of the NormCRM, and thus the form of the cohesion function, in the following we

focus on the specic case of the normalized generalized gamma process, denoted by

NGG(κ, σ, P0). The main reason is that it induces a prior on the number of groups

which is more disperse than that induced by the Dirichlet process. Its Lévy's

62

intensity is

ρ(ds) = κs−1−σe−sds

where κ is a mass parameter and σ a discount parameter. The cohesion function

becomes c(Aj) = (1 − σ)nj−1, where (α)n is the Pochammer symbol, or rising

factorial (note that it does not depend on the auxiliary variable u). It is clear that

the parameter σ has a deep inuence on the clustering behavior. In particular,

the discount parameter aects the variance: the larger it is, the more disperse is

the distribution on the number of clusters. This feature mitigates the annoying

the rich-gets-richer eect, typical of the Dirichlet process and discussed in Section

1.2.1, leading to more homogeneous clusters. For more details on the behavior of

σ, see for instance Argiento et al. (2016a), Lijoi et al. (2007) and Argiento et al.

(2010). The prior on u becomes, in this case,

p(u) ∝ un−1

(u+ 1)n−κKe−(κ/σ(u+1)σ−1), u > 0.

Appendix 3.A reports details about the Gibbs sampler for posterior inference for

the model described in this section.

3.2.1 The choice of the similarity function

It is quite natural to let the similarity be a non-increasing function of the distance

among covariates in the cluster, namely

DAj =∑i∈Aj

d(xi, cAj ) (3.5)

where cAj is the centroid of the set of covariates in cluster j and d is a suitable

distance function, discussed later. Moreover, we assume this function to take value

1 if the cardinality of the set Aj is 1, i.e. |Aj | = 1.

Analytical results about specic quantities of interest, as for instance the prob-

ability of having a specic number k of groups or the probability of observing two

items together, are not easy to compute in closed form. A simple calculation, which

can be useful to intuitively understand the behavior of our prior, species the prob-

ability of observing one cluster in the case n = 2:

p(1, 2 ;κ, u, σ) =c(u, 2)g(x1, x2)

c(u, 2)g(x1, x2) + c(u, 1)2=

(1− σ)g(x1, x2)(1− σ)g(x1, x2) + κ(1 + u)σ

which tends to 1 when g(x1, x2) goes to +∞ (i.e., the distance goes to 0) and

tends to 0 when g(x1, x2) goes to 0, i.e. the covariates are far apart.

However, choosing the similarity function is not an easy task, since it heavily

aects the results, as we will see later. For this reason, we propose a list of reasonable

similarity functions that proved to work reasonably well in practice:

1. gA(x∗j ;λ) = e−tα, for α > 0 (α = 0.5, 1, 2), with t = λDAj ;

3.2. A covariate driven model for clustering 63

0 1 2 3 4 5 6

01

23

4

Distance

Sim

ilari

ty

gA

gB

gC

Figure 3.1: Proposed similarity functions.

2. gB(x∗j ;λ) = 1/tα, t = λDAj , for α > 0 (α = 0.5, 1, 2);

3.

gC(x∗j ;λ) =

e−t log t if t ≥ 1

ee1/e−1

t if t < 1e

, t = λDAj .

The three cases are displayed in Figure 3.1.

Obviously, a deeper understanding of the theoretical properties implied by the

choice of the similarity function is needed, and it is part of future work. However,

the three cases above gave us pretty satisfying results, at least numerically, as we

will show in Section 3.3.

In what follows, we provide guidelines for the elicitation of the parameter λ

driving the inuence of the covariates in the prior of the random partition. This

parameter is analogue to the temperature parameter dened in Dahl et al. (2017)

but, unlike them, the presence of the intractable normalizing constant (3.3) prevents

us from assigning a prior on λ. Therefore, some empirical rules to set this parameter

are needed. Varying this parameter has the eect of rescaling the range of values

where we evaluate the similarity function, since we locate in dierent parts of the

horizontal axis in Figure 3.1. For this reason, in order to select a value for λ, before

running the model we should display the histogram of all λd(xi,xj), i = 1, . . . , n,

j = 1, . . . , n, j 6= i, for some possible values of λ and adjust the range of values

the similarity can take (more or less variability, for example). For instance, suppose

we assume function gA (the blue line in Fig. 3.1): if we choose a very small λ, we

concentrate the values of λD around the origin, and hence we obtain similar values

for gA(·): in this case, the eect of the covariate information on the prior of ρn will

64

be very mild, since the range of values that the similarity can assume is very limited.

A similar argument is valid for large values of λ. In conclusion, we calibrate λ s.t.

gA is evaluated in the range, say, (0, 3), for this particular choice of similarity.

In order to dene the distance d appearing in (3.5), we will consider covariates

that are continuous or binary (categorical or ordinal covariates are transformed into

dummies, i.e. vectors of binary covariates). If x1 and x2 are vectors of covariates,

xj = (xcj ,xbj), where xcj is the subvector of all the continuous covariates and xbj is

the subvector of all binary covariates, we dene

d(x1,x2) = dc(xc1,xc2) + db(xb1,x

b2),

where dc is the Malahanobis distance between vectors, i.e. the Euclidean distance

between standardized vectors of covariates, and db is the Hamming distance between

vectors of binary covariates (see Zhang et al., 2006). Instead of the Malahanobis

distance, we could consider: l1-norm or lp-norm or sup-norm distances. As far

as the sensitivity with respect to the distance d is concerned, we noticed that the

choice of the distance moderately aects the results: in our numerical experiments,

we tested dierent choices of distance (Euclidean vs. Malahanobis, Hamming vs. a

standardized version of it, . . . ). However, a deeper understanding of how to calibrate

the distance function is part of future work.

In spite of that, we made other modeling choices which are subject of current

research: for instance, note that in formula (3.5) we decided not to normalize the

distance with respect to the number of elements. This is due to the fact that we

do not want to introduce undesirable eects of smoothing. In fact, by dividing the

quantity by the number of elements in the group, we obtain a smoothing eect that

lower the eect of the covariates.

Lastly, we note that the similarity function gC mimics the asymptotic behavior

of the cohesion function induced by the NGG: the reason is to balance the eect

of the similarity function with respect to the specic cohesion form of the NGG

process.

3.3 Simulated data

The performance of the model described in Section 3.2 is illustrated through a

simple example on a simulated dataset.

In particular, we simulated a dataset of points (yi, xi1, . . . , xip) for i = 1, . . . , n,

with n = 200 and p = 4. The last 2 covariates are binary and were generated

from the Bernoulli distribution, while the rst 2 were generated from Gaussian

densities. The responses yi's were generated from a linear regression model with

linear predictor xTi β, where β0 := (β0

0 , β01 , β

02 , β

03 , β

04) and variance σ2

e = 0.5. We

have generated 3 dierent groups by simulating both covariates and responses from

distributions with dierent parameters:

3.3. Simulated data 65

Simulated responses

Density

−30 −20 −10 0 10

0.0

00.0

20.0

40.0

60.0

8

Y

−4 −2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0

−30

−10

010

−4

02

4

X_1

X_2

−2

02

4

0.0

0.4

0.8

X_3

−30 −20 −10 0 10 −2 0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

X_4

Figure 3.2: (Top) Simulated data: histogram of responses. (Bottom) Scatterplot ofthe simulated dataset; dierent colors represent dierent groups (3) the data havebeen generated from.

66

Group 1: 75 covariate vectors and responses were independently generated as fol-

lows

(xi1, xi2) ∼ N2(µ1, σ20I2) µ1 = (−3, 3), σ2

0 = 0.5

xi3, xi4iid∼ Bern(0.1)

Yi ∼ N (xTi β0, σ2

e) β0 = (1, 5, 2, 1, 0), σ2e = 0.5


lows

(xi1, xi2) ∼ N2(µ2, σ20I2) µ1 = (0, 0), σ2

0 = 0.5



e) β0 = (4, 2,−2, 1,−1), σ2e = 0.5


lows

(xi1, xi2) ∼ N2(µ2, σ20I2) µ1 = (3, 3), σ2

0 = 0.5



e) β0 = (−1,−5,−2,−1, 1), σ2e = 0.5

Figure 3.2 shows the histogram of the responses yi, i = 1, . . . , n: the three groups

are very evident. In the panel below we display the scatterplot of covariates and

responses. As far as the prior is concerned, we included the whole vectors of xi in the

similarity and assume the cohesion function of the NGG process with κ = 0.3, σ =

0.2 such that E(Kn) = 5.9. Moreover, the base measure P0 is Np(0, σ2/κ0Ip×p

)−

IG(σ2; a, b) with κ0 = 0.01, (a, b) = (2, 1). Here, λ = 1.

We run the algorithm described in Appendix 3.A to obtain 5,000 nal iterations,

after a burnin of 2,000 and thinning of 10 iterations. A-posteriori we classied all

datapoints according to the optimal partition, under the dierent similarity func-

tions and obtaining the following missclassication rates:

missclassif rate gA gC g ≡ 1

0% 4% 16%

where g ≡ 1 stands for the model without covariate dependence in the prior for the

partition. By optimal partition we mean the realization, in the MCMC chain, of the

random partition ρn which minimizes posterior expected value of the Binder's loss

function with equal missclassication weights (Lau and Green, 2007a). In this work,

we employed this specic loss function; however, the issue of nding an appropriate

point estimate of the clustering structure of the data based on the posterior is still

subject of research. Various loss functions have been proposed, satisfying principles

such as invariance with respect to permutations of indices and labels. Alternative

3.3. Simulated data 67

loss functions may be found in Fritsch and Ickstadt (2009) and Meil (2007). The

latter introduced the Variation of Information criterion, that quanties the infor-

mation shared between dierent partitions. Moreover, the loss function dened in

Fritsch and Ickstadt (2009) is inspired by the Rand index.

Recently, Wade and Ghahramani (2017) proposed a new approach that is able to

quantify also uncertainty: a credible region of level (1−α) is dened as the smallest

ball around the point estimate with posterior probability greater than (1−α). The

metric on the space of partition is induced by the Binder or Variation of Information

functions.

0.0

00

.05

0.1

00

.15

0.2

00

.25

0.3

0

Ng

Pro

b.

ma

ss

3 4 5 6 7 8 9 10 11

0.0

00

.05

0.1

00

.15

0.2

00

.25

0.3

0

Ng

Pro

b.

ma

ss

3 4 5 6 7 8 9 10 11

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Ng

Pro

b.

ma

ss

3 4 5 6 7

Figure 3.3: Posterior distribution of Kn under gA (left), gC (center) and g ≡ 1(right).

We computed the posterior distribution of Kn, the number of clusters, in the

three cases: see Figure 3.3. Figure 3.4 displays the predictive distribution corre-

sponding to the covariates x1 of the rst subject. The green vertical line corresponds

to the actual observation y1. It is clear that in the last case, i.e. when we do not

include covariate information in the prior for the random partition, the predictive

law is not able to distinguish to which of the three groups the item belongs (thus,

we have three peaks in the law). In cases A and C the predictive law exhibits only

one main peak: the covariate information helps, in this case, in selecting the right

group for the observation. This is also proved by the missclassication table above.

−20 −10 0 10 20

0.0

0.1

0.2

0.3

0.4

0.5

Predictive density

De

nsity

−20 −10 0 10 20

0.0

0.1

0.2

0.3

0.4

Predictive density

De

nsity

−20 −10 0 10 20

0.0

00

.05

0.1

00

.15

Predictive density

De

nsity

Figure 3.4: Predictive distribution of Y1 under gA (left), gC (center) and g ≡ 1(right); vertical lines denote the true value

Figure 3.5 reports the cluster estimate under gA (no missclassication error).

68

Compare Figure 3.6 where we display the cluster estimate under model PPM without

covariates in the prior, i.e. g ≡ 1.

−4 −2 0 2 4

−30

−20

−10

010

X1

Y

−1 0 1 2 3 4 5

−30

−20

−10

010

X2

Y

0.0 0.2 0.4 0.6 0.8 1.0

−30

−20

−10

010

X3

Y

0.0 0.2 0.4 0.6 0.8 1.0

−30

−20

−10

010

X4

Y

Figure 3.5: Cluster estimate under gA.

This very simple illustrative example shows the good performance of our model;

it is worth including covariate information when eliciting the prior for the random

partition, both from the point of view of clustering and predictive accuracy. More-

over, the covariates that enter in the similarity function do not have to be the same

as those in the regression part of the likelihood: one could, for instance, use some

of them to drive the prior on the partition and use the others in the regression

component. The next section describes the analysis of a real dataset about blood

donations.

3.4 The AVIS data on blood donations

The Associazione Volontari Italiani del Sangue (AVIS, Association of Voluntary

Italian Blood Donors) is the major Italian non-prot and charitable organization

for blood donation, bringing together over a million volunteer blood donors across

Italy. The main aim of the association is to foster the development of voluntary,

recurring, anonymous and without prot blood donation at the community level.

Predicting arrivals of blood donors is fundamental since it allows for better planning

of donation sessions. In the next sections we propose a model that allows subject-

specic prediction of the next donation and, at the same time, compute cluster

3.4. The AVIS data on blood donations 69

−4 −2 0 2 4

−30

−20

−10

010

X1

Y

−1 0 1 2 3 4 5

−30

−20

−10

010

X2

Y

0.0 0.2 0.4 0.6 0.8 1.0

−30

−20

−10

010

X3

Y

0.0 0.2 0.4 0.6 0.8 1.0

−30

−20

−10

010

X4

Y

Figure 3.6: Cluster estimate under PPM model, i.e. g ≡ 1.

estimates in order to have deep insights about the population of donors. A general

framework for the analysis of gap-times is briey described in the following section.

For a thorough review of the models available from the literature, see Cook and

Lawless (2007).

3.4.1 A framework for recurrent events

Let us consider a single recurrent event process, where the events occur in con-

tinuous time, starting without loss of generality in t = 0. Let the sequence T1, T2,

. . . , such that 0 ≤ T1 ≤ T2 ≤ . . . , denote the events' time, where Tk is the time of

the k-th event. There are two main ways of describing and modeling this process:

through event counts over a certain time interval and through gap times between

successive events; the choice of the framework is usually driven by the objective

of the analysis. The former approach is often useful when individuals frequently

experience the events of interest, the latter instead is used when events are rela-

tively infrequent and prediction of the time to the next event is of interest, which

is the framework of this application. An important setting in which models based

on gap time between successive events are particularly attractive is the one where a

subject i is restored to a similar state after each event in the same way as a system

returns to a new state after a repair. These are known as renewal processes, in

which the gaps Tj = Tj − Tj−1 (j = 1, 2, . . . ) are conditonally independent and

70

identically distributed. Assume we have more subjects in a sample, and there exists

a dierent recurrent event process for any of them. Moreover, let us assume that

individual i is observed over the time interval [0, τi]. If ni events are observed at

times 0 < ti1 < ti2 < · · · < tini ≤ τi, let tij = tij − ti,j−1 (j = 1, .., ni ) and

ti,ni+1 = τi − tini , where ti0 = 0. These are the observed gap times for individual

i with the sample being possibly right censored. Let f(·|x) and S(·|x) denote the

density and survival functions for the gap time given covariates x. In terms of these

functions the likelihood for N subjects can be written as

L (tij , j = 1, .., ni + 1, i = 1, .., N | xij) =N∏i=1

ni∏j=1

fij(tij |xij)Si (tini+1|xini+1)

(3.6)

If ti,ni+1 = 0, the observation terminates after the ni-th event, hence the nal time

is not censored and the term involving the survival function in (3.6) disappears.

Observe that the likelihood in (3.6) is the one used by standard survival analysis

methods to model a sample involving failure times tij , and the last observation for

each subject is censored. An important inference model of this kind is the AFT

model for a response time Tij for which Yij = log(Tij) has a distribution of the form

Yij = ui + βTj xij + σiεij

where xij is the vector of covariates, xed within gap times but (possibly) varying

across gaps, βj are gap-dependent regression coecients and εij are i.i.d random

variables whose distribution is independent on the covariates. Moreover, ui and σiare subject specic random eects and scale parameter, respectively. In order to

choose an appropriate likelihood, namely a density (and, consequently, a survival

function), for our observations we need to perform a brief explorative study of our

data.

3.4.2 Data pre-processing and choice of the covariates

We conne our interest to whole blood's donations, performed between January

1st 2010, and May 15th 2016, in the main building of AVIS, while donations in the

mobile collection centres or within hospitals (which represents a small fraction) are

neglected. In the prospective of treating donations as recurrent events the date May

15th 2016 is the censoring time of the last observation for almost all the donors,

except those having their last donation exactly on that date. Before tackling the

inferential task that is our main goal, the initial rough dataset containing 18305

total events has been submitted to a ltering and cleaning process. In particular,

we removed the donors whose number of donations was greater to 15 (these are very

few and would increase the variance of the estimates) and those who are marked by

a denitive suspensions, namely the donor is declared no more adequate to donate.

At the end, we came up with a clean dataset containing 17198 donations, made by

3333 donors.


Consistently with Section 3.4.1, we denote by Tij the time (in days) passed from

donation j−1 to donation j for donor i: following the approach of accelerated failure

time models, the response variable is the logarithm of the gap time, Yij = log(Tij).

However, the modeling scheme presented in Section 3.4.1 is enriched by including

a random partition model for clustering donors' trajectories via the mixture model

discussed in Section 3.2. Before describing the model we are going to t to our data,

a brief descriptive analysis is due.

Men

log Gap times

Den

sity

4 5 6 7 8

0.0

0.2

0.4

0.6

0.8

1.0

log( 90 days )

Women

log Gap times

Den

sity

3 4 5 6 7 8

0.0

0.5

1.0

1.5

log( 6.5 months )

Figure 3.7: Histogram of the logarithm of the observed gap-times divided accordingto gender, male (left) and women (right).

An important premise to highlight is that, according to the Italian law, the

maximum number of whole blood donations is 4 per year for men and 2 for women,

with a minimum of 90 days between a donation and the other; this fact causes the

lack of a left tail in the histograms of the gap times (in log-scale) for men and women

displayed in Figure 3.7. Note that the minimum for the men is about to 4.5, and

that exp(4.5) ' 90 days, which is the minimum time that can occur between two

events required by the law. For women, the distribution has a mode approximately

in 5.3 in the log scale: this means 200 days, that corresponds to about 6 months and

a half. Therefore, this means that most of the donors decide to donate as soon as

they are allowed to. That said, one may wonder why there are observations shorter

than the minimum gap time imposed. This may happen in certain situations in

which the doctor, under good donor's health conditions, requires an anticipated

donation; it may also happen when planning a vacation, a donor decides to donate

earlier rather than skip the donation. The strong asymmetry in the distribution of

the logarithm of the gap-times motivated the choice of a skew-normal distribution

as a model for the likelihood. See Section 3.4.3.

In Figure 3.8 the mean and the median of the gap times Tij , i = 1, . . . , for each

value of j, have been computed. Both quantities have higher values for small j and

then they decrease with j increasing; a possible explanation is that the more a donor

proceeds with the donations, the more he becomes loyal to the activity and he devel-

72

2 4 6 8 10 12 14

100

150

200

250

300

350

Donations

Day

s

MeanMedian

Figure 3.8: Mean and median of the gap times (in days) Tij , i = 1, . . . , 3333 , foreach j ∈ 1, . . . , 15.

ops a kind of regularity in donating. Moreover, the number of observations for each j

from 1 to 15 is clearly decreasing: 765, 604, 415, 351, 312, 217, 174, 119, 84, 76, 75, 43,

39, 38, 21. We can notice a change at time t∗ = 8: indeed, until that time the aver-

age percentage of people that do not return to the donate again is around 20%−30%,

while after t∗ it decreases to 10%. This can be interpreted as a sort of loyalty of

donors, that strengthens after a bunch of donations.

As far as the covariates are concerned, the association recorded the following

covariates at each blood donation:

age: continuous (in years)

BMI: an indicator of high body fatness, calculated as a person's weight in

kilograms divided by the square of height in meters

gender: 1 if the donor is a woman, 0 if he is a man

blood type: it has for levels, depending on the blood type (0, A, B, AB)

Rh factor: 1 if it is positive, 0 if negative

smoking habits: 1 if the donor regularly smokes, 0 otherwise

practical activity: 1 if the donor regularly practices a sport, 0 otherwise

Note that all the covariates not strictly related to blood features (as weight, height

and life habits) are declared by the donor and not measured by a doctor: therefore,

they turn out to be quite inaccurate. Table 3.1 shows the empirical frequencies for

the static covariates described above.

In what follows, we considered as time-varying continuous covariates the age

and the body mass index (BMI). On the other hand, gender, blood type, Rh factor,

smoking habits and practical activity are treated as static covariates.


Female A B AB 0 Rh+ Smoker Activity

31.56% 37.86% 12.27% 3.81% 46.06% 86.7% 32.43% 69.78%

Table 3.1: Empirical relative frequencies for the dierent categories of the staticcovariates.

0

1000

2000

3000

20 30 40 50 60 70

Age

cou

nt

0

1000

2000

3000

4000

20 30 40

BMI

cou

nt

Figure 3.9: Histogram of the two time varying covariates: age (left) and BMI (right).

Figure 3.9 shows the histograms for the two donation-dependent covariates: the

1st empirical quartile for the covariate BMI is 21.98, and the 3rd quartile is 26.03,

with a median of 23.9. For the age, the 1st empirical quartile is 28, and the 3rd

quartile is 44, with a median of 35.

3.4.3 Skew normal mixture model

Skew normal mixture models have been successfully employed in various con-

texts: in the Bayesian framework see, for instance, Bayes and Branco (2007),

Frühwirth-Schnatter and Pyne (2010), Canale and Scarpa (2013), Arellano-Valle

et al. (2007).

The skew normal distribution is a continuous probability distribution that gener-

alizes the normal distribution to allow for non-zero skewness: for its properties, see

Azzalini (2005) and Arellano-Valle and Azzalini (2006). The univariate skew nor-

mal distribution has three parameters: location ξ ∈ R , scale ω ∈ R+, and skewness

λ ∈ R, and it is denoted by SN(ξ, ω, λ). The probability density function is

f(x) =2

ωφ

(x− ξω

)Φ

(λ

(x− ξω

))where φ and Φ denote the probability density function and the cumulative density

function of a standard normal random variable, respectively. For λ = 0 the standard

normal N (ξ, ω2) is recovered. Figure 3.10 compares the density for dierent values

of skewness: it is clear how this parameter drives the asymmetry of the distribution.

74

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

pdf S

kew

Nor

mal

λ−4−1014

Figure 3.10: Probability density function of a Skew normal distribution with ξ = 0,ω = 1 and skewness parameter: λ ∈ −4,−1, 0, 1, 4.

We use a stochastic representation of the skew normal distribution. Let Z and

ε be two independent random variables, where Z is distributed as an half normal

distribution, Z ∼ T N [0,+∞)(0, 1) and ε ∼ N (0, 1). Then, the random variable X

dened by

X = δZ +√

1− δ2ε for any δ ∈ (−1, 1)

is skew-normal distributed with λ =δ√

1− δ2: this is a one-to-one correspondence

which maps (−1, 1) into R (i.e. λ ∈ R). In order to take into account also location

and scale parameters, we may introduce the random variable Y , dened through an

ane transformation of X, as

Y = ξ + ωX = ξ + ω(δZ +

√1− δ2ε

)(3.7)

and we have that Y ∼ SN(ξ, ω, λ). However, the parameterization adopted in what

follows is dierent (see Frühwirth-Schnatter and Pyne (2010)), and it is convenient

when performing posterior inference, as made clear later. Thus, we dene

ξ = ξ, ψ = ωδ = ωλ√

1 + λ2, σ2 = ω2

(1− δ2

)= ω2 1

1 + λ2(3.8)

and we use the following stochastic representation:

Y = ξ + ψZ + σ2ε, Y ∼ SN(ξ, ω2, λ

), λ =

ψ

σ, ω2 = σ2 + ψ2

where ψ ∈ R and σ2 ∈ R+. To conclude, we recall that E (Y ) = ξ + ψ

√2

πand

V ar(Y ) = ψ2

(1− 2

π

)+ σ2.


The skew normal distribution is chosen to describe the likelihood of the logarithm

of the observations (gap times), after accounting for a regression term that considers

both static covariates and donation dependent covariates, as follows

Yij = log Tij |si = l,βj ,β0, ul, ωl, λl ∼ SN(ul + x′ijβj + x′iβ0, ω2l , λl),

for j = 1, . . . , (ni + 1), i = 1, . . . , Nd, where each observation is independent on the

others, across i and j. Here, Nd denotes the number of donors and ni the number

of recorded donations for the i-th subject. This is an AFT model. Note that the

cluster dependent parameters are the intercept ul, the scale ωl and the skewness λl,

while the regression coecients do not vary among the groups. Using the stochastic

representation (3.7) and reparametrizing as in (3.8), we obtain the equivalent model

for the likelihood:Yij |si = l,βj ,β0, ul, ψl, σ

2l , ηij ∼ N (ul + x′ijβj + x′iβ0 + ψlηij , σ

2l )

ηij ∼ T N [0,∞)(0, 1)

for j = 1, . . . , (ni + 1), i = 1, . . . , Nd. The priors we assume are the following:

β0 ∼ Np1 (0,Σ0)

β1, . . . ,βJ+1|τ21 , . . . , τ

2p2

iid∼ Np2

(0, diag(τ2

1 , . . . , τ2p2))

τ21 , . . . , τ

2p2

iid∼ IG(ν0, τ0)

p(ρn = (e1, . . . , en)) ∼ PPM ux(

θl = (ul, ψl)T , σ2

l

)Kl=1|K iid∼ N2(θl;θ0, σ

2lK0)× IG(σ2

l ; a, b) (3.9)

where diag(τ21 , . . . , τ

2p2) is a diagonal matrix which diagonal entries are τ2

1 , . . . , τ2p2

and p2 is the number of time-varying covariates in the regression. The prior dis-

tribution for the vector of unique values(θl = (ul, ψl)

T , σ2l

)Kl=1

is the same as

in Frühwirth-Schnatter and Pyne (2010), where the couple (ul, ψl)T is a-priori

Gaussian distributed with mean θ0 = (ξ0, ψ0)T and variance-covariance matrix

σ2lK0 = σ2

l diag(κ0, κ1). These priors are conditionally conjugate, helping us when

devising the Gibbs sampler of Appendix 3.B.

As far as the covariate information for our case study is concerned, we need to

distinguish between what enters as a standard dependent variable through linear

regression and what inuences the prior on the partition through the similarity

function g(·). Clearly, the variables can be repeated, if one thinks that a certain

covariate is important from the viewpoint of both regression and clustering. In the

regression setting, after a thorough investigation (see Gianoli (2016)), we decided

to consider the following covariates: the time-varying continuous covariates are age

and the body mass index (BMI), which is dened as the weight of the body divided

for the square of the body height (p2 = 2). Gender, blood type, Rh factor and

smoking habits are treated as static covariates (hence, p1 = 6). On the other hand,

76

λ κ σ (κ0, κ1) (a, b) LPML Epost(K) Vpost(K)

a 0.01 0.5 0.001 (5,5) (2.04,0.208) 6094.255 5.043 0.041b 0.01 0.5 0.2 (5,5) (2.04,0.208) 5594.895 5.034 0.033c 0.01 0.5 0.5 (5,5) (2.04,0.208) 5528.061 5.891 0.446d 0.005 0.5 0.2 (5,5) (2.04,0.208) 6053.80 4.038 0.038e 0 0.5 0.5 (5,5) (2.04,0.208) 6397.37 3.6155 0.305f 0.01 0.5 0.1 (10,10) (2.5,0.3) 5660.43 7.174 1.084

Table 3.2: Test setting for the AVIS dataset: parameter λ and (κ, σ) refer to thesimilarity function g(·) and the NGG process, respectively. Moreover, (κ0, κ1) arethe diagonal entries of the matrix K0 and (a, b) the parameters of the inverse gammain (3.9). V post denotes posterior variance.

the covariates that enter in the similarity function driving the prior on the partition

are considered all static and take the value of the rst blood donation: age, gender,

BMI, blood type, practical activities habits, Rh factor and smoking habits.

3.4.4 Case study

In order to manage the complexity of the algorithm and the number of data,

we implemented the algorithm in C++. Every run of the Gibbs sampler produced

a nal sample size of 2,000 iterations, after a thinning of 5 and initial burn-in of

2,000 iterations. In all cases, convergence was checked using both visual inspection

of the chains and standard diagnostics available in the R package CODA. We x hy-

perparameters as follows: for the variance covariance matrix of the static regression

coecients Σ0 = diag(1, . . . , 1) and (a0, b0) = (2.25, 0.625) such that the variances

τ21 , . . . , τ

2p2

have a-priori mean 0.5 and variance 1. Moreover, θ0 = (0, 0)T . The

other hyperparameters vary as described in Table 3.2. In tests named a, b, c the

parameter σ increases while the others are xed; in b, d and e, on the other hand,

λ varies. Note that, since in case e λ = 0, we are not considering the eect of the

covariates in the prior for the partition. For all the tests, we chose the similarity

function gC in Section 3.2.1, with the single parameter λ, that rescales the distance

among covariates in a cluster.

As far as the donation-dependent covariates are concerned, all the tests agree

so that we report inference for test c in Table 3.2. Posterior summaries of βj ,

j = 1, . . . , 15, show that the covariate age has little eect: indeed, almost all the

90% posterior credibility intervals (but the 8-th) contain the 0 (see the left panel of

Figure 3.11, for test c). Intuitively, covariate age has a little change in time, since

the temporal window of the study is 6 years. The eect of age on the 8-th gap time

seems to have a slight negative eect: young people tend to have more frequent

donations, once that they became loyal donors. On the other hand, covariate BMI

has a stronger eect, that increases moderately over the subsequent donations. The

eect is positive, meaning that donors with an higher BMI usually experience larger

gap times; this is due to the fact that donors with high BMI undergo more detailed


2 4 6 8 10 12 14

−0.0

06−0

.004

−0.0

020.

000

0.00

20.

004

#donations

90% c.i.

Age effect

2 4 6 8 10 12 14

−0

.00

20

.00

00

.00

20

.00

40

.00

60

.00

80

.01

0#donations

90% c.i.

BMI effect

Figure 3.11: Posterior distribution of the regression coecients corresponding todonation-dependent covariates for Age (left) and BMI (right) under test c in Table3.2.

medical controls, such as cardiological examinations, thus extending the gap time.

The right panel of Figure 3.11 displays posterior mean and 90% credible interval

for the donation-dependent coecients related to the covariate BMI under Test c in

Table 3.2. Here, the 15-th intervals are not displayed because they have a variance

which is signicantly larger than the previous ones.

Concerning the interpretation of static covariates, Figure 3.12 shows posterior

estimates and 90% credibility intervals for the 6 coecients of the regression. We

obtained that the covariate smoking is of little importance. On the contrary, the

eect of the covariate gender is very strong: due to the regulation, women tend

to experience longer gap-times as expected. Also the posterior distribution of the

factor Rh exhibits a support concentrated on (small) positive values. Donors with

positive Rh factor tend to show slightly longer gap-times with respect to donors

with a negative Rh factor: they are less common (13.3% in our population) and can

not receive blood from those with a positive factor.

Also the blood type seems to have a mild eect: even if the posterior of the

parameter related to blood type A contains the 0 in all the cases of Table 3.2, blood

type B has a positive eect, such as AB. Note that we tried to substitute the two

covariates related to blood type and Rh factor with their interaction: the results

were unsatisfactory because of identiability issues for β0.

Now, we compare the tests in Table 3.2 in order to have an insight about the role

of the parameters. From comparing the posterior of the number of groups K in tests

a, b and c, we have that for smaller values of σ the mass is concentrated in 5, while

in case c (σ = 0.5) the mode is in 6 and the variance is also larger, giving mass

78

0.63 0.64 0.65 0.66 0.67

020

40

60

Gender

90% c.i.

0.00 0.01 0.02 0.03

020

40

60

80

A

90% c.i.

−0.02 0.00 0.02 0.04 0.06

05

10

15

20

25

30

35

AB

90% c.i.

−0.01 0.00 0.01 0.02 0.03 0.04

010

20

30

40

50

B

90% c.i.

0.01 0.02 0.03 0.04 0.05 0.06

010

20

30

40

50

RH

90% c.i.

−0.02 −0.01 0.00 0.01

020

40

60

Smoke

90% c.i.

Figure 3.12: Posterior distribution of the regression coecients corresponding tostatic covariates under test c in Table 3.2.

to the set 5, 6, 7, 8, 9. The optimal partition, obtained using a minimization of

the Binder loss function approach, becomes more interpretable when σ is large, see

Figure 3.13. The six groups have dierent features in terms of number of donations

(green and purple clusters gather donors with a smaller number of donations with

respect to the others). The red cluster clearly contains those donors that are very

regular (loyal), since the gap time is approximately constant in time. The blue

and orange groups seem to contain donors that become loyal with time, since the

trajectories decreases with the donations: however, the orange group has a faster

decrease on average and the number of donations is slightly smaller. The small

yellow group gathers the donors that are regular in time but with a relatively small

number of donations. As far as the eect of covariates in the optimal partition is

concerned, it seems that covariates age, Rh and skewness inuence the partition.

We also tried to apply the Variation of Information criterion of Meil (2007) in order

to obtain the optimnal partition. According to this method, we obtained 4 groups

under setting c (not reported here). The cardinality of each group is (407, 1180,

1736, 10). Dierently from the result obtained with the Binder loss function, the


clusters do not dier in terms of number of donations.

0 5 10 15

0500

1000

1500

# Donations

Gap

tim

e (

day

s)

#elem = 394

0 5 10 15

0500

1000

1500

# Donations

Gap

tim

e (

day

s)

#elem = 1133

0 5 10 15

0500

1000

1500

# Donations

Gap

tim

e (

day

s)

#elem = 251

0 5 10 15

0500

1000

1500

# Donations

Gap tim

e (

day

s)

#elem = 502

0 5 10 15

0500

1000

1500

# Donations

Gap tim

e (

day

s)

#elem = 1041

0 5 10 15

0500

1000

1500

# Donations

Gap tim

e (

day

s)

#elem = 11

Figure 3.13: Optimal partition, according to the Binder loss function minimizationmethod, under Test c of Table 3.2. Each panel shows a cluster: the thick black linerepresents the mean curve into each group, computed for each donation. Dierentbehaviors may be noticed.

The hyperparameters of P0 in Test f are dierent from the other tests; in this

case, we obtain larger a-posteriori mean of Kn. This behavior is common among

nonparametric mixture models, that are not particularly robust with respect to the

choice of the base measure.

The choice of the temperature parameter λ related to the similarity function is

not straightforward since it inuences quite strongly the posterior on the number of

clusters: decreasing λ leads to a reduction of the number of clusters K. Indeed, Test

e, corresponding to a prior on the partition that is not inuenced by the covariates,

puts mass in the set 3, 4, 5, 6 with mode in 4. Figure 3.14 displays the optimal

partition under Test e. We found 2 large and 2 very small clusters, that are harder

to interpret with respect to those in Figure 3.13. In general, we suggest to follow

this approach to x λ: rst, compute all the possible pairwise distances among the

set of covariates, then evaluate the range of these values and set a value for λ in

80

order to rescale the range of the distances in a given interval.

0 5 10 15

0500

1000

1500

# Donations

Gap

tim

e (

day

s)

#elem = 1158

0 5 10 15

0500

1000

1500

# Donations

Gap

tim

e (

day

s)

#elem = 2160

0 5 10 15

0500

1000

1500

# Donations

Gap tim

e (

day

s)

#elem = 11

0 5 10 15

0500

1000

1500

# Donations

Gap tim

e (

day

s)

#elem = 4

Figure 3.14: Optimal partition, according to the Binder loss function minimizationmethod, under test e of Table 3.2. In this test, the eect of the cohesion functionin the prior for the partition is null.

Finally, Figure 3.15 shows the predictive distribution for one donor of the popu-

lation and dierent donations: the vertical line is the actual observation. The blue

lines represent the 90% credibility interval for the predictive law, and the black line

is the prediction. Note the skewness of the distributions.

We mention also that we computed cross-validated prediction errors. We ran-

domly removed 333 donors from the dataset and used the remaining 3000 to compute

the posterior distribution of the parameters. Then, we computed the following index

of goodness-of-t:333∑i=1

ni∑j=1

|yij − ytestij |

where the predicted value yij is the mean value of the predictive law. We obtained

value 39338. This can be used when comparing with other methods: this will be

the object of a further investigation.

3.5. Discussion and future work 81

4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Time 1

log Gap−Time

De

nsi

ty

4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Time 2

log Gap−Time

De

nsi

ty

4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Time 3

log Gap−Time

De

nsi

ty

Figure 3.15: Predictive distribution under test a of Table 3.2.

We conclude the analysis of the AVIS dataset by stating the main ndings we

obtained; indeed, we illustrated our ndings to the general manager and the medical

director of the Milan department of AVIS and discussed about how the policy-makers

may take advantage of the results.

From their viewpoint, it is important to understand the dierent proles of

donors and how covariates aect the behavior of the donors, in order to better plan-

ning awareness campaigns and to start surveys for investigating why some donors

stop donating without any medical reason.

3.5 Discussion and future work

In this chapter we presented a product partition model with dependence on

covariates; dierently from Müller et al. (2011), the similarity function we assume is

a generic non-increasing function of the distance among covariates in a cluster. We

proposed three possible choices; as a future development of the work, a thorough

investigation on the properties of the prior on the random partition induced by the

functional form of the similarity is needed. In particular, it is still not clear how to

balance the eect of the covariates with respect to the cohesion function induced by

the underlying completely random measure. For this reason, we provided empirical

guidelines for choosing the hyperparameters.

Moreover, we mention that we tried to depart from the product partition form

of the similarity, by considering g(xA1 , . . . , xAK ) as a trade-o between compactness

and separation, as follows

gcomp(xA1 , . . . ,xAK ) =1

K

K∑l=1

g(xAl)

where the normalization over the number of groups K is made in order not to

promote a large number of groups (since we are summing over the groups Ai).

82

Essentially, we compute an average of the compactness in the clusters. As far as

the separation is concerned, the objective is to encourage clusters that are well

separated, i.e. whose centroids are far from each other. We dened the separation

in this way

gsep(xA1 , . . . ,xAK ) = gsep(c1, . . . , cK) =1

K

K∑l=1

1− e−λt∗i1 + e−λt

∗i

where

t∗i =

nid(ci, c)

d(ci, c)

Here, ni and ci stand for the cardinality and the centroid of group i, and c repre-

sents the global centroid. Again, the normalization with respect to K is made in

order to favor a large k, while the logistic transformations is made to rescale the

contribution to (0, 1), to balance the eect of the compactness (which is a number

in (0, 1), indeed). Summing up, we have

g(xA1 , . . . ,xAK ) = gcomp(xA1 , . . . ,xAK ) + γgsep(xA1 , . . . ,xAK )

where γ is a value that weights the eect of the separation. However, the preliminary

numerical tests were unsatisfactory, since the results were very similar with respect

to the case without covariates in the prior; this may be due to the eect of averaging

we are introducing by summing over all the components of the partition. However,

dierent ways of dening separation can be employed and this will be subject of

future research.

As far as the application on the blood's donations data is concerned, the main

development is to improve predictive accuracy of our model: other possible models

for the likelihood could be employed, indeed.

Appendix 3.A: Gibbs sampler

In this section, we illustrate the Gibbs sampler Pólya urn scheme for the con-

jugate case, i.e. when the base distribution P0 is conjugate to the mixture kernel

f(·|x, θ). In particular, in our case we consider f as the kernel of a Gaussian dis-

tribution, θ =(β, σ2

), where β is the vector of the regression coecients, namely

f(y; x,β, σ2

)= N

(y; xTβ, σ2

). The base measure P0 is given by

θ =(β, σ2

)∼ Np(β;µ0, σ

2B0)× IG(σ2; a0, b0).

The algorithm is an extension of Algorithm 8 in Neal (2000), later generalized

to NormCRMs in Favaro and Teh (2013) (see Section 3 there). It is a marginal

MCMC sampler in its simplest form, thanks to the above mentioned conjugacy. In

this case, the cluster parameters θ∗1, . . . , θ∗k can be eciently marginalized out from


the joint distribution (3.4), obtaining

K∏j=1

(m(y∗j )c(u, nj)g(x∗j )

)D(u, n)

where m(y∗j ) is the marginal distribution of data in the j-th cluster. Therefore,

the Gibbs sampler is obtained by repeatedly sampling from the following full-

conditionals (for simplicity of notation, we use the term rest to denote all the

other variables but the one on the left we condition to):

1. for the auxiliary variable u, note that, given the partition ρn, it is indepen-

dent of the observations. Thus, we sample from L(u|ρn,x,y, θ∗1, . . . , θ∗K) ∝un−1e−Ψ(u)

∏Kj=1 c(nj , u), that in the case of the NGG simplies to

L(du|rest) ∝ un−1

(u+ 1)n−κKe−(κ/σ(u+1)σ−1)du.

We use a simple Metropolis-Hastings update with a Gaussian proposal kernel

truncated in (0,+∞).

2. sampling from L(θ∗1, . . . , θ∗K |rest): each θ∗j =

(β∗, σ2∗

j

), for j = 1, . . . ,K is

updated within each cluster according to the usual parametric update in the

conjugate Normal - Normal-inverse gamma distribution case. In particular,

we have that for each j = 1, . . . ,K, the cluster specic parameters can be

sampled independently from the following distributions:

β∗j |σ2∗j , rest ∼ N

(µ∗, σ2∗

j B∗)

where B∗ =(B−1

0 +∑

i∈Aj xixiT)−1

, µ∗ = B∗

(B−1

0 µ0 + +∑

i∈Aj yixi

)and

σ2∗j |rest ∼ IG

a∗j = a0 +nj2, b∗j = b0 +

1

2

βT0 B−10 β0 +

∑i∈Aj

y2i − µ∗TB−1

∗ µ∗

.

3. the random partition ρn can be updated using a form of Gibbs sampling

whereby the cluster assignment of one item Yi is updated once at a time.

Denote with ρ−in−1 the partition of n − 1 items where the i-th item has been

removed and ei = l the event that Yi is assigned to cluster l,where l varies in1, . . . , k−in−1, k

−in−1 + 1

, where k−in−1 is the number of clusters available in the

partition without i. Note that k−in−1 + 1 is included to consider the case where

the item forms a new cluster. Therefore, we have to sequentially sample from

84

the following conditional distribution, for i = 1, . . . , n,

L(ei = l|u, xini=1 , yini=1 , ρ

(−i)n−1) =

L(yini=1 |u,x, ρ(−i)n−1, ei = l)p(ei = l|ρ(−i)

n−1)

L(yini=1 |u,x, ρ(−i)n−1)

,

(3.10)

where l = 1, . . . , k(−i)n−1+1. Moreover, observe that, for any l = 1, . . . , k

(−i)n−1 , k

(−i)n−1+

1, the prior on the partition can be written as:

p(ρ(−i)n−1, ei = l|rest) ∝ D(u, n)

K∏j=1

(c(u, nj)g(x∗j )

)

∝ D(u, n)

k(−i)n−1∏j=1

(c(u, nj)g(x∗j )

)c(nl + 1, u)g(x∗l ∪ xi)

= D(u, n)

k(−i)n−1∏j=1

(c(u, nj)g(x∗j )

) c(nl + 1, u)g(x∗l ∪ xi)c(nl, u)g(x∗l )

∝ p(ρ(−i)n−1, u)

c(nl + 1, u)g(x∗l ∪ xi)c(nl, u)g(x∗l )

Note that g(∅) = 1, and in this case c(nl, u) = c(0, u) = 1. It is important

to highlight the contribution given by the similarity function here, comparing

to the case without covariates, as in Favaro and Teh (2013): the probability

of assigning item i to an already existing cluster l is modied according to

the ratio g(x∗l ∪ xi)/g(x∗l ) that quanties how the total similarity among

the covariates varies adding i to the group. If xi is very similar to the others,

the ratio will be greater than 1; on the other hand, if the covariate is very

dierent, the contribution of the similarity will be less than 1, thus decreasing

the probability of assigning the item to that cluster. Moreover, we have that

the contribution of the likelihood in (3.10) is:

L(yini=1 |u, xini=1 , ρ

(−i)n−1, ei = l) =

k(−i)n−1∏

j=1,j 6=lm(y∗j )m(y∗l ∪ yi)

m(y∗l )

m(y∗l )

= L(y(−i)|ρ(−i)n−1)

m(y∗l ∪ yi)m(y∗l )

,

where m(∅) = 1 in the case of a new cluster. Therefore, (3.10) becomes

L(ei = l|rest) = L(y(−i)|ρ(−i)n−1)

m(y∗l ∪ yi)m(y∗l )

1

L(y1, . . . , yn|ρ(−i)n−1)

p(ei = l, ρ(−i)n−1)

p(ρ(−i)n−1)

∝m(y∗l ∪ yi)

m(y∗l )

c(nl + 1, u)g(x∗l ∪ xi)c(nl, u)g(x∗l )

,


so that each ei is sequentially assigned according to this law.

Appendix 3.B: Gibbs sampler for the blood donations ap-

plication

In this section we develop a Gibbs sampler to sample from the posterior of

our model in Section 3.4: thanks to a careful choice of the priors, we have that

most of the full-conditionals are conjugate, thus accelerating the computation. This

is important, given the cardinality of the sample and the dimensionality of the

parameters. Indeed, the state space is given by(ηij)

ni+1j=1 , i = 1, .., Nd,

(Y censi(ni+1)

)i = 1, .., Nd,β0, (βj)

J+1j=1 ,

(τ2i

)p2

i=1, ρn,

(ul, ψl, σ

2l

)Kl=1

.

The full-conditionals are listed below:

Parameters (ηij)ni+1j=1 , i = 1, .., Nd: each ηij can be independently sampled accord-

ing to

L(ηij |rest) ∝ exp

(− 1

2σ2l

(yij − (ul + x′ijβj + x′iβ0 + ψlηij)

)2 − 1

2η2ij

)I (ηij > 0)

which turns out to be a truncated normal, namely

ηij |rest ∼ T N [0,∞)

(ψl

σ2l + ψ2

l

(yij − (ul + x′ijβj + x′iβ0)

),

σ2l

σ2l + ψ2

l

)for j = 1, . . . , ni + 1 and i = 1, . . . , Nd.

Parameters(Y censi(ni+1)

)i = 1, .., Nd: the censored observation are independently

sampled according to

Y censi(ni+1)|rest ∼ T N [yi(ni+1),+∞)

(ul + x′ijβj + x′iβ0 + ψlηij , σ

2l

)for i = 1, . . . , Nd.

Parameter β0: Thanks to conjugacy, we have that the full-conditional is multi-

variate p1-dimensional Gaussian, with mean β∗0 and variance-covariance matrix Σ∗0,

where

Σ∗0 =

(Σ−1

0 +

Nd∑i=1

ni + 1

σ2l

xixTi

)−1

and

β∗0 = Σ∗0

Nd∑i=1

ni+1∑j=1

yij − (ul + x′ijβj + ψlηij)

σ2l

xi

.

Parameters (βj)J+1j=1 : each coecient vector βj can be sampled independently

86

from a multivariate p2-dimensional Gaussian, with mean β∗j and variance-covariance

matrix Σ∗j , where

Σ∗j =

K−10 +

∑i:(ni+1)≥j

1

σ2l

xijxTij

−1

and

β∗j = Σ∗j

∑i:(ni+1)≥j

yij − (ul + x′iβ0 + ψlηij)

σ2l

xij

.

Parameters(τ2i

)p2

i=1: each parameter τ2

m is independently sampled from

τ2m|rest ∼ IG

ν0 +J + 1

2, τ0 +

1

2

J∑j=1

β2jm

m = 1, . . . , p2.

Parameters(ul, ψl, σ

2l

)Kl=1

: the likelihood for data in cluster Al that is used to

build the joint distribution for (ul, ψl, σ2l ) is proportional to

∏i∈Al

ni+1∏j=1

1√2πσ2

l

exp

(− 1

2σ2l

(yij − (ul + ψlηij))2

)

with yij = yij − x′ijβj − x′iβ0. This is equivalent to have data Yi : i ∈ Al, whereeach Yi has dimension ni + 1. If we denote this vector by Yl, of dimension Nl =∑

i∈Al(ni + 1), its conditional distribution is gaussian distribution with mean Xlθland variance-covariance matrix σ2

l INl×Nl . The design matrix Xl has rows (1 η11),

(1 η1(n1+1)), . . . , (1 ηNd(nNd+1)). Moreover, θl = (ul ψl)T .

The prior we chose is conjugate, therefore we only need to update the parameters.

In particular, we have

σ2l |rest ∼ IG

a+Nl

2, b+

1

2

∑i,j

y2i,j + θT0 K

−10 θ0 − θ∗0K∗−1θ∗0

where K∗ =

(∑i,j γijγ

Tij +K−1

0

)−1and θ∗0 = K∗

(∑i,j yijγij +K−1

0 θ0

). The in-

tercept ul and the skew parameter ψl are no longer independent and are conditionally

gaussian distributed of mean θ∗0 and variance-covariance matrix σ2lK∗.

Partition ρn: in order to sample the partition, we need to resort to a generalization

of Neal Algorithm 8, since we are in the non conjugate case. The steps are very

similar those described in Section 3.2 of Favaro and Teh (2013), except for the

presence of the similarity function g(·). In particular, the probability of assigning

the i-th subject to cluster l, l = 1, 2, . . . ,K−in is


p (ei = l|rest) ∝ p(ρ−in−1, u

) c(nl + 1, u)g (x∗l ∪ xi)

c(nl, u)g(x∗l)

ni+1∏j=1

N(yij ;ul + xTijβj + xTi β0 + ψlηij , σ

2l

) (3.11)

where the superscript (−i) denotes a quantity that refers to the partition of n − 1

subjects after i has been removed. Moreover, we need to take into account M

possibile new clusters, whose parameters are generated from the prior distribution,um, ψm, σ

2m

,m = 1, . . . ,M . The probability of allocating the subject to one of

these new clusters is as in (3.11) divided byM , with the agreement that c(∅, u) = 1,

g(∅) = 1. We recall that under the NGG assumption,c(nl + 1, u)

c(nl, u)= nl−σ if nl > 0,

and κ(u+ 1)σ otherwise.

Auxiliary parameter u: the auxiliary parameter for the normalized completely

random measure can be simply drawn using a Metropolis-Hastings steps from

L (u|rest) ∝ un−1 exp (−ψ(u))

Kn∏l=1

c(nl, u)

that for the NGG case turns out to be

un−1

(1 + u)n−κKnexp (−κ/σ((u+ 1)σ − 1)) .

Chapter 4

Determinantal point process mixtures

via spectral density approach

This chapter is based on Bianchini et al. (2017).

In this chapter, we consider mixture models with a nite and random number of com-

ponents, rather than assume it innite. However, in the usual framework described in

Chapter 1, it is often the case that we observe an overestimation of the number of groups

(both in the nonparametric case and in nite mixture models). This motivates the introduc-

tion of a model that induces a-priori separation among the location parameters; this can be

reached by dropping the conditional i.i.d. assumption typical of models as (1.4). We explore

a class of determinantal point process (DPP) mixture models dened via spectral represen-

tation, which leads to the required repulsion among the points of the process. We focus on

a power exponential spectral density, even if the proposed approach is in fact quite general.

In the second part of the chapter we generalize our model to account for the presence of

covariates, both in the likelihood as linear regression and in the weights of the mixture by

means of a mixture of experts approach. This yields a trade-o between repulsiveness of

locations in the mixtures and attraction among subjects with similar covariates.

We develop full Bayesian inference through a Gibbs sampler involving a reversible jump

step. Finally, we evaluate the eectiveness of the proposed model by several simulation

scenarios and data illustrations.

90

4.1 Introduction

As we discussed in Chapter 1, mixture models are an extremely popular class of

models, that have been successfully used in many applications. For a review, see,

e.g. Frühwirth-Schnatter (2006). Such models are typically stated as

yi | k,θ,πiid∼

k∑j=1

πjf(yi | θj), i = 1, . . . , n, (4.1)

where π = (π1, . . . , πk) are constrained to be nonnegative and sum up to 1, θ =

(θ1, . . . , θk), and 1 ≤ k ≤ ∞, with k =∞ corresponding to a nonparametric model.

A common prior assumption when k < +∞ is that π ∼ Dirichlet(δ1, . . . , δk) and

that the components of θ are drawn i.i.d. from some suitable prior p0. However, the

weights π may be constructed dierently, e.g. using a stick-breaking representation

(nite or innite), which poses a well-known connection with more general models,

including nonparametric ones. See, e.g., Ishwaran and James (2001b) and Miller and

Harrison (2017). A popular class of Bayesian nonparametric models is the Dirichlet

process mixture (DPM) model, introduced in Ferguson (1983) and Lo (1984). It is

well-known that this class of mixtures usually overestimates the number of clusters,

mainly because of the rich gets richer property of the Dirichlet process. By this

we mean that both prior and posterior distributions are concentrated on a relatively

large number of clusters, but a few are very large, and the rest of them have very

small sample sizes. Mixture models may even be inconsistent; see Rousseau and

Mengersen (2011), where concerns about over-tted mixtures are illustrated, and

Miller and Harrison (2013), for inconsistency of the posterior distribution of k of

DPMs.

Despite their success, mixture models like (4.1) tend to use excessively many

mixture components. As pointed out in Xu et al. (2016), this is due to the fact that

the component-specic parameters are a priori i.i.d., and therefore, free to move.

This motivated Petralia et al. (2012), Fúquene et al. (2016) and Quinlan et al. (2017)

to explicitly dene joint distributions for θ having the property of repulsion among

their components, i.e. that p(θ1, . . . , θk) puts higher mass on congurations such

that components are well separated. For a dierent approach, via sparsity in the

prior, see Malsiner-Walli et al. (2016).

Xu et al. (2016) explored a similar way to accomplish separation of mixture

components, by means of a Determinantal Point Process (DPP) acting on the pa-

rameter space. DPPs have recently received increased attention in the statistical

literature (Lavancier et al., 2015). DPPs are point processes having a product den-

sity function expressed as the determinant of a certain matrix constructed using

a covariance function evaluated at the pairwise distances among points, in such a

way that higher mass is assigned to congurations of well-separated points. We

give details below. DPPs have been used to make inference mostly on spatial data.

Bardenet and Titsias (2015) and Aandi et al. (2014) applied DPPs to model spa-

tial patterns of nerve bers in diabetic patients, a basic motivation being that such

4.1. Introduction 91

bers become more clustered as diabetes progresses. The latter discussed also appli-

cations to image search, showing how such processes could be used to study human

perception of diversity in dierent image categories. Similarly, Kulesza et al. (2012)

show how DPPs can be applied to various problems that are relevant to the machine

learning community, such as nding diverse sets of high-quality search results, build-

ing informative summaries by selecting diverse sentences from documents, modeling

non-overlapping human poses in images or video, and automatically building time-

lines of important news stories. More recently, Shirota and Gelfand (2017) have

described an approximate Bayesian computation method to t DPPs to spatial

point pattern data. The rst paper where DPPs were adopted as a prior for statis-

tical inference in mixture models is Aandi et al. (2013). The statistical literature

also includes a number of papers illustrating theoretical properties for estimators of

DPPs from a non-Bayesian viewpoint; see, for instance, Biscio and Lavancier (2016,

2017) and Bardenet and Titsias (2015).

We discuss full Bayesian inference for a class of mixture densities where the

locations follow stationary DPPs. Our rst contribution is the introduction of an

approach that generalizes and extends the model studied in Xu et al. (2016) who

base their analysis on a special case of DPPs called L-ensambles, which consider a

nite state space. Instead, we resort to the spectral representation of the covariance

function dening the determinant as the joint distribution of component-specic

parameters. Our methods can thus be used with any such valid spectral represen-

tation, as described by Lavancier et al. (2015), which implies great generality of the

proposal. The extensions considered here are stated in the context of both uni- and

multi-dimensional responses, and are detailed in Section 4.2.4.

For the sake of concreteness, our illustrations focus on the case of power ex-

ponential spectral representation; see examples with dierent spectral densities in

Section 4.4.2. This particular specication allows for exible repulsion patterns,

and we discuss how to set up dierent types of prior behavior, shedding light on

the practical use of our approach in that particular scenario. Although we limit

ourselves to the case of isotropic DPPs, inhomogeneous DPPs can be obtained by

transforming or thinning a stationary process (Lavancier et al., 2015). A crucial

point in our models and algorithms is the DPP density expression, which is only

dened for DPPs restricted to compact subsets S of the state space, with respect

to the unit rate Poisson process. When this density exists, it explicitly depends on

S. A sucient condition for the existence of the density is that all the eigenval-

ues of the covariance function, restricted to S, are smaller than 1. We follow the

spectral approach and assume that the covariance function dening the DPP has a

spectral representation. A basic motivation for our choice is that conditions for the

existence of a density become easier to check. We review here the basic theory on

DPPs, making an eort to be as clear and concise as possible in the presentation

of our subsequent models. We discuss applications in the context of synthetic and

real data applications.

A second contribution of this work is the extension of the proposed spectral

DPP model to incorporate covariate information in the likelihood and also in the

92

assignment to mixture components. In particular, subjects with similar covariates

are a priori more likely to co-cluster, just as in mixtures of experts models (see,

e.g., McLachlan and Peel, 2005), where weights are dened as normalized expo-

nential functions. From a computational viewpoint, a third contribution of our

work is the generalization of the reversible jump (RJ) MCMC posterior simulation

scheme proposed by Xu et al. (2016) to the general spectral approach and also to the

covariate-dependent extensions we consider. We consider two RJ MCMC versions

for uni- and multi-variate responses, as discussed later. In all cases the algorithms

require computing the DPP density with respect to the unit rate Poisson process.

We explain how to carry out the calculations, and discuss the need to restrict the

process to (any) compact subset. When extending the model to incorporate covari-

ate information in both likelihood and prior assignment to mixture components, the

RJ MCMC algorithm requires modications, as discussed below.

We explicitly consider the estimation of clusters of subjects in the sample, by

considering the partition that minimizes the posterior expectation of Binder's loss

function (Binder, 1978) under equal misclassication costs. This is a common choice

in the applied Bayesian nonparametric literature (Lau and Green, 2007b). In partic-

ular, we emphasize one conceptual advantage of the separation induced by the prior

assumption, namely, the reduction in the number of clusters a posteriori compared

to the usual mixture models that do not include this feature. Reducing the eective

number of clusters a posteriori helps scaling up our model better than alternatives

with no separation when the sample size grows. We illustrate this particular point

in our data illustrations.

4.2 Using DPPs to induce repulsion

We review here the basic theory on DPPs to the extent required to explain our

mixture model. We use the same notation as in Lavancier et al. (2015), where

further details on this theory may be found.

4.2.1 Basic theory on DPPs

Let B ⊆ Rd ; we mainly consider the cases B = Rd and B = S, a compact subset

in Rd . By X we denote a simple locally nite spatial point process dened on B,

i.e. the number of points of the process in any bounded region is a nite random

variable, and there is at most one point at any location. See Daley and Vere-Jones

(2003; 2007) for a general presentation on point processes. The class of DPPs we

consider is dened in terms of their moments, expressed by their product density

functions ρ(n) : Bn → [0,+∞), n = 1, 2, . . .. Intuitively, for any pairwise distinct

points x1, . . . , xn ∈ B, ρ(n)(x1, . . . , xn)dx1 · · · dxn is the probability that X has a

point in an innitesimal small region around xi of volume dxi, for each i = 1, . . . , n.

More formally, X has n-th order product density function ρ(n) : Bn → [0,+∞) if

this function is locally integrable (i.e.∫S |ρ

(n)(x)|dx < +∞ for any compact S) and,

4.2. Using DPPs to induce repulsion 93

for any Borel-measurable function h : Bn → [0,+∞),

E

6=∑x1,...,xn∈X

h(x1, . . . , xn)

=

∫Bnρ(n)(x1, . . . , xn)h(x1, . . . , xn)dx1 · · · dxn,

where the 6= sign over the summation means that x1, . . . , xn are pairwise distinct.

See also Møller and Waagepetersen (2007). Let C : B×B → R denote a covariance

function.

A simple locally nite spatial point process X on B is called a determinantal

point process with kernel C if its product density functions are

ρ(n)(x1, . . . , xn) = det[C](x1, . . . , xn), (x1, . . . , xn) ∈ Bn, n = 1, 2, . . . ,

where [C](x1, . . . , xn) is the n × n matrix with entries C(xi, xj). We write X ∼DPPB(C); when B = Rd we write X ∼ DPP (C).

Note that, if A is a Borel subset of B, then the restriction XA := X ∩A of X to

A is a DPP with kernel given by the restriction of C to A×A.By Theorem 2.3 in Lavancier et al. (2015), rst proved by Macchi (1975), such

DPP's exist under the two following conditions:

C is a continuous covariance function; hence, by Mercer's Theorem,

C(x, y) =

+∞∑k=1

λSkφk(x)φk(y), (x, y) ∈ S × S, S compact subset,

where λSk and φk(x) are the eigenvalues and eigenfunctions of C restricted

to S × S, respectively.

λSk ≤ 1 for all compact S in Rd and all k.

Formula (2.9) in Lavancier et al. (2015) reports the distribution of the number

N(S) of points of X in S, for any compact S:

N(S)d=

+∞∑k=1

Bk, E(N(S)) =

+∞∑k=1

λSk , Var(N(S)) =

+∞∑k=1

λSk (1− λSk ), (4.2)

where Bkind∼ Be(λSk ), i.e. the Bernoulli random variable with mean λSk . When

restricted to any compact subset S, the DPP has a density with respect to the unit

rate Poisson process which, when λSk < 1 for all k = 1, 2, . . ., has the following

expression:

f(x1, . . . , xn) = e|S|−DSdet[C](x1, . . . , xn), n = 1, 2, . . . , (4.3)

94

where |S| =∫S dx, DS = −

∑+∞1 log(1− λSk ) and

C(x, y) =

+∞∑1

λSk1− λSk

φk(x)φk(y), x, y ∈ S.

When n = 0 the density (as well as the determinant) is dened to be equal to 0. See

Møller and Waagepetersen (2007) for a thorough denition of absolute continuity of

a spatial process with respect to the unit rate Poisson process. However, note that

from the rst part of (4.2) we have P(N(S) = 0) =∏+∞k=1(1− λSk ); this probability

could be positive due to the assumption λSk < 1 for all k = 1, 2, . . ..

From now on we restrict our attention to stationary DPP's, that is, when

C(x, y) = C0(x − y), where C0 ∈ L2(Rd) is such that its spectral density ϕ ex-

ists, i.e.

C0(x) =

∫Rdϕ(y) cos(2πx · y)dy, x ∈ Rd

and x · y is the scalar product in Rd . If ϕ ∈ L1(Rd) and 0 ≤ ϕ ≤ 1, then the

DPP (C) process exists. Summing up, the distribution of a stationary DPP can be

assigned by its spectral density; see Corollary 3.3 in Lavancier et al. (2015).

To explicitly evaluate (4.3) over S =

[−1

2,1

2

]d, we approximate C as suggested

in Lavancier et al. (2015). In other words, we approximate the density of X on S

by

fapp(x1, . . . , xn) = e|S|−Dappdet[Capp](x1, . . . , xn), x1, . . . , xn ⊂ S, (4.4)

where

Capp(x, y) = Capp,0(x− y) =∑k∈Zd

[ϕ(k)

1− ϕ(k)

]cos(2πk · (x− y)), x, y ∈ S, (4.5)

Dapp =∑k∈Zd

log

(1 +

ϕ(k)

1− ϕ(k)

).

To understand why the approximation C(x, y) ≈ Capp,0(x − y) (x − y ∈ S)

follows, as well as the corresponding approximation for the tilted versions of these

functions, we observe that the exact Fourier expansion of C0(x − y) in S is as in

(4.5) with the real part of∫S C0(y)e−2πik·ydy instead of ϕ(k); if we assume C0 such

that C0(t) ≈ 0 for t 6∈ S, then

Re

(∫SC0(y)e−2πik·ydy

)≈ ϕ(k) := Re

(∫RdC0(y)e−2πik·ydy

).

See also Lavancier et al. (2015), Section 4.1. Figure 4.1 displays the value of C0(t)

corresponding to the Gaussian spectral density where s = 0.5 and ρ varies as in

the legend. The vertical dashed line represents the right endpoint of the set S =


0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

t

C0

ρ

2

0.1

5

Figure 4.1: Value of C0(t) corresponding to the Gaussian spectral density whens = 0.5 and ρ is equal to 0.1, 2, 5.

[−1

2,1

2

]. The approximation C0(t) ' 0 for t /∈ S is perfectly adhered to when ρ is

small. The higher ρ is, the slower is the decay rate of the function C0(t).

When S = R is a rectangle in Rd , we can always nd an ane transformation

T such that T (R) = S =

[−1

2,1

2

]d. Dene Y = T (X). If fappY is the approximate

density of Y as in (4.4), we can then approximate the density of XR by

fapp(x1, . . . , xn) = |R|−ne|R|−|S|fappY (T (x1, . . . , xn)), x1, . . . , xn ⊂ R.(4.6)

In practice, the summation over Zd in (4.5) above is truncated to ZNd, where ZN :=

−N,−N + 1, . . . , 0, . . . , N − 1, N (see Section 4.3 in Lavancier et al., 2015).

A particular example of spectral density that we found useful is

ϕ(x; ρ, ν) = sd exp

−(

s√π

)ν (Γ(d2 + 1)

Γ( dν + 1)

)νdρνd‖x‖ν

, ρ, ν > 0, (4.7)

for xed s ∈ (0, 1) (e.g. s = 12) and ‖x‖ is the Euclidean norm of x ∈ Rd . This

function is the spectral density of a power exponential spectral model (see (3.22)

in Lavancier et al. (2015) when α = s αmax(ρ, ν)). In this case, we write X ∼PES − DPP (ρ, ν). The corresponding spatial process is isotropic. When ν = 2,

the spectral density is

ϕ(x; ρ, ν) = sd exp

−s

2ρ2d

√π‖x‖2

, ρ > 0,

corresponding to the Gaussian spectral density. We discuss more specically the

choice of (4.7) later in Section 4.4.

96

4.2.2 The mixture model with repulsive means

To deal with limitations of model (4.1) or DPMs, we consider repulsive mixtures.

Our aim is to estimate a random partition of the available subjects, and we want

to do so using few groups. By repulsion we mean that cluster locations are a

priori encouraged to be well separated, thus inducing fewer clusters than if they

were allowed to be independently selected. We start from parametric densities

f(·; θ), which we take to be Gaussian, and assume that the collection of location

parameters follows a DPP. We specify a hierarchical model that achieves the goals

previously described. Concretely, we propose:

yi | si = k, µk, σ2k,K

ind∼ N(yi;µk, σ

2k

)i = 1, . . . , n (4.8)

X = µ1, µ2, . . . , µK ,K ∼ PES −DPP (ρ, ν) (4.9)

(ρ, ν) ∼ π (4.10)

p(si = k) = wk, k = 1, . . . ,K for each i (4.11)

w1, . . . , wK | K ∼ Dirichlet(δ, δ, . . . , δ) (4.12)

σ2k | K

iid∼ IG(a0, b0), (4.13)

where the PES-DPP(ρ, ν) assumption (4.9) is regarded as a default choice that

could be replaced by any other valid DPP alternative. The choice of π in (4.10)

will be discussed below in Section 4.4. We note that, as stated, the prior model

may assign a positive probability to the case K = 0. This case of course makes no

sense from the viewpoint of the model described above. Nevertheless, we adopt the

working convention of redening the prior to condition on K ≥ 1, i.e., truncating

the DPP to having at least one point. In practice, the posterior simulation scheme

later described simply ignores the case K = 0, which produces the desired result.

Note also that we have assumed prior independence among blocks of parameters not

involving the locations µk.

Model (4.8)-(4.13) is a DPP mixture model along the lines proposed in Xu et al.

(2016). Indeed, we both use DPPs as priors for location points in the mixture of

parametric densities. However, the specic DPP priors are dierent, as they restrict

to a particular case of DPPs (L-ensambles), and choose a Gaussian covariance func-

tion for which eigenvalues and eigenfunctions are analytically available. We adopt

instead the more general spectral approach for assigning the prior (4.9). Similar to

Xu et al. (2016), we carry out posterior simulation using a reversible jump step as

part of the Gibbs sampler. However, when updating the location points µ1, . . . , µKwe refer to formulas (4.4)-(4.6). Xu et al. (2016) take advantage of the analytical

expressions that we do not have for our case, and that are also unavailable in other

possible specic choices of the spectral density. As a general comment, we under-

line that the numerical evaluation of the DPP density, involving the computation

of the determinant of a K ×K matrix, is not particularly expensive, even in case of

a large dataset; in this case, the repulsion property will favor a moderate number

K of clusters. See Section 4.4.4, where we describe applications of this model to


datasets, using the posterior simulation algorithms described in Section 4.2.4. In our

experience, the proposed model scales well compared to mixtures with independent

components.

4.2.3 Competitor repulsive models

We briey introduce the class of parsimonious mixture models in Quinlan et al.

(2017), to be used as a competitor model for our applications. Quinlan et al. (2017)

exploit the idea of repulsion, i.e. when any two mixture components are encouraged

to be well separated, as we do. For the sake of comparison, we introduce their

model for unidimensional data: similarly to our case, they consider a mixture of K

Gaussian components, but assume a xed value k for K in (4.8) and (4.11)-(4.13).

The prior for the location parameters µ1, . . . , µk is called repulsive distribution,

and denoted by NRepk(µ,Σ, τ), where µ ∈ R, Σ, τ > 0; see (3.4)-(3.6) in Quinlan

et al. (2017). This prior is characterized by a repulsion potential that assumes the

following expression:

φ1(r; τ) = − log(

1− e−12τr2)1(0,+∞)(r), τ > 0

Petralia et al. (2012) use a similar model, where the repulsion potential is

φ2(r; τ) =τ

r21(0,+∞)(r), τ > 0

Potential φ2 introduces a stronger repulsion than φ1, in the sense that in Petralia

et al. (2012), locations are encouraged to be further apart than in Quinlan et al.

(2017). Note also that, by nature of the point process, our approach does not

require an upper bound on the allowed number of mixture components (similar to

DPM models), contrary to the approach in Quinlan et al. (2017) and Petralia et al.

(2012). The posterior simulation algorithm we propose for our model is described

in Section 4.2.4.

4.2.4 Gibbs sampler for model in Section 4.2.2

Posterior inference for our DPP mixture model as in (4.8)-(4.13) is carried out

using a Gibbs sampler algorithm. The full-conditionals are outlined below: we pro-

vide the details of the computation only when the conditional posterior distribution

is not straightforward. In what follows, rest refers to the data and all parameters

except for the one to the left of |.

The labels s1, . . . , sn are independently distributed according to a discrete

distribution whose support is 1, 2, . . . ,K:

p(si = k | rest) ∝ wkN(yi;µk, σ

2k

). (4.14)

The distribution of the weights w1, . . . , wK is conjugate: the conditional

98

distribution is still a Dirichlet distribution, where the parameters are δ + nk,

k = 1, . . . ,K.

The variances in each component of the mixture σ21, . . . , σ

2K are generated

independently according to the following distribution:

σ2k | rest ∼ IG

a0 +nk2, b0 +

1

2

∑i: si=k

(yi − µk)2

, k = 1, . . . ,K.

Sampling the means µ1, . . . , µK needs more care: following the reasoning in

Xu et al. (2016), this full-conditional can be written as

p(µ1, . . . , µK | rest) ∝ det[C](µ1, . . . , µK; ρ, ν)K∏k=1

∏i: si=k

N(yi;µk, σ

2k

)

∝K∏k=1

(C(µk, µk)− bC−1−kb

T) ∏i: si=k

N(yi;µk, σ

2k

) ,

thanks to the Schur determinant identity. Note that det[C](µ1, . . . , µK; ρ, ν)

in the above expression follows from the expression of the density of a DPP

on a compact set; see (4.6). Then, b is a vector dened as b = C(µk, µ−k),

µ−k = µjj 6=k and C−1−k is a matrix of dimension (K − 1) × (K − 1) dened

as C (µ−k, ρ, ν). Moreover, µk = T (µk) is the transformed variable that takes

values on the set S = [−1/2, 1/2]d. Typically, the rectangle R such that

T (R) = S is xed in such a way that it is large and contains abundantly all

the datapoints.

We update each mean µk separately for k = 1, . . . ,K using a Metropolis-

Hastings update.

The full-conditional for the parameters (ρ, ν) is as follows

p(ρ, ν | rest) ∝ det(C)

[µ1, . . . , µK , ρ, ν]e(−∑Nk=−N log(1+ϕ(k;ρ,ν)))π(ρ, ν).

The adaptive Metropolis-Hastings algorithm of Roberts and Rosenthal (2009)

is employed in this case, in order obtain a better mixing of the chains and to

avoid the annoying choice of the parameters for the proposal distribution.

In order to sample K we need a Reversible Jump step: standard proposals to

estimate mixtures of densities with a variable number of components are based

on moment matching (Richardson and Green, 1997) and have been relatively

often used in the literature. The idea is to build a proposal that preserves

the rst two moments before and after the move, as in Xu et al. (2016). In

particular, the only possible moves are the splitting move, passing from K to

K + 1, and the combine move, from K to K − 1.


(i) Choose move type: uniformly choose among split and combine move

(however, if K = 1 the only possibility is to split)

(ii.a) Combine: randomly select a pair (j1, j2) to merge into a new parameter

indexed with j1. The following relations must hold:

wnewj1 = wj1 + wj2

wnewj1 µnewj1 = wj1µj1 + wj2µj2

wnewj1

(µnew2j1 + σnew2

j1

)= wj1

(µ2j1 + σ2

j1

)+ wj2

(µ2j2 + σ2

j2

)(ii.b) Split: randomly select a component j to split into two new components.

In this case, we need to impose the following relationships:

wnewj1 = αwj , wnewj2 = (1− α)wj

µnewj1 = µj −

√wnewj2

wnewj1

r(σ2j

)1/2, µnewj2 = µj −

√wnewj1

wnewj2

r(σ2j

)1/2σnew2j1 = β(1− r2)

wjwnewj1

σ2j , σnew2

j2 = (1− β)(1− r2)wjwnewj2

σ2j

where α ∼ Beta(1, 1), β ∼ Beta(1, 1) and r ∼ Beta(2, 2).

(iii) Probability of acceptance: the proposed move is accepted with prob-

ability α = min

(1,

1

q(proposed, old)

)if we selected a combine step,

min (1, q(old, proposed)) in the split case. In particular,

q(old, proposed) = |det(J)|p(K + 1,wnew,µnew,σ2new | y)

p(K,wold,µold,σ2old | y)

×psplitK+1

1

(K + 1)

(K + 1)pcombK p(α)p(β)p(r)

where

|det(J)| =w4j(

wnewj1wnewj2

)3/2(σ2j )

3/2(1− r2)

and

p(K + 1,wnew,µnew,σ2new | y)

p(K,wold,µold,σ2old | y)=

likelihood(wnew,µnew,σ2new)

likelihood(wold,µold,σ2old)

×π(σ2new

j1)π(σ2new

j2)

π(σ2j )

DirichletK+1(wnew)

DirichletK(wold)

det(CK+1)

det(CK).

Moreover, psplitK+1 = 0.5 if K > 1, 1 otherwise; pcombK = 0.5 if K > 1, 0

otherwise.

100

We note that in the case of multidimensional data points, the parameters include

covariance matrices Σ1,Σ2, . . . ,ΣK instead of scalar variablesσ2

1, . . . , σ2K

; how-

ever, the marginal inverse-Wishart prior distribution is semi-conjugate, yielding the

update of the parameters similarly as in a standard Normal-Normal/inverse-Wishart

model. The main diculty again lies in the Reversible Jump step: we modied

points (ii.a), (ii.b) and (iii) described above according to the algorithm in Della-

portas and Papageorgiou (2006), Section 3.1. The basic idea is to build moves on

the space of eigenvectors and eigenvalues of the current covariance matrix, so that

the proposed covariance matrices are positive denite.

4.3 Generalization to covariate-dependent models

The methods discussed in Section 4.2 were devised for density estimation-like

problems. We now extend the previous modeling to the case where p-dimensional

covariates x1, . . . , xn are recorded as well. We do so by allowing the mixture weights

to depend on such covariates. In this case, there is a trade-o between repulsiveness

of locations in the mixtures and attraction among subjects with similar covariates.

We also entertain the case where covariate dependence is added to the likelihood part

of the model. Our modeling choice here is akin to mixtures of experts models (see,

e.g., McLachlan and Peel, 2005), i.e., the weights are dened by means of normalized

exponential function.

Building on the model from Section 4.2.2, we assume the same likelihood (4.8)

and the DPP prior for X = µ1, µ2, . . . , µK ,K in (4.9)-(4.10), but change (4.11)

and (4.12) to

p(si = k) = wk(xi) =exp

(βTk xi

)∑Kl=1 exp

(βTl xi

) , k = 1, . . . ,K (4.15)

β2, . . . , βK | Kiid∼ Np (β0,Σ0) , β1 = 0, (4.16)

where the β1 = 0 assumption is to ensure identiability. To complete the model,

we assume (4.13) as the conditional marginal for σ2k; the prior for (ρ, ν) in (4.10) is

later specied. Here β0 ∈ Rp, and to choose Σ0, we use a g-prior approach, namely

Σ0 = φ ×(XTX

)−1, where φ is xed, typically of the same order of magnitude of

the sample size (see Zellner, 1986).

Assuming now (4.8) on top of (4.15)-(4.16) rules out the case of a likelihood

explicitly depending on covariates, which instead would generally achieve a better

t than otherwise. Of course, there are many ways in which such dependence may

be added. For the sake of concreteness, we assume here a Gaussian regression

likelihood, where only the intercept parameters arise from the DPP prior. More

4.3. Generalization to covariate-dependent models 101

precisely, we assume

yi | si = k, xi, µk, σ2k,K

ind∼ N(yi;µk + xTi γk, σ

2k

)i = 1, . . . , n (4.17)

(γ1, σ21), . . . , (γK , σ

2K)|K iid∼ N − IG(γ0,Λ0, a0, b0), (4.18)

where the γk's are p-dimensional regression coecients. The notation in (4.18)

means that γk | σ2k ∼ Np(γ0, σ

2kΛ0), and σ2

k ∼ IG(a0, b0), where γ0 ∈ Rp and Λ0 is

a covariance matrix. The prior for si and βj 's is given in (4.15)-(4.16) as in the

previous model. Note that (4.17) implies that only the intercept term is distributed

according to the repulsive prior. Thus, we allow the response mean to be corrected

by a linear combination of the covariates with cluster-specic coecients, with the

repulsion acting only on the residual of this regression. The result is a more exible

model than the repulsive mixture (4.8)-(4.13). Observe that there is no need to

assume the same covariate vector in (4.17) and (4.15), but we do so for illustration

purposes only.

The Gibbs sampler algorithm employed to carry out posterior inference for this

model is detailed in Section 4.3.1. However, it is worth noting that the reversible

jump step related to updating the number of mixture components K and the update

of the coecients β2, β3, . . . , βK are complicated by the presence of the covariates.

For the β coecients, we resort to a Metropolis-Hastings step, with a multivariate

Gaussian proposal centered in the current value. For K, we employ an ad hoc

Reversible Jump move.

4.3.1 Gibbs sampler in presence of covariates

The Gibbs sampler algorithm employed to carry out posterior inference for model

(4.17)-(4.18), (4.9)-(4.10), (4.15)-(4.16) is dierent from the one in Section 4.2.4

except for the full conditionals of (ρ, ν). The sampling of labels sini=1 diers from

(4.14), since now

p(si = k | rest) ∝ wk(xi)N(yi;µk + xTi γk, σ

2k

)∝ exp

(βTk xi

)N(yi;µk + xTi γk, σ

2k

).

The sampling of the µkKk=1 is similar as the same step in Section 4.2.4, but now

p(µ1, . . . , µK | rest) ∝ det[C](µ1, . . . , µK; ρ, ν)

K∏k=1

∏i: si=k

N(yi − xTi γk;µk, σ2

k

).

However the substantial change from the model without covariates to the model with

covariates is due to the update of K, the number of components in the mixture, and

of β2, . . . , βK (recall that β1 = 0 for identiability reasons); these are indeed

complicated by the presence of the covariates. Moreover, the update of σ2k is now

102

replaced by

p(γk, σ2k | rest) ∝

∏i:si=k

N(yi;µk + xTi γk, σ

2k

)Np(γk; 0, σ2

kΛ0

)IG(σ2

k; a0, b0)

∝ 1

(2πσ2k)nk/2

e

(−

1

2σ2k

∑i: si=k

(yi−µk−xTi γk)2

)Np(γk; 0, σ2

kΛ0

)IG(σ2

k; a0, b0)

where nk = #i : si = k; here we assume the vector of the prior mean of γk, γ0, to

be equal to the 0-vector. This full-conditional is the posterior of the standard con-

jugate normal likelihood, normal - inverse gamma regression model. In particular,

we have that

γk | σ2k, rest ∼ Np

(m∗, σ2

kΛ∗)

with Λ∗ =(Λ−1

0 +∑

i: si=kxix

Ti

)−1and m∗ = Λ∗

(∑i: si=k

yixi). Moreover

σ2k | rest ∼ IG

a0 +nk2, b0 +

1

2

∑i: si=k

y2i −m∗T (Λ∗)−1m∗

.

The full-conditional for the coecients βk, k = 2, . . . ,K is as follows:

p(β2, . . . , βK | rest) ∝K∏k=1

∏i: si=k

exp(βTk xi

)∑K`=1 exp

(βT` xi

)Np(βk;β0,Σ0)

which has no known form. Therefore we resort to a Metropolis Hastings step with

a multivariate Gaussian proposal, centered in the current value of the vector and

with a diagonal covariance matrix, i.e. ζIp×p, where ζ is a tuning parameter chosen

to guarantee convergence of the chains.

On the other hand, the update of K requires a Reversible Jump-type move.

However, the approach used in Section 4.2.4 above is dicult to implement when

mixing weights depend on covariates, as in this case, so that we need to nd another

way to dene a transition probability. Our approach is similar to that of Norets

(2015), with some dierences that will be highlighted in the next lines.

As before, there are two available moves: split or combine. The probability of

proposing one of them is 0.5, except if K = 1, when only the split move can be

proposed.

Split: if this move is picked, Kprop = K + 1, so that we need to create a new group

and its corresponding parameters (the other parameters are kept xed):

(i) randomly pick one cluster, say j, containing at least two items

(ii) randomly divide data associated to this group, yj , into two subgroups, yj1 and

yj2 ;

4.3. Generalization to covariate-dependent models 103

(iii) set γj1 = γj , σ2j1

= σ2j , βj1 = βj , µj1 = µj . Now we need to choose a value for

γj2 , σ2j2, βj2 and µj2 . In Norets (2015), this is done by sampling the new values

from the posterior, conditioning also on the other parameters (even if, for prac-

tical purposes, Gaussian approximations for conditional posteriors are used in

the implementation of the algorithm). Instead, we sample(µj2 , γj2 , σ

2j2

)from

the posterior of the following auxiliary model

yj2 | µ, γ, σ2 iid∼ N (µ+ xTj2γ, σ2)

γ =

[µ

γ

]| σ2 ∼ Np+1(0, σ2Γ0)

σ2 ∼ IG(ξ0, ν0)

where xj2 and yj2 represent covariates and responses in the new group with

label j2, respectively. Parameter βj2 is sampled from a p-dimensional Gaus-

sian distribution with mean βmode and variance covariance matrix Σmode. In

particular, βmode is the argmax of the following expression

∏i: si=j2

exp(βTj2xi

)exp

(βTj2xi

)+∑

j 6=j2 E(

exp(βTj xi

))Np (βj2 ;β0,Σ0) ,

which corresponds to an approximation of the full-conditional of the βk (we

dropped the dependence on the other βj 's by considering the expected value

in the denominator). Note that E(

exp(βTj xi

))is not other than the moment

generating function, thus it is equal to exp(βT0 xi + xTi Σ0xi/2

).

Combine: here Kprop = K − 1, so that it suces to collapse two groups into

one. Specically, we randomly choose one group to delete, say j1, and remove the

corresponding parameters βj1 , µj1 and σ2j1. Then, we choose another group, j2, and

assign all the data yj1 to it.

Acceptance rate: this is simply given by

α(K → K+1) =p(y | K + 1, θK+1)π(K + 1, θK+1)

p(y | K, θK)π(K, θK)

1

f(µj2 , γj2 , σ2j2, βj2)

pSK+1

pCK+1

pc(j1, j2)

ps(j)

α(K → K−1) =p(y | K − 1, θK−1)π(K − 1, θK−1)

p(y | K, θK)π(K, θK)f(µj1 , γj1 , σ

2j1 , βj1)

pCK−1

pSK

ps(j)

pc(j1, j2)

where θK = (σ21:K , γ1:K , µ1:K , β1:K). Moreover, ps(j) is the probability of splitting

component j and similarly for the other terms.

104

4.4 Simulated data and reference datasets

Before illustrating the application of our models to specic datasets, we discuss

some general choices that apply to all examples. Every run of the Gibbs sampler

(implemented in R) produced a nal sample size of 5,000 or 10,000 iterations (unless

otherwise specied), after a thinning of 10 and initial burn-in of 5,000 iterations. In

all cases, convergence was checked using both visual inspection of the chains and

standard diagnostics available in the CODA package. Elicitation of the prior for

(ρ, ν) requires some care, as the role of these parameters is dicult to understand.

Therefore, an extensive robustness analysis with respect to π(ρ, ν) for those datasets

was carried out. See Sections 4.4.1 and 4.4.3. We point out that an initial prior

independence assumption π(ρ, ν) = π(ρ) π(ν) produced bad mixing of the chain.

In particular, when ρ is small with respect to ν, the spectral function ϕ(·) has a

very narrow support, concentrated near the origin, forcing the covariance function

Capp(x, y) to become nearly constant for x, y ∈ S and thus producing nearly singular

matrices. We next investigated the case π(ρ, ν) = π(ρ | ν)× πν(ν), where

ρ | ν d= M(s, ε, ν) + ρ0, ρ0 ∼ gamma(aρ, bρ). (4.19)

Here,M(s, ε, ν) is a constant that is the maximum value of ρ such that ϕ(2; ρ, ν) > ε

(here ϕ(2; ρ, ν) is a reference value chosen to avoid a small support), and ε is a

threshold value, assumed to be small (0.05, for instance). From Figure 4.2, it is

0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5

x

Spe

ctra

l den

sity

ν

1

2

5

10

30

0 50 100 150 200

0.0

0.1

0.2

0.3

0.4

0.5

x

Spe

ctra

l den

sity

ν

1

2

5

10

30

Figure 4.2: Power exponential spectral density ϕ(x; ρ, ν) when ρ is 2 (left) and 100(right) and ν varies.

clear that ϕ(·; ρ, ν) goes to 0 too fast when ν is small relative to ρ. It follows that,

in case d = 1,

M(s, ε, ν) =(

logs

ε

)1/ν Γ(1/ν + 1)

s.

4.4. Simulated data and reference datasets 105

On the other hand, two dierent choices for πν were considered: a gamma dis-

tribution, which gave a bad chain mixing, and a discrete distribution on V2 =

0.5, 1, 2, 3, 5, 10, 15, 20, 30, 50 (or on one of its subsets). In this case, the mixing of

the chain was better, but the posterior for ν did not discriminate among the values

in the support. For this reason, in Sections 5.3, 6 and 7, we assume ν = 2, s = 1/2

and

ρd=

√π log

(1

2ε

)+ ρ0, ρ0 ∼ gamma(aρ, bρ). (4.20)

4.4.1 Data illustration without covariates: reference datasets

We illustrate our model via two datasets without covariates with unidimensional

(Galaxy data) and bidimensional (Air Quality data) observations, both publicly

available in R (galaxy from the DPpackage and airquality in the base version).

For the latter data set we removed 42 incomplete observations.

The popular dataset Galaxy contains n = 82 measured velocities of dierent

galaxies from six well-separated conic sections of space. Values are expressed in

Km/s, scaled by a factor of 10−3. We set the hyperparameters in this way: for the

variance σ2k of the components, (a0, b0) = (3, 3) (such that the mean is 1.5 and the

variance is 9/4) and for the weights wk the Dirichlet has parameter (1, 1, . . . , 1).

The other hyperparameters are modied in the tests, as in Table 4.1, where we

report summaries of interest, such as the prior and posterior mean and variance for

the number of components K. In addition, we also display the mean squared error

(MSE) and the log-pseudo marginal likelihood (LPML) as indices of goodness of

t, dened as MSE =∑n

i=1(yi− yi)2 and LPML =∑n

i=1 log(f(yi | y(−i))

), where

yi is the posterior predictive mean and f(yi | y(−i)) is the i−th conditional predic-

tive ordinate, that is the predictive distribution obtained using the dataset without

the i−th observation. Figure 4.3 (left) shows density estimates and the estimated

partition of the data, obtained as the partition that minimizes the posterior ex-

pectation of Binder's loss function under equal misclassication costs (see Lau and

Green, 2007b). The points at the bottom of the plots represent observations, while

colors refer to the corresponding cluster. Figure 4.3 (right) displays the posterior

distribution of K for Test 4 and 6 in Table 4.1.

As a comparison, the same posterior quantities than in Table 4.1 were computed

using the DPM, the Repulsive Gaussian Mixture Models (RGMM) by Quinlan et al.

(2017), and also the proposal by Petralia et al. (2012). To make results comparable,

we assumed the same prior information on hyperparameters common to all the

mixture models. See Table 4.5. From these tables, it is clear that alternative

repulsive models are good competitors to ours, and that they generally achieve a

better t to the dataset. The tests showing the best indexes of goodness of t are

typically those overestimating the number of clusters.

Finally, we recall that in Section 4.4.2 we report some further tests on the Galaxy

dataset to show the inuence of various choices of spectral density on the inference.

We conclude that there is evidence to robustness with respect to the choice of

106

ρ ν E(K) V (K) Epost(K) V post(K) MSE LPML

1 2 2 2 1.67 6.09 1.10 78.95 -171.722 5 10 5.00 7.12 6.07 1.09 78.33 -167.963 aρ = 1, bρ = 1 2 2.18 1.978 6.10 1.10 73.89 -164.474 aρ = 1, bρ = 1 10 2.73 2.15 6.11 1.12 74.93 -162.715 aρ = 1, bρ = 1 discr(V1) 2.47 2.21 6.06 1.08 74.02 -172.546 aρ = 1, bρ = 1 discr(V2) 2.51 2.27 6.10 1.13 76.64 -170.94

Table 4.1: Prior specication for (ρ, ν) and K and posterior summaries for theGalaxy dataset; (aρ, bρ) appear in (4.20); here V1 is 1, 2, 5, 10, 20 and V2 =0.5, 1, 2, 3, 5, 10, 15, 20, 30, 50. V denotes the variance.

Figure 4.3: Density estimates and estimated partition for the Galaxy dataset underTest 4 in Table 4.1, including 90% credibility bands (light blue).

spectral density.

We have considered one further application, this time using the same vari-

ables from the dataset Air Quality (ozone and solar radiation) as considered in

Quinlan et al. (2017). Instead of (4.8), we assume that our likelihood is a bidi-

mensional Gaussian, with bidimensional mean vectors distributed according to the

PES − DPP (ρ, ν) prior as before, and with covariance matrices Σk independent

and identically distributed according to the inverse-Wishart distribution. See Sec-

tion 4.2.4 for changes in the Gibbs sampler with multidimensional data points, this

time adapted from Dellaportas and Papageorgiou (2006). Table 4.2 reports sum-

maries of interest for a few tests carried out, including the prior and posterior mean

and variance for the number of components K, and the LPML. As usual in the

context of other mixture models, we nd that the inference depends on the cho-

sen hyperparameters. If we compare with corresponding inference in Quinlan et al.


(2017), we got lower estimates of K, and a better t of the model to the data. The

posterior predictive densities, not shown here, seem very similar to those in Fig. 9

(b) in Quinlan et al. (2017).

Test ρ ν E(K) V ar(K) E(K|data) V ar(K|data) LPML

7 3 2 3 2.62 2.18 0.39 -246.818 ρ0 ∼ gamma(1, 0.5) 2 2.7 2.37 2.15 0.21 -257.66

Table 4.2: Prior specication for (ρ, ν) and K and posterior summaries for theairquality dataset; ρ0 appear in (4.20).

4.4.2 Dierent spectral densities: application to the Galaxy dataset

We consider the proposed model with dierent spectral densities, to check its

robustness on the inference. All the models presented in this chapter are, as a matter

of fact, general and in principle any spectral density ϕ(·) satisfying the conditions

for the existence of the DPP process can be employed. The choice of the spectral

density in (4.7) is motivated by its strong repulsiveness (see Lavancier et al., 2015).

However, in this section we show inference on the Galaxy dataset obtained when

spectral representations other than the power spectral density, drive the DPP.

We choose isotropic covariance functions that are well-known in the spatial statis-

tics literature: the Whittle-Matérn and the generalized Cauchy. Both densities de-

pend on three parameters: intensity ρ > 0, scale α > 0 and shape ν > 0. In order

to assure ϕ(x) < 1 for all x, ρ must be smaller than ρmax = α−dM , where M needs

to be specied for each of the two cases. For the Whittle-Matérn we have

ϕ(x; ρ, α, ν) = ρΓ(ν + d/2) (2α

√π)d

Γ(ν) (1 + ‖2παx‖)ν+d/2, M =

Γ(ν)

2dπd/2Γ(ν + d/2)

and for the generalized Cauchy

ϕ(x; ρ, α, ν) = ρ21−ν (2

√π)d

Γ(ν + d/2)‖2παx‖dKν(‖2παx‖), M =

Γ(ν + d/2)

Γ(ν)πd/2

where d is the dimension of the space where x lives (d = 1 in what follows) and

Kν(·) is the modied Bessel function of the second kind.

We x ρ =1

2ρmax and (α, ν) equal to: (i) (0.1,0.1), (ii) (0.1,2), (iii) (1,0.1) in

the tests below. To t the Galaxy data to the model in Section 4.2.2, the selected

hyperparameter values are δ = 1, the parameter of the Dirichlet, and (a0, b0) =

(3, 3), the parameters of the inverse gamma (see (4.12) and (4.13)).

Table 4.3 displays posterior summaries for the two families of spectral densities

under hyperparameter settings (i), (ii) and (iii). Posterior summaries of the number

of components K and goodness-of-t values are close to those of Table 4.1. This

108

Whittle-Matérn

Test E(K) Var(K) E(K | data) Var(K | data) MSE LPML

(i) 10.21 17.29 6.07 1.09 73.67 -167.22(ii) 2.09 2.15 6.08 1.09 73.89 -167.68(iii) 3.53 9.87 6.07 1.10 75.80 -167.33

Generalized Cauchy

Test E(K) Var(K) E(K | data) Var(K | data) MSE LPML

(i) 5.65 14.49 6.09 1.09 76.98 -166.60(ii) 1.84 1.73 6.07 1.10 75.75 -167.42(iii) 0.25 0.06 6.07 1.12 80.66 -169.84

Table 4.3: Prior mean and variance of K and posterior summaries for the Galaxydataset with Whittle-Matérn (top) and generalized Cauchy (bottom) spectral den-sities.

gives evidence to robustness with respect to the choice of the spectral density.

4.4.3 Tests on data from a mixture with 8 components

We simulated a dataset with n = 100 observations from a mixture of 8 compo-

nents. Each component is the Gaussian density with mean θk and σ2k = σ2 = 0.05:

the means θk are evenly spaced in the interval (−10, 10). In the model (4.8)-

(4.13), we set aρ = 2.0025, bρ = 0.050125 so that E(ρ0) = 0.05 and V ar(ρ0) = 1;

again, s = 0.5 and δ = 1. We recall that ρ0 is dened in (4.19).

Table 4.4 reports hyperparameters values for dierent tests and posterior sum-

maries of interest, as well as prior mean and variance of K. In particular, we show

the posterior mean and variance for the number of componentsK (with which we as-

sess the eectiveness of the model for clustering), the mean squared error (MSE) and

the log-pseudo marginal likelihood (LPML) (that helps quantifying the goodness-

of-t). In all cases we obtained a pretty satisfactory estimate of the exact number

of components, which is 8: the posterior is concentrated around the true value with

a very small variance. See also Figure 4.4.

From the density estimation viewpoint, we have from Table 4.4 that both MSE

and LPML are similar for all the tests, thus indicating robustness with respect to

the prior choice of parameters ρ and ν. However, preferable tests seem to be S2 and

S7; see Figure 4.5, where density estimates and estimated partitions for these two

cases are displayed. The posterior density of ρ under Tests S2 and S7 is shown in

Figure 4.6.

4.4.4 Comparison to alternative models

We now consider tting alternative models to the Galaxy and two simulated

datasets, one from the mixture with 8 components introduced in the previous sec-


Prior specication

Test ρ ν E(K) V (K)

S0 9.00 1 8.98 45.12S1 9 10 9 23.05S2 aρ = 1, bρ = 1 1 1.94 1.99S3 aρ = 1, bρ = 1 2 2.18 1.99S4 aρ = 1, bρ = 1 10 2.74 2.17S5 aρ = 1, bρ = 1 discr(2,5,20) 2.52 2.11S6 aρ = 1, bρ = 1 discr(V1) 2.45 2.18S7 aρ = 1, bρ = 1 discr(V2) 2.5 2.25

Posterior summaries

Test E(K | data) V (K | data) MSE LPML

S0 7.98 0.20 4.65 2.39S1 7.99 0.19 4.62 3.10S2 8.00 0.17 4.62 3.66S3 7.991 0.16 4.62 3.03S4 7.99 0.17 4.63 2.96S5 7.99 0.16 4.63 3.61S6 7.99 0.17 4.65 3.42S7 7.99 0.18 4.63 3.36

Table 4.4: Prior specication for (ρ, ν) and the corresponding mean and vari-ance induced on K (top). Hyperparameters (aρ, bρ) appear in (4.20), while V1 =1, 2, 5, 10, 20 and V2 = V1 ∪ 0.5, 3, 15, 30, 50. Posterior summaries for the simu-lated dataset from a mixture with 8 components are in the bottom subtable.

0.0

0.2

0.4

0.6

0.8

1.0

K

Prob

abilit

y m

ass

7 8 9

0.0

0.2

0.4

0.6

0.8

1.0

K

Prob

abilit

y m

ass

7 8 9 10

Figure 4.4: Posterior distribution of K for the simulated dataset from the mixtureof 8 components under Tests S2 (left) and S7 (right) in Table 4.4.

tion, and the second consisting of 10,0000 observations generated from a mixture

110

data

Den

sity

−10 −5 0 5 10

0.0

0.1

0.2

0.3

0.4

0.5

0.6

| || |||||| ||| || ||| || || | || | ||||| ||| | ||| | ||| | || ||| | ||| || || | ||| ||| | ||| | | ||| | | || || || | | | || || | ||| | | ||| | ||| |

data

Den

sity

−10 −5 0 5 10

0.0

0.1

0.2

0.3

0.4

0.5

0.6

| || |||||| ||| || ||| || || | || | ||||| ||| | ||| | ||| | || ||| | ||| || || | ||| ||| | ||| | | ||| | | || || || | | | || || | ||| | | ||| | ||| |

Figure 4.5: Density estimate and estimated partition for the simulated dataset fromthe mixture of 8 components under Tests S2 (left) and S7 (right) in Table 4.4. Thepoints at the bottom of the density estimate represent the data, and each colorrepresents one of the eight estimated clusters.

of 20 components. We consider rst the gold standard of Bayesian nonparametric

models, the DPM, and then the RGMM by Quinlan et al. (2017), and the similar

specication in Petralia et al. (2012). The same prior information on hyperparam-

eters common to all the mixture models was assumed, i.e. the same marginal prior

for σ2k and (w1, . . . , wk). Hyperparameter τ in the potentials φ1 and φ2 was set ac-

cording to the suggestion in Quinlan et al. (2017) (τ = 5.54). As a comparison, the

DPM

Test α E(K) E(K|data) V ar(K|data) MSE LPML

7 gamma(0.5, 1) 2.9 6.166 1.549 62.703 -151.7978 0.8 4.3 5.936 1.25 61.255 -151.1469 0.45 3 4.371 1.142 139.659 -169.97810 gamma(4, 2) 7.7 7.271 1.594 36.708 -149.258

Repulsive models

Model E(K|data) V ar(K|data) MSE LPML

Quinlan et al. (2017) 6.462 0.440 38.122 -162.574Petralia et al. (2012) 7.621 0.757 20.964 -156.522

Table 4.5: Prior specication for α and posterior summaries for the Galaxy datasetusing the function DPdensity in DPpackage (top) and repulsive models (bottom).

same posterior quantities than in Table 4.1 were computed; see Tables 4.5 and 4.6.

The DPM was tted via the function DPdensity available from DPpackage (Jara

et al., 2011), while the code for the alternative repulsive models was gently provided


ρ

Den

sity

5 10 15 20 25 30

0.00

0.05

0.10

0.15

ρ

Den

sity

0 10 20 30 40 50 60

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Figure 4.6: Posterior distribution of ρ for the simulated dataset from the mixtureof 8 components under Tests S2 (left) and S7 (right) in Table 4.4.

by José Quinlan and Garritt Page.

DPM

Test α E(K) E(K|data) V ar(K|data) MSE LPML

11 0.43 3 7.961 0.5 4.779 -11.24612 gamma(4, 2) 8.17 8.665 0.910 4.248 -10.116

Repulsive models


Quinlan et al. (2017) 10.73 1.407 3.121 -4.754Petralia et al. (2012) 8.51 0.357 4.152 -4.022

Table 4.6: Posterior summaries for the simulated dataset from the mixture of 8components using using the function DPdensity in DPpackage (top) and repulsivemodels (bottom).

Comparison of the tables above and Tables 4.1 and 4.4 show that the alternative

repulsive models are good competitors to ours, and according to the dataset and

hyperparameters specication, they may achieve a better (Galaxy) or worse (sim-

ulated data) t to the data. The tests showing the best indexes of goodness of t

are typically those overestimating the number of clusters. It is well-known that,

in general, clustering in the context of nonparametric mixture models as DPMs is

strongly aected by the base measure (see, e.g. Miller and Harrison, 2017). The

same disadvantage aects the mixture models in Quinlan et al. (2017) and Petralia

et al. (2012). Our model, on the other hand, avoids the delicate choice of the base

measure leading to more robust estimates of K.

As a further comparison, see also Figure 4.7 which displays the posterior dis-

tribution of K under the DPM mixture and our models for the Galaxy dataset.

112

4 6 8 10 12 14

0.0

0.1

0.2

0.3

0.4

K

Prob

abilit

y m

ass

DPP − 4

DPP − 6

DPM

Figure 4.7: Posterior distribution of the number K of components for the Galaxydataset under Test 4 (black) and 6 (blue) in Table 4.1 and under the DPM model(red) as in Test 7 in Table 4.5.

For the second simulated dataset, we considered applicability for a moderately

large sample size, generating 10,000 observations from a 20-component mixture, 10

of them being Gaussian, and the rest being skew-normal distributions with positive

and negative skewness. The true density is showed in Figure 4.8. To estimate the

true number of clusters, we tted dierent alternative models to this dataset: our

model, the repulsive mixture models by Quinlan et al. (2017) and Petralia et al.

(2012), and the nite mixture model implemented in the mclust R package via the

function Mclust (Fraley et al., 2012) with a number of components between 10 and

25. The same prior information on hyperparameters common to all the Bayesian

mixture models was assumed. The Mclust function returns the estimates of the

number of components corresponding to the best three models, in this case 11, 17

and 18. Though the run-time for this application is around 15 times longer than in


PES −DPP 16.41 1.38 1356.43 -13239.54Quinlan et al. (2017) 14.13 0.146 1475.98 -13771.56Petralia et al. (2012) 20.81 0.564 1002.05 -12940.49

Table 4.7: Posterior summaries for the large simulated dataset.

the case of the Galaxy data, our algorithm reduces the eective number of clusters


data

Den

sity

−10 −5 0 5 10

0.00

0.05

0.10

0.15

Figure 4.8: Histogram, truedensity (red) and density esti-mate (black) of the large sim-ulated dataset, including 90%credibility bands (light blue).

a posteriori, thus helping our model scaling up. Intuitively, the increase in the run-

time is mostly due to the larger number of mixture components and the much larger

sample size than in the case of other datasets illustrated here.

4.4.5 Simulated data with covariates

We consider the same simulated dataset as in Müller et al. (2011), Section 5.2; the

simulation truth" consists of 12 dierent distributions, corresponding to dierent

covariate settings (see Figure 1 of that paper). Model (4.8)-(4.10), (4.13)-(4.16) was

tted to the dataset, assuming β0 = 0, Σ0 = 400×(XTX

)−1, aρ = 1, bρ = 1.2, and

a0, b0 such that the prior mean of σ2k is 50 and variance is 300. Recall also that here

we assume ν = 2.

As an initial step, inference for the complete dataset (1000 observations) was

carried out, yielding a posterior of K, not reported here, mostly concentrated over

the set 8, 9, . . . , 16, with a mode at 11. Figure 4.9 shows posterior predictive dis-

tributions for the 12 dierent reference covariate values, along with 90% credibility

intervals. These are in good accordance with the simulation truth (compare Figure

1 in Müller et al., 2011).

To replicate the tests in Müller et al. (2011), a total of M = 100 datasets of size

200 were generated by randomly subsampling 200 out of the 1000 available obser-

vations. Computational burden over multiple repetitions was controlled by limiting

the posterior sample sizes to 2,000. Table 4.8 displays the root MSE for estimating

E(y | x1, x2, x3) for each of the 12 covariate combinations dening the dierent clus-

ters for our model and for the PPMx, as in Table 1 of Müller et al. (2011). The com-

putations also include evaluation of the root MSE and LPML for all the 100 datasets

for estimating the data used to train the model, with MSEtrain =∑n

i=1 (yi − yi)2 ,

114

0 20 40 60 80 100 120

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

0 20 40 60 80 100 120

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

0 20 40 60 80 100 120

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

0 20 40 60 80 100 120

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

0 20 40 60 80 100 120

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

0 20 40 60 80 100 120

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

0 20 40 60 80 100 120

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

0 20 40 60 80 100 120

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

0 20 40 60 80 100 120

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

0 20 40 60 80 100 120

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

0 20 40 60 80 100 120

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

0 20 40 60 80 100 120

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

Figure 4.9: Predictive distribution corresponding to the 12 dierent reference valuesof the covariates. The simulation truth can be found in Figure 1 of Müller et al.(2011).


where yi is the expected value of the estimated predictive distribution, and for a

test dataset of 100 new data, MSEtest =∑n

i=1

(ytesti − yi

)2. In addition, we report

LPMLtrain, value of the Log Pseudo Marginal Likelihood for the training dataset.

Table 4.9 shows the values compared to other competitor models, i.e the linear de-

pendent Dirichlet process mixture (LDDP) dened in De Iorio et al. (2004), the

product partition model (PPMx) in Müller et al. (2011) and the linear dependent

tailfree process model (LDTFP) in Jara and Hanson (2011). The best values are in

bold: our model performs well according to the LPML, while the MSE suggests to

use PPMx or LDTFP. In general, our model is competitive with respect to other

popular models in the literature. Moreover, in the LDDP case, we have that the

average number of clusters is 20.6 with variance 2.266, thus indicating a less parsi-

monious model compared to ours.

x1 x2 x3 DPP PPMX

-1 0 0 6.1 7.90 0 0 6.7 3.91 0 0 7.2 2.8-1 1 0 6.5 5.40 1 0 6.5 4.61 1 0 6.8 4.0-1 0 1 6.8 6.10 0 1 6.1 4.21 0 1 5.7 4.5-1 1 1 5.9 9.50 1 1 6.6 8.31 1 1 5.8 6.2

avg 6.4 5.6

Table 4.8: Root MSE for estimating E(y | x1, x2, x3) for 12 combinations of co-variates (x1, x2, x3) and PPMx as competing model of reference (compare also theresults in Table 1 of Müller et al., 2011).

DPPx LDDP PPMX LDTFP

Root MSEtrain 324.531 304.742 278.395 304.374Root MSEtest 216.675 215.1694 217.2459 212.761

LPMLtrain -871.8 -902.2295 -873.1671 -901.465

Table 4.9: Comparison with competitors for the simulated dataset with covariates:best values according to each index are in bold. DPPx denotes our model, whileLDDP is the linear dependent Dirichlet process mixture, PPMx is the productpartition model with covariates, and LDTFP is the linear dependent tailfree processmodel.

In summary, our extensive simulations suggest that the proposed approach tends

to require less mixture components than other well-known alternative models, while

116

still providing a reasonably good t to the data.

4.5 Biopic movies dataset

For this illustrative example we consider the Biopics data available in the R

package fivethirtyeight (Ismay and Chunn, 2017). This dataset is based on the

IMDB database, related to biographical lms released from 1915 through 2014. An

interesting explorative analysis of the data can be found in goo.gl/M2QWFt.

We consider the logarithm of the gross earnings at US box oce as a response

variable, with the following covariates: (i) year of release of the movie (in a suitable

scale, continuous); (ii) a binary variable that indicates whether the main character

is a person of color; and (iii) a categorical variable that considers if the country of

the movie is US, UK or other. After removing the missing data from the dataset,

we were left with n = 437 observations and the number of covariates p = 4. We

note that 76 biopics have a person of color as a subject and the frequencies of

the category origin are (256, 79, 64) for US, UK and other, respectively; other

means mixed productions (e.g. US and Canada, or US and UK). In what follows,

the hyperparameters in model (4.17)-(4.18), (4.9)-(4.10), (4.15)-(4.16) are chosen

as β0 = 0, (aρ, bρ) = (1, 1). The prior mean and variance of K induced by these

hyperparameters are 2.162 and 1.978, respectively. The scale hyperparameter φ in

the g-prior for β and (a0, b0) vary as determined in Table 4.10, where m and v

denote the prior mean b0/(a0 − 1) and variance b20/((a0 − 1)2(a0 − 2)), respectively,

of the inverse gamma distribution for σ2k as in (4.18). We also assume γ0 equal to

the vector of all 0's, while Λ0 is such that the marginal a priori variance of γk is

equal to diag(0.01, 0.1, 0.1, 0.1), in accordance to the variances of the corresponding

frequentist estimators.

Test φ m v E(K | data) sd(K |data) MSE LPML

A 50 5 1 4.49 1.10 1126.32 -960.89B 200 5 10 4.45 1.19 983.55 -954.55C 50 3 +∞ 5.66 1.27 501.22 -918.74D 200 10 5 4.21 1.33 1805.83 -980.61E 100 2 1 5.31 1.21 564.26 -935.56F 200 2 10 5.51 1.26 557.44 -925.22

Table 4.10: Prior specication for βk's and σ2k's parameters and posterior summaries

for the Biopics dataset; m and v are prior mean and variance, respectively, of σ2k.

Posterior mean and variance of the number K of mixture components are in thefth and sixth columns, respectively, while the last two columns report MSE andLPML, respectively.

The posterior of K is robust with respect to the choice of prior hyperparameters;

on the other hand, our results show that by not including covariates in the likelihood,

i.e. setting all γk's are equal to 0, inference onK is much more sensitive to the choice

goo.gl/M2QWFt

4.5. Biopic movies dataset 117

of (a0, b0) (results not shown here).

Predictive inference was also considered, by evaluating the posterior predictive

distribution at the following combinations of covariate values: (i) (mean value for

covariate year, US, white); (ii) (mean value for covariate year, US, color); (iii)

(mean value for covariate year, UK, white); (iv) (mean value for covariate year,

UK, color); (v) (mean value for covariate year, other, white); and (vi) (mean value

for covariate year, other, color). Corresponding plots are shown in Figure 4.10.

y

Den

sity

8 10 12 14 16 18 20

0.00

0.05

0.10

0.15

0.20

0.25 (i)

(ii)

(iii)

(iv)

(v)

(vi)

Figure 4.10: Predictive distribution for the log of gross earnings for cases (i)− (vi)under Test E in Table 4.10 for the Biopics dataset.

These distributions appear to be quite dierent in the six cases: in particular, we

can observe that in cases (i) and (ii), the posterior is shifted towards higher values.

This is quite easy to interpret, since the measurements are given by the earnings

in the US box oces; therefore, we expect that in general US movies will be more

protable in that market. The dierence due to the race is, on the other hand, less

evident. However, the predictive densities show slightly higher earnings for movies

where the subject is a person of color, if the origin is other ((v) and (vi)). Movies

from the UK, on the other hand, exhibit the opposite behavior ((iii) and (iv)).

We report here the posterior cluster estimate for Test B in Table 4.10. We found

three groups, with sizes 10, 193, 234, respectively; see Figure 4.11 for the estimated

clusters and boxplots of the response. As a comparison, it can be useful to report

the total average values for the response, 15.36, and for the covariates: 7.89 (year),

0.18 (UK), 0.15 (other), 0.83 (white). These 3 groups have a nice interpretation in

terms of covariates: group 1 is the smallest, with a high average response (17.18),

118

2 4 6 8

81

01

21

41

61

82

0

Year of release

Box

−offic

e e

arn

ings

1 2 3

81

01

21

41

61

82

0

Box

−offic

e e

arn

ings

Figure 4.11: Cluster estimate (left) under our model (Test B in Table 4.10) forthe Biopics dataset. Each color represents one of the three estimated clusters.Coordinate y is the response, i.e. log box-oce earning, while coordinate x is thecovariate year of release. The boxplot of the response per group is in the right panel.

and it is characterized by a high percentage of movies from other countries, with a

person of color as its subject. Group 2 corresponds also to a high average response

(16.42), but the average values of UK, other and person of color are similar to the

total averages (0.14, 0.09, 0.84, respectively). The average response in group 3 is

smaller (14.40) than the total sample mean, while the average values of UK, other

and person of color are 0.22, 0.17, 0.84, respectively.

To assess eectiveness of the proposed model, we compare the results with the

linear dependent Dirichlet process mixture model introduced in De Iorio et al. (2004)

and implemented in the LDDPdensity function of DPpackage (Jara et al., 2011).

Prior information has been xed as follows: for Test G the mass parameter of

the Dirichlet process α is set equal to 0.3 such that E(K) = 2.87 and V ar(K) =

1.81, that approximately match the prior information we gave on the parameter K.

Similarly, under Test H, α is distributed according to the gamma(1/4, 1/2), such

that the prior mean onK is 3.6 and variance 22.18. The normal baseline distribution

is a multivariate Gaussian with mean vector 0 and a random covariance matrix which

is given a non-informative prior and the inverse-gamma distribution for the variances

of the mixture components has parameters such that mean and variance are equal

to 5, 1, respectively similarly as in Table 4.10. Posterior summaries can be found in

Table 4.11.

As a comparison between the estimated partitions under our model (Figure 4.11)

and the LDDP mixture model, Figure 4.12 displays the estimated partition obtained

under the LDDP model under Test G, that has 3 groups with sizes 300, 127, 10.

4.6. Air quality index dataset 119

Figure 4.12: Cluster estimate obtained under a linear dependent Dirichlet processmodel with prior specication G in Table 4.11.

Case E(K | data) sd(K | data) MSE LPML

G 2.95 1.03 1282.49 -937.51H 3.56 2.36 682.98 -914.00

Table 4.11: Posterior summaries for the tests on the Biopics dataset under a lineardependent Dirichlet process mixture.

4.6 Air quality index dataset

The Air Quality Index (AQI) is an index for reporting air quality, see for instance

https://airnow.gov/index.cfm?action=aqibasics.aqi. It describes how clean

or polluted the air is, and what associated health eects might be a concern for the

population. The Environmental Protection Agency calculates the AQI for ve major

air pollutants regulated by the Clean Air Act: ground-level ozone, particle pollution,

carbon monoxide, sulfur dioxide, and nitrogen dioxide. Data can be obtained from

several sources, for instance, from http://aqicn.org/. For a real-time map, see

https://clarity.io/airmap/.

For the purpose of this illustration, we investigate the spatial relations in mea-

surements of the AQI made on September 13th, 2015, at 16pm. We consider 1147

locations scattered around North and South America, shown in Figure 4.13 here (the

values of AQI have been standardized). Note that the highest AQI values, indicating

the most polluted air, are depicted in red in the map. We ran the MCMC algorithm

to t model (4.8)-(4.10), (4.13)-(4.16), with a burn-in of 10,000, a thinning of 10

and a nal sample size of 5,000. As before, β0 = 0 and ν = 2. Table 4.12 displays

dierent settings of the hyperparameters for which the prior mean on the number of

groups is 1.996 and the prior standard deviation is 1.290 (computed using a Monte

Carlo approach). The dierent hyperparameter settings dier for the specication

of φ, the scale hyperparameter in the g-prior in (4.16), and the prior mean m and

variance v of σ2k; see (4.13).

Figure 4.14 shows the estimated clusters obtained under Test AQ1. The north

- east coast seems to be associated with better environmental conditions, and it is

https://airnow.gov/index.cfm?action=aqibasics.aqi

http://aqicn.org/

https://clarity.io/airmap/

120

Figure 4.13: Air quality indexdataset, where the number of lo-cations is 1147. Yellow pointsdenote areas with the smallestvalue of AQI, while red denotespoints with the highest value.

Test φ m v mean(K) sd(K) MSE LPML

AQ1 1000 2 1 6.999 1.469 861.048 -1101.013AQ2 500 10 +∞ 5.192 1.143 870.689 -1235.988AQ3 1000 0.1 1 9.045 2.243 840.0685 -1071.931

AQ4 500 5 +∞ 7.811 2.665 835.596 -1160.631

Table 4.12: Prior specication for the Air quality index dataset. The scale parameterφ appears in the g-prior specication of (4.16), while m and v denote prior meanand variance, respectively, of σ2

k as in (4.13).

clear that important urban sprawls are generally grouped together. More in detail,

the Binder loss function method estimated 6 groups characterized by the following

mean and standard deviations of the AQI: (0.95, 0.44) in the red group, (-0.27, 0.45)

in the yellow, (-0.70, 0.21) in the green, (1.7,1.64) in the light blue, (0.28, 0.54) in the

blue, (-0.51, 0.29) in the pink group; yellow, pink and green points are associated to

lower values of AQI, while red and light blue to higher values. The boxplots of the

AQI by cluster in Figure 4.14 are clearly interpretable: the cluster depicted in light

blue gathers the polluted cities in south America and big cities in the West coast of

the U.S. (Las Vegas, Los Angeles, Seattle, for instance). On the other hand, yellow

and green points indicate less dangerous environmental conditions that characterize

the North-East coast: however, the small red cluster contains the big cities that are

present in this area (Chicago, New York, Philadelphia, Boston).

Figure 4.15 displays three dierent predictive laws that correspond to dierent

locations: Sacramento, which shows the lowest predicted values of AQI, New York,

where the environmental conditions are worse, and Monterrey, that presents an

intermediate situation. Figure 4.16 shows the posterior predictive mean for a grid

4.6. Air quality index dataset 121

1 2 3 4 5 6

02

46

AQ

I

Cluster

4

2

5

3

6

1

Figure 4.14: Estimated partition of the Air Quality index dataset under hyperpa-rameters as those in Test AQ1 in Table 4.12. The number of estimated clusters is 6,each denoted by a dierent color, with sizes 17, 221, 183, 136, 306, 284, respectively.

of locations scattered around north America.

Similarly as for the Biopics dataset, we compare the inference under our model

with the linear dependent Dirichlet process mixture model introduced in De Iorio

et al. (2004). Prior information is xed as follows: α is distributed according to

the gamma(1, 1) distribution for Test AQ5, so that the prior mean and variance of

K are 7.15 and 36, respectively, i.e. the prior of K is vague. On the other hand,

in Test AQ6 the mass parameter α of the Dirichlet process is set equal to 0.15

such that E(K) = 2.09 and V ar(K) = 1.02, which approximately matches the prior

information given on K (mean 1.996 and variance 1.29). The baseline distribution is

122

AQI

De

nsity

0 2 4 6

0.0

0.2

0.4

0.6

0.8

AQI

De

nsity

0 2 4 6

0.0

0.2

0.4

0.6

0.8

AQI

De

nsity

0 2 4 6

0.0

0.2

0.4

0.6

0.8

Figure 4.15: Predictive distribution corresponding to 3 dierent locations (NewYork, Sacramento, Monterrey) under Test AQ4 in Table 4.12 for the Air QualityIndex dataset.

Figure 4.16: Prediction over a grid of coordinates for the Air Quality Index datasetunder Test AQ4 in Table 4.12.

a multivariate Gaussian with mean vector 0 and a random covariance matrix which

is given a non-informative prior and the hyperparameters of the inverse-gamma

distribution for the variances of the mixture components are such that prior mean

and variance are equal to 5 and 1, respectively. Posterior summaries can be found

in Table 4.13.

4.7. Conclusion 123

Test E(K | data) Var(K | data) MSE LPML

AQ5 5.14 0.38 827.72 -1100.73AQ6 5.03 0.16 827.04 -1100.06

Table 4.13: Posterior summaries for the Air Quality Index dataset under the lineardependent Dirichlet process mixture for two dierent prior specications.

4.7 Conclusion

This work deals with mixture models where the prior has the property of repul-

sion across location parameters. Specically, the discussion is centered on mixtures

built on determinantal point processes (DPPs), that can be constructed using a

general spectral representation. The methods work with any valid spectral density,

but for the sake of concreteness, illustrations were discussed in the context of the

power exponential case.

Though we limit ourselves to the case of isotropic DPPs, inhomogeneous DPPs

can be obtained by transforming or thinning a stationary process. However, we

believe that this case is not very interesting, unless there is a strong reason to

assume non-homogeneous locations a priori.

Our computational experiments and data illustrations show that the repulsion

induced by the DPP priors indeed tends to eliminate the annoying case of very

small clusters that commonly arises when using models that do not constrain lo-

cation/centering parameters. This happens with very small sacrice of model t

compared to the usual mixture models.

Another advantage of our model over DPMs is that we avoid the delicate choice

of the base measure of the Dirichlet process, leading to more robust estimates on

the number K of components in the mixture.

Chapter 5

Constructing stationary time series of

completely random measures via

Bayesian conjugacy

One exible approach to building stationary time-dependent processes exploits the

mathematical notion of conjugacy in a Bayesian framework. Under this approach, the

transition law L (Xt|Xt−1) of a process Xt is dened as the predictive distribution of

an underlying Bayesian model (see e.g. Pitt and Walker (2005)). Then, if the model is

conjugate, the transition kernel can be analytically derived, making the approach partic-

ularly appealing. We aim at achieving such a convenient mathematical tractability in the

context of completely random measures (CRMs), i.e. when the variables exhibiting a time

dependence are CRMs. In order to take advantage of the conjugacy, here we consider

the large class of exponential family of completely random measures (see Broderick et al.

(2017)). This leads to a simple description of the process which has an AR(1)-type structure

and oers a framework for generalizations to more complicated forms of time-dependence.

The proposed process can be straightforwardly employed to extend CRM-based Bayesian

nonparametric models such as feature allocation models to time-dependent data. These

processes can be applied to problems from modern real life applications in very dierent

elds, from computer science to biology. In particular, we develop a dependent latent fea-

ture model for the identication of features in images and a dynamic Poisson factor analysis

for topic modelling, which are tted to synthetic and real data.

126

5.1 Stationary autoregressive typeAR(1)models for uni-

variate data

An intense research activity of the past decades has been focused on constructing

strictly stationary autoregressive type (AR-type) models with arbitrary stationary

distributions (see, for instance, Mena and Walker (2007), Pitt and Walker (2005),

Jørgensen and Song (1998)). We will focus on the approach introduced in Pitt et al.

(2002) and later generalized in various frameworks (e.g. more general time depen-

dences and nonparametric approaches). The aim is to build a strictly stationary

process Xt whose marginal laws are xed and denoted by p(x). A suitable auxil-

iary random variable Y , with conditional distribution p(y|x), is introduced and the

transition density driving the AR(1)-type model Xt is obtained as

p(x|xt−1) =

∫p(x|y)p(y|xt−1)ν(dy) (5.1)

with

p(x|y) =p(y|x)p(x)∫

p(y|x)p(x)η(dx)

where ν and η are reference measures, such as the Lebesgue or counting measures.

This construction implies that p(·) is the invariant density for the transition in (5.1),i.e.

p(x) =

∫p(x|xt−1)p(xt−1)η(dxt−1).

Note that the transition density of the process Xt has the interpretation of

p(x|xt−1) = EY |Xt−1(p(x|y)|xt−1)

where the expectation is with respect to p(y|xt−1). The latter can be seen as the

posterior distribution of the model X|Y ∼ p(x|y); Y ∼ p(y) where p(x|y) acts as a

conditional sampling model (the likelihood) and p(y) as prior.

In this parametric framework Pitt et al. (2002) studied a wide class of models

for Xt, obtained when p(x|y) belongs to the exponential family and p(y) is the

corresponding conjugate prior. One of the advantages of this approach is that the

integral of the transition kernel in (5.1) has a closed analytical form. This char-

acteristic makes these models particularly appealing from an applicative point of

view, since the latent variable Y results in a mathematical trick to build the desired

dependence.

In this work we aim at achieving the same mathematical tractability but in a

more general context where the observations are completely random measures, intro-

duced in Chapter 1; nevertheless, it is quite often the case in practical applications

that the CRMs are merely latent variables. Other works extending the approach

to a nonparametric framework are, among the others, Mena and Walker (2005) and

Antoniano-Villalobos and Walker (2016).

More general time dependences can be found in Mena and Walker (2007) and

5.2. Exponential completely random measures 127

Pitt and Walker (2005).

5.2 Exponential completely random measures

One of the main ingredients of the model we are going to propose in Section 5.3 is

the exponential family of CRMs, introduced in Broderick et al. (2017). As mentioned

in Chapter 1, a broad class of Bayesian nonparametric priors can be viewed as models

for the allocation of data points to traits. These processes give us traits paired with

rates or frequencies with which the traits occur in some population. Corresponding

likelihoods assign each data point in the population to some nite subset of traits

conditioned on the trait frequencies. What makes these models nonparametric is

that the number of traits in the prior is countably innite. That is, such a model

allows the number of traits in any dataset to grow with the size of the data. Thus,

nonparametric models allow for a great exibility but present also many challenges

from a computational viewpoint, since an innite number of parameters are involved.

In this sense, having conjugagy would be a valuable advantage when dealing with

this kind of models in real life applications. Conjugacy asserts that the prior belongs

to the same family of distributions as the posterior: the exponential family of CRMs

provides the opportunity of building models with a conjugate structure. Hence, we

are able to consider marginal processes, which take a particularly straightforward

form, as well as to avoid to handle innite-dimensional parameters, namely the

prior and the posterior. For instance, in Section 1.3.2 in Chapter 1 we gave a useful

marginal representation of a general class of completely random measures. In order

to keep the chapter self-contained, we recall some basic notions that are needed to

present the family of exponential CRMs.

From now on we will represent each trait by a point ψ in some (Polish) space

Ψ of traits. Further, let Jk be the frequency, or rate, of the trait represented by

ψk, where k ≥ 1 indexes the countably many traits. In particular, Jk ∈ R+. Then,

(Jk, ψk) is a couple consisting of the frequency of the k − th trait together with

the trait itself. We can represent the full collection of pairs of traits with their

frequencies by a discrete measure on Ψ that places weight Jk at location ψk, namely

G =∑k≥1

Jkδψk .

Next, we form data point X conditionally to G, viewed as a discrete measure as

well. Each atom of X represents a pair consisting of a trait to which the individual

is allocated and a degree to which the individual is allocated to this particular trait.

That is, X is a discrete measure whose support coincides with the support of G and

X =∑j≥1

xkδψk , (5.2)

where now xk ∈ R+ represents the degree to which the data point belongs to trait

ψk.

128

Recall that any (homogeneous) completely random measure may be uniquely

characterized by its Lévy's intensity, that can be factorized as

ν(ds× dψ) = ρ(ds)P0(dψ)

where ρ is any σ−nite, deterministic measure on R+ and P0 is any proper (diuse)

probability distribution on Ψ.

Each jump xk in (5.2) is drawn according to some distribution H that takes Jk,

the weight of G at location ψk, as a parameter; i.e.,

xk ∼ H(dx|Jk) independently across k.

Some assumptions are needed on the prior and the likelihood:

1. ρ(R+) = +∞: we require that the measure has a countable innite number of

atoms.

2. each data point can be allocated to only a nite number of traits. Thus, we

require the number of atoms in every X to be nite, that is H(dx|J) must be

discrete, with support N = 0, 1, 2, . . . , for all J , and write h(x|J) for the

probability mass function of x given J . Moreover, note that, by construction,

the pairs (Jk, ψk)Kk=1 form a marked Poisson point process with rate measure

µmark(ds× dx) := ρ(ds)h(x|s), so we assume

∞∑x=1

νx(R+) < +∞ for νx := ρ(ds)h(x|s).

Given these assumptions, the exponential family of completely random measures

oers a convenient framework for developing our model. We recall Denition 4.1

of Broderick et al. (2017), discarding the xed part of the measure, not of interest

here:

Definition 5.1

We say that a CRM G is an exponential CRM if the ordinary component has rate

measure µ(ds× dψ) = ρ(ds)P0(dψ) for some probability distribution P0 and weight

rate measure ρ of the form

ρ(ds) = γ exp< η(s), ξ > +λ[−A(s)] (5.3)

where γ > 0, ξ and λ are hyperparameters, η(·) is the natural parameter and A(·)is known as the cumulant function.

Theorem 4.2 of Broderick et al. (2017) states that these random measures are

automatic conjugate priors for an exponential CRM likelihood, as follows:

Theorem 5.1

Let G =∑∞

k=1 Jkδψk . Let X be generated conditionally on G according to an

5.3. Building a stationary time dependent model for a

sequence of discrete random measures 129

exponential CRM with xed-location atoms at ψk∞k=1 and no ordinary component.

In particular, the distribution of the weight xk at ψk of X has the following density,

belonging to the exponential family for parametric distributions, conditioning on

the weight Jk at ψk of G:

h(x|Jk) = κ(x) exp < η(Jk), φ(x) > −A(Jk) .

Here, κ(x) is a function of data and φ(x) is the sucient statistic. Then, a conjugate

prior for X is the exponential CRM distribution, with weight rate measure as in

(5.3).

As a consequence, very simple marginal and size-biased representations can be

derived (for more details, see Broderick et al. (2017)). We also remark the fact

that many models that are well-known in the literature are entailed in this frame-

work: Beta-Bernoulli, Poisson-Gamma, Beta-Negative Binomial processes, among

the others.

5.3 Building a stationary time dependent model for a

sequence of discrete random measures

We start by motivating the model presented here: all the papers mentioned in

Section 5.1 focused on exibly modeling time-dependent univariate continuous or

count data. However, it is important to build models that reect the complexity

of the data that are available nowadays and coming from very dierent sources

(for instance, images, documents, etc.). Driven by this challenge, we propose an

extension of that models where we consider discrete measures, in the same framework

of Broderick et al. (2017). Some related works are given, among the others, by Srebro

and Roweis (2005), Williamson et al. (2010) and Caron et al. (2012). We come out

with a model which is very exible, due to the nonparametric structure oered by

completely random measures, but it is at the same time mathematically tractable,

thanks to the conjugacy property guaranteed by the exponential family of CRMs.

The main purpose here is to extend the nonparametric generalized latent trait model

discussed in Section 1.3.3 to include time dependence in the underlying process.

Therefore, we are going to dene a model for expressing formulas (1.26) − (1.27)

and (1.29) − (1.30) in Chapter 1. Note that in what follows, the process Xt isnothing other than Θt in Section 1.3.3.

5.3.1 The model

The aim is to build a model for discrete measures evolving in time of the form

(5.2), namely a time series as

Xt =

+∞∑k=1

xtkδψk , t = 0, 1, . . . , T (5.4)

130

where xtkind∼ h(·|Jk) and h has the form of the exponential family. We are going

to exploit the construction described in Section 5.1: however, the auxiliary random

variable that we are going to consider is an expCRM . This choice allows us to write

down the posterior, p(G|Xt−1), that is composed by two components:

the ordinary part, that is a CRM whose Lévy intensity is updated in this

way,

ρpost(s) = γκ(0) exp < ξ + φ(0), η(s) > −(λ+ 1)A(s)

the xed-locations component, that can be written as

Knew∑j=1

Jnew,jδψnew,j

where Knew is the number of components of Xt−1 that we have actually ob-

served (number of k s.t. x(t−1)k > 0). Here,

Jnew,j ∼ fnew,j(s) ∝ exp < ξ + φ(xnew,k), η(s) > −(λ+ 1)A(s)

We are also able to compute the transition kernel in (5.1), p(Xt|Xt−1), obtained

by integrating out the latent parameter: this is specied in the next proposition.

Proposition 1

The transition kernel for a sequence of discrete random measures belonging to the

exponential family can be described by two parts:

1. the values of xtk corresponding to the ψk that have been observed in Xt−1 =∑x(t−1)kδψk are sampled according to

hcond(xtk = x|x(t−1)k) = κ(x) exp−B(ξ + x(t−1)k, λ) +B(ξ + x(t−1)k + x, λ+ 1)

where x(t−1)k > 0, exp(B(a, b)) =

∫exp(< a, η(θ) > −bA(θ))dθ.

2. For every x = 1, 2, . . . , new atoms are observed: these are ρnewt,x ∼ Poisson(Mt,x),

and ψt,x,jiid∼ P0, j = 1, . . . , ρt,x. Here, Mt,x = γκ(0)κ(x) expB(ξ + φ(0) +

φ(x), λ+ 2).

The result follows easily by applying Corollary 6.2 in Broderick et al. (2017).

Looking closely at p(Xt|Xt−1), it is clear that this distribution is given by two con-

tributions: an innovation term, which consists of sampling new items ψk according

to a thinned Poisson process, and an inserting/deleting (thinning) process, where

we re-sample the value xtk related to the location ψk that has been observed at time

t− 1. Thus, we can write down the two contributions as follows

Xt | Xt−1d=∑k

xthintk δψ(t−1)k+∑x≥1

ρt,x∑j=1

xδψnewj,x. (5.5)



The following proposition species the likelihood of the model for (X0, X1, . . . , XT ).

Proposition 2

The likelihood of our model is the following:

L(X0, X1, . . . , XT |γ, ξ, λ,∆) = L(X0|γ, ξ, λ,∆)

T∏t=1

L(Xt|Xt−1, γ, ξ, λ,∆)

∝∏x≥1

L(ρnew0,x |γ, ξ, λ)

ρ0∏j=1

P0(ψ0j |∆)

T∏t=1

ρt−1∏l=1

hcond(xthintl |x(t−1)l > 0, ξ, λ)×

×ρnewt∏j=1

P0(ψnewtj |∆)∏x≥1

L(ρnewt,x |ξ, λ, γ)

∝∏x≥1

T∏t=0

Poisson(ρnewt,x ;Mt,x)×ρ0∏j=1

P0(ψ0j |∆)

T∏t=1

ρnewt∏j=1

P0(ψnewtj |∆)

×T∏t=1

ρt−1∏l=1

κ(xthintl ) exp−B(ξ + x(t−1)k, λ) +B(ξ + x(t−1)k + xthintl , λ+ 1)

∝ exp

−∑x≥1

T∑t=0

(Mt,x − ρnewt,x log(Mt,x)

)+

+

T∑t=1

ρt−1∑l=1

(−B(ξ + x(t−1)k, λ) +B(ξ + x(t−1)k + xthintl , λ+ 1)

)

×∏x≥1

T∏t=0

1

ρnewt,x !×

T∏t=1

ρt−1∏l=1

κ(xthintl )

ρ0∏j=1

P0(ψ0j |∆)

T∏t=1

ρnewt∏j=1

P0(ψnewtj |∆)

where ∆ and (γ, ξ, λ) are the parameters of P0 and of the Lévy density, respectively.

Mt,x has been dened above at point 2., and it depends on γ, ξ and λ.

Moreover, ρnew0,x = #k s.t x0k = x, x = 1, 2, . . . and ρnewt,x = #k s.t xtk =

x and ψtk is new , x = 1, 2, . . . , are the number of items with label x observed at

time 0 and t; ρ0 =∑

x≥1 ρnew0,x and ρnewt =

∑x≥1 ρ

newt,x are the number of items that

have been observed for the rst time at time t = 0, 1, . . . , T . Lastly, ρt−1 is the

number of observed items/traits at the previous time.

As a further analysis of the proposed model, it can be interesting to investigate

whether an autoregressive relationship between Xt and Xt−1 holds or not. In partic-

ular, an autoregressive model species that the mean depends linearly on previous

132

values. Can we recover a similar relationship? By exploiting relation (5.5), we have

E (Xt(A)|Xt−1) =

∫ ∑k

xthintk δψ(t−1)k(A) +

∑x≥1

ρt,x∑j=1

xδψnewj,x(A)

×× L

(dxthintk , k ≥ 1, dψnewj,x , j = 1, . . . , ρt,x, dρt,x, x ≥ 1|Xt−1

)=∑k≥1

E(xthintk |x(t−1)k

)δψ(t−1)k

(A) +∑x≥1

+∞∑L=1

xPoisson(L;Mt,x)P0(A)

so that we end up with

E (Xt(A)|Xt−1) =∑k≥1

E(xthintk |x(t−1)k

)δψ(t−1)k

(A) +∑x≥1

xMt,xP0(A). (5.6)

We are going to specify this relationship for the three special cases below.

Other values of interest that may be useful for interpreting and xing the hy-

perparameters of the exponential CRM, namely (ξ, λ, γ), are the expected value of

the number of features associated to a (strictly) positive weight and the expected

value of the total mass of the CRM.

Proposition 3

Let X be a completely random measure dened as in (5.4): then, the expected value

of the number of features associated to a positive weight is

E

( ∞∑k=1

I (xk > 0)

)=

∞∑x=1

γκ(x) exp (B(ξ,+φ(x), λ+ 1)) (5.7)

and the expected value of the total mass is

E (X(Ψ)) =∞∑x=1

γxκ(x) exp (B(ξ,+φ(x), λ+ 1)) . (5.8)

Proof. Formula (5.7) can be computed by using the law of total expectation and

the size biased representation of a CRM of Corollary 5.2 in Broderick et al. (2017)

with m = 1:

E

( ∞∑k=1

I (xk > 0)

)= EG

EX

∑i≥1

ρi∑j=1

I (xi,j > 0) | G =∑i≥1

ρi∑j=1

Ji,jδψi,j

= EG

∑i≥1

ρi

=∑i=1

γκ(i) exp (B(ξ + φ(i), λ+ 1))

since ρi ∼ Poisson(Mi) and Mi = γκ(i) exp (B(ξ + φ(i), λ+ 1)) (see formula (28)



in Broderick et al. (2017)); moreover, the size-biased representation allows us to

reorder atoms such that the weights xk are represented in increasing order, i.e. we

have ρ1 weights taking value 1, ρ2 weights taking value 2, etc. This helps us in

the computation. In addition, formula (5.8) is obtained with a similar reasoning, as

follows:

E (X(Ψ)) = E

( ∞∑k=1

xk

)= EG

EX

∑i≥1

ρi∑j=1

xi,j | G =∑i≥1

ρi∑j=1

Ji,jδψi,j

= EG

EX

∑i≥1

iρi | G =∑i≥1

ρi∑j=1

Ji,jδψi,j

=∞∑i=1

iMi =∞∑i=1

γiκ(i) exp (B(ξ,+φ(i), λ+ 1)) .

The two formulas above are specied for the Poisson-Gamma case in Section

5.3.3. Before looking closely at three special cases, a remark on the trend of the

total number of traits is due. We saw how, at each time instant some new features

appear, generated from the base distribution P0: denote with KT the total number

of features appeared in the process up to time T . Then, the growth rate of KT is

linear. This is due to the stationarity, since

ρnewt ∼ Poisson

∑x≥1

Mt,x

where Mt,x = γκ(0)κ(x) exp (B(ξ + φ(0) + φ(x), λ+ 2)) does not depend t. Thus

E (KT ) =T∑t=1

∑x≥1

Mx = γκ(0)T∑x≥1

κ(x) exp (B(ξ + φ(0) + φ(x), λ+ 2))

grows linearly with T .

5.3.2 Example 1: Beta - Bernoulli

As prior for G, consider the well-known Beta process, rst introduced by Hjort

(1990), corresponding to the following choice of Lévy intensity,

ρ(s) = γs−1(1− s)c−1, γ > 0, c > 0

and a Bernoulli process likelihood (Thibaux and Jordan (2007)), where

xtk|Jkind∼ Be(Jk)

134

so that the jumps xtk ∈ 0, 1.Regarding the predictive law, simple calculations lead to the following results

for the conditional distribution of xtk given x(t−1)k = 1,

xtk|x(t−1)k = 1 ∼ Be(

1

c+ 1

)and

ρnewt ∼ Poisson(

γ

c+ 1

).

Equation (5.6) turns out to be

E (Xt(A)|Xt−1) =1

c+ 1Xt−1(A) +

γ

c+ 1P0

which is of AR(1) type.

Moreover, the likelihood is the following:

L(X0, X1, . . . , XT |γ, c,∆) = Poisson(ρ0; γ/(c+ 1))

ρ0∏j=1

P0(ψ0,j |∆)

×T∏t=1

ρt−1∏l=1

Be(xthintl ;1

c+ 1)

ρnewt∏j=1

P0(ψnewtj |∆)Poisson

(ρnewt | γ

c+ 1

)∝ cNT−ST

(c+ 1)−NnewT

(c+ 1)NTγN

newT exp

−γ(T + 1)

c+ 1

T∏t=0

ρnewt∏j=1

P0(ψnewjt |∆)

where NT =∑T

t=1 ρt−1 (number of total observed traits), ST =∑T

t=1

∑ρt−1

l=1 xthintl

(total number of items survived after the thinning) and NnewT =

∑Tt=0 ρ

newt .

Figure 5.1 shows some simulated data, where the number of time steps is 6

and the values of γ and c vary. In particular, γ acts as a mass parameter, namely

increasing γ leads to more atoms from the innovation part. On the other hand,

if the parameter c increases, the items are less persistent, since the probability of

observe them again is smaller.

Now, suppose to assign two independent gamma distributions as prior for the

parameters γ and c, γ ∼ gamma(a, b) and c ∼ gamma(s, r). In this case, we have

that the full-conditional for γ is

γ|data, c,∆ ∼ gamma(a+NnewT , b+ T/(c+ 1))

and the full-conditional for c is proportional to

cNT−ST+s−1(c+ 1)−NnewT −NT exp

−γ(T + 1)

c+ 1− rc

I(c>0),

so that a step of Metropolis-Hastings method is needed.



ψ1newψ2

new

X_ 0

ψ1new ψ2

new

X_ 0

ψ1new ψ1

thinψ2thin

X_ 1

ψ1new ψ2

new ψ4newψ1

thin ψ2thin

X_ 1

ψ1new ψ1

thin ψ2thin

X_ 2

ψ1new ψ2

newψ3new ψ4

newψ5new ψ6

newψ1thin ψ3

thinψ4thin

X_ 2

ψ1new ψ2

newψ3newψ1

thin ψ2thin ψ3

thin

X_ 3

ψ1newψ2

new ψ3newψ1

thin ψ2thinψ3

thin ψ4thinψ5

thinψ6thin ψ7

thinψ8thin

X_ 3

ψ1thin ψ2

thinψ3thinψ4

thin ψ5thin ψ6

thin

X_ 4

ψ1newψ2

new ψ3newψ1

thin ψ2thinψ3

thin ψ4thinψ5

thin ψ6thinψ7

thinψ8thin ψ9

thinψ10thin

X_ 4

ψ1newψ1

thin ψ2thinψ3

thinψ4thin ψ5

thin ψ6thin

X_ 5

ψ1newψ2

new ψ3new ψ5

new ψ6newψ7

new ψ8new ψ1

thinψ2thin ψ3

thinψ4thin ψ5

thinψ6thin ψ7

thinψ8thin ψ9

thinψ10thinψ11

thin ψ12thinψ13

thin

X_ 5

ψ1newψ2

new ψ3new ψ4

newψ5newψ1thinψ2

thin ψ3thinψ4

thinψ5thin ψ6

thin

X_ 6

ψ1new ψ2

new ψ3new ψ4

newψ1thinψ2

thin ψ3thin ψ5

thin ψ6thinψ7

thin ψ8thin ψ9

thinψ10thinψ11

thin ψ12thinψ13

thin ψ14thinψ15

thin ψ16thinψ17

thin ψ18thinψ19

thin

X_ 6

Figure 5.1: Simulated data, where T = 6, and (γ, c) = (1, 0.1) (left column), (γ, c) =(5, 0.1) (right column).

136

Figure 5.2: Posterior distribution for c in the two settings: (a, b, s, r) = (3, 1, 1, 1)(left) and (a, b, s, r) = (3, 1, 0.1, 0.05) (right). The vertical line represents the truevalue of the parameter c.

As a toy example, we simulated from this process a time series with T = 100

time instants, γ = 3 and c = 1. Two dierent sets of prior have been chosen:

rst, we assigned (a, b, s, r) = (3, 1, 1, 1), second, (a, b, s, r) = (3, 1, 0.1, 0.05), where

the a-priori variance for c is small (1) and then higher (40). After running 10000

iterations of the Gibbs sampler, we obtained the posterior distributions for c shown

in Figure 5.2. In both cases, the posterior is concentrated around the exact value.

Figure 5.3 shows the predictive distribution for some quantities of interest: the

number of new items observed at time t+ 1, L(ρnewt+1 |data) (left), and the posterior

probability that an item is observed s times consecutively, where s = 1, 2, . . . (right).

The last quantity, conditionally on c, is given by

(1

c+ 1

)s.

5.3.3 Example 2: Poisson - Gamma

Consider now another couple of conjugate processes of the exponential CRMs

family, namely the Poisson likelihood and the Gamma processes. Suppose the weight

xk at location ψk has support on N and has a Poisson density with parameter

Jk ∈ R+:

h(x|Jk) =1

x!Jxk e

−Jk =1

x!exlog(Jk)−Jk

so that

κ(x) =1

x!φ(x) = x η(s) = log(s) A(s) = s.

The conjugate process in this case is the so-called generalized Gamma process, where

ρ(s) = γsξe−λs, γ > 0, ξ ∈ (−2,−1], λ > 0.



2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Nr. of new items

Pro

babi

lity

mas

s

2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

0.5

Time

Pro

babi

lity

Figure 5.3: Predictive distribution for the number of new items observed at timet+ 1, L(ρnewt+1 |data) (left), and posterior probability that an item is observed s timesconsecutively, where s = 1, 2, . . . (right)

From now on, set ξ = −1, as usual in the literature. Broderick et al. (2017) rst

established the conjugacy of the Poisson - Gamma processes. From an applicative

viewpoint, this model recently emerged in the literature as a prior in Bayesian non-

parametric learning scenarios; in particular, a Poisson - Gamma process may be

employed when we assume to have multiple latent features associated with observa-

tions and each feature can have multiple occurrences within each data point. See,

for instance, Titsias (2008) and Roychowdhury and Kulis (2015). In the latter work,

the authors propose a variational algorithm for inference under models involving the

Gamma processes and derive an error bound for that approximation. The model is

then applied to the problem of learning latent topics in document corpora. In Tit-

sias (2008), the issue of learning visual object recognition systems from unlabelled

images is investigated.

In our model, we have

hcond(xtk = x|x(t−1)k = y) =

(x+ y − 1

x

)(1− 1

λ+ 2

)y 1

(λ+ 2)x

namely xtk|x(t−1)k ∼ NegBin(x(t−1)k;

1

λ+ 2

)and

ρnewt,x ∼ Poisson(γ

x

1

(λ+ 2)x

), x = 1, 2, . . . .

Figure 5.4 shows two simulated datasets, where the mass parameter γ is xed

to 10; λ takes values 0.1 (left panel) and 2.5 (right). It is clear that parameter λ

takes the role of controlling the thinning part, since a larger value of λ implies less

repetitions and in general smaller values of the degrees.

138

ψ1new ψ2

newψ3newψ4

new

X_ 0

ψ1new ψ2

new ψ3new

X_ 0

ψ1newψ2

newψ3newψ4

new ψ5newψ7

newψ8newψ9

thin ψ10thinψ11

thin

X_ 2

ψ1newψ3

new ψ4newψ5

thin ψ6thin

X_ 2

ψ1new ψ2

new ψ3newψ4

newψ6new ψ7

new ψ8newψ9

thinψ11thinψ12

thinψ13thin

X_ 3

ψ1new ψ2

thinψ3thinψ4

thin

X_ 3

ψ1new ψ2

new ψ3newψ4

new ψ5newψ6

new ψ7thinψ8

thinψ9thin ψ10

thin ψ11thinψ13

thinψ14thin

X_ 4

ψ1new ψ2

new

X_ 4

ψ1new ψ2

newψ3newψ4

new ψ5newψ6

newψ7thinψ8thin ψ9

thinψ10thinψ11

thin ψ12thin ψ13

thinψ15thinψ16

thin

X_ 5

ψ2newψ3thin

X_ 5

ψ1new ψ2

newψ3newψ4

new ψ5new ψ6

new ψ7newψ8

thin ψ9thinψ10

thinψ11thin ψ12

thinψ13thinψ14

thin ψ15thinψ17

thinψ18thin

X_ 6

ψ1newψ2

new ψ4thin

X_ 6

Figure 5.4: Simulated data, where T = 6, and (γ, λ) = (10, 0.1) (left column),(γ, λ) = (10, 2.5) (right column).



Relation (5.6) is, in this case,

E (Xt(A)|Xt−1) =∑k≥1

1

λ+ 2

(1− 1

λ+ 2

)−1

x(t−1)kδψ(t−1)k(A) +

∑x≥1

xγ

x

1

(λ+ 2)xP0(A)

=1

λ+ 1Xt−1(A) +

γ

λ+ 1P0(A)

which is of AR(1) type.

The likelihood is

L(X0, X1, . . . , XT |γ, λ,∆) =∏x≥1

Poisson(ρnew0,x |γ, λ)×T∏t=1

∏x≥1

Poisson(ρnewt,x |γ, λ)

×ρt−1∏l=1

NegBin(xthintl |xt−1,l > 0, 1/(λ+ 2))T∏t=0

ρt∏j=1

P0(ψtj |∆)

∝ exp(−γ(T + 1)log

(λ+ 2

λ+ 1

))γ∑x≥1

∑Tt=0 ρ

newt,x (λ+ 2)−

∑x≥1 x

∑Tt=0 ρ

newt,x

× (λ+ 1)NT (λ+ 2)−NT−ST

where NT =∑T

t=1

∑ρt−1

l=1 x(t−1)l, ST =∑T

t=1

∑ρt−1

l=1 xthintl . As prior on (λ, γ), we

assume

(λ, γ) ∼ gamma(s, r)× gamma(a, b).

In the recent literature, there exist works related to dynamic modeling of count

matrices, that can be compared to our proposal. Among the others, Acharya et al.

(2015) (gamma process dynamic Poisson factor analysis) and Han et al. (2014)

(Dynamic rank factor model).

We conclude with the computation of formulas (5.7) and (5.8) in this case:

E

(∑k

I (xk > 0)

)= γ

∞∑i=1

Γ(ξ + 1 + i)

i! (λ+ 1)ξ+1+i= γ

Γ(ξ + 2)

(λ(λ+ 1)ξ+2

λξ+2− λ− 1

)(ξ + 1)(λ+ 1)ξ+2

(ξ=−1)= γ log

(λ+ 1

λ

)and

E (X(Θ)) = E

(∑k

xk

)= γ

Γ(ξ + 2)

λξ+2

(ξ=−1)= γ/λ.

5.3.4 Example 3: Beta prime - Odds Bernoulli

The last example regards another process where xk ∈ 0, 1, called Odds Bernoulliprocess, introduced in Broderick et al. (2017). In this case, the mass probability of

140

xk is

h(x|Jk) = Jxk (1 + Jk)−1 = exp (xlog(Jk)− log(1 + Jk)) ,

that is, if J = ρ/(1 − ρ), where ρ is the probability of a successful Bernoulli draw,

it can be seen as an odds ratio. Then,

κ(x) = 1 φ(x) = x η(s) = log(s) A(s) = log(1 + s).

The conjugate process is the so-called Beta prime process, with

ρ(s) = γsξ(1 + s)−λ, s > 0, γ > 0, ξ ∈ (−2,−1], λ > ξ + 1.

Simple calculations lead to

exp(−B(a, b)) =Γ(b)

Γ(a+ 1)Γ(b− a+ 1)

so that the conditional law is

xtk = x|x(t−1)k = 1 ∼ Be(x;

1

λ(ξ + 2)

)and

ρnewt ∼ Poisson(γ

Γ(ξ + 2)Γ(λ− ξ)Γ(λ+ 2)

).

Moreover, relation (5.6) is, in this case,

E (Xt(A)|Xt−1) =ξ + 2

λXt−1(A) + γ

Γ(ξ + 2)Γ(λ− ξ)Γ(λ+ 2)

P0(A)

which is again of AR(1) type.

We conclude specifying the likelihood:

L(X0, X1, . . . , XT |γ, ξ, λ,∆) ∝ exp

−γΓ(ξ + 2)Γ(λ− ξ)

Γ(λ+ 2)(T + 1)

(γ

Γ(ξ + 2)Γ(λ− ξ)Γ(λ+ 2)

)NT× λ−ST (ξ + 2)

∑Tt=1

∑ρt−1j=1 xthintl (λ− ξ − 2)ST−

∑Tt=1

∑ρt−1j=1 xthintl

where ST =∑T

t=1 ρt−1 and NT =∑T

t=0 ρnewt (ρnew0 = ρ0).

5.4 Application: latent feature model on a synthetic

dataset of images

The Indian Buet process (Griths and Ghahramani (2011)) and its dependent

extensions (Williamson et al. (2010) and Miller et al. (2012), among the others) have

been used in models for unsupervised learning in which a linear Gaussian latent

feature model is employed to investigate the hidden binary features. In particular,

5.4. Application: latent feature model on a synthetic

dataset of images 141

suppose that each datum xt is generated from a Gaussian distribution with mean

ztA where A is a feature matrix, so that xt = ztA + noise. We refer to the model

described in Section 5 of Griths and Ghahramani (2011), but a similar example

can be found in Ruiz et al. (2014). Consider a simulated dataset consisting of an

8 × 8 image, evolving over T time steps. The features that generated our data are

represented in Figure 5.5.

Figure 5.5: Features.

The noise is Gaussian distributed with mean zero and variance σ2X , that will

be specied later. Note that the features, namely the rows of matrix A, are the

location points of the measure in our representation: a-priori, they are Gaussian

distributed too, with zero mean and variance given by σ2AI.

The goal is to build a MCMC algorithm to sample from

L(A, z1, z2, . . . , zT , γ, c, σ2X , σ

2A|x(1),x(2), . . . ,x(T ))

where x(t), t = 1, . . . , T are vectors of dimension 64 representing the images; we wish

to recover the inner features of Figure 5.5. The complete model is the following:

L(X|Z, σ2X) ∼

T∏t=1

ND(x(t)|z(t)A, σ2XI)

A ∼ NK+×D(0, σ2A)

σ2x ∼ invgamma (αx, βx)

σ2A ∼ invgamma (αA, βA)

L(z(1), z(2), . . . , z(T )|γ, c

)= L

(z(1)|γ, c

) T∏t=2

L(z(t)|z(t−1), γ, c

)(γ, c) ∼ gamma(a, b)× gamma(s, r)

142

where K+ is the number of columns of Z = (z1, z2, . . . , zT )T with sums greater than

0 and D is the dimension of each datum, D = 64 in this case. The prior for the

feature matrix A is the Gaussian distribution for matrices of dimension K+ × D:

each column has mean 0 and variance-covariance matrix σ2AI. For the law of the

binary matrices assigning the features to data, Z = (z1, . . . , zT ), see the example in

5.3.2.

Note that, using CRMs notation, zt are the jumps of the measure, and features

a1,a2, . . . , namely the rows of A are the traits. Thus, one could dene the

underlying time dependent CRMs of our model as

Yt =∑k≥1

ztkδak .

The model described above is entailed in the general class of models in Chapter 1,

formulas (1.28) - (1.30). There, the kernel K is given by the multivariate Gaussian

distribution, the link function is the identity and the base measure P0 is again mul-

tivariate Gaussian. The generalized Indian Buet, GIB, process has been replaced

by the autoregressive lag 1 type process described in Section 5.3.2.

5.4.1 Devising a particle Gibbs sampler

Sequential Monte Carlo (SMC) methods are a popular class of algorithms em-

ployed for sampling from general high-dimensional probability distributions; they

proved to be very ecient when dealing with state-space (or hidden Markov state)

models, which is in fact our case (see Doucet and Johansen (2009)). Our goal is

to sample from a distribution of the form p (θ,A, z1, . . . , zT |x1, . . . ,xT ) where the

hidden Markov state process is a process dened on the (ideally) innite dimensional

vector of binary variables, representing the presence or the absence of features at a

certain time. Moreover, θ =(σ2X , σ

2A, c, γ

)here. We employ the particle Gibbs sam-

pler of Andrieu et al. (2010) to reach this goal. In that work, the authors proposed

a valid particle approximation to the Gibbs sampler, where a conditional SMC step

is required.

In a nutshell, the algorithm consists in alternating two steps: the updating of

the static parameters θ through their full-conditionals, θ ∼ p (θ|Z,x1, . . . ,xT ,A)

(using Metropolis-Hastings steps, when needed) and a run of a conditional SMC

algorithm where the target is pθ (z1, . . . , zT ,A|x1, . . . ,xT ), conditional on the pre-

viously drawn θ and its ancestral lineage. After this step, we have N couples par-

ticle/weight(Z(i),A(i), wi

)i=1,...,N

that approximate the target distribution, where

Z(i) is a sequence of T binary vectors and A(i) is a K+(i) × D matrix whose rows

contain the features. Sampling from it requires simply to draw an index n from

a discrete distribution with weights (w1, . . . , wN ) and consider the corresponding

particle,(Z(n),A(n)

). In the following, we are going to specify the steps of the

algorithm that are related to our specic model: for an overview of the method, see

Andrieu et al. (2010).



Before that, we need to express a sequential representation of the feature matrix

A. Remember that our target is L(A, z1, . . . , zT ,θ|x1, . . . ,xT ), that is proportional

to the joint law.

A convenient way of expressing this joint law is the following, where in each line

we point out the dependence of the parameters on a certain time (note that θ is

considered xed):

t = 1 : L (z1)L(A1|z1, σ

2A

)N(x1|z1A

1, σ2XI)

t = 2 : L (z2|z1)L(A2|A1, z1:2, σ

2A

)N(x2|z2A

1:2, σ2XI)

. . . . . .

t = T : L (zT |zT−1)L(AT |A1:(T−1), z1:T , σ

2A

)N(xT |z2A

1:T , σ2XI)

where At is the submatrix of A containing a number K+t of draws from the prior,

where K+t is equal to the number of new features that are discovered at time t, i.e.

that appear for the rst time at time t. Thus, A1:t is the union of all the features

observed up to time t, namely the union of the rows ofA1, . . . ,At

. Therefore,

L(At|A1:(t−1), z1:T , σ

2A

), thanks to the specic choice of prior, can be written as∏D

d=1N(atd|0, σ2

AI)where atd is the d−th column of the matrix At, of length K+

t .

The total number of features, K+, can be simply obtained by∑T

t=1K+t .

This representation helps us in devising a particle Gibbs sampler. We give details

on the conditional SMC step: this update is similar to a standard SMC algorithm

but it ensures that a prespecied path (Z,A) with ancestral lineage B1:T survives all

the resampling steps, whereas the remaining N−1 particles are generated according

to a proposal distribution which is hopefully similar to the target:

Time n=1

a. For i = 1, 2, . . . , N, i 6= B1, sample(z

(i)1 ,A1(i)

)∼ qθ

(z

(i)1 ,A1(i)

)= L

(A1(i)|z(i)

1 , σ2A, σ

2X ,X1

)π(z

(i)1 |γ, c

)b. Compute the unnormalized weight of each particle i = 1, 2, . . . , N

w(i)1 =

N(X1|z(i)

1 A1(i), σ2XI)L(A1(i)|z(i)

1 , σ2A

)π(z

(i)1 |γ, c

)L(A1(i)|z(i)

1 , σ2A, σ

2X ,X1

)

Time n ∈ 2, . . . ,T

a. For i 6= Bn, resample the particles according to a discrete distribution with

weights proportional tow

(1)n−1, w

(2)n−1, . . . , w

(N)n−1

, denoted D(wn−1).

144

Dene ξn−1(i) ∼ D(wn−1) and update the path of the i−th particle ξ(i) =

(ξ1(i), . . . , ξn−1(i)).

b. For i = 1, 2, . . . , N, i 6= Bn, sample(z(i)n , A

n(i))∼ qθ

(z(i)n ,A

n(i))

= L(An(i)|z(i∪ξ(i))

1:n ,A1:(n−1),(i∪ξ(i)), σ2A, σ

2X ,X1 : n

)×π(z

(i)1 |z

(ξn−1(i))(n−1) , γ, c

)c. Compute the unnormalized weight of each particle i = 1, 2, . . . , N

w(i)n =

N(Xn|z(i)

n A1:n,(i∪ξ(i)), σ2XI)L(An(i)|A1:(n−1),ξ(i), z

(i∪ξ(i))1:n , σ2

A

)L(An(i)|A1:(n−1),ξ(i), z

(i∪ξ(i))1:n , σ2

A, σ2X ,X1:n

)where qθ is the proposal distribution for the particles, consisting of drawing znfrom its prior distribution (namely applying thinning to the parent particle and

innovation) and sampling the newKn+ features of A, An, from their conditional

distribution.

The distribution

L(An(i)|z(i∪ξ(i))

1:n ,A1:(n−1),(i∪ξ(i)), σ2A, σ

2X ,X1 : n

)can be easily computed by observing that the full-conditional of the complete matrix

A1:n is

L (A1:n|z1:n,x1:n) =D∏d=1

N(Ad

1:n|µd,Σ)

where Ad1:n is the d−th column of the matrix, Σ is the variance-covariance matrix

with inverse given by

Σ−1 =1

σ2X

n∑t=1

ztzTt +

1

σ2A

I

and the mean is

µd = Σ×n∑t=1

xtdσ2X

zt.

The properties of Gaussian vectors then guarantee that

L(An(i)|z(i∪ξ(i))

1:n ,A1:(n−1),(i∪ξ(i)), σ2A, σ

2X ,X1 : n

)=

D∏d=1

N(And |µdn,Σd

)namely we sample the D columns (of length Kn

+) independently from a Gaussian

distribution of variance-covariance matrix

Σn = Σ+ − ΛT Σ−1Λ



and mean

µdn = µd2 + ΛT Σ−1(ad − µd1

)where Σ+, Σ and Λ are submatrices of Σ, dened according to Figure 5.6

Figure 5.6: Decomposition of matrix Σ into submatrices.

where Σ has dimensions (K1:(n−1)+ ×K1:(n−1)

+ ), Λ (K1:(n−1)+ ×Kn

+) and Σ+ (Kn+×

Kn+), where K1:n

+ stands for∑n

t=1Kt+, the number of features appeared up to time

n. Moreover, µd1 is the subvector of µd containing the rst K1:(n−1)+ elements and µd2

is the subvector of µd containing the last Kn+ elements. Finally, ad is the subvector

of A1:nd containing the rst K

1:(n−1)+ elements.

Note that the proposal requires to sample z(t) from the prior: we simply need

to sample z(t)prop,k ∼ Be(1/(c+ 1)) for those k s.t. z

(t−1)k = 1 and then add new ρ

(t)new

components to the vector, where ρ(t)new ∼ Poisson(γ/(c+ 1)).

As far as the static parameters θ are concerned, the full-conditional are as follows:

σ2A|rest ∼ inv − gamma

(αA +

K+D

2, βA +

1

2

∑Dd=1 ||ad||2

)

σ2X |rest ∼ inv − gamma

(αX +

TD

2, βX +

1

2

∑Tt=1 ||xt − ztA||2

)

γ|rest ∼ gamma(aγ +Nnew

T , bγ +T

c+ 1

)where Nnew

T =∑T

t=1 ρnewt .

L (c|rest) ∝ cNT−ST+ac−1(c+ 1)−NnewT −NT exp

(− γT

c+ 1− bcc

)with NT =

∑Tt=2 ρt−1, ST =

∑Tt=2

∑ρt−1

l=1 xthintl .

The algorithm has been implemented in Rcpp (see Eddelbuettel et al., 2011) a

language extension of R that allows us to combine C++ and R. In this way we are

able to improve the performance of the algorithm by rewriting key functions in C++.

The following proposition establishes the independence of the weights on the

features, namely the images:

Proposition 4

The weights of the Particle Gibbs sampler are independent of the features.

Proof.

146

A convenient characteristic of the the particle lter is that the weights w(i)n do

not depend on the new n-th features that appear at time n, n = 1, . . . , T . Indeed,

we have that for every n ≥ 1,

w(i)n =

N(Xn|z(i)


(i∪ξ(i))1:n , σ2

A

)L(An(i)|A1:(n−1),ξ(i), z

(i∪ξ(i))1:n , σ2

A, σ2X ,X1:n

)= N

(Xn|z(i)


(i∪ξ(i))1:n , σ2

A

)×

L(X1:n|A1:n,(i∪ξ(i)), z

(i∪ξ(i))1:n , σ2

X

)L(An(i)|A1:(n−1),ξ(i), z

(i∪ξ(i))1:n , σ2

A

)L(X1:n|A1:(n−1),ξ(i), z

(i∪ξ(i))1:n , σ2

X

)−1

= N(Xn|z(i)

n A1:n,(i∪ξ(i)), σ2XI) L(X1:n|A1:(n−1),ξ(i), z

(i∪ξ(i))1:n , σ2

X

)L(X1:n|A1:n,(i∪ξ(i)), z

(i∪ξ(i))1:n , σ2

X

)=L(X1:n|A1:n,(i∪ξ(i)), z

(i∪ξ(i))1:n , σ2

X

)∏n−1t=1 N

(Xt|zξt(i)t A1:t,ξ1:t(i), σ2

XI)

=

∫L(X1:n|A1:(n−1),ξ(i), z

(i∪ξ(i))1:n ,A+, σ2

X

)L(dA+|z(i∪ξ(i))

1:n

)∏n−1t=1 N


XI)

where A+ is a (Kn+ ×D) matrix whose entries are Gaussian distributed with mean

0 and variance σ2A. Therefore,

w(i)n =

∏n−1t=1 N


XI)

∏n−1t=1 N


XI)×

×∫N(Xn|z(i∪ξ(i))

1:n

[A1:(n−1),ξ(i),A+

], σ2

XI)L(dA+|z(i∪ξ(i))

1:n

)=

∫N(Xn|z(i∪ξ(i))

1:n

[A1:(n−1),ξ(i),A+

], σ2

XI)L(dA+|z(i∪ξ(i))

1:n

)=

∫RK

n+×D

D∏d=1

N

ynd − Kn−1∑l=1

zn,lA1:(n−1),ξt(i)l,d ;

Kn+∑

j=1

zn,Kn−1+jA+j,d, σ

2X

×

Kn+∏

k=1

D∏d=1

N(dA+

j,d; 0, σ2A

).

As far as the time n = 1 is concerned, we have that the unnormalized weigth for



Figure 5.7: Images, t ∈ 1, 5, 10, 15, 19, 25.

the i-th particle is, using Bayes' theorem,

w(i)1 =

N(X1|z(i)1 A1(i), σ2

XI)L(A1(i)|z(i)1 , σ2

A

)π(z(i)1 |γ, c

)L(A1(i)|z(i)1 , σ2

A, σ2X ,X1

)=N(X1|z(i)1 A1(i), σ2

XI)L(A1(i)|z(i)1 , σ2

A

)π(z(i)1 |γ, c

)N(X1|z(i)1 A1(i), σ2

XI)L(A1(i)|z(i)1 , σ2

A

)m(X1|z(i)1 )

= m(X1|z(i)1

)π(z

(i)1 |γ, c).

5.4.2 Numerical results

In order to assess the eectiveness of our algorithm, we simulated 3 dierent

datasets: the rst (i) with T = 25, 4 features (the rst 4 of Figure 5.5) and a

medium/low value of noise, σ2X = 0.01; the second (ii) has a higher level of noise,

σ2X = 0.05, and the third one (iii) contains T = 100 observations, 6 latent fea-

tures and a medium level of noise, σ2X = 0.02. Figure 5.7 shows a sample of six

observations in the (i)-th dataset.

We set the parameters of the priors as follows:

(αA, βA): parameters of an inverse gamma of mean 1 and variance 2;

(αX , βX): parameters of an inverse gamma of mean 0.1 and variance 0.1;

(aγ , bγ): parameters of a gamma of mean 0.5 and variance 2;

148

(ac, bc): parameters of a gamma of mean 1.1 and variance 1.

As far as the MCMC parameters are concerned, we set N = 2000 particles for

the conditional SMC step and 2000 total iterations after a burnin of 1000 and a

thinning of 5 for all the tests. For the rst dataset, the algorithm identied exactly

4 features, that are represented in Figure 5.8 (those are the estimated features at

the last iteration). Also zt is perfectly estimated in this case, for any t = 1, . . . , 25.

The problem of estimating features with more robust approaches is not trivial:

we are not, indeed, aware of any established method for the estimation relying

on the minimization of some posterior loss function between true and estimated

features, similarly as in Lau and Green (2007a) for clustering. Recently, an R

package called sdols has been released by David Dahl and Peter Müller. The

package provides methods for summaries of distributions on feature allocations,

based on sequentially allocated latent structure optimization algorithm to minimize

dierent loss functions: however, the paper associated is not yet available, thus the

use of the package is limited, due to lack of documentation. Moreover, we would need

more general approaches, going beyond feature allocation (see the next example of

Poisson Factor Analysis). A simpler idea is the use of a 0-1 loss function, leading to

the MAP estimator. However, also this approach presents many disadvantages: in

fact, in such a high dimensional parameter space, it is very unlikely that the same

parameter is observed more than once.

Figure 5.8: Features A at the last iteration for simulated data (i).

The traceplots of the parameters σ2X , σ

2A, c and γ do not exclude convergence

and present a good mixing (plots not reported here).

Figure 5.9 shows the expected value of the predictive distribution for a new

image, at time T + 1 (b), and for the image at time t = 19 (a).

For the simulated data (ii), where the noise is higher, the algorithm found 6

features, where two of them contain only noise. However, the predictive means

5.5. Application: Poisson Factor Analysis for time dependent

topic modelling 149

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(a)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(b)

Figure 5.9: (a) Predictive distribution (left) and real observation (right) for timet = 19 and (b) for a new time instant T+1, x(T+1). The last image shows observationat time T .

remain satisfying for all t = 1, . . . , T + 1.

Finally, under simulation truth (iii), we obtained 7 features at the last iteration:

six of them are a good estimate of the features in Figure 5.5 and the last one

contains only noise. We note, however, that some observations contain actually

only noise since no features are present: the algorithm is thus able to capture also

these situations.

5.5 Application: Poisson Factor Analysis for time de-

pendent topic modelling

We consider the problem of learning latent topics in a time series of documents.

A popular approach is to consider a binary matrix that counts if a word is present

or not in the document: in this framework, Perrone et al. (2016) provide a model

to include time dependence. Here, on the other hand, the observations are given

by dwt, the number of times word w appears in the document observed at time t;

therefore, this count matrix D has dimensions V ×T , where V is the vocabulary size

and T is the number of documents. Our aim is to learn the latent factors, namely

the topics, and to investigate how the popularity of each topic is changing over time.

150

Taking into account time dependence, we allow for the possibility that a new topic

may be discovered or may stop being relevant at some point in time.

We model the count matrix D via a Poisson likelihood

D ∼ Poisson (ΦI) (5.9)

where the notation stands for dwt ∼ Poisson(∑K+

l=1 φwlIlt

), independently across

w = 1, . . . , V , t = 1, . . . , T . In (5.9), the matrix Φ has dimensions V ×K+, whereK+

is the number of topics appeared up to time T , and it is called factor loading matrix ;

its columns, denoted φkK+

k=1 are vectors of length V representing the topics. Each

topic is interpreted as a distribution on the dictionary and therefore modeled as

φ1, . . . ,φK |δiid∼ Dirichlet(δ, δ, . . . , δ).

This approach is called Poisson Factor Analysis (PFA) and it has been successfully

applied to topic modeling (in some cases also considering time dependence) in Zhou

and Carin (2015), Acharya et al. (2015), Roychowdhury and Kulis (2015), Zhou

et al. (2012), among the others.

On the other hand, I is a matrix of dimensions K+×T called factor counts: the

columns ItTt=1 are a sequence from a time dependent Poisson - Gamma process,

as in Section 5.3.3. It contains the strength, or importance, of each topic at time t,

t = 1, . . . , T . Therefore, K+ can be interpreted as the random variable representing

the number of topics in the observations: ideally, it can grow up to innite as T

grows.

We now introduce the notation, useful in what follows: Lt denotes the number

of words in the t-th documents, namely Lt =∑V

w=1 dwt, t = 1, . . . , T . Moreover,

if x is a matrix of dimension M × N , we denote x·j as the sum over the rows,

x·j =∑M

i=1 xij , j = 1, . . . , N and xi· as the sum over the columns, xi· =∑N

j=1 xij ,

i = 1, . . . ,M . Finally, we call Kt the number of topics appeared up to time t (so

that K+ = KT ).

In order to simplify the MCMC inference, it is common to augment (5.9) in this

way

dwt =Kt∑l=1

dlwt, dlwt ∼ Poisson(φwlIlt), ind. w = 1, . . . , V, t = 1, . . . , T, l = 1, . . . ,Kt

(5.10)

so that the entries of D can be explained as a sum of smaller counts, each produced

by a hidden topic.


topic modelling 151

Lemma 5.1

An equivalent representation of the model in (5.10) is the following:

dwt =

Kt∑l=1

dlwt,

dl1t, dl2t, . . . , d

lV t|dl·t,φl ∼MultinV (dl·t;φ1l, . . . , φV l)

dl·t|Ilt ∼ Poisson (Ilt) .

(5.11)

We assume independence across l = 1, . . . ,Kt, t = 1, . . . , T .

The equivalence can be readily proved, since under (5.10) we have

L(dlwt, w = 1, . . . , V, l = 1, . . . ,Kt, t = 1, . . . , T |I,Φ

)=

T∏t=1

Kt∏l=1

V∏w=1

Poisson(dlwt;φwlIlt)

=

T∏t=1

Kt∏l=1

V∏w=1

((φwlIlt)

dlwt

dlwt!e−φwlIlt

)= e−

∑Tt=1

∑Ktl=1 Ilt

T∏t=1

Kt∏l=1

V∏w=1

((φwlIlt)

dlwt

dlwt!

)and under (5.11)

L(dlwt, w = 1, . . . , V, l = 1, . . . ,Kt, t = 1, . . . , T |I,Φ

)=

T∏t=1

Kt∏l=1

MultinV (dlwt, w = 1, . . . , V ; dl·t;φ1l, . . . , φV l)Poisson(dl·t; Ilt)

=

T∏t=1

Kt∏l=1

(dl·t!

dl1t! . . . dlV t!

φdl1t1l . . . φ

dlV t

V l

Idlwte

−Ilt

lt

dlwt!

)= e−

∑Tt=1

∑Ktl=1 Ilt

T∏t=1

Kt∏l=1

V∏w=1

((φwlIlt)

dlwt

dlwt!

)which are in fact the same.

We are going to adopt model (5.11) for its convenience when building an algo-

rithm for the following model:

dwt =K+∑l=1

dlwt, dlwt ∼ Poisson (φwlIlt) w = 1, . . . , V, t = 1, . . . , T

φ1, . . . ,φK |δiid∼ Dirichlet(δ, δ, . . . , δ)

I1, . . . , IT |γ, λ ∼ TD − PoissonGamma(γ, λ)

(λ, γ) ∼ gamma(s, r)× gamma(a, b)

δ ∼ gamma(a0, b0)

(5.12)

The model above can be employed to model a real world dataset about texts

observed over a time period. For instance, a dataset that has been widely considered

in the literature is the State of the Union dataset. It contains the transcripts of

65 US State of the Union addresses, from Truman in 1945 to Bush in 2006. After

removing stop words and terms that occur less than 10 times in total, 2755 words are

152

left in our dictionary. Figure 5.10 shows the wordclouds for two presidents, Clinton

in 1997 and Bush in 2003: it is useful to highlight popular or trending terms based

on frequency of the words for that particular address.

americaspecific

outlaysmoon

loss

check

need

edpassed

brut

al

safeguard

east

money

white

clos

ing

want

wise

think

they

com

mon

sens

ebenefit

american

occasion

thought

when

celebratecharter

knowledge

financial

central

july

cost

headnet

broader

wish

earth

amendments

estate

calledsafeguards

action

occa

sion

months

amendments

called

passedopenedwhite

wish

glory

turning

network

cost

industry

aids

needed

main

head

americanmen

thought

americawise

hostile

esta

te

wait

vigorous

money

injustice

diversity risks

Figure 5.10: (Left) Wordcloud for the address of Clinton in 1997 and (right) forBush in 2003.

In the following section, we describe an algorithm for posterior inference under

our model.

5.5.1 Particle Gibbs sampler

We specify the same algorithm as in Section 5.4.1, the Particle Gibbs sampler

of Andrieu et al. (2010), for the model we are taking into account. The parameters

that are sampled according to their full-conditionals are, in this case, θ = δ, γ, λwhile the parameters addressed by the conditional Particle Filter step are given by

the topics and their trends, (φl) , l = 1, 2, . . . , It t = 1, .., T.The full-conditionals for θ are the following (where rest denotes all the variables

but the one on the left of the expression):

Parameter δ: the full-conditional for this parameter is

L(δ|rest) ∝K∏k=1

V∏w=1

φδ−1kw

∏Vw=1 Γ(δ)

Γ(δV )δa0−1e−δb0 , δ > 0.

Parameter λ: its full-conditional is

L(λ|rest) ∝(λ+ 2

λ+ 1

)−γT(λ+ 1)NT (λ+ 2)−NT−ST−LT λaλ−1e−bλλ, λ > 0

with NT =∑T

t=2

∑ρt−1

l=1 I(t−1)l ST =∑T

t=2

∑ρt−1

l=1 Ithintl , LT =∑

x≥1 x∑T

t=1 ρnewt,x .

A Metropolis-Hastings step is used, since the distribution is not of a known

form.


topic modelling 153

Parameter γ: its full-conditional is

L(γ|rest) = gamma

aγ +∑x≥1

T∑t=1

ρnewt,x , bγ + T × log

(λ+ 2

λ+ 1

)As far as the conditional sequential Monte Carlo step is concerned, we need rst

to write down the law we aim to sample from:

L (It , t = 1, . . . , T, φl, l = 1, 2, . . . |D,θ) ∝T∏t=1

V∏w=1

(Kt∑l=1

φwlIlt

)dwt×

× exp

(−

V∑w=1

Kt∑l=1

φwlIlt

)L(It|It−1, γ, λ)

KT∏l=1

Dirichlet(φl; δ, . . . , δ)

∝T∏t=1

V∏w=1

(Kt∑l=1

φwlIlt

)dwtexp

(−

V∑w=1

Kt∑l=1

φwlIlt

)L(It|It−1, γ, λ)×

×ρt∏l=1

Dirichlet(φl; δ, . . . , δ)

where ρt stands for the number of new topics appeared at time t (innovation).

Suppose we have N particles:

Time t=1

a. Propose

I(i)1 , (φ

(i)l ), l = 1, . . . ,K

(i)1

as follows:

Sample K(i)1 ∼ Poisson

(∑x≥1Mx

)where Mx =

γ

x

(1

λ+ 2

)x,

x = 1, 2, . . .

Perform an EM (expectation maximization) step to compute the follow-

ing values:(I(i), (φ

(i)l ), l = 1, . . . ,K

(i)1 , (dlw), l = 1, . . . ,K

(i),w=1,...,V1

):

(i) initialize the φl's and Il's;

(ii) at iteration m calculate dwl = dw1φ

(m−1)wl I

(m−1)l1∑K

k=1 φ(m−1)wk I

(m−1)k1

;

(iii) set I(m)k1 =

∑Vw=1 dwl and (iv) φ

(m)wl ∝

dwl

I(m)l1

.

Repeat steps (ii), (iii) and (iv) until a convergence criterion is satised.

Propose I(i)1 from a truncated normal of dimension K

(i)1 :

I(i)1 ∼ T NK

(i)1

(I(i), I

(I(i)))

154

where I(I(i))

= diag(I(i))is the inverse of the Fisher information

matrix. The truncation forces the values to be in [0,+∞). In order to

obtain integer values, we apply the ceiling function to each element of

the vector.

Propose φ(i)l according to its full-conditional, i.e.

φ(i)l ∼ DirichletV

(δ + d1l, . . . , δ + dV l

)where dwl are the values computed during the EM step.

b. The (unnormalized) weight for each particle is given by

w(i)1 =

∏Vw=1

(∑K(i)1

l=1 φwlI(i)l1

)dwtexp

(−∑K

(i)1

l=1 I(i)l1

)L(I

(i)1 |γ, λ

)q(K

(i)1 ;φ

(i)l , l = 1, . . . ,K

(i)1 ; I

(i)1

)×K

(i)1∏

l=1

DirichletV

(φ

(i)l ; δ, . . . , δ

)

where q(K

(i)1 ;φ

(i)l , l = 1, . . . ,K

(i)1 ; I

(i)1

)is the proposal distribution that gen-

erated the i-th particle. In particular, this can be computed by evaluating

three contributions as follows

1. qK(K(i)1 ) = Poisson

(K

(i)1 ;∑

x≥1Mx

);

2. qI(I(i)1 ) =

∏K(i)1

l=1

(Φ

(I

(i)l1 ; Il1,

√Il1

)−Φ

(I

(i)l1 − 1; Il1,

√Il1

))1− Φ

(0; Il1,

√Il1

) where

Φ(x;µ, σ) is the cumulative density function of a univariate Gaussian

distribution with mean µ and standard deviation σ;

3. qφ

(φ

(i)l , l = 1, 2, . . .

)=∏K

(i)1

l=1 DirichletV

(φ

(i)l ; δ + dlw1, . . . , δ + dlwV

).

Time t ∈ 2 . . . T

a. Propose

I(i)t , (φ

(i)l ), l = 1, . . . ,K

(i)t,inn

as follows:

Sample K(i)t,inn ∼ Poisson

(∑x≥1Mx

)where Mx =

γ

x

(1

λ+ 2

)x,

x = 1, 2, . . . ;

Perform the same EM step as for t = 1 in order to create(I(i), (φ

(i)l ), l = 1, . . . ,K

(i)1 , (dlw), l = 1, . . . , (K

(ξ(i))t−1 +K

(i)t,inn), w = 1, . . . , V

).

Note that the rst K(ξ(i))t−1 are xed.


topic modelling 155

Propose I(i)t from a truncated normal of dimension K

(i)t = (K

(ξ(i))t−1 +

K(i)t,inn):

I(i)t ∼ T NK

(i)t

(I(i), I

(I(i)))

where I(I(i))

= diag(I(i))is the inverse of the Fisher information

matrix. The truncation forces the values to be in [0,+∞). In order to

obtain integer values, we apply the ceiling function to each element of

the vector.

Propose the new topics appeared at time t, φ(i)l for l ∈ 1, 2, . . . ,K(i)

t,inn,according to its full-conditional, i.e.

φ(i)l ∼ DirichletV

(δ + d1l, . . . , δ + dV l

)where dwl are the values computed during the EM step.

b. The (unnormalized) weight for each particle is given by

w(i)t =

∏Vw=1

(∑K(i)t

l=1 φwlI(i)lt

)dwtexp

(−∑K

(i)t

l=1 I(i)lt

)L(I

(i)t |I

ξ(i)t−1, γ, λ

)q(K

(i)1 ;φ

(i)l , l = 1, . . . ,K

(i)1 ; I

(i)1

)×K

(i)t,inn∏l=1

DirichletV

(φ

(i)l ; δ, . . . , δ

)

where q(K

(i)1 ;φ

(i)l , l = 1, . . . ,K

(i)1 ; I

(i)1

)is the proposal distribution that gen-

erated the i-th particle. In particular, this can be computed by evaluating

three contributions as follows

1. qK(K(i)t,inn) = Poisson

(K

(i)t,inn;

∑x≥1Mx

);

2. qI(I(i)t ) =

∏K(i)t

l=1

(Φ

(I

(i)lt ; Ilt,

√Ilt

)−Φ

(I

(i)lt − 1; Ilt,

√Ilt

))1− Φ

(0; Ilt,

√Ilt

) ;

3. qφ

(φ

(i)l , l = 1, 2, . . .

)=∏K

(i)t,inn

l=1 DirichletV

(φ

(i)l ; δ + dlw1, . . . , δ + dlwV

).

In order to speed up the computational time, the Sequential Monte Carlo step has

been implemented in C++ thanks to Rcpp, Eddelbuettel et al. (2011).

5.5.2 Application to a simulated dataset

We simulated a very simple dataset consisting of T = 23 documents with

three well-separated topics. The true topics are depicted in Figure 5.12 and the

156

5 10 15 20

Time

Doc

umen

ts

Figure 5.11: Simulated dataset dwt, w = 1, 2, . . . , t = 1, . . . , 23: the horizontal axisrepresents time, the vertical axis the vocabulary. The color purple denotes the value0.

observations in Figure 5.11, where along the horizontal axis we have the time

t ∈ 1, 2, . . . , 23, while the vertical axis represents the vocabulary: brighter col-

ors depict higher values for the counts. It is clear that the rst six documents

contain the rst topic only, then from t = 7 to t = 14 only the second topic, and

the last eight documents contain only the third topic (see the simulation truth in

the left panel of Figure 5.14). As far as the prior information is concerned, we

0 5 10 15 20 25 30

0.0

00

.05

0.1

00

.15

φ1

Words

0 5 10 15 20 25 30

0.0

00

.05

0.1

00

.15

φ2

Words

0 5 10 15 20 25 30

0.0

00

.05

0.1

00

.15

φ3

Words

Figure 5.12: The real three topics that generated data in Figure 5.11.

xed the parameter δ of the Dirichlet distribution in (5.12) at 0.0001. Moreover,

λ ∼ gamma(2, 1) and γ is gamma distributed with mean 3 and variance 5.

We run the algorithm described in Section 5.5.1 with N = 2000 particles, 1000

nal iterations after a burnin of 100 iterations. The last iteration of the algorithm

ended up with K = 6 estimated topics, displayed in Figure 5.13; the corresponding

trends can be found in the right panel of Figure 5.14. The truth is recovered fairly

well, even if there are three topics, φ2, φ3, φ5, that contain only noise, i.e. only one

or two words are selected. These are, indeed, associated with a trend Ilt, l ∈ 2, 3, 5


topic modelling 157

0 5 10 15 20 25 30

0.0

00

.05

0.1

00

.15

0.2

00

.25

φ^

1

Words

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

φ^

2

Words

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

φ^

3

Words

0 5 10 15 20 25 30

0.0

00

.05

0.1

00

.15

0.2

00

.25

φ^

4

Words

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

φ^

5

Words

0 5 10 15 20 25 30

0.0

00

.05

0.1

00

.15

φ^

6

Words

Figure 5.13: The estimated six topics, as in the last iteration of the algorithm.

with very low values, compared to the three main topics, namely φ1, φ4, φ6. On

the other hand, the trends corresponding to them are reasonably recovered.

Finally, Figure 5.15 contains the comparison between the posterior predictive

mean (in purple) and true observations (in black) at time t ∈ 1, 15, 23: posteriorinference of model (5.12) was able to correctly recreate the observed documents.

As far as the test cases are concerned, we still need to perform sensitivity analysis

with respect to the hyperparameters and investigate the behavior of the Particle

Gibbs sampler when increasing the number of observed documents and/or topics.

5.5.3 Application to the State of the Union dataset

We provide some preliminary studies on the real data application about the

State of the Union dataset, mentioned in Section 5.5. We explore the dataset

consisting of the full text of 65 speeches of American presidents between 1945 (Tru-

man) to 2006 (G.W. Bush). We pre-processed the data by removing stop words

and punctuation and discarded words appearing fewer than 10 times. We ended up

with a vocabulary of 2723 words. Our goal is to discover what topics appear in the

158

5 10 15 20

01

02

03

04

05

0

Time

Tre

nd

to

pic

s

Topic 1

Topic 2

Topic 3

5 10 15 20

01

02

03

04

05

06

0

Time

Tre

nd

to

pic

s

Figure 5.14: The real trends for the three topics that generated data in Figure 5.11(left). The estimated trends for the six topics obtained at the last iteration of thealgorithm (right).

corpus and to track the evolution of their popularity over time.

The Particle Gibbs sampler was run for 1000 iterations with a burn-in period of

500 iterations and a thinning of 5. As far as the hyperparameters are concerned we

set δ ∼ gamma(1, 5), λ ∼ gamma(2, 10) and γ ∼ gamma(2, 5).

The model estimated 20 topics: some of them are meaningful and easily inter-

pretable. Remember that topics are dened by their distribution over words, so it

is possible to label them by looking at their most likely words. Figure 5.16 shows 9

interpretable topics: in the top-left part of each plot the 10 most likely words of the

topics are listed. The thick line represents the estimated temporal evolution of the

topic weights. One of the qualitative advantages of modeling time dependency ex-

plicitly is that interesting insights into the importance of topics and their change in

time: the topic (b), for example, refers to the terrorism related to the Iraqi conict

and it appears late in time. Similarly, the topic related to Internet and education,

(a), appears just before 2000.

On the other hand, there are topics that persist in time, such as (e) and (f):

in particular, by looking at the most representative words for the topics, we can

deduce that topic (e) is related to money and economy, and topic (f) to peace and

patriotism.

5.6 Discussion and future developments

In this chapter we illustrated a strategy for dening a time dependent process

whose values are completely random measures. We provided a simple description

5.6. Discussion and future developments 159

0 5 10 15 20 25 30

05

10

15

Words

Re

co

nstr

ucte

d d

ata

@ t

ime

1

0 5 10 15 20 25 30

02

46

810

Words

Re

co

nstr

ucte

d d

ata

@ t

ime

1

5

0 5 10 15 20 25 30

02

46

Words

Re

co

nstr

ucte

d d

ata

@ t

ime

2

3

Figure 5.15: Posterior predictive mean for observations at time t ∈ 1, 15, 23 (inpurple, from left to right) and real observations (in black).

of the proposed process which has an AR(1)-type structure and oers a framework

for generalizations to more complicated forms of time-dependence.

In particular, as a further development of the work, we aim at investigating

the extension to p-lagged dependence: from the analogy between linear time series

processes for real valued random variables and for point processes, we have that an

AR(p) process may be seen as

Xt =

p⋃j=1

φj(Xt−j) ∪ εt

where φ1(·), φ2(·), . . . , φp(·) are thinning operators and εt is an innovation term.

In this framework, we can generalize our model through mixture of transition

distributions as in Mena and Walker (2007):

f(xt|xt−1, xt−2, . . . , xt−p) =

p∑k=1

wkfk(xt|xt−k)

where wk ≥ 0 and∑

k wk = 1. The stationarity is preserved (see Proposition 1 in

Mena and Walker, 2007). If we consider a transition kernel fk that is equal for every

k, we have

f(xt|xt−1, . . . , xt−p) =

p∑k=1

wk

∫p(xt|G)p(dG|xt−k) =

∫p(xt|G)

p∑

k=1

wkp(dG|xt−k)

that can be interpreted as

Xt =

p⋃j=1

wjφj(Xt−j) ∪ εt

160

since only the thinning part is aected by the conditioning on Xt−k.

Another extension we would like to investigate is motivated by the following

remark: the time dependent model proposed in this chapter does not allow for re-

appearance of traits. More in detail, suppose that at time t we observe a trait

ψ and that this trait is deleted by the thinning process, namely at time t + 1 is

not observed anymore. Then, under our model, trait ψ has null probability of being

observed again for any time t > t. This is due to the fact that the centering measure

P0 is absolutely continuous.

However, in real data applications, allowing for the re-appearance of traits may

be of interest: in the case of topic modeling, for example, it may happen that a

topic disappears and then it appears again after a few years.

Finally, as a further development, we aim at providing a general algorithm to

tackle posterior inference for the wide class of models we proposed. Consider, indeed,

the general framework described in Section 1.3.3; the distribution of the vector of

scores in (1.29) can be replaced by the time dependent prior developed in this

chapter, Section 5.3. Then, a generalization of Particle Gibbs samplers to perform

posterior inference devised for the specic applications in Sections 5.4.1 and 5.5.1

may be dened.

5.6. Discussion and future developments 161

1950 1960 1970 1980 1990 2000

02

04

06

0

Time

Weig

ht

hireforgeclassroomscelebrateinternetridmillenniumpartnershipseasierteen

(a)1950 1960 1970 1980 1990 2000

01

02

03

04

05

06

07

0

Time

Weig

ht

saddamhusseinterroristiraqiiraqsbrutalmurdernevercoalitionthroughout

(b)1950 1960 1970 1980 1990 2000

02

04

06

08

0

Time

Weig

ht

gunsneighborhoodneighborhoodstestingliterallyshesseniorsviolentrichardson

(c)

1950 1960 1970 1980 1990 2000

02

04

06

08

0

Time

Weig

ht

missilessovietsextratechnologyadvancedballisticletsmissilediplomacyintellectual

(d)1950 1960 1970 1980 1990 2000

02

00

04

00

06

00

08

00

01

00

00

Time

Weig

ht

dollarsyearmillionfiscalexpendituresprogramgovernmentbillioneconomicyears

(e)1950 1960 1970 1980 1990 2000

50

01

00

01

50

02

00

0

Time

Weig

ht

peaceworldmustamericahopenationsmightgreatentirehard

(f)

1950 1960 1970 1980 1990 2000

02

04

06

08

01

00

12

0

Time

Weig

ht

lowincomegainearnersaspirationssufferenjoyedunfortunatelycrisisbearhelped

(g)1950 1960 1970 1980 1990 2000

02

04

06

08

01

00

Time

Weig

ht

rulersstoodreadyskilledaspectscommunistasiasubversiontiderussia

(h)1950 1960 1970 1980 1990 2000

05

01

00

15

0

Time

Weig

ht

tonightlatepoliceblackgasillegaldemocratssaysknewhatred

(i)

Figure 5.16: Posterior topic weights over the years 1945-2006 and 10 most likelywords for each topic (State of the Union address data set).

Bibliography

Acharya, A., Ghosh, J., and Zhou, M. (2015). Nonparametric Bayesian factor anal-

ysis for dynamic count matrices. In AISTATS.

Aandi, R. H., Fox, E., Adams, R. P., and Taskar, B. (2014). Learning the param-

eters of determinantal point process kernels. In ICML, pages 12241232.

Aandi, R. H., Fox, E., and Taskar, B. (2013). Approximate inference in continuous

determinantal processes. In Advances in Neural Information Processing Systems,

pages 14301438.

Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle markov chain monte

carlo methods. Journal of the Royal Statistical Society: Series B, 72(3):269342.

Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian

nonparametric problems. The Annals of Statistics, 2:11521174.

Antoniano-Villalobos, I. and Walker, S. G. (2016). A nonparametric model for

stationary time series. Journal of Time Series Analysis, 37(1):126142.

Arbel, J. and Prünster, I. (2017). A moment-matching Ferguson & Klass algorithm.

Statistics and Computing, 27(1):317.

Arellano-Valle, R. and Azzalini, A. (2006). On the unication of families of skew-

normal distributions. Scandinavian Journal of Statistics, 33(3):561574.

Arellano-Valle, R., Bolfarine, H., and Lachos, V. (2007). Bayesian inference for

skew-normal linear mixed models. Journal of Applied Statistics, 34(6):663682.

Argiento, R., Bianchini, I., and Guglielmi, A. (2016a). A blocked Gibbs sam-

pler for NGG-mixture models via a priori truncation. Statistics and Computing,

26(3):641661.

Argiento, R., Bianchini, I., and Guglielmi, A. (2016b). Posterior sampling from

ε-approximation of normalized completely random measure mixtures. Electronic

Journal of Statistics, 10(2):35163547.

Argiento, R., Guglielmi, A., Hsiao, C., Ruggeri, F., and Wang, C. (2015). Modelling

the association between clusters of snps and disease responses. In Mitra, R.

and Mueller, P., editors, Nonparametric Bayesian Methods in Biostatistics and

Bioinformatics. Springer.

164

Argiento, R., Guglielmi, A., and Pievatolo, A. (2010). Bayesian density estimation

and model selection using nonparametric hierarchical mixtures. Computational

Statistics and Data Analysis, 54:816832.

Asmussen, S. and Glynn, P. W. (2007). Stochastic simulation: algorithms and

analysis, volume 57. Springer Science & Business Media.

Azzalini, A. (2005). The skew-normal distribution and related multivariate families.

Scandinavian Journal of Statistics, 32(2):159188.

Barcella, W., Iorio, M. D., Baio, G., and Malone-Lee, J. (2016). Variable selection

in covariate dependent random partition models: an application to urinary tract

infection. Statistics in Medicine, 35(8):13731389.

Bardenet, R. and Titsias, M. (2015). Inference for determinantal point processes

without spectral knowledge. In Advances in Neural Information Processing Sys-

tems, pages 33933401.

Barndor-Nielsen, O. E. (2000). Probability densities and Lévy densities. University

of Aarhus. Centre for Mathematical Physics and Stochastics.

Barrientos, A. F., Jara, A., and Quintana, F. A. (2012). On the support of MacEach-

ern's dependent Dirichlet processes and extensions. Bayesian Analysis, 7(2):277

310.

Barrios, E., Lijoi, A., Nieto-Barajas, L. E., and Prünster, I. (2013). Modeling with

normalized random measure mixture models. Statistical Science, 28:313334.

Barry, D. and Hartigan, J. A. (1993). A Bayesian analysis for change point problems.

Journal of the American Statistical Association, 88(421):309319.

Basford, K., McLachlan, G., and York, M. (1997). Modelling the distribution of

stamp paper thickness via nite normal mixtures: The 1872 Hidalgo stamp issue

of Mexico revisited. Journal of Applied Statistics, 24(2):169180.

Bayes, C. L. and Branco, M. D. (2007). Bayesian inference for the skewness param-

eter of the scalar skew-normal distribution. Brazilian Journal of Probability and

Statistics, pages 141163.

Bianchini, I., Guglielmi, A., and Quintana, F. A. (2017). Determinantal point

process mixtures via spectral density approach. arXiv preprint arXiv:1705.05181.

Binder, D. A. (1978). Bayesian cluster analysis. Biometrika, 65:3138.

Biscio, C. A. N. and Lavancier, F. (2016). Quantifying repulsiveness of determinantal

point processes. Bernoulli, 22:20012028.

Biscio, C. A. N. and Lavancier, F. (2017). Contrast estimation for parametric sta-

tionary determinantal point processes. Scandinavian Journal of Statistics, 44:204

229.

Bibliography 165

Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Pólya urn

schemes. The Annals of Statistics, pages 353355.

Blei, D. M. and Frazier, P. I. (2011). Distance dependent chinese restaurant pro-

cesses. Journal of Machine Learning Research, 12:24612488.

Blei, D. M., Griths, T. L., and Jordan, M. I. (2010). The nested chinese restaurant

process and bayesian nonparametric inference of topic hierarchies. Journal of the

ACM, 57(2):7.

Bondesson, L. (1982). On simulation from innitely divisible distributions. Advances

in Applied Probability, 14(4):855869.

Broderick, T., Wilson, A. C., and Jordan, M. I. (2017). Posteriors, conjugacy, and

exponential families for completely random measures. Bernoulli (Forthcoming

papers).

Canale, A. and Scarpa, B. (2013). Informative Bayesian inference for the skew-

normal distribution. arXiv preprint arXiv:1305.3080.

Caron, F., Davy, M., and Doucet, A. (2012). Generalized Polya urn for time-varying

Dirichlet process mixtures. arXiv preprint arXiv:1206.5254.

Chung, Y. and Dunson, D. (2009). Nonparametric Bayes conditional distribution

modeling with variable selection. Journal of the American Statistical Association,

104:16461660.

Cook, R. D. and Weisberg, S. (1994). An introduction to regression graphics. John

Wiley & Sons.

Cook, R. J. and Lawless, J. (2007). The statistical analysis of recurrent events.

Springer Science & Business Media.

da Silva, A. F. and da Silva, M. A. F. (2012). Package "dpmixsim".

Dahl, D. B. (2008). Distance-based probability distribution for set partitions with

applications to Bayesian nonparametrics. JSM Proceedings. Section on Bayesian

Statistical Science, American Statistical Association.

Dahl, D. B., Day, R., and Tsai, J. W. (2017). Random partition distribution indexed

by pairwise information. Journal of the American Statistical Association, pages

112.

Daley, D. J. and Vere-Jones, D. (2003). Basic properties of the Poisson process. An

Introduction to the Theory of Point Processes: Volume I: Elementary Theory and

Methods, pages 1940.

Daley, D. J. and Vere-Jones, D. (2007). An introduction to the theory of point

processes: volume II: general theory and structure. Springer.

166

De Blasi, P., Favaro, S., Lijoi, A., Mena, R. H., Prünster, I., and Ruggiero, M. (2015).

Are Gibbs-type priors the most natural generalization of the Dirichlet process?

IEEE transactions on pattern analysis and machine intelligence, 37(2):212229.

De Iorio, M., Johnson, W. O., Müller, P., and Rosner, G. L. (2009). Bayesian non-

parametric nonproportional hazards survival modeling. Biometrics, 65(3):762

771.

De Iorio, M., Müller, P., Rosner, G. L., and MacEachern, S. N. (2004). An ANOVA

model for dependent random measures. Journal of the American Statistical As-

sociation, 99:205215.

Delatola, E.-I. and Grin, J. E. (2011). Bayesian nonparametric modelling of the

return distribution with stochastic volatility. Bayesian Analysis, 6(4):901926.

Dellaportas, P. and Papageorgiou, I. (2006). Multivariate mixtures of normals with

unknown number of components. Statistics and Computing, 16(1):5768.

Di Lucca, M. A., Guglielmi, A., Müller, P., and Quintana, F. A. (2013). A sim-

ple class of Bayesian nonparametric autoregression models. Bayesian Analysis,

8(1):63.

Doucet, A. and Johansen, A. M. (2009). A tutorial on particle ltering and smooth-

ing: Fifteen years later. Handbook of nonlinear ltering, 12(656-704):3.

Dunson, D. B. (2000). Bayesian latent variable models for clustered mixed outcomes.

Journal of the Royal Statistical Society: Series B, 62(2):355366.

Dunson, D. B. (2003). Dynamic latent trait models for multidimensional longitudinal

data. Journal of the American Statistical Association, 98(463):555563.

Eddelbuettel, D., François, R., Allaire, J., Chambers, J., Bates, D., and Ushey, K.

(2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software,

40(8):118.

Erdélyi, A., Magnus, W., Oberhettinger, F., Tricomi, F. G., and Bateman, H.

(1953). Higher transcendental functions, volume 2. McGraw-Hill New York.

Escobar, M. and West, M. (1995). Bayesian density estimation and inference using

mixtures. Journal of American Statistical Association, 90:577588.

Favaro, S. and Teh, Y. (2013). MCMC for normalized random measure mixture

models. Statistical Science, 28(3):335359.

Feller, W. (1971). An Introduction to Probability Theory and Its Applications, vol.

II. John Wiley, New York, second edition edition.

Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The

Annals of Statistics, pages 209230.

Bibliography 167

Ferguson, T. S. (1983). Bayesian density estimation by mixtures of normal distri-

butions. In M. H. Rizvi, J. R. and Siegmund, D., editors, Recent Advances in

Statistics: Papers in Honor of Herman Cherno on his Sixtieth Birthday, pages

287302. Academic Press.

Ferguson, T. S. and Klass, M. (1972). A representation of independent increment

processes without Gaussian components. Ann. Math. Statist., 43:16341643.

Foti, N. and Williamson, S. (2015). A survey of non-exchangeable priors for Bayesian

nonparametric models. IEEE Transactions on pattern Analysis and Machine In-

telligence, 37:359371.

Fraley, C., Raftery, A. E., Murphy, T. B., and Scrucca, L. (2012). mclust (Version

4) for R: Normal Mixture Modeling for Model-Based Clustering, Classication,

and Density Estimation.

Fritsch, A. and Ickstadt, K. (2009). Improved criteria for clustering based on the

posterior similarity matrix. Bayesian Analysis.

Frühwirth-Schnatter, S. (2006). Finite mixture and Markov switching models.

Springer Series in Statistics. Springer, New York.

Frühwirth-Schnatter, S. and Pyne, S. (2010). Bayesian inference for nite mixtures

of univariate and multivariate skew-normal and skew-t distributions. Biostatistics,

11(2):317336.

Fúquene, J., Steel, M., and Rossell, D. (2016). On choosing mixture components

via non-local priors. arXiv preprint arXiv:1604.00314.

Gelman, A., Hwang, J., and Vehtari, A. (2014). Understanding predictive informa-

tion criteria for Bayesian models. Statistics and Computing, 24(6):9971016.

Ghahramani, Z. and Griths, T. L. (2006). Innite latent feature models and the

Indian buet process. In Advances in neural information processing systems, pages

475482.

Gianoli, I. (2016). Analysis of gap times of recurrent blood donations via Bayesian

nonparametric models. Master thesis, Politecnico di Milano, Italy.

Gradshteyn, I. and Ryzhik, L. (2007). Table of integrals, series, and products -

Seventh Edition. Academic Press, San Diego (USA), sixth edition.

Grin, J. and Walker, S. G. (2011). Posterior simulation of normalized random

measure mixtures. Journal of Computational and Graphical Statistics, 20:241

259.

Grin, J. E. (2013). An adaptive truncation method for inference in Bayesian

nonparametric models. arXiv preprint arXiv:1308.2045.

168

Grin, J. E. and Leisen, F. (2014). Compound random measures and their use in

Bayesian nonparametrics. arXiv preprint arXiv:1410.0611.

Griths, T. L. and Ghahramani, Z. (2011). The Indian buet process: An intro-

duction and review. Journal of Machine Learning Research, 12(Apr):11851224.

Han, S., Du, L., Salazar, E., and Carin, L. (2014). Dynamic rank factor model

for text streams. In Advances in Neural Information Processing Systems, pages

26632671.

Hartigan, J. A. (1990). Partition models. Communications in statistics-Theory and

methods, 19(8):27452756.

Hjort, N. L. (1990). Nonparametric Bayes estimators based on beta processes in

models for life history data. The Annals of Statistics, pages 12591294.

Ishwaran, H. and James, L. (2001a). Gibbs sampling methods for stick-breaking

priors. J. Amer. Statist. Assoc., 96:161173.

Ishwaran, H. and James, L. F. (2001b). Gibbs sampling methods for stick-breaking

priors. Journal of the American Statistical Association, 96:161173.

Ishwaran, H. and James, L. F. (2002). Approximate Dirichlet process computing in

nite normal mixtures. Journal of computational and graphical statistics, 11(3).

Ismay, C. and Chunn, J. (2017). vethirtyeight: Data and Code Behind the Stories

and Interactives at 'FiveThirtyEight'. R package version 0.1.0.

James, L., Lijoi, A., and Prünster, I. (2009). Posterior analysis for normalized ran-

dom measures with independent increments. Scandinavian Journal of Statistics,

36:7697.

Jara, A. and Hanson, T. E. (2011). A class of mixtures of dependent tail-free

processes. Biometrika, 98:553.

Jara, A., Hanson, T. E., Quintana, F. A., Müller, P., and Rosner, G. L. (2011). DP-

package: Bayesian semi-and nonparametric modeling in r. Journal of Statistical

Software, 40(5):1.

Jørgensen, B. and Song, P. X.-K. (1998). Stationary time series models with expo-

nential dispersion model margins. Journal of Applied Probability, pages 7892.

Kallenberg, O. (1983). Random measures. Academic Pr.

Kingman, J. (1967). Completely random measures. Pacic Journal of Mathematics,

21(1):5978.

Kingman, J. F. C. (1993). Poisson processes, volume 3. Oxford university press.

Bibliography 169

Kulesza, A., Taskar, B., et al. (2012). Determinantal point processes for machine

learning. Foundations and Trends in Machine Learning, 5:123286.

Lau, J. W. and Green, P. J. (2007a). Bayesian model based clustering procedures.

Journal of Computational and Graphical Statistics, 16:526558.

Lau, J. W. and Green, P. J. (2007b). Bayesian model-based clustering procedures.

Journal of Computational and Graphical Statistics, 16:526558.

Lavancier, F., Møller, J., and Rubak, E. (2015). Determinantal point process mod-

els and statistical inference: Extended version. Journal of the Royal Statistical

Society: Series B, 77:853877.

Lawrence, N. (2005). Probabilistic non-linear principal component analysis with

Gaussian process latent variable models. Journal of Machine Learning Research,

6.

Lijoi, A., Mena, R. H., and Prünster, I. (2005). Hierarchical mixture modeling with

normalized inverse-Gaussian priors. Journal of the American Statistical Associa-

tion, 100(472):12781291.

Lijoi, A., Mena, R. H., and Prünster, I. (2007). Controlling the reinforcement in

Bayesian nonparametric mixture models. Journal of the Royal Statistical Society

B, 69:715740.

Lo, A. Y. (1984). On a class of bayesian nonparametric estimates: I. density esti-

mates. The Annals of Statistics, 12:351357.

Lomelí, M., Favaro, S., and Teh, Y. W. (2017). A marginal sampler for σ-stable

poissonkingman mixture models. Journal of Computational and Graphical Statis-

tics.

Lopes, H. F. and West, M. (2004). Bayesian model assessment in factor analysis.

Statistica Sinica, pages 4167.

Macchi, O. (1975). The coincidence approach to stochastic point processes. Advances

in Applied Probability, pages 83122.

MacEachern, S. N. (1999). Dependent nonparametric processes. In ASA proceedings

of the section on Bayesian statistical science, pages 5055.

MacEachern, S. N. (2000). Dependent Dirichlet processes. Technical report, De-

partment of Statistics, The Ohio State University.

Malsiner-Walli, G., Frühwirth-Schnatter, S., and Grün, B. (2016). Model-based

clustering based on sparse nite Gaussian mixtures. Statistics and Computing,

26:303324.

170

McAulie, J. D., Blei, D. M., and Jordan, M. I. (2006). Nonparametric empirical

Bayes for the Dirichlet process mixture model. Statistics and Computing, 16(1):5

14.

McCulloch, C. E. and Neuhaus, J. M. (2001). Generalized linear mixed models.

Wiley Online Library.

McLachlan, G. and Peel, D. (2005). Finite Mixture Models. John Wiley & Sons,

Inc.

Meil , M. (2007). Comparing clusterings - an information based distance. Journal

of Multivariate Analysis.

Mena, R. H. and Walker, S. G. (2005). Stationary autoregressive models via a

Bayesian nonparametric approach. Journal of Time Series Analysis, 26(6):789

805.

Mena, R. H. and Walker, S. G. (2007). Stationary mixture transition distribution

models via predictive distributions. Journal of statistical Planning and Inference,

137(10):31033112.

Miller, J. W. and Harrison, M. T. (2013). A simple example of Dirichlet process

mixture inconsistency for the number of components. In Advances in neural in-

formation processing systems, pages 199206.

Miller, J. W. and Harrison, M. T. (2017). Mixture models with a prior on the number

of components. Journal of the American Statistical Association. In Press.

Miller, K. T., Griths, T., and Jordan, M. I. (2012). The phylogenetic indian

buet process: A non-exchangeable nonparametric prior for latent features. arXiv

preprint arXiv:1206.3279.

Møller, J. and Waagepetersen, R. P. (2007). Modern statistics for spatial point

processes. Scandinavian Journal of Statistics, 34:643684.

Moustaki, I. and Knott, M. (2000). Generalized latent trait models. Psychometrika,

65(3):391411.

Müller, P. and Quintana, F. (2010). Random partition models with regression on

covariates. Journal of statistical Planning and Inference, 140(10):28012808.

Müller, P., Quintana, F., and Rosner, G. L. (2011). A product partition model

with regression on covariates. Journal of Computational and Graphical Statistics,

20:260278.

Müller, P., Quintana, F. A., and Rosner, G. A. (2011). A product partition model

with regression on covariates. Journal of Computational and Graphical Statistics,

20:260278.

Bibliography 171

Neal, R. (2000). Markov chain sampling methods for Dirichlet process mixture

models. Journal of Computational and Graphical Statistics, 9:249265.

Nieto-Barajas, L. E. (2013). Lévy-driven processes in Bayesian nonparametric in-

ference. Boletin de la Sociedad Matemática Mexicana, 19.

Norets, A. (2015). Optimal retrospective sampling for

a class of variable dimen-sion models. Unpublished

manuscript, Brown University, available at http://www. brown.

edu/Departments/Economics/Faculty/Andriy_Norets/papers/optretrsampling.

pdf.

Page, G. L. and Quintana, F. A. (2015). Spatial product partition models. Bayesian

Analysis.

Park, J.-H. and Dunson, D. B. (2010). Bayesian generalized product partition model.

Statistica Sinica, pages 12031226.

Perrone, V., Jenkins, P. A., Spano, D., and Teh, Y. W. (2016). Poisson random

elds for dynamic feature models. arXiv preprint arXiv:1611.07460.

Petralia, F., Rao, V., and Dunson, D. B. (2012). Repulsive mixtures. In Advances

in Neural Information Processing Systems.

Pitman, J. (1996). Some developments of the Blackwell-Macqueen urn scheme. In

Ferguson, T. S., Shapley, L. S., and B., M. J., editors, Statistics, Probability

and Game Theory: Papers in Honor of David Blackwell, volume 30 of IMS Lec-

ture Notes-Monograph Series, pages 245267. Institute of Mathematical Statistics,

Hayward (USA).

Pitman, J. (2003). Poisson-Kingman partitions. In Science and Statistics: a

Festschrift for Terry Speed, volume 40 of IMS Lecture Notes-Monograph Series,

pages 134. Institute of Mathematical Statistics, Hayward (USA).

Pitman, J. (2006). Combinatorial Stochastic Processes. LNM n. 1875. Springer,

New York.

Pitt, M. K., Chateld, C., and Walker, S. G. (2002). Constructing rst order

stationary autoregressive models via latent processes. Scandinavian Journal of

Statistics, 29(4):657663.

Pitt, M. K. and Walker, S. G. (2005). Constructing stationary time series models

using auxiliary variables with applications. Journal of the American Statistical

Association, 100(470):554564.

Quinlan, J. J., Quintana, F. A., and Page, G. L. (2017). Parsimonious Hierarchical

Modeling Using Repulsive Distributions. arXiv preprint arXiv:1701.04457.

172

Quintana, F. A. and Iglesias, P. L. (2003). Bayesian clustering and product partition

models. Journal of the Royal Statistical Society: Series B, 65(2):557574.

Quintana, F. A., Müller, P., and Papoila, A. L. (2015). Cluster-specic variable

selection for product partition models. Scandinavian Journal of Statistics.

Ranganath, R. and Blei, D. M. (2017). Correlated random measures. Journal of the

American Statistical Association.

Regazzini, E., Lijoi, A., and Prünster, I. (2003). Distributional results for means of

random measures with independent increments. The Annals of Statistics, 31:560

585.

Ren, L., Du, L., Carin, L., and Dunson, D. (2011). Logistic stick-breaking process.

The Journal of Machine Learning Research, 12:203239.

Richardson, S. and Green, P. J. (1997). On bayesian analysis of mixtures with an

unknown number of components (with discussion). Journal of the Royal Statistical

Society: series B, 59:731792.

Roberts, G. O. and Rosenthal, J. S. (2009). Examples of adaptive MCMC. Journal

of Computational and Graphical Statistics, 18:349367.

Rodriguez, A. and Dunson, D. B. (2011). Nonparametric Bayesian models through

probit stick-breaking processes. Bayesian analysis, 6(1).

Rosi«ski, J. (2001). Series representations of Lévy processes from the perspective of

point processes. In Lévy processes, pages 401415. Springer.

Rousseau, J. and Mengersen, K. (2011). Asymptotic behaviour of the posterior

distribution in overtted mixture models. Journal of the Royal Statistical Society:

Series B, 73:689710.

Roychowdhury, A. and Kulis, B. (2015). Gamma processes, stick-breaking, and

variational inference. In AISTATS.

Ruiz, F. J., Valera, I., Blanco, C., and Perez-Cruz, F. (2014). Bayesian nonpara-

metric comorbidity analysis of psychiatric disorders. Journal of Machine Learning

Research, 15(1):12151247.

Sethuraman, J. (1994). A constructive denition of Dirichlet priors. Statistica sinica,

pages 639650.

Shirota, S. and Gelfand, A. E. (2017). Approximate Bayesian Computation and

Model Assessment for Repulsive Spatial Point Processes. Journal of Computa-

tional and Graphical Statistics. In press.

Srebro, N. and Roweis, S. (2005). Time-varying topic models using dependent

Dirichlet processes. Univ. Toronto, Canada, Tech. Rep. TR, 3:2005.

Bibliography 173

Thibaux, R. and Jordan, M. I. (2007). Hierarchical beta processes and the Indian

buet process. In AISTATS.

Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal component anal-

ysis. Journal of the Royal Statistical Society: Series B, 61(3):611622.

Titsias, M. K. (2008). The innite gamma-Poisson feature model. In Advances in

Neural Information Processing Systems, pages 15131520.

Trippa, L. and Favaro, S. (2012). A class of normalized random measures with an

exact predictive sampling scheme. Scandinavian Journal of Statistics, 39(3):444

460.

Wade, S. and Ghahramani, Z. (2017). Bayesian cluster analysis: Point estimation

and credible balls. Bayesian Analysis.

Wallach, H., Jensen, S., Dicker, L., and Heller, K. (2010). An alternative prior

process for nonparametric Bayesian clustering. In Proceedings of the Thirteenth

International Conference on Articial Intelligence and Statistics, pages 892899.

Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely

applicable information criterion in singular learning theory. The Journal of Ma-

chine Learning Research, 11:35713594.

Williamson, S., Orbanz, P., and Ghahramani, Z. (2010). Dependent indian buet

processes. In AISTATS.

Wilson, I. (1983). Add a new dimension to your philately. The American Philatelist,

97:342349.

Xu, Y., Müller, P., and Telesca, D. (2016). Bayesian inference for latent biological

structure with determinantal point processes. Biometrics, 72:955964.

Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis

with g-prior distributions. Bayesian inference and decision techniques: Essays in

Honor of Bruno De Finetti, 6:233243.

Zhang, P., Wang, X., and Song, P. X.-K. (2006). Clustering categorical data based on

distance vectors. Journal of the American Statistical Association, 101(473):355

367.

Zhou, M. and Carin, L. (2015). Negative binomial process count and mixture model-

ing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):307

320.

Zhou, M., Hannah, L., Dunson, D. B., and Carin, L. (2012). Beta-negative binomial

process and Poisson factor analysis. In International Conference on Articial

Intelligence and Statistics, pages 14621471.

politecnico di milano - modeling and computational aspects ......un approccio molto di erente è...

Documents