politecnico di milano - modeling and computational aspects ......un approccio molto di erente è...
TRANSCRIPT
Politecnico di Milano
Mathematics Department
Doctoral Programme In
Mathematical Models and Methods in Engineering
Modeling and computational aspects of
dependent completely random measures in
Bayesian nonparametric statistics
Doctoral Dissertation of:
Ilaria Bianchini
Supervisor:
Prof. Alessandra Guglielmi
Co - supervisor:
Prof. Raaele Argiento
The Chair of the Doctoral Program:
Prof. Irene Sabadini
Year 2017 - XXX Cycle
Abstract
Bayesian nonparametrics is a lively topic in the statistical literature. Thanksto its versatility, the approach applies to a wide range of modern applications,from machine learning to medicine. In particular, an intense research activityhas been recently devoted to the development of dependent stochastic pro-cesses to be used as priors in nonparametric models. Exchangeability is indeedno longer the proper assumption in all contexts. Many datasets contain co-variate information, that we wish to leverage to improve model performance.In a nutshell, dependent nonparametric processes extend existing nonpara-metric priors over measures, partitions, sequences, etc. to obtain priors overcollections of such mathematical objects; members of these collections are as-sociated with values in some metric covariate space, such as time or externalmeasurements.
This thesis illustrates dierent modeling strategies motivated by practicalproblems involving covariates, the main goals being density estimation andclustering. At the beginning, completely random measures are presented,since they are the leitmotif of the thesis and the Bayesian nonparametricmodels introduced afterwards are mainly built on top of them. In the followingchapters (Chapter 2-5) the original contributions are illustrated.
Chapter 2 presents a truncation method to a-priori approximate the mix-ing measure in an innite mixture model; in particular, we focus on mixtureswhere the mixing measure is given by normalized completely random mea-sures. Among the illustrative examples, we show how to easily include thecovariates in the support of the measure. In Chapter 3, a dierent approach,where covariates enter directly in the prior for the random partition, is pre-sented. This model is motivated by an health-care problem: proling thedierent behaviors over time of blood donors. In Chapter 4 we address theissue of overestimating the number of components in mixture model, whichtypically occurs when using Dirichlet process mixtures.
To this end, a model that induces a-priori separation among the groupspecic parameters is proposed. A class of determinantal point process mix-ture models dened via the spectral representation is explored. These modelsincorporate also dependence on covariates via mixtures of experts. In the nalchapter, we introduce a class of time dependent processes taking values in thespace of exponential completely random measures. These processes have anAR(1)-type structure and may be used as building block in latent trait modelsto develop, for instance, time dependent feature allocation models.
Sommario
La statistica bayesiana nonparametrica è un'area di ricerca molto attivae vivace, grazie alla sua essibilità e alle più svariate applicazioni che mo-tivano il suo sviluppo, dal machine learning alla medicina. In particolare,una parte della letteratura piú recente è dedita all'introduzione di processistocastici dipendenti (dal tempo, da covariate, . . . ) da utilizzare come prior
nei modelli nonparametrici. Infatti, la tipica assunzione di scambiabilitá nonsempre è appropriata: in molti contesti, è possibile sfruttare delle informazioniaggiuntive, cioè le covariate, per migliorare la performance del modello statis-tico. In sintesi, processi nonparametrici dipendenti estendono le distribuzionia-priori note in letteratura denite per misure, partizioni, . . . su collezionidi tali oggetti matematici. Ogni elemento di questa collezione è associato avalori nello spazio delle covariate, come il tempo o altri tipi di misurazioni.
Questa tesi illustra diversi approcci modellistici, motivati da applicazionireali, il cui obiettivo è fare stima di densità e raggruppamento dei dati. Dopoaver introdotto i concetti di base della statistica bayesiana nonparametricache vengono usati nella tesi, vengono presentati quattro capitoli contenentiil contributo originale di questo lavoro. Nella prima parte viene illustratoun'approssimazione di misure di probabilitá aleatorie basate su un tronca-mento a-priori: queste vengono usate in modelli mistura la cui misturante èuna misura completamente aleatoria normalizzata. Uno degli esempi, in parti-colare, considera il caso in cui il supporto della misura dipende dalle covariate.Un approccio molto dierente è quello adottato nel capitolo successivo, dovele covariate inuenzano in modo diretto la prior sulla partizione aleatoria. Ilmodello è motivato da un interessante problema applicativo in ambito sani-tario: analizzare il comportamento dei donatori di sangue. A seguire, aron-tiamo il problema della sovrastima del numero di gruppi nei modelli mistura:qui viene proposto ed illustrato un modello che a-priori induce separazionetra i parametri delle componenti della mistura, attaverso l'impiego di processidi punto che dipendono dal determinante di una certa matrice di varianza ecovarianza. Le covariate sono incluse nel modello attraverso un approccio tipomistura di esperti. Inne, introduciamo una classe di processi dipendenti daltempo che vivono nello spazio di particolari misure completamente aleatorie. Iprocessi hanno una struttura di tipo autoregressivo e possono essere utilizzatifacilmente in modelli più complessi.
Acknowledgements
The work contained in this thesis is the result of the collaboration with anumber of people, to whom goes my deepest gratitude. First of all I want tothank Alessandra Guglielmi and Raaele Argiento, who believed in me sincewe rst met. You provided me with continuous support and encouragement.I am also grateful to the brilliant collaborators that hosted me during the vis-iting periods, professors Fernando Quintana and Jim Grin. Your hospitalityand stimulating discussions made my stays abroad memorable experiences.Another thank goes to my PhD colleagues who shared lunch time with me.
Un enorme ringraziamento anche ai miei genitori e agli amici di sempre:i punti di riferimento su cui posso sempre contare. Inne, grazie a Matteo,a cui dedico questa tesi, che mi ha sopportato e sostenuto in questi anni conpazienza e amore.
Contents
Introduction 1
1 Introduction to
completely random measures 5
1.1 Completely random measures . . . . . . . . . . . . . . . . . . 61.2 Bayesian nonparametric models for density estimation and clus-
tering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Generalized latent trait models . . . . . . . . . . . . . . . . . 19
2 Posterior sampling from ε-approximation of normalized com-
pletely random measure mixtures 27
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.2 Preliminaries on normalized completely random measures . . . 292.3 ε-approximation of normalized completely random measures . 302.4 ε-NormCRM process mixtures . . . . . . . . . . . . . . . . . . 362.5 Normalized Bessel random measure mixtures: density estimation 382.6 Linear dependent NGG mixtures: an application to sports data 452.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Appendix 2.A: Details on full-conditionals for the Gibbs sampler . . 48Appendix 2.B: Proofs of the theorems . . . . . . . . . . . . . . . . . 49
3 Covariate driven clustering:
an application to blood donors data 57
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.2 A covariate driven model for clustering . . . . . . . . . . . . . 603.3 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . 643.4 The AVIS data on blood donations . . . . . . . . . . . . . . . 683.5 Discussion and future work . . . . . . . . . . . . . . . . . . . . 81Appendix 3.A: Gibbs sampler . . . . . . . . . . . . . . . . . . . . . 82Appendix 3.B: Gibbs sampler for the blood donations application . 85
4 Determinantal point process mixtures via spectral density ap-
proach 89
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.2 Using DPPs to induce repulsion . . . . . . . . . . . . . . . . . 924.3 Generalization to covariate-dependent models . . . . . . . . . 1004.4 Simulated data and reference datasets . . . . . . . . . . . . . . 104
ii
4.5 Biopic movies dataset . . . . . . . . . . . . . . . . . . . . . . . 1164.6 Air quality index dataset . . . . . . . . . . . . . . . . . . . . . 1194.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5 Constructing stationary time series of completely randommea-
sures via Bayesian conjugacy 125
5.1 Stationary autoregressive type AR(1) models for univariate data1265.2 Exponential completely random measures . . . . . . . . . . . . 1275.3 Building a stationary time dependent model for a sequence of
discrete random measures . . . . . . . . . . . . . . . . . . . . 1295.4 Application: latent feature model on a synthetic dataset of images1405.5 Application: Poisson Factor Analysis for time dependent topic
modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1495.6 Discussion and future developments . . . . . . . . . . . . . . . 158
Bibliography 163
Introduction
Since many scientic problems become more and more complex, models and computa-
tional methods for data analysis require more and more sophisticated statistical tools. In
this sense, Bayesian nonparametric (BNP) statistics oers a framework for the development
of exible models with a broad-spectrum application. This thesis presents advances in BNP
models for dealing with dependence on covariates or time from a modelling perspective; new
models involving completely random measures are introduced, together with corresponding
MCMC algorithms to perform posterior inference. Along the thesis, we tackle various ap-
plications where the issue of dependence arises, such as healthcare and image analysis. The
building block that recurs in this work is given by completely random measures, tractable
mathematical objects that might be employed for building probability random measures
and also for modeling latent structures in the observations. Starting from the denition
and the main properties of completely random measures, we present dierent modelling
strategies for performing clustering and density estimation through mixture models, as well
as latent feature estimation.
The work is developed in 5 self-contained chapters, whose structure is summarized
hereinafter.
Chapter 1: in order to create a coherent framework for the development of the models
presented in the dissertation, we include an initial chapter presenting literature review
and basic concepts related to completely random measures; its reading is useful to
understand the main motivations of this work.
Chapter 2 is based on a published paper: see Argiento et al. (2016b). In a nutshell, we
deal with nonparametric mixture models whose mixing distribution belongs to the
class of normalized homogeneous completely random measures. We tackle the issue
related to the innite dimensionality of the parameter by proposing a truncation,
discarding the weights of the unnormalized measure smaller than a threshold. We
provide some theoretical properties about the approximation, as convergence and
posterior characterization. A general conditional blocked Gibbs sampler is devised,
in order to sample from the posterior of the model. Illustrative examples, including
also covariate information in the location points of the random measure, show the
eectiveness of the method.
Chapter 3 illustrates the problem of predicting the next donation time for a blood donor.
We consider data on blood donations provided by Milan department of AVIS (Italian
Volunteer Blood-donors Association). With the goal of characterizing behaviors of
donors, we analyze gap times between consecutive blood donations. In particular,
we take into account population heterogeneity via model based clustering. The main
contribution is given by the introduction, in an accelerated failure time model with a
skew normal likelihood, of a prior on the random partition that explicitly accounts for
covariate information. In particular, we consider a prior for the random partition of
the form product partition and a term that takes into account the distance between
covariates in a cluster.
2
Chapter 4 deals, dierently from the others, with nite mixture models, with a random
number of components. Typically, when using mixture models, nite or innite, over-
estimating the number of groups is quite common; hence, there is a need for models
inducing a-priori separation among the clusters. We explore a class of determinantal
point process (DPP) mixture models dened via spectral representation, focusing on
a power exponential spectral density. In the second part of the chapter we generalize
our model to account for the presence of covariates, both in the likelihood as linear
regression and in the weights of the mixture by means of a mixture of experts ap-
proach. This yields a trade-o between repulsiveness of locations in the mixtures and
attraction among subjects with similar covariates. This project has been developed
during my stay (1.5 months) at Ponticia Universidad Catolica de Chile, under the
supervision of Prof. Fernando A. Quintana; a preliminary version of the paper can
be found in Bianchini et al. (2017).
Chapter 5 aims at developing a new way of exibly modeling series of completely random
measures that exhibit some temporal dependence. These processes might be fruitful
in real life applications, such as latent feature model for the identication of features in
images or Poisson factor analysis for topic modelling. In order to achieve a convenient
mathematical tractability, namely to be able to dene a exible transition kernel for
the process, we consider the large class of exponential family of completely random
measures. This leads to a simple description of the process which has an AR(1)-type
structure and oers a framework for generalizations to more complicated forms of
time-dependence. This project was started during my stay (3 months) at University
of Kent, under the supervision of Prof. Jim Grin.
Each chapter includes details on the implementation of the MCMC methods employed
in posterior inference. Most of the statistical analyses have been carried out using R and
C++. The algorithms of the last chapter have been coded in Rcpp, a R package providing R
functions as well as C++ classes which oers a seamless integration of R and C++.
Contents 3
Absolutely
continuousdistributions
Nam
eNotation
Mean
Variance
Gaussian
N( µ,σ
2)
1√
2πσ
2ex
p
( −1
2σ2
(x−µ
)2
)µ
σ2
Gam
ma
gamma(α,β
)βα
Γ(α
)xα−
1ex
p−βxI (
0,+∞
)(x
)α/β
α/β
2
InverseGam
ma
IG
(α,β
)βα
Γ(α
)x−α−
1ex
p−β/xI (
0,+∞
)(x
)β
(α−
1)
β2
(α−
1)2
(α−
2),ifα>
2
Beta
Beta
(a,b
)1
B(a,b
)xa−
1(1−x
)b−
1I (
0,1
)(x
)a
a+b
ab
(a+b)
2(a
+b
+1)
Dirichlet
Dirichlet(α
1,...,α
k)
Γ(∑ k i=
1αi)
Γ(α
1)...Γ
(αk)x
α1−
11
xα
2−
12
...x
αk−
1k
I Sk−
1(x
)E(Xj)
=αj
α0
Var(Xj)
=αj(α
0−αi)
α2 0(α
0+
1)
α0
=∑ k i=
1αi
Sk−
1= x∈Rk
:∑ k i=
1xi
=1,
0<xi<
1,∀i
Table1:
Absolutely
continuousdistributions:
notationandparam
eterizationusedthroughoutthethesis.
4
Disc
rete
distr
ibutio
ns
Nam
eNotation
Prob
ability
mass
Mean
Varian
ce
Bern
oulli
Bern
oulli(π
)πx(1−π
)1−xI
(x∈0,1
)π
π(1−π
)
Binom
ialBin
(n,π
)
(nx )
πx(1−p)n−
xI(x∈0,1,...,n
)nπ
nπ
(1−π
)
Negative
binom
ialNB
(r,π)
(x
+r−
1
x
)·(1−p)rpxI
(x∈0,1,2,...
)πr
1−π
πr
(1−π
)2
Multin
omial
Multin
(n;p
1 ,...,pk )
n!
x1 !···x
k ! px
11···p
xkkIS
EXi
=npi
VarX
i =npi (1−pi )
∑ki=
1pi
=1
S= (
x1 ,...,x
k : ∑
ki=1xi
=n,x
i ∈0,...,n
∀i )
Poisson
Poisson
(λ)
λx
x! e −
λI(x∈0,1,2
...)
λλ
Table2:
Discrete
distrib
ution
s:notation
andparam
eterizationused
throu
ghoutthethesis.
Chapter 1
Introduction to
completely random measures
This introductory chapter describes the leitmotiv of the thesis, that is completely ran-
dom measures. These are a exible probabilistic tool that can be exploited in a wide variety
of situations when dealing with Bayesian nonparametrics: from mixture models for density
estimation and clustering, after a suitable normalization of the random measure, to latent
factor models, where completely random measures are considered for modeling presence or
absence of features.
The main results, available from the literature, and useful for the comprehension of the
rest of the thesis, are reviewed in this chapter.
6
1.1 Completely random measures
Completely random measures are elegant and mathematically tractable proba-
bilistic tools that oer a useful framework for the understanding of peculiar charac-
teristics of popular Bayesian nonparametric priors and for the construction of new
models. Extensive descriptions of this subject can be found, among the others, in
Kingman (1993) and Kallenberg (1983).
We start with the denition of completely random measures, rst introduced
in Kingman (1967). Denote, rst, by M the space of boundedly nite (positive)
measures over a complete and separable metric space X endowed with the Borel
σ−algebra X , i.e. for any µ inM and any bounded set A in X one has µ(A) < +∞.
Moreover, we letM stand for the corresponding Borel σ-algebra on M.
Definition 1.1 (Completely random measure)
Let µ be a measurable mapping from (Ω,F ,P) into (M,M), such that for any n ≥ 1
and any collection A1, . . . , An in X , with Ai ∩ Aj = ∅ for any i 6= j, the random
variables µ(A1), . . . , µ(An) are mutually independent. Then µ is termed completely
random measure (CRM).
In order to have a better intuition of this mathematical object, it is useful to know
that realizations of CRMs are discrete measures with probability 1, at least in this
work. In general, a CRM can be decomposed into three independent components: a
deterministic measure µdet, a countable collection ofM non-negative random masses
at non-random locations and a countable collection of non-negative random masses
at random locations, µc =∑
i≥1 Jiδθi (see Chapter 8 of Kingman (1993) for a more
detailed explanation). Accordingly,
µ = µc +
M∑i=1
Viδψi + µdet
where the xed location points ψ1, . . . , ψM , with M ∈ 1, 2, . . . , ∪ ∞ are in
X, the random jumps V1, . . . , VM are mutually independent and they are indepen-
dent from µc. In what follows, we will not consider the deterministic component
µdet, that can be viewed as a centering measure and not of importance when deal-
ing with statistical modelling. Typically, when considering priors in the Bayesian
nonparametric context, we assume a completely random measure given only by the
component µc, ignoring the component with random jumps at xed locations, easier
to characterize. Therefore, it is assumed that a CRM is an a.s. discrete measure
with random jumps and random support points.
Every random measure can be given through its Laplace exponent, i.e. the
expectation of all linear functionals (see Kallenberg, 1983). In particular, a CRM
µc is characterized by the Lévy-Khintchine representation which states that
E(e−∫X f(t)µc(dt)
)= exp
(−∫R+×X
(1− e−sf(t)
)ν(ds, dt)
)(1.1)
1.1. Completely random measures 7
where f : X → R is a measurable function such that∫|f |dµc < ∞ (almost surely)
and ν is a measure on R+ × X such that for any D in X∫D
∫R+
min(s, 1)ν(ds, dt) <∞. (1.2)
The measure ν in (1.1) characterizing µc is referred to as the Lévy intensity of
µc: it contains the information about the distribution of the jumps and locations of
the random measure. Such a measure will play an important role throughout this
work. Moreover, it is often useful to factorize the jump and the location part of ν
by writing it as
ν(ds, dx) = ρx(ds)P0(dx) (1.3)
where P0 is a probability measure on X and ρ is a transition kernel on X×B (R+),
namely x 7→ ρx(A) is X -measurable for any A in B (R+) and ρx is a measure on
(R+,B (R+)) for any x in X, called kernel. If ρx = ρ does not depend on x, both the
intensity ν and the CRM µc are called homogeneous; in this case, the sequence of
jumps Jii≥1 is controlled by the kernel ρ and the locations points are independent
of the jumps and are independent and identically distributed according to P0.
Example 1 (Gamma process)
A homogeneous CRM λ whose Lèvy intensity is given by
ν(ds, dx) = κe−s
sdsP0(dx)
is a gamma process with parameters (κ, P0), κ > 0.
The name gamma process originates since the Laplace functional (1.1) of f =
γID, with γ > 0 and ID the indicator function of a subset D, sums up to
E(e−γλ(D)
)= (1 + γ)−κP0(D).
Then, it is clear that the random variable λ(D), i.e. the random mass assigned to
a subset D, is gamma distributed with parameters (1, κP0(D)).
Example 2 (Beta process)
A homogeneous CRM η whose Lèvy intensity is
ν(ds, dx) = κs−1(1− s)c−1dsP0(dx),
with support on (0, 1] is a beta process with parameters (κ, c, P0), where κ, c > 0.
This is a degenerate beta density, where κ > 0 is the mass parameter and c > −1 is
called concentration parameter. This CRM was rst introduced in Hjort (1990) for
survival analysis. Analogously as in the previous example, one can show that, for
any D ∈ X,η(D) ∼ Beta(cκP0(D), c(1− κP0)).
8
Note that η(D) has mean equal to κP0(D) and varianceκP0(D)(1− κP0(D))
c+ 1; thus,
the interpretation of c is that of a concentration parameter.
CRMs are closely connected to Poisson processes; before explaining this rela-
tionship, it is useful to recall the denition of Poisson processes.
Definition 1.2 (Poisson process)
A Poisson process Π with Levy's intensity ξ(·) on Y is a random countable subset
of a separable space Y such that:
1. for any disjoint numerable subsets A1, A2, . . . , An of Y, the random variables
N(A1), N(A2), . . . , N(An) are independent;
2. N(A) has the Poisson distribution Poisson(ζ), where ζ = ζ(A) is such that
ζ > 0.
We denote by N the cardinality of the set Π ∩A, N(A) = #Π ∩A, where A is a
subset of the space Y where the process takes place.
Every CRM µc can be represented as a linear functional of a Poisson process
Π dened on the product space Y = R+ × X with Levy's intensity as in (1.3), see
Kingman (1967):
µc(A) =
∫A
∫R+
sΠ(ds, dx), ∀A ∈ X.
From a constructive viewpoint, the (homogeneous) measure µc =∑
i≥1 Jiδθi with
Lévy's intensity as in (1.3) can be generated by sampling the locations θi pointsaccording to P0 and the jumps Ji from a Poisson process with intensity ρ(ds). An
illustration is reported in Figure 1.1.
1.2 Bayesian nonparametric models for density estima-
tion and clustering
In this section we review some of the most popular models for density estimation
and clustering. In the framework of Bayesian nonparametric statistics, indeed, these
two goals can be pursued simultaneously by means of semi-parametric or nonpara-
metric mixture models, where CRMs or, more precisely, normalized CRMs, are used
as the random mixing measure.
1.2.1 Mixture models
A general tool for dening a prior on densities has been rst suggested in Lo
(1984) and Ferguson (1983). The basic idea consists of introducing a sequence of
exchangeable latent variables θnn≥1 generated according to some discrete random
probability measure. At rst, let us start recalling the notion of exchangeability.
1.2. Bayesian nonparametric models for density estimation
and clustering 9
θ~ P0
J ~
PP(ρ
)
Figure 1.1: Representation of the Lévy's intensity for a Beta process ν(s, x) =κs−1(1 − s)c−1P0(x), with P0 being a standard Gaussian distribution, κ = 1 andc = 0.5 (left). A realization of a CRMs with this intensity (right).
Let (θn)n≥1 be a sequence of observations, dened on a probability space (Ω,F ,P),
where each θi takes values in X, a complete and separable metric space endowed by
a σ-algebra X (for instance, X = Rk for some positive integer k and X = B(Rk)).The typical assumption in the Bayesian approach is exchangeability of a sequence
of observations. Formally, this means that for every n ≥ 1 and every permutation
π of the indices 1, 2,.., n,
L(θ1, θ2, . . . , θn) = L(θπ(1), θπ(2), .., θπ(n)).
The strength of (innite) exchangeability lies in the following theorem:
Theorem 1.1 (de Finetti's representation)
If θ1, θ2, . . . is an innitely exchangeable sequence of variables with probability mea-
sure P , then there exists a distribution function Q on F , the set of all distributionfunctions on X, such that the joint distribution of (θ1, θ2, . . . , θn) has the form
p(θ1, θ2, . . . , θn) =
∫PX
n∏i=1
K(θi)dQ(P ),
where the integral is calculated over the space PX of the probability measures on X.Equivalently, one could write
θi|Piid∼ P i = 1, 2, .., n,
P ∼ Q
for any n ≥ 1.
Therefore P is a random element dened on (Ω,F ,P) with values in PX and
10
its distribution Q is the so-called de Finetti measure and can be interpreted as the
prior distribution on an innite dimensional object, that is a random probability
distribution.
Figure 1.2: Representation of a mixture of 4 Gaussian components.
Mixture models provide a statistical framework for modeling a collection of (ex-
changeable) continuous observations (X1, . . . , Xn), where each measurement is sup-
posed to arise from one of k groups, with k eventually unknown, and each group is
modeled by a kernel distribution from a suitable parametric family.
This model is usually represented hierarchically in terms of a collection of inde-
pendent and identically distributed latent random variables (θ1, . . . , θn):Xi|θi
ind∼ K(·|θi) i = 1, . . . , n
θi|Piid∼ P i = 1, . . . , n
P ∼ Q
(1.4)
where P is a discrete random probability measure, Q is its distribution (i.e. the
prior) and K(·|θ) is a probability density function parametrized by the latent ran-
dom variable θ; for instance, K can be a Gaussian distribution and, in that case,
θ = (µ,Σ). Note that nite mixture models can be recovered when P has only a
nite (xed or random) number L of random jumps at random locations points, i.e.
P =∑L
j=1 plδτl where usually a Dirichlet distribution prior is given to the vector
(p1, . . . , pL) and the atoms are independent and identically distributed (i.i.d) from a
probability distribution P0. On the other hand, in the innite mixture model case,
P has a countable number of items, P =∑
j≥1 pjδτj .
Model (1.4) is equivalent to assuming that the data X1, . . . , Xn are i.i.d. accord-
ing to a random probability density that is a convolution of kernel distributions:
X1, . . . , Xn|Piid∼ f(x) =
∫ΘK(x|θ)P (dθ). (1.5)
1.2. Bayesian nonparametric models for density estimation
and clustering 11
The randomness of P is inherited by the unknown density f : therefore, this approach
allows us to put a prior on f , which is often the object of interest when tackling
density estimation problems. Note that since P is discrete and the mixture model
can be written as a weighted sum of a countably innite number of parametric
densities
f(x) =+∞∑j=1
pjK (x|τj)
where the weights (pj)j>1 represent the relative frequency of the groups in the
population indexed by θj .
This approach provides a exible model for clustering items in a hierarchical
setting without the necessity to specify in advance the exact number of clusters.
Indeed, in representation (1.4), given the discreteness of the mixing measure, there
can be ties among the latent variables since P(θi = θj) > 0 for any i 6= j.
Possible coincidences among the θ′is induce a partition structure within the ob-
servations. Suppose, for instance, that there are k ≤ n distinct values θ∗1, . . . , θ∗k
among θ1, . . . , θn and let Aj :=i : θi = θ∗j
for j = 1, . . . , k. According to such a
denition, any two dierent indices i and l belong to the same group Aj if and only
if θi = θl = θ∗j . Hence, the A′js describe a clustering scheme for the (continuous)
observations Xi's: any two observations Xi and Xl belong to the same cluster if and
only if i, l ∈ Aj for some j. In particular, the number of distinct values θ∗i among
the latent θi's identies the number of clusters into which the n observations can be
partitioned. Within the framework of nonparametric hierarchical mixture models,
one might be interested in determining an estimate of the density f and in evalu-
ating the posterior distribution of the number of clusters present in the observed
sample. The most popular model of this family is the Dirichlet Process Mixture
(DPM) model, where the random probability measure P is the Dirichlet process.
1.2.2 Dirichlet process mixture
The Dirichlet process is a cornerstone in Bayesian nonparametrics since its rst
introduction in Ferguson (1973). Its success can be explained by its mathematical
tractability and the ease of use when devising Markov chain Monte Carlo (MCMC)
techniques. Before describing the mixture model based on the DP, we briey review
the denition and the main properties of the process.
Definition 1.3 (Dirichlet process)
Let P0 be a distribution over Θ and κ be a positive real number. Then, for any
nite measurable partition S1, . . . , Sr of Θ, the vector (P (S1), . . . , P (Sr)) is random
since P is random. Then P is a Dirichlet process with base distribution P0 and
concentration parameter κ, written P ∼ DP (κ, P0), if
(P (S1), . . . , P (Sr)) ∼ Dirichlet(κP0(S1), . . . , κP0(Sr))
12
for every nite measurable partition S1, . . . , Sr of Θ. See Table 1 in the Introduction
for the notation.
Parameters P0 and κ play intuitive roles in the denition of the DP. The base
distribution is the mean of the DP: for any measurable set S, we have E(G(S)) =
P0(S). On the other hand, the concentration parameter κ can be understood as the
reciprocal of the variance: V ar(G(S)) = P0(S)(1−P0(S))/(κ+ 1). The larger κ is,
the smaller the variance, and the DP will concentrate more of its mass around the
mean.
The rst property of DP to review is conjugacy. Let P ∼ DP (κ, P0) and let
θ1, . . . , θn be a sequence of i.i.d. draws from P . We are interested in the posterior
distribution of P given observed values of (θ1, . . . , θn). Let nk = #i : θi ∈ Sk bethe number of observed values in Sk . It is straightforward to show that
(P (S1), . . . , P (Sr))|θ1, . . . , θn ∼ Dirichlet (κP0(S1) + n1, . . . , κP0(Sr) + nr)
and this relationship holds for any r and any partition of the space. By denition of
DP, we have that the posterior of P is a DP as well, with concentration parameter
κ+ n and base distributionκP0 +
∑ni=1 δθi
κ+ n, where δx is the Dirac delta located in
x, i.e.
P |X1, . . . , Xn ∼ DP(κ+ n,
κP0 +∑n
i=1 δθiκ+ n
).
Hence, the DP provides a conjugate family of priors over distributions that is closed
under posterior updates given observations.
Another property that plays a fundamental role when developing MCMC al-
gorithms or generalizing the DP to include covariate dependence is the Blackwell-
MacQueen urn scheme, that characterizes the sequence of predictive distributions
of a sample from a DP. Let θ1, θ2, . . . |Piid∼ P and P ∼ DP (κ, P0). Consider the
posterior predictive distribution for a new sample θn+1, conditioning on θ1, . . . , θn,
i.e. when is P integrated out.
It is straightforward to prove that, for any n = 1, 2, . . . ,
θn+1|θ1, . . . , θn ∼1
κ+ n
(κP0 +
n∑i=1
δθi
). (1.6)
See Blackwell and MacQueen (1973). Thus, the posterior base distribution given
θ1, . . . , θn is also the predictive distribution of a new observation θn+1.
We highlight that the predictive distribution (1.6) has point masses located at
the previous draws θ1, . . . , θn. Thus, with positive probability, a sample from the
DP will have ties, regardless of the continuity of the base measure.
As a last remark about the properties of the DP, we provide a constructive
1.2. Bayesian nonparametric models for density estimation
and clustering 13
denition of the process, called stick-breaking construction. Consider
βkiid∼ Beta(1, κ) τk
iid∼ P0
pk = βk
k−1∏i=1
(1− βi), k = 2, 3, . . . , p1 = β1.
Dene P =∑
j≥1 pjδτj ; Sethuraman (1994) proved that P ∼ DP (κ, P0). The
construction of pk's can be understood metaphorically by considering βk as the
length of a piece of stick. Starting with a stick of length 1, we break it at β1,
assigning p1 to be the length of stick we just broke o. Now recursively break the
other portion to obtain p2, p3 and so forth. This simple construction helped in
building posterior inference, as well as in dening extensions of the DP.
A Dirichlet process mixture prior (DPM) (Antoniak, 1974) is (1.5) when P ∼DP (κ, P0), namely
Xi|Piid∼ f(Xi) =
∫K(Xi|θ)P (dθ), P ∼ DP (κ, P0). (1.7)
For example, in a univariate setting, a DP location mixture of normals is
Xi|P ∼∫RN (Xi|µ, σ2)P (dµ), P ∼ DP (κ, P0),
where P0 is a base measure dened on R. Working with an innite number of
components is particularly appealing because it ensures that, for appropriate choices
of the kernel K(yi|θ), the DPM model has support on a large classes of distributions.
For example, Lo (1984) showed that a DP location-scale mixture of normals has full
support on the space of absolutely continuous distributions.
We can further rewrite the model (1.7) in terms of clusters and unique values
of the latent parameters by dening a vector (s1, s2, . . . , sn) of labels indicating to
which cluster item i belongs, i.e. si = j ⇔ θsi = θ∗j :
Xi|θ∗j , si = jind∼ K(xi|θ∗j ), i = 1, . . . , n
θ∗j ∼ P0, j = 1, . . . ,Kn
p(s1, . . . , sn) = Γ(κ)/Γ(κ+ n)κKnΓ(n1)× · · · × Γ(nKn) (1.8)
where Kn is the number of distinct values among s1, . . . , sn and nj =∑
i I(si = j)
is the number of indicator variables si's that are equal to j.
Previous work on nonparametric Bayesian clustering has paid some attention to
the implicit a priori the rich-get-richer property imposed by the Dirichlet process
(see, e.g. Wallach et al., 2010). This leads to partitions consisting of a small number
of large clusters: new observations are more likely to join already-large clusters.
This is clear from representation (1.8), where the probability of joining an already
14
existing cluster is proportional to the cardinality of the cluster. Although the rich-
get-richer cluster usage may be appropriate for some clustering applications, there
are others for which it is undesirable. Thus, there exists a need for alternative
priors in clustering models. In what follows, we introduce a more general class of
mixture models that entail the DPM and partially alleviate the problem of the
rich-get-richer behavior; with this aim, we consider mixtures with mixing measure
given by normalized completely random measures (NormCRMs). As we will see
in the next section, the NormCRMs are very exible but still mathematically and
computationally tractable, making them a good choice as P in the mixture models.
Another issue when dealing with DPM models is how to manage the presence
of covariates: indeed, traditional nonparametric priors such as the Dirichlet process
assume that observations are exchangeable. Exchangeability is not a reasonable as-
sumption in every context. For example, in time series or spatial data, we often see
correlations between observations that occur close in time or space. Many datasets
contain these types of covariate information; we wish to include these covariates as
deterministic variables to condition to, in order to exploit this additional information
to improve model performance. Typically, it is assumed that the members of these
collections of priors are associated with values in some covariate space - usually a
metric space representing time or geographic location - and that locations that are
close in covariate space tend to generate similar structures. Dependent nonpara-
metric processes have been used in a wide variety of applications. Examples include
image processing (see e.g. the R package dpmixsim da Silva and da Silva, 2012),
text analysis (e.g. Blei et al., 2010) and nance, to construct stochastic volatility
models (Delatola and Grin, 2011, among the others). If we wish to model data
that depend on some covariate, it makes sense to build a collection of correlated pro-
cesses (MacEachern, 2000). The goal is thus to induce dependency between random
measures, both in terms of the locations and the jumps.
1.2.3 Normalized CRM
The Dirichlet process on a complete and separable metric space X can also be
obtained by normalizing the jumps of a gamma CRM µ with parameter (κ, P0) as
described in Example 1: the random probability measure Q = µ/µ(X) has the same
distribution as the Dirichlet process on X with parameter (κ, P0). Therefore, one
might wonder whether a full Bayesian analysis can be performed if, in the above
normalization, the gamma process is replaced by any CRM with a generic Lévy
intensity as in (1.3). From a Bayesian perspective, the idea of normalization rst
appeared in Regazzini et al. (2003). A denition stated in terms of CRMs is as
follows.
Definition 1.4 (Normalized CRM)
Let µ be a CRM with intensity measure ν such that 0 < µ(X) < ∞ almost surely.
1.2. Bayesian nonparametric models for density estimation
and clustering 15
Then, the random probability measure
Q =µ
µ(X)
is called normalized completely random measure on (X,X ).
The requirement of positive and nite total mass µ(X) is satised if the corre-
sponding intensity ν = ρxP0 (in the non homogeneous case) is such that∫R+
ρx(ds) = +∞, for all x. (1.9)
This means that the jumps of the process form a dense set in (0,+∞) and there are
innitely many masses near the origin: indeed, according to (1.9) the total mass of
ρx must sum up to +∞. At the same time, the second condition in (1.2) forces ρxto have this mass near the origin. In this case µ is also called an innite activity
process.
It is important to remark that, apart from the Dirichlet process, NormCRMs
are not structurally conjugate. Nonetheless, one can still provide a posterior char-
acterization of NormCRMs in the form of a mixture representation. In the sequel,
we will focus on NormCRMs, whose underlying Lévy intensity has a non-atomic
centering measure P0. It is then useful to introduce an auxiliary variable whose
density, conditionally on the sample, can be expressed as
pX(u) ∝ un−1e−ψ(u)k∏j=1
∫R+
snje−usρ(ds)
where ψ is the Laplace exponent, namely
ψ(u) =
∫R+
(1− e−us
)ρ(ds)
(in the homogeneous case). Then, if data are from model (1.4), where Q is the
probability distribution of a NormCRM, we have that the posterior of P is still a
NormCRM with some xed location points corresponding to the locations of the
observations, conditioning on the auxiliary variable u that we introduced. Details
and theorems are given in James et al. (2009). This can be considered as a sort
of conditional conjugacy property, that makes the computation simpler: in fact,
when building a Gibbs sampler for sampling from the posterior of (1.4), the full-
conditionals are relatively easy to derive thanks to the presence of u.
In the rest of the thesis, by P ∼ NormCRM(ρ, P0) we denote an homogeneous
normalized completely random measure with intensity ν(ds, dx) = ρ(ds)P0(dx)
where P0 is a probability measure and ρ(·) is the Lévy intensity controlling the
16
jumps, such that the two following conditions hold true:∫R+
min (s, 1) ρ(ds) < +∞ and
∫R+
ρ(ds) = +∞.
1.2.4 Exchangeable partition probability functions and product
partition models
Innite mixture models with NormCRMs as mixing measure are a exible tool
for both density estimation and clustering; however, if the statistical focus is on
clustering data, we might consider the same model under a dierent parametrization,
starting from the so-called exchangeable partition probability function driving the
prior induced on the the random partition.
We have already mentioned how the discreteness of P in model (1.4) implies
that there might be ties within latent variables. Correspondingly, dene ρn to be
a random partition of the integers 1, . . . , n such that any two integers i and j
belong to the same set in ρn if and only if θi = θj . Observe that ρn is random
since (θ1, . . . , θn) is. Let k ∈ 1, . . . , n and suppose A1, . . . , Ak is a partition of
1, . . . , n, that is a possible realization of ρn. We already saw in model (1.8) that
we can express the DPM through the parameter ρn. In this case, the prior induced
on ρn depends only on the cardinality of the groups, n1, . . . , nk. More in general,
under the assumption of exchangeability, a common specication for the probability
distribution of ρn consists in assuming that it depends only on the frequencies of
each set in the partition, namely is a function of (n1, . . . , nk) ∈ Πn,k as follows:
p(ρn = A1, . . . , Ak) = π(n)k (n1, . . . , nk) (1.10)
where
Πn,k =
(n1, . . . , nk) : ni ≥ 1,
k∑j=1
nj = n
.
Then,π
(n)k : 1 ≤ k ≤ n, n ≥ 1
with π
(n)k dened in (1.10) is termed exchangeable
partition probability function (EPPF).
The EPPF determines the distribution of a random partition of N. It follows
that, for any 1 ≤ k ≤ n and any (n1, . . . , nk) ∈ Πn,k, π(n)k is a symmetric function
of its arguments and it satises the marginal invariance rule
π(n)k (n1, . . . , nk) = π
(n+1)k+1 (n1, . . . , nk, 1) +
k∑j=1
π(n+1)k (n1, . . . , nj + 1, . . . , nk) (1.11)
On the other hand, as shown in Pitman (1996), every non-negative symmetric func-
tion satisfying the rule (1.11) is the EPPF of some exchangeable sequence. See
Pitman (1996) for a thorough analysis of EPPFs.
A dierent perspective on priors for random partitions of exchangeable sequences
1.2. Bayesian nonparametric models for density estimation
and clustering 17
is given by the product partition model (PPM), proposed by Hartigan (1990)
and Barry and Hartigan (1993) and popularized in the BNP literature by Quintana
and Iglesias (2003). The PPM explicitly denes a probability distribution p(ρn)
over partitions by using a non-negative function of Aj , the collection of indices in
1, . . . , n of data that are assigned to cluster j; the function is denoted by c(Aj) andit is known as the cohesion function. A product partition probability is dened
as
p(ρn = A1, . . . , Ak) = C
k∏j=1
c(Aj) (1.12)
where C is a suitable normalizing constant, depending on the number of clusters k.
Conditional on a given partition, typically the PPM assumes independent sampling
across clusters for data X1, . . . , Xn, i.e.
p(X1, . . . , Xn|ρn, θ∗j , j = 1, . . . ) =∏j
p(X∗j |θ∗j ) (1.13)
where θ∗j are cluster specic parameters and X∗j is the collection of observations
belonging to the j-th group. Applications of the PPM often use exchangeability ofXi
across i ∈ Aj by assuming that Xi, i ∈ Aj , are i.i.d. given θ∗j . One of the appealingcharacteristics of the PPM is its conjugate nature. The posterior p(ρn|X1, . . . , Xn)
is again a product partition model, with updated cohesion functions c(Aj)p(X∗j ),
where p(X∗j ) is the marginal law of Xi, i ∈ Aj under partition ρn.
Exchangeable innite mixture models where the mixing measure is a NormCRM,
namely
Xi|θiind∼ f(Xi|θi), θi|P
iid∼ P, P ∼ NormCRM(ρ, P0), (1.14)
can be written as product partition models with likelihood (1.13) and prior on the
partition (1.12) through a change of parametrization. In fact, as in the DPM case,
if the random probability measure P in model (1.14) is not of interest, it may be
marginalized out, yielding:
Xi|θ∗jkj=1
, siind∼ f(Xi|θ∗si), θ∗j |P0
iid∼ P0, p(ρn = (s1, . . . , sn)) ∼ p(ρn) (1.15)
where p(ρn) is the induced prior distribution of the partition ρn (see James et al.,
2009). Note that we assume that data are i.i.d. within clusters and independent
across clusters. In case of Gibbs-type models (see e.g. De Blasi et al., 2015), as for
instance when P in (1.4) is the normalized generalized gamma process (Regazzini
et al., 2003), the prior on the partition in (1.15) can be seen as an (exchangeable)
product partition model, namely
p(ρn = A1, . . . , Akn) = C
kn∏j=1
c(Aj),
18
where C is the normalizing constant. The relationship between exchangeable innite
mixture models and product partition models emphasizes the marginal invariance of
the implied sequence of partition distributions with increasing sample size. Loosely,
the probability distribution for partitions over 1, 2, . . . , n is the same as the distri-
bution obtained by marginalizing out item (n+ 1) from the probability distribution
for partitions of (n+1) items, 1, 2, . . . , n+1 (see, for instance, Section 2.4 of Dahl
et al., 2017).
Now, we specify the two parameterizations for two simple cases of mixture models
(1.4); rst, examining the Blackwell-MacQueen urn scheme implied by the DP, it is
clear that the cohesion function in this case is c(Aj) = κ(nj − 1)!, where nj = |Aj |denotes the cardinality of the j-th cluster.
The second special case is given by the normalized generalized gamma process,
denoted by NGG(κ, σ, P0). The Lévy's intensity of the jumps is
ρ(ds) = κs−1−σe−sds, s > 0
where κ is a mass parameter and σ a discount parameter, σ ∈ [0, 1). The cohesion
function becomes c(Aj) = (1 − σ)nj−1 where (α)n is the Pochammer symbol, or
rising factorial, dened as (α)n = Γ(α + n)/Γ(α). It is clear that the parameter σ
has a deep inuence on the clustering behavior. For a thorough discussion on this
topic, see Lijoi et al. (2007).
As a further step, it may often be the case where we want to include some
covariate information in the model: for instance, dependence on time, space or
subject specic information. From a modeling viewpoint, it is natural to assume
that observations with similar covariates (i.e. close in time or space) are a-priori
more probably clustered together: however, including this behavior in the prior is
not straightforward and for this reason recent literature is rich of models generalizing
to covariate dependent clustering. Among the others, Müller et al. (2011) proposed
the PPMx. We start from this model to develop our contributions in Chapter 5.
1.2.5 Marginal representation for NormCRM
When building a MCMC algorithm for posterior sampling (see, for instance,
Favaro and Teh (2013)) or when generalizing the model to include some covari-
ate information (Dahl et al. (2017) and Müller et al. (2011)) it may be useful to
consider representation (1.15). Consider a sample θ1, . . . , θn from a NormCRM G,
i.e., θ1, . . . , θn|Giid∼ G and G ∼NormCRM(ρ, P0). The most common example of
a marginal process comes from the Dirichlet process (DP). In this case, we have
already mentioned in Section 1.2.2 that, if we marginalize the innite dimensional
parameter G out, then the law of the sample θ1, . . . , θn is uniquely characterized in
term of the random partition and the distinct values. In particular, we have that
1.3. Generalized latent trait models 19
the joint law of partition and unique values is
L(A1, . . . , Ak, dθ∗1, . . . , dθ
∗k) = π
(n)k (n1, . . . , nk)
k∏j=1
P ∗0 (dθ∗j ), (1.16)
where π(n)k is the EPPF. The latter decomposition (1.16) sheds light on the law of
a sample from a NormCRM. It can be factorized into two parts: the law of the
partition ρn, and the law of the cluster-specic parameters θ∗1, . . . , θ∗k. The rst
factor, namely the EPPF, depends only on the Lévy's intensity ρ, while the second
(conditionally to the number of unique values kn) is the product of the centering
measure P0.
If we want to draw a sample from model (1.16) we need rst to sample a random
partition ρn of the data, then for each of the resulting k clusters we need a cluster
specic parameter θ∗j , j = 1, . . . , k, sampled i.i.d. from P0. In order to sample
a random partition ρn with distribution given by the EPPF π(n)k , one can use a
generalization of the Chinese restaurant process metaphor (Pitman, 2006). The
metaphor consists in considering a Chinese restaurant with an (ideally) innite
number of tables, each with an innite capacity. Customer 1 sits at the rst table.
After that n = 1, 2, . . . customers have entered and sit in the restaurant, there are
k occupied tables: the next customer, n+ 1, will sit at a new table with probability
equal toπn+1k+1 (n1, . . . , nk, 1)
πnk (n1, . . . , nk, 1),
while she/he will sit at one of the k occupied table with probability
πn+1k (n1, . . . , nj + 1, . . . , nk, 1)
πnk (n1, . . . , nk, 1), j = 1, . . . , k.
The resulting sequence is exchangeable, meaning that the order in which the
customers sit does not aect the probability of the nal distribution.
1.3 Generalized latent trait models
In multivariate analysis, a fruitful approach to study high dimensional data is
given by the introduction of latent variables; dependency among the observations is
thus provided by the relationship between the observations and the latent variables
themselves. Such latent models can be used, for instance, as tools for modelling the
covariance structure. Under the Gaussian assumption for the observations, it boils
down to principal component analysis and factor analysis, that are well known sta-
tistical tools used to identify low-dimensional structures in the data (see Lawrence,
2005, Tipping and Bishop, 1999). Recently, such models have been extended to al-
low for a unied estimation method for mixed (continuous, binary and categorical)
variables. These extensions rely on the exponential distribution family. They date
20
back to the work of Moustaki and Knott (2000) and are referred to as generalized
latent trait models. See also Dunson (2000, 2003) for further developments.
For the sake of clarity, in the following we describe the two main ingredients
needed to build a generalized latent trait model as introduced in Moustaki and
Knott (2000). This model will be generalized in Chapter 5 to account for time
dependence. Let Xi = (Xi1, . . . , Xiq) be the i−th q−dimensional response variable.
Assume the following conditions:
1. Each component Xil, l = 1, . . . , q, is independent of the others and is assumed
to have distribution from the following class of distributions related to the
well-known exponential family (see McCulloch and Neuhaus (2001)), i.e. its
density has the following analytical form:
K(xil; θil, τl) = κl (xil; τl) exp (τl (< θil, xil > −A(θil)))
= exp (τl (< xil, θil > −A(θil)) + cl(xil, τl)) ,(1.17)
where
µil = E (Xil|θil, τl) =da(θil)
dθil,
νil = Var (Xil|θil, τl) =1
τl
d2a(θil)
dθ2il
,
and τl is a scalar dispersion (or nuisance) parameter. For any choice of the dis-
persion parameter, density (1.17) forms an exponential family with parameter
θil.
2. The canonical parameter θil is related to covariates and latent variables through
the generalized linear model
gl(µil(θil)) = ηil =s∑
h=1
βhuih +
p∑j=1
λ∗jζ∗ij (1.18)
where gl is a monotonic dierentiable link function, ηil is the linear predic-
tor, (ui1, . . . , uis) is the vector of available covariates for individual i. The
name generalized latent trait model is justied by the following interpreta-
tion: ζ∗i1, . . . , ζ∗ip is a vector of loadings and λ
∗1, . . . , λ
∗p is a collection of factors.
The latent variables, ζ∗ij , represent latent traits of individuals accounting for
item specic response tendencies.
For instance, suppose to observe binary values: then, it is natural to model
the variables Xil with a Bernoulli distribution, with expected value πil. The link
function may be the logit transformation, that is,
g(πil) = logit(pil) = log
(πil
1− πil
).
1.3. Generalized latent trait models 21
On the other hand, when data are counts, then theXil are assumed to be Poisson
distributed with mean µil. In this case the link function is the logarithmic function
g(µil) = log(µil).
Finally, for continuous observations one assumes that the observations are Gaus-
sian distributed: in this case, the link function is the identity, obtaining the standard
linear factor model.
Before moving to the nonparametric generalization, in this work we assume s = 0
in (1.18), i.e. we do not consider the contribution of observable covariates, but only
the term given by latent variables, i.e. gl(µil) = ηil =∑p
j=1 λ∗jζ∗ij .
1.3.1 A nonparametric prior for the factors
In the latent variables context and in particular in generalized latent trait models,
choosing the number of factors p is not an easy task and there is not a clear strategy
to x this number: see for example Lopes and West (2004) for a thorough discussion.
In the Bayesian nonparametric literature, however, the restriction given by a xed
number of latent factors p is overcome by letting p (ideally) be innite. In this
context, completely random measures play an important role in dening exible
models.
Following approach and notation of Broderick et al. (2017), we consider variables
coming from Bayesian nonparametric models as being composed of two parts: (i)
a collection of traits and the corresponding frequencies or rates and (ii) for each
data point, an allocation to dierent traits. Both parts can be expressed as random
measures. Each trait is represented by a point λ in some (Polish) space Ψ of traits.
Furthermore, let Jk be the frequency (or rate in case of a normalized measure), of
the trait represented by λk, where k ≥ 1 indexes the countably many traits. In
particular, Jk ∈ R+. Then, (Jk, λk) is a couple consisting of the k − th trait and
its frequency. We can represent the full collection of couples trait/frequency by a
discrete measure on Ψ that places weight Jk at location λk:
G =∑k≥1
Jkδλk .
Next, for each individual i we consider a random measure Θi, whose distribution
does depend on G. Measure Θi consists of a sum over traits to which the i-th
individual is allocated and a degree to which the individual is allocated to this
particular trait. That is, Θi is a discrete measure whose support concides with the
support of G, i.e. λ1, λ2, . . . , and
Θi =∑k≥1
ζikδλk , (1.19)
where now ζik ∈ R+ represents the degree to which the data point belongs to trait
λk.
22
Summing up, to characterize the latent structure of n individuals, the following
model is assumed
Θ1, . . . ,Θn|Giid∼ L(dΘ|G) (1.20)
G ∼ CRM(ν) (1.21)
where ν := ν(ds, dψ) = ρ(ds)P0(dψ) is the Lévy's intensity of the CRM as dened
in Section 1.1. However, we still need to fully describe L(dΘ|G), the law of the
traits measure given G in (1.21); measure Θi is a CRM with only a xed-location
component, conditionally to G. In particular, the locations of Θi are the same as
those of G, as in (1.19): ζik is drawn according to some distribution H that takes
Jk, the weight of G at location λk, as a parameter, namely
ζik ∼ H(·|Jk) independently across i = 1, . . . , n and k ≥ 1.
Note that while every atom of Θi is located at an atom of G, it is not necessarily
the case that every atom of G has a corresponding atom in Θi. In particular, if ζiktakes value zero, there is no atom in Θi at λk.
As far as the likelihood is concerned, each data point can be allocated only to
a nite number of traits. Thus, we assume that the number of weights dierent
from zero in every Θi is nite. This is achieved by considering H(dζ|J) a discrete
distribution with support N0 = 0, 1, 2, . . . , for any J . We denote by h(ζ|J) the
probability mass function of ζ given J . This guarantees that each data point can
be associated with a subset of the possible latent variables, namely the traits, which
we refer to as the latent features of that data point.
Moreover, note that, by construction, the pairs (Jk, λk)k≥1 form a marked
Poisson point process with rate measure µmark(ds×dx) := ρ(ds)h(x|s), so we assume
∞∑x=1
νx(R+) < +∞ for νx := ρ(ds)h(x|s)
in order to have a nite number of latent variables generating a data point.
1.3.2 Marginal representation
In the Bayesian nonparametric literature we often bump into processes that
are actually versions of model (1.20)-(1.21) where G has been integrated out: they
are, in fact, more interpretable and usually lead to easier MCMC algorithms when
sampling from the posterior.
In order to dene a generalized latent trait model we need a joint prior for the
vector of scores ζ∗i := (ζ∗i1, . . . , ζ∗ip), for i = 1, . . . , n, the vector of factor loadings
λ∗ := (λ∗1, . . . , λ∗p) and parameter p, as introduced in Section 1.3, which are nite
dimensional objects. On the other hand, in equations (1.20)-(1.21) we introduced a
model for traits and frequencies involving innite dimensional mathematical objects.
First, we recall a marginal characterization of a sample Θ1, . . . ,Θn from model
1.3. Generalized latent trait models 23
(1.20)-(1.21) provided in Theorem 6.1 of Broderick et al. (2017). The marginal
distribution of Θ1, . . . ,Θn is described by the following construction. For each i =
1, 2, . . . , n,
1. let λ∗kqi−1
k=1 be the union of the atom location in Θ1, . . . ,Θi−1. Let ζ∗j,k :=
Θj(λ∗k). Let ζ∗i,k denote the weight of Θi|Θ1, . . . ,Θi−1 at λk. Then ζ∗i,k has
distribution described by the following probability mass function:
h(ζ∗i,k = ζ|ζ∗1,k, . . . , ζ∗i−1,k) =
∫ρ(dθ)h(ζ|θ)
∏i−1m=1 h(ζ∗m,k|θ)∫
ρ(dθ)∏i−1m=1 h(ζ∗m,k|θ)
(1.22)
2. for each ζ = 1, 2, . . . , Θi has ρi,ζ new atoms whose weight is ζ. Where
ρi,ζind∼ Poisson
(ρ
∣∣∣∣∫ ρ(dθ)h(0|θ)i−1h(x|θ)), independently across i, ζ
(1.23)
Moreover, these atoms are located at λi,ζ,jρi,ζj=1 where
λi,ζ,jiid∼ P0(dλ) independently across i, ζ, j. (1.24)
Henceforth, let us consider a sample Θ1, . . . ,Θn from (1.20)- (1.21): our aim is
to show that the sample can be summarized by a n-dimensional array of scores
ζ∗1 , . . . , ζ∗n and a vector of traits λ∗; then, we see how points 1. and 2. above also
characterize the marginal law of these objects.
First, we observe that from Theorem 5.1 of Broderick et al. (2017) the marginal
law of Θ1 can be expressed as follows: for each ζ ∈ N, there are ρ1,ζ atoms of Θ1
with weight ζ, where
ρ1,ζind.∼ Poisson
(∫θρ(dθ)h(ζ|θ)
)across ζ.
These atoms have locations λ1,ζ,jρ1,ζ
j=1, where λ1,ζ,jiid∼ P0 across ζ, j.
Thanks to the independence among variables in the construction above we
can rstly let p1 :=∑∞
ζ=1 ρ1,ζ , and let λ∗1 = λ∗kp1
k=1 be the disjoint (by as-
sumption) union of the λ1,ζ,jρ1,ζ
j=1. Note that p1 is nite by Assumption A2.
Finally let ζ∗1 = ζ∗1,k := Θ1(λ∗k)p1
k=1. It is clear that Θ1 may be repre-
sented with the pairs (ζ∗1 ,λ∗1). Moreover, we can x the sampling order and as-
sume that λ∗1 =: (λ∗1, . . . , λ∗p1
) and ζ∗1 := (ζ∗1 , . . . , ζ∗p1
). We proceed by induc-
tion: for n = 2, 3, . . . consider a sample (ζ∗1 ,λ∗1), . . . , (ζ∗n−1,λ
∗n−1), generated as
described below. Thanks to points 1. and 2. above, we can characterize the
law of (ζ∗n,λ∗n)|(ζ∗1 ,λ∗1), . . . , (ζ∗n−1,λ
∗n−1). The two vectors ζ∗n and λ∗n have length
pn = pn−1 + p∗n, where pn−1 is the length of ζ∗n−1 and p∗n =∑∞
ζ=1 ρi,ζ (see (1.23)).
The rst pn−1 entries of ζ∗n are distributed as described in equation (1.22), while the
rst pn−1 entries of λ∗n are equal to λ
∗n−1 (thinning). In addition, the last p
∗n entries
of the vectors ζ∗n and λ∗n are lled according to equations (1.23) and (1.24) following
24
the sampling order (we call this second part innovation). For ease of notation, we
will assume that all the vectors ζ∗'s have the same length p = pn, with the proviso
that ζj,i = 0 if pi < p; moreover, we will let λ∗ = λ∗n. A metaphor that generalizes
the well known Indian Buet process can be formulated to describe the marginal
law of the ζ's. Thus, one can employ the notation ζ∗1 , . . . , ζ∗n ∼ GIB(ν), where GIB
stands for generalized Indian Buet (explained later). Consider an Indian buet,
namely a buet where there are innitely many dishes; dierently from the usual
construction, assume now that customers can choose as many portions of the dishes
they want. Then, the rst customer enters the restaurant and takes 1 portion of ρ1,1
dishes, 2 portions of ρ1,2 dishes, . . . , ζ portions of ρ1,ζ dishes, and so on. At the end,
the rst customer will have chosen p1 dishes, and the vector ζ∗1 = (ζ∗1,1, . . . , ζ∗1,p1
)
reports how many portions of each dish labelled as λ∗ = (λ∗1, . . . , λ∗p1
) she/he has
chosen. Recursively, the n-th customer chooses dishes and number of portions ac-
cording to two steps. First, for each dish k = 1, . . . , pn−1 already chosen by the
previous customers, she/he will take ζ∗n,k portions (also 0), then she/he will choose
1 portion of ρn,1 new dishes, 2 portions of ρn,2 of new dishes, . . . , ζ portions of ρ1,ζ
new dishes, and so on.
This marginal representation shed light on a dierent characteristic of the non-
parametric prior on the latent factors that we are considering: it induces a feature
model, in the same way the NormCRM induces a model for clustering i.e. a prior
for the random partition ρn. More formally, a feature allocation model fn of
[n] := 1, . . . n is a multiset of non-empty subsets of [n] called features, such that
no index i belongs to innitely many features. We write fn = A1, . . . , Ap, wherep is the number of features. A partition is a special case of a feature allocation
for which the features are restricted to be mutually exclusive and exhaustive. The
features of a partition are often referred to as clusters. We note that a partition
is always a feature allocation, but the converse statement does not hold in general.
Consider now the random feature allocation of the data indices [n] = 1, . . . , n,fn =: A1, . . . , Ap, given by the following i ∈ Aj if and only if ζ∗i,j > 0, so that the
law of ζ,1 . . . , ζ∗n characterize the law of fn.
1.3. Generalized latent trait models 25
1.3.3 Nonparametric generalized latent trait model
We are now ready to write down the general form of a generalized latent trait
model as follows:
X1, . . . , Xn|Θ1, . . . ,Θn, τ2 ∼
n∏i=1
q∏l=1
K(Xil; θil, τ2l ), (1.25)
where gl(µil(θil)) = ηil =
∫Θi(dθ) =
∑j≥1
λjζij
Θ1, . . . ,Θn|Giid∼ L (Θ|G) (1.26)
G ∼ CRM(ν) (1.27)
τ21 , . . . , τ
2qiid∼ p(τ2)
where K(·; θ, τ2) is a kernel density belonging to some parametric family. We point
out that this model includes as special cases the popular Innite latent feature model
of Ghahramani and Griths (2006) or the latent Poisson factor analysis of Zhou
et al. (2012). However, to the best of our knowledge, our formulation is a general
representation that has never appeared in the literature.
We can write down a marginal version of the model above by integrating out
the innite dimensional parameter G and introducing the representation ζ∗1 , . . . , ζ∗n
and λ∗ as follows:
X1, . . . , Xn|ζ∗1 , . . . , ζ∗n, λ∗, τ 2 ∼n∏i=1
q∏l=1
K(Xil; θil, τ2l ), (1.28)
gl(µil(θil)) = ηil =
p∑j=1
λ∗jζ∗ij
ζ∗1 , . . . , ζ∗n ∼ GIB(ν) (1.29)
λ∗1, λ∗2 . . . iid∼ P0 (1.30)
τ21 , . . . , τ
2qiid∼ p(·).
A standard application for which model (1.28) is employed is the problem of
learning recurrent features in a collection of images: this task is of interest, for
instance, when analyzing a video (sequence of images) and looking for objects that
appear frequently. The most well-known model is the linear-Gaussian latent feature
model, in which the features are binary, as in Griths and Ghahramani (2011)
and Ghahramani and Griths (2006). Common factors are, in this case, images
containing specic features that recur over the observations and are usually modeled
as i.i.d. Gaussian distribution: λj ∼ Nq(0, σ2ZI) where q is the total number of pixels
in the observations and I is a q× q identity matrix. The conditional distribution of
26
an image Xi, namely the kernel K, is
Xi|A, ζi, τ2 ∼ Nq(Aζi, τ
2I)
where A is a matrix whose rows are the traits λj 's and τ2 is the variance of each
component.
The GIB in this case is given by the Indian Buet Process itself, described
as follows. The rst customer (the rst observation) starts at the left and samples
Poisson(α) dishes. The i-th costumer moves from left to right sampling dishes with
probabilitymk
iwhere mk is the number of customers to have previously sampled
dish k. Having reached the end of the previously sampled dishes, she/he tries
a number of new dishes that is Poisson(α/i) distributed. If we apply the same
ordering scheme to the binary matrix whose entries ζik tell us whether or not (0/1)
the hidden feature k contributes to the i-th item generated by this process, we
recover an exchangeable distribution. It is clear that the number of active features
K+ is given by∑n
i=1 Poisson(α/i). Figure 1.3 shows a realization of this matrix,
where the rows represent the customers (namely the n observations) and the columns
are the dishes. Note that the number of dishes that have been tasted up to a certain
arrival of a customer grows with the number of customers.
Figure 1.3: Matrix representation of a realization from an Indian Buet process(image taken from Griths and Ghahramani, 2011).
Chapter 2
Posterior sampling from
ε-approximation of normalized
completely random measure mixtures
This chapter is based on Argiento et al. (2016b).
In this chapter we deal with the mixture models introduced in Section 1.2; in particu-
lar, we consider the case where the mixing distribution belongs to the class of normalized
homogeneous completely random measures. However, the issue related to the innite di-
mensionality of the parameter has been only mentioned so far. Here, we address this
computational issue by proposing a truncation method for the mixing distribution. The
idea is to discard the weights of the unnormalized measure smaller than a threshold. We
provide some theoretical properties about the approximation, as convergence and posterior
characterization. A relatively simple blocked Gibbs sampler is devised, in order to sample
from the posterior of the model. In particular, we are able to sample from the posterior of
the truncated mixing measure.
The performances of the proposed approximation are illustrated by two dierent ap-
plications. In the rst, a new random measure, called normalized Bessel random measure,
is introduced; goodness of t indexes show its good performances as mixing measure for
density estimation. The second example describes how to incorporate covariates in the
support of the normalized measure, leading to a linear dependent model for regression and
clustering.
In order to keep the chapter self-contained, in Section 2.2 the notation used for homo-
geneous normalized completely random measure and its approximation is recalled.
28
2.1 Introduction
One of the livelier topic in Bayesian nonparametrics concerns mixtures of para-
metric densities where the mixing measure is an almost surely discrete random
probability measure. The basic model is the Dirichlet process mixture model, ap-
peared rst in Lo (1984), where the mixing measure is indeed the Dirichlet process.
Dating back to Ishwaran and James (2001a) and Lijoi et al. (2005), many alterna-
tive mixing measures have been proposed; the former paper replaced the Dirichlet
process with stick-breaking random probability measures, while the latter focused
on normalized completely random measures. These hierarchical mixtures play a
pivotal role in modern Bayesian nonparametrics, and their popularity is mainly due
to the high exibility in density estimation problems as well as in clustering, which
is naturally embedded in the model.
In some statistical applications, the clustering induced by the Dirichlet process
as mixing measure may be restrictive. In fact, it is well-know that the latter allo-
cates observations to clusters with probabilities depending only on the cluster sizes,
leading to the the rich gets richer behavior. Within some classes of more general
processes, as, for instance, stick-breaking and normalized processes, the probability
of allocating an observation to a specic cluster depends also on extra parameters,
as well as on the number of groups and on the cluster size. We refer to Argiento
et al. (2015) for a recent review of the state of the art on Bayesian nonparametric
mixture models and clustering.
Since posterior inference for Bayesian nonparametric mixtures involves an innite-
dimensional parameter, this may lead to computational issues. However, there is
a recent prolic literature focusing mainly on two dierent classes of MCMC algo-
rithms, namely marginal and conditional Gibbs samplers. The former integrates
out the innite dimensional parameter (i.e. the random probability), resorting to
generalized Polya urn schemes; see Favaro and Teh (2013) or Lomelí et al. (2017).
The latter includes the nonparametric mixing measure in the state space of the
Gibbs sampler, updating it as a component of the algorithm; this class includes the
slice sampler (see Grin and Walker, 2011). Among conditional algorithms there
are truncation methods, where the innite parameter (i.e. the mixing measure) is
approximated by truncating the innite sums dening the process, either a poste-
riori (Argiento et al., 2010; Barrios et al., 2013) or a priori (Argiento et al., 2016a;
Grin, 2013).
In this work we introduce an almost surely nite dimensional class of random
probability measures that approximates the wide family of homogeneous normal-
ized completely random measures introduced in Section 1.1; we use this class as
the building block in mixture models and provide a simple but general truncation
algorithm to perform posterior inference. Our approximation is based on the con-
structive denition of the weights of the completely random measure as the points of
a Poisson process on R+. In particular, we consider only points larger than a thresh-
old ε, controlling the degree of approximation. Conditionally on ε, our process is
nite dimensional both a priori and a posteriori.
2.2. Preliminaries on normalized completely random
measures 29
Here we illustrate two applications. In the rst one, a new choice for the Lévy
intensity ρ, characterizing the normalized completely random measure, is proposed:
the Bessel intensity function that, up to our knowledge, has never been applied
in a statistical framework, but known in nance (see Barndor-Nielsen, 2000, for
instance). We call this new process normalized Bessel random measure. In the
second application, we set ρ to be the well-known generalized gamma intensity and
consider a centering measure P0x depending on on a set of covariates x, yielding a
linear dependent normalized completely random measure.
In this chapter, since the main objective is the approximation of the nonpara-
metric process arising from the normalization of completely random measures, we
x ε to a small value. However, it is worth mentioning that it is possible to choose
a prior for ε, but the computational cost might greatly increase for some intensity
ρ.
The new achievements of this chapter can be summarized as follows: (i) a gener-
alization of the ε-approximation given in Argiento et al. (2016a) for the NGG process
to the whole family of normalized homogeneous completely random measures, (ii) a
dierent technique providing the posterior distribution (and the exchangeable par-
tition probability function) of this new random probability measure, making use of
Palm's formula, and (iii) the introduction of the normalized Bessel random measure
as mixing measure in Bayesian nonparametric mixtures.
In particular, after the introduction of the nite dimensional ε-approximation
of a normalized completely random measure, we derive its posterior and show that
the ε-approximation converges to its innite dimensional counterpart (Section 2.3).
Then we provide a Gibbs sampler for the ε-approximation hierarchical mixture
model (Section 2.4). Section 2.4.1 illustrates some criteria to choose the approx-
imation parameter ε. Section 2.5.1 is devoted to the introduction of the normalized
Bessel random measure, and some of its properties; on the other hand, Section 2.5.2
discusses an application of the ε-Bessel mixture models to both simulated and real
data. Section 2.6 denes the linear dependent ε−NGG's, and considers linear de-
pendent ε −NGG mixtures to t the AIS data set. To complete the set-up of the
chapter, Section 2.2 is devoted to a summary of basic notions about homogeneous
normalized completely random measures, and Section 2.7 contains a conclusive dis-
cussion.
2.2 Preliminaries on normalized completely randommea-
sures
In this section we recall the notation and the main denitions that are useful
in the rest of the chapter. See also Section 1.1 of the introductory chapter. Let
Θ ⊂ Rm for some positive integer m. Let µ be a homogeneous completely random
measure on Θ with Levy's intensity ν(ds, dτ) = κρ(ds)P0(dτ), where ρ(s) is the
density of a non-negative measure on R+, and κP0 is a nite measure on Θ with
30
total mass κ > 0. Assume that ρ satises regularity conditions∫ +∞
0min1, sρ(s)ds < +∞, (2.1)
and ∫ +∞
0ρ(s)ds = +∞. (2.2)
This implies that the homogeneous completely random measure can be represented
as µ(·) =∑
j≥1 Jjδτj (·). Since µ is homogeneous, the support points τj and the
jumps Jj of µ are independent, and the τj 's are independent identically distributed(iid) random variables from P0, while Jj are the points of a Poisson process on
R+ with mean intensity ρ. Moreover, if T := µ(Θ) =∑
j≥1 Ji, by (2.1) and (2.2)
P(0 < T < +∞) = 1.
Therefore, the corresponding normalized completely random measure P can be
dened through normalization of µ:
P :=µ
µ(Θ)=
+∞∑j=1
JjTδτj =
+∞∑j=1
Pjδτj . (2.3)
We refer to P in (2.3) as a (homogeneous) normalized completely random measure
with parameter (ρ, κP0). As an alternative notation, following James et al. (2009),
P is referred to as a homogeneous normalized measure with independent increments.
An alternative construction of normalized completely random measures can be given
in terms of Poisson-Kingman models as in Pitman (2003).
2.3 ε-approximation of normalized completely random
measures
The goal of this section is the denition of a nite dimensional random proba-
bility measure that is an approximation of a general normalized completely random
measure with Lévy intensity given by ν(ds, dτ) = ρ(ds)κP0(dτ), introduced above.
First of all, by the Restriction Theorem for Poisson processes, for any ε > 0, all
the jumps Jj of µ larger than a threshold ε are still a Poisson process, with mean
intensity γε(s) := κρ(s)1(ε,+∞)(s). Moreover, the total number of these points
is Poisson distributed, i.e. Nε ∼ Poisson(Λε) where Λε := κ∫ +∞ε ρ(s)ds. Since
Λε < +∞ for any ε > 0 by (2.1), Nε is almost surely nite. In addition, conditionally
to Nε, the points J1, . . . , JNε are iid from the density
ρε(s) =γε(s)
Λε=κρ(s)
Λε1(ε,+∞)(s), (2.4)
thanks to the relationship between Poisson and Bernoulli processes; see, for instance,
Kingman (1993), Section 2.4.
2.3. ε-approximation of normalized completely random
measures 31
We denote by µε the CRM with Lévy intensity
νε(ds, dτ) := ρ(ds)1(ε,+∞)(s)dsκP0(dτ). (2.5)
This implies that µε =∑Nε
j=1 Jjδτj . However, it is not worth trying to normalize µε,
since µε(B) = 0 for any B if Nε = 0. We consider, instead, the CRM µε so dened:
µε(·)d= J0δτ0(·) + µε(·) (2.6)
where (J0, τ0) is independent from (Jj , τj), j ≥ 1, J0 and τ0 are independent with
density ρε and P0, respectively. Thus
µε(·) = J0δτ0(·) +
Nε∑j=1
Jjδτj (·) =
Nε∑j=0
Jjδτj (·).
Summing up, we dene:
Pε(·) =
Nε∑j=0
Pjδτj (·) =
Nε∑j=0
JjTεδτj (·), (2.7)
where Tε =∑Nε
j=0 Jj , τjiid∼ P0, τj and Jj independent. We denote Pε in
(2.7) by ε − NormCRM and write Pε ∼ ε − NormCRM(ρ, κP0). When ρε(s) =
1/(ωσΓ(−σ, ωε))s−σ−1e−ωs, s > ε, Pε is the ε−NGG process introduced in Argiento
et al. (2016a), with parameter (σ, κ, P0), 0 ≤ σ ≤ 1, κ ≥ 0.
Increasing Lévy processes are completely random measures for Θ = R (or R+).
Therefore, it is worth mentioning some literature on ε-approximation of such pro-
cesses in the nancial context. In particular, the book by Asmussen and Glynn
(Asmussen and Glynn, 2007, Chapter XII) provides a justication for the approx-
imation of innite activity Lévy processes by compound Poisson processes: any
Lévy jump process J on R can be represented as the sum of two independent Lévy
processes
J(s) = J1(s) + J2(s), s ∈ R,
where the Lévy measures of J1 and J2 are restrictions of the whole Lévy measure
on (−ε, ε) and (−∞,−ε]∪[ε,+∞), respectively. When considering the homogeneous
completely random measure µ under (2.1) and (2.2) as here, this theory yields that µ
is the sum of two independent homogeneous completely random measures µ(0,ε] and
µε, corresponding to mean intensities ρ(s)1(0,ε](s) and ρε as in (2.4), respectively.
Note that µε is the CRM in the right hand-side of (2.6). The basic idea of the
ε-approximation is that, if ε is small enough, µ(0,ε] can be neglected and µ can be
approximated by µε; see (Asmussen and Glynn, 2007, Chapter XII) and Trippa and
Favaro (2012).
The approach to ε-approximation taken here is similar, though not identical,
since we rst add the random mass J0 in the random point τ0 to µε to dene the
32
CRM µε as in (2.6). The random probability measure Pε in (2.7) is then dened by
normalization of µε. We will show in Proposition 2.3 that Pε converges in distribu-
tion to P as ε goes to 0, but the basic idea of the approximation is that the point
mass we add to µε is negligible; see Section 2.4.1.
Several other methods have been proposed in order to approximate a normalized
measure; rst of all, we mention the inverse Lévy measure method, referred to as
Ferguson-Klass representation (Ferguson and Klass, 1972) in this context, represent-
ing the Poisson process of the jumps of a subordinator as a series of trasformed (via
the survival function of the Lévy intensity) points of a unit rate Poisson process. Of
course, to get implementable simulation algorithms, the series expansion has to be
truncated at a xed and large integer N , or whenever the new jump to be added to
the series is smaller that a threshold ε. In the latter case, the truncation rule would
yield only jumps of size greater than ε, obtaining an algorithm that is similar to
that proposed here (Asmussen and Glynn, 2007, Chapter XII). On the other hand,
Arbel and Prünster (2017) proposes a truncation rule of the series representation at
a xed integer N quantifying the error through a moment-matching criterion, i.e.
evaluating a measure of discrepancy between actual moments of the whole series
and moments of the truncated sum based on the simulation output. More series
representations of the jump process can be considered, with corresponding trunca-
tion rules; see Bondesson (1982) and Rosi«ski (2001). Alternatively, Trippa and
Favaro (2012) proposed a novel class of r.p.m.'s, that is dense in the class of homo-
geneous normalized completely random measures. These authors rst approximate
any CRM µ with µε which, as we have already mentioned, has nite Lévy measure.
Then, resorting to the denseness" of the novel class, they approximate µε with
an element of this class, with Lévy intensity given by the weighted sum of a nite
number of intensities of nite activity processes, plus the intensity of the gamma
process.
Let θ = (θ1, . . . , θn) be a sample from Pε, a ε−NormCRM(ρ, κP0) as dened
in (2.7), and let θ∗ = (θ∗1, . . . , θ∗k) be the (observed) distinct values in θ. We denote
by allocated jumps of the process the values Pl∗1 , Pl∗2 , . . . , Pl∗k in (2.7) such that there
exists a corresponding location for which τl∗i = θ∗i , i = 1, . . . , k. The remaining
values are non-allocated jumps. We use the superscript (na) for random variables
related to non-allocated jumps. The rst result is a characterization of the posterior
law of the random measure µε, not yet normalized; however, we need introducing
two more ingredients rst. We consider an auxiliary random variable U such that
U |µε ∼ Gamma(n, Tε), so that the marginal density of U is
fU (u;n) =un−1
Γ(n)E(Tnε e−Tεu) =
un−1
Γ(n)(−1)n
d
dunE(e−uTε)
=un−1
Γ(n)(−1)n
d
dunΛε,ueΛε,u
ΛεeΛε,
(2.8)
and the last equality follows easily from the denition of Tε and the Lévy-Khintchine
representation, using notation dened in (2.11). We also formulate the following
2.3. ε-approximation of normalized completely random
measures 33
lemma, whose proof is straightforward.
Lemma 2.3.1
Let µε be a nite CRM with Lévy intensity νε as in (2.5), and let µε be dened as
in (2.6). Consider a CRM µ? such that
µ?(·) d= Xµε(·) + (1−X)µε(·), (2.9)
where X ∼ Bernoulli(p), p = a/(a+ b), a, b > 0, and X is independent on µε and
(J0, τ0). The Laplace functional of µ? is:
Ψ[f ] =aA[f ] + b
a+ bexp
−∫R+×Θ
(1− e−f(τ)s
)νε(ds, dτ)
, (2.10)
for any positive f , where
A[f ] := E(
e−f(τ0)J0
)=
∫R+×Θ
e−f(τ)sρε(s)dsP0(dτ)
=1
Λε
∫R+×Θ
e−sf(τ)νε(ds, dτ)
is the Laplace functional of the random measure J0δτ0 .
The posterior distribution of µε has the following characterization.
Theorem 2.3.1
If Pε is an ε−NormCRM(ρ, κP0), then the conditional distribution of Pε, given θ∗
and U = u, is obtained by normalization of the following random measure
µ∗ε(·)d= µ(na)
ε,u (·) + µ(a)ε,u(·) = µ(na)
ε,u (·) +k∑j=1
J(a)j δθ∗k(·)
where
1. the law of the process of non-allocated jumps µ(na)ε,u (·) is distributed as the
CRM µ? dened in (2.9), corresponding to Lévy intensity in (2.10) given by
e−usνε(ds, dτ) and probability p of success p = Λε,u/(Λε,u + k), where
Λε,u := κ
∫ +∞
εe−usρ(s)ds, u ≥ 0; (2.11)
2. the process of allocated jumps µ(a)ε,u(·) has xed points of discontinuity θ∗ =
(θ∗1, . . . , θ∗k) with weights J
(a)j
ind∼ snie−usρ(s)1(ε,+∞)(s)ds, j = 1, . . . , k;
3. µ(na)ε,u (·) and µ
(a)ε,u(·) are independent, conditionally to l∗ = (l∗1, . . . , l
∗k), the
vector of locations of the allocated jumps;
34
4. the posterior law of U given θ∗ has density on the positive real numbers given
by
fU |θ∗(u|θ∗) ∝ un−1eΛε,u−Λε Λε,u + k
Λε
k∏i=1
∫ +∞
εκsnie−usρ(s)ds, u > 0.
The proof of the above proposition, as well as of all the others in this section, is in
Appendix 2.B. An immediate consequence of Theorem 2.3.1 is the next proposition.
Corollary 2.3.1
The conditional distribution of Pε, given θ∗ and U = u, veries the distributional
equation
P ∗ε (·) d= wP (na)
ε,u (·) + (1− w)k∑j=1
P(a)j δθ∗k(·)
where P(na)ε,u (·) is the null measure if µ
(na)ε,u (Θ) = 0, w = µ
(na)ε,u (Θ)/(µ
(na)ε,u (Θ) +∑k
j=1 J(a)j ), and the jumps P (a)
1 , . . . , P(a)k associated to the xed points of discon-
tinuity θ∗1, . . . , θ∗k are dened as P
(a)j = J
(a)j /
∑kj=1 J
(a)j , j = 1 . . . , k.
Theorem 2.3.1 and Corollary 2.3.1 conceive the nite dimensional counterpart
of Proposition 1 in James et al. (2009).
Both the innite and nite dimensional processes dened in (2.3) and (2.7),
respectively, belong to the wide class of species sampling models, deeply investigated
in Pitman (1996), and we use some of the results there to derive ours. Let (θ1, . . . , θn)
be a sample from (2.3) or (2.7) (or, more generally, from a species sampling model);
since it is a sample from a discrete probability, it induces a random partition pn :=
C1, . . . , Ck on the set Nn := 1, . . . , n where Cj = i : θi = θ∗j for j = 1, . . . , k. If
#Ci = ni for 1 ≤ i ≤ k, the marginal law of (θ1, . . . , θn) has unique characterization:
L(pn, θ∗1, . . . , θ
∗k) = p(n1, . . . , nk)
k∏j=1
L(θ∗j ),
where p is the EPPF associated to the random probability. The EPPF p is a
probability law on the set of the partitions of Nn. The following proposition providesan expression for the EPPF of a general ε−NormCRM .
Proposition 2.1
Let (n1, . . . , nk) be a vector of positive integers such that∑k
i=1 ni = n. Then, the
EPPF associated with a Pε ∼ ε-NormCRM(ρ, κP0) is
pε(n1, . . . , nk) =
∫ +∞
0
[un−1
Γ(n)
(k + Λε,u)
Λεe(Λε,u−Λε)
k∏i=1
∫ +∞
εκsnie−usρ(s)ds
]du(2.12)
where Λε,u has been dened in (2.11).
2.3. ε-approximation of normalized completely random
measures 35
A result concerning the EPPF of a generic normalized (homogeneous) completely
random measure can be obtained from Pitman (2003), formulas (36)-(37):
p(n1, . . . , nk) =
∫ +∞
0
un−1
Γ(n)eκ∫+∞0 (e−us−1)ρ(s)ds
(k∏i=1
∫ +∞
0κsnie−usρ(s)ds
)du.
(2.13)
It follows that the EPPF of (2.7) converges pointwise to that of the corresponding
(homogeneous) normalized completely random measure (2.3) when ε tends to 0.
Proposition 2.2
Let pε(·) be the EPPF of a ε−NormCRM(ρ, κP0). Then for any sequence n1, . . . , nkof positive integers with k > 0 and
∑ki=1 ni = n,
limε→0
pε(n1, . . . , nk) = p0(n1, . . . , nk), (2.14)
where p0(·) is the EPPF of the NormCRM(ρ, κP0) as in (2.13).
Convergence of the sequence of EPPFs yields convergence of the sequences of
ε−NormCRMs, generalizing a result obtained for ε−NGG processes.
Proposition 2.3
Let Pε be a ε−NormCRM(ρ, κP0), for any ε > 0. Then
Pεd→ P as ε→ 0,
where P is a NormCRM(ρ, κP0). Moreover, as ε tends to +∞, Pεd→ δτ0 , where
τ0 ∼ P0.
The proof of the above proposition is along the same lines as the proof of Propo-
sition 1 in Argiento et al. (2016a), and therefore it is omitted here.
Furthermore, the m-th moment of Pε, m = 1, 2, . . . , is equal to:
E [(Pε(B))m] = E[(P0(B))Km
](2.15)
where B ∈ B(Θ) and Km is the number of distinct values in a sample of size m from
Pε. In particular, when m = 2, Km assumes values in 1, 2, and the probability
that K2 = 1 is the probability that, in a sample of size 2 from Pε, the sample values
coincide, i.e. pε(2). Therefore E(Pε(B)2) = P0(B)pε(2) + (P0(B))2(1− pε(2)), and
consequently
Var(Pε(B)) = pε(2)P0(B) (1− P0(B)) . (2.16)
Analogously, the covariance structure of Pε is as follows:
Cov(Pε(B1), Pε(B2)) = pε(2) (P0(B1 ∩B2)− P0(B1)P0(B2)) (2.17)
36
for any B1, B2 ∈ B(Θ). Proofs of (2.15) and (2.17) are given in Appendix 2.B.
2.4 ε-NormCRM process mixtures
We consider mixtures of parametric kernels as the distribution of data, where
the mixing measure is the ε − NormCRM(ρ, κP0). The model we assume is the
following:
Yi|θiind∼ f(·; θi), i = 1, . . . , n
θi|Pεiid∼ Pε, i = 1, . . . , n
Pε ∼ ε−NormCRM(ρ, κP0),
ε ∼ π(ε),
(2.18)
where f(·; θi) is a parametric family of densities on Y ⊂ Rp, for all θ ∈ Θ ⊂ Rm. Itis a special case of model (1.4) in Chapter 1, where the de Finetti measure is given
by the family of ε-NormCRM processes.
Remember that P0 is a non-atomic probability measure on Θ, such that E(Pε(A)) =
P0(A) for all A ∈ B(Θ) and ε ≥ 0. Model (2.18) will be addressed here as
ε−NormCRM hierarchical mixture model.
The design of a Gibbs scheme to sample from the posterior distribution of model
(2.18) is straightforward, once we have augmented the state space with the variable
u, by using the posterior characterization in Theorem 2.3.1. The Gibbs sampler
generalizes that one provided in Argiento et al. (2016a) for ε−NGG mixtures, but
it is designed for any Lévy intensity ρ under (2.1) and (2.2). Description of the
full-conditionals is below, and further details can be found in Appendix 2.A.
1. Sampling from L(u|Y ,θ, Pε, ε): it is clear that, conditionally to Pε, u is
independent from the other variables and distributed according to gamma
with parameters (n, Tε).
2. Sampling from L(θ|u,Y , Pε, ε): each θi, for i = 1, . . . , n, has discrete law
with support τ0, τ1, . . . , τNε, and probabilities P(θi = τj) ∝ Jjf(Yi; τj).
3. Sampling from L(Pε, ε|u,θ,Y ): this step is not straightforward and can be
split into two consecutive substeps:
3.a Sampling from L(ε|u,θ,Y ): see Appendix 2.A.
3.b Sampling from L(Pε|ε, u,θ,Y ): via characterization of the posterior
in Theorem 2.3.1, since this distribution is equal to L(Pε|ε, u,θ). To put
into practice, we have to sample (i) the number Nna of non-allocated
jumps, (ii) the vector of the unnormalized non-allocated jumps J (na),
(iii) the vector of the unnormalized allocated jumps J (a), the support of
the allocated (iv) and non-allocated (v) jumps. See Appendix 2.A for a
wider description.
2.4. ε-NormCRM process mixtures 37
We highlight that, when sampling from non-standard distributions, Accept-Reject
or Metropolis-Hastings algorithms have been exploited.
2.4.1 Some ideas on the choice of ε
We believe that a brief discussion on the choice of the approximation parameter
ε is worth doing. We could also consider it random, as we did in Argiento et al.
(2016a), where the ε-NGG mixture model was proposed. In our general view, this
parameter can be considered either as a true parameter, and then it should be
xed on the ground of the prior information we have, or as a tuning parameter to
approximate the exact model (normalized completely random measure mixtures).
If we prefer the latter alternative as we did here, ε has to be small. However, since the
result on ε-approximation (Theorem 2.3) concerns the prior distribution in (2.18),
the only suggestions we can give refer to a priori criteria. Here we suggest to set ε
such that the sum of the masses µ((0, ε]) and J0 we perturb µ with, obtaining µε,
is small. In particular, since the interest is in normalized random measures, small
is xed with respect to the expectation E(T ) of the total mass of µ, i.e. we choose
ε such that
r(ε) :=E(µ(0, ε]) + E(J0)
E(T )≤ ν, (2.19)
where ν is typically a small value. Rather, alternative criteria are available; for
instance, as in Argiento et al. (2016a), we could choose ε to achieve a prexed value
for E(Nε) or Var(Nε). As far as (2.19) is concerned, observe that
E(µ(0, ε]) = κ
∫ ε
0sρ(s)ds, Var(µ(0, ε]) = κ
∫ ε
0s2ρ(s)ds;
from (2.1), it follows that
E(µ(0, ε])→ 0 Var(µ(0, ε])→ 0 as ε→ 0,
i.e. the r.v. µ(0, ε] converges to 0 in L∈ and this implies convergence in probability.
Besides, we have that
ε ≤ E(J0) =κ∫ +∞ε sρ(s)ds
Λε≤ E(T )
E(Nε).
Consequently, when ε→ 0, E(Nε)→ +∞ and thus E(J0) converges to 0.
As an interesting example, we evaluate the ratio r(ε) when ρ(s) = 1/Γ(1 −σ)s−1−σe−ωε for 0 ≤ σ < 1, κ > 0 and ω = 1, that means when µ is the gen-
eralized gamma process, i.e. the unnormalized CRM dening NGG processes by
38
Figure 2.1: Values of r(ε) when ρ isthe Lévy intensity of the generalizedgamma CRM, with κ = 1 and dierentvalues of σ, as a function of log10(ε).
−10 −8 −6 −4 −2
0.00
0.05
0.10
0.15
0.20
log10(ε)
Rat
io
σ
0.001
0.1
0.3
0.6
normalization. By 8.354.2 in Gradshteyn and Ryzhik (2007), we have that
E (µ(0, ε]) =κ
Γ(1− σ)(Γ(1− σ)− Γ(1− σ; ε))
=κ
Γ(1− σ)
(+∞∑n=0
(−1)nε1−σ+n
n!(1− σ + n)
)ε→0∼ κε1−σ
Γ(2− σ),
and E(J0) = Γ(1−σ, ε)/Γ(−σ, ε). We also mention that Var(µ(0, ε]) ∼ (κε2−σ)/Γ(2−σ) as ε tends to 0. Figure 2.1 shows r(ε) when µ is the generalized gamma process
with κ = 1 and dierent values of σ, as a function of ε. Note that a smaller threshold
ε is needed in order to obtain the same value of ν when the parameter σ decreases
to 0.
Similar calculations can be derived when µ is the Bessel random measure intro-
duced in the next section.
2.5 Normalized Bessel random measure mixtures: den-
sity estimation
In this section we introduce a new normalized process, called normalized Bessel
random measure. Section 2.5.1 describes theoretical results: in particular, we show
that this family encompasses the well-known Dirichlet process. Then we t the
mixture model to synthetic and real datasets in Section 2.5.2. Results are illustrated
through a density estimation problem.
2.5.1 Denition
Let us consider a normalized completely random measure corresponding to mean
intensity
ρ(s;ω) =1
se−ωsI0(s), s > 0,
2.5. Normalized Bessel random measure mixtures: density
estimation 39
where ω ≥ 1 and
Iν(s) =
+∞∑m=0
(s/2)2m+ν
m!Γ(ν +m+ 1)
is the modied Bessel function of order ν > 0 (see Erdélyi et al., 1953, Sect 7.2.2).
It is straightforward to see that, for s > 0,
ρ(s;ω) =1
se−ωs +
+∞∑m=1
1
22m(m!)2s2m−1e−ωs, (2.20)
so that ρ is the sum of the Lévy intensity of the gamma process with rate parameter
ω and of the Lévy intensities
ρm(s;ω) =1
22m(m!)2s2m−1e−ωs, s > 0, m = 1, 2, . . . (2.21)
corresponding to nite activity Poisson processes. It is simple to check that (2.1)
and (2.2) hold. Hence, following (2.3) in Section 2.2, we introduce the normalized
Bessel random measure P , with parameters (ω, κ), where ω ≥ 1 and κ > 0. Thanks
to (2.20) and the Superposition Property of Poisson processes the total mass T in
(2.3) can be written as
Td= TG +
+∞∑m=1
Tm, (2.22)
where TG, T1, T2, . . . are independent random variables, TG being the total mass
of the gamma process and Tm the total mass of a completely random measure
corresponding to the intensity νm(ds, dτ) = ρm(s)dsκP0(dτ). In particular, TG ∼gamma(κ, ω), while Tm =
∑Nmj=1 J
(m)j , whereNm ∼ Poisson(κΓ(2m)/((2ω)2m(m!)2)),
and J (m)j are the points of a Poisson process on R+ with intensity κρm. By this
notation we mean that Tm is equal to 0 when Nm = 0, while, conditionally to
Nm > 0, J(m)j
iid∼ gamma(2m,ω). We can write down the density function of T , via
the Lévy-Khintchine representation:
ψ(λ) := − log(E(e−λT )
)= κ
∫ +∞
0(1− e−λs)ρ(s;ω)ds
= κ
(log
(ω + λ
ω
)+
+∞∑m=1
Γ(2m)
22m(m!)2ωm−
+∞∑m=1
Γ(2m)
22m(m!)2(ω + λ)m
)
= κ log
(ω + λ+
√(ω + λ)2 − 1
ω +√ω2 − 1
).
The same expression is obtained when T ∼ fT (t) = κ(ω +√ω2 − 1)κ
e−ωt
tIκ(t),
t > 0 (see Gradshteyn and Ryzhik, 2007, formula (17.13.112)). Observe that, when
ω = 1, fT is called Bessel function density (Feller, 1971). By (2.13), the EPPF of
40
the normalized Bessel random measure is:
pB(n1, . . . , nk;ω, κ) = κk∫ +∞
0
un−1
Γ(n)
(ω +√ω2 − 1
ω + u+√
(ω + u)2 − 1
)κ1
(u+ ω)n
×k∏j=1
Γ(nj) 2F1
(nj2,nj + 1
2; 1;
1
(u+ ω)2
)du, (2.23)
where
2F1(α1, α2; γ; z) :=∞∑m=0
(α1)m (α2)m(γ)m
1
m!(z)m , with (α)m :=
Γ(α+m)
Γ(α)
is the hypergeometric series (see Gradshteyn and Ryzhik, 2007, formula (9.100)).
The following proposition shows that the EPPF of the normalized Bessel random
measure converges to the EPPF of the Dirichlet process as the parameter ω increases.
The proof is given in Appendix 2.B.
Proposition 2.4
Let (n1, . . . , nk) be a vector of positive integers such that∑k
i=1 ni = n, where
k = 1, . . . , n. Then, the EPPF (2.23), associated with the normalized Bessel random
measure P with parameter (ω, κ), ω ≥ 1, κ > 0, and mean measure P0, is such that
limω→+∞
pB(n1, . . . , nk;ω, κ) = pD(n1, . . . , nk;κ),
where pD(n1, . . . , nk;κ) is the EPPF of the Dirichlet process with measure parameter
(κ, P0).
The prior distribution of Kn, the number of distinct values in a sample of size
n from the normalized Bessel random measure, could be derived from its EPPF in
(2.23). However, this is not an easy task from a computational point of view, so that
we prefer to use a Monte Carlo strategy to simulate from the prior of the Kn. The
simulation strategy is also useful to understand the meaning of the parameters of
the normalized Bessel random measure: κ has the usual interpretation of the mass
parameter, since, when xing ω, E(Kn) increases with κ. On the other hand, the
eect of ω is quite peculiar: decreasing ω (thus drifting apart from the Dirichlet
process), with κ xed, the prior distribution of Kn shifts towards smaller values.
However, when E(Kn) is kept xed, the distribution has heavier tails if ω is small
(see Figures 2.2 and 2.4 (a)).
The Lévy intensity (2.20) of the normalized Bessel completely random measure
has a similar expression as the intensity corresponding to an element of the class Cin Trippa and Favaro (2012). Both intensities are linear combinations of intensities
of the gamma process and of the type si−1e−ωs1(0,+∞)(s), corresponding to nite
activity Poisson processes. Here, the intensity of the Bessel random probability
measure corresponds to an innite mixture with xed weights, where the indexes
2.5. Normalized Bessel random measure mixtures: density
estimation 41
0.0
0.1
0.2
0.3
0.4
0.5
0 2 4 6 8 10 13 16 19 22 25 28
k
0.1
0.5
1
3
5
Figure 2.2: Prior distribution ofKn under a sample from ε-NB pro-cess with ε = 10−6, ω = 1.05 andseveral values for κ, as reported inthe legend.
i are even integers (see (2.21)), while Trippa and Favaro (2012) assume a linear
combination of a nite number of components, through a vector of parameters.
2.5.2 Application
In this section let us consider the hierarchical mixture model (2.18), where the
mixing measure is Pε, the ε-approximation of the normalized Bessel random mea-
sure, as introduced above (here ε-NB(ω, κP0) mixture model). Of course, when
ε is small, this model approximates the corresponding mixture when the mixing
measure is P ; to the best of our knowledge, this normalized Bessel completely
random measure has never been considered in the Bayesian nonparametric liter-
ature. By decomposition (2.22), we argue that this model is suitable when the
unknown density shows many dierent components, where a few of them are very
spiky (they should correspond to Lévy intensities (2.21)), while there is a folk of
atter components which are explained by the intensity (1/s)e−ωs of the Gamma
process. For this reason, we consider a simulated dataset which is a sample from a
mixture of 5 Gaussian distributions with means and standard deviations equal to
(15, 1.1), (50, 1), (20, 4), (30, 5), (40, 5), and weights proportional to 10, 9, 4, 5, 5.The histogram of the simulated data, for n = 1000, is reported in Figure 2.3.
We report posterior estimates for dierent sets of hyperparameters of the ε-NB
mixture model when f(·; θ) is the Gaussian density on R and θ = (µ, σ2) stands for
its mean and variance. Moreover, P0(dµ, dσ2) = N (dµ; yn, σ2/κ0) × IG(dσ2; a, b).
We set κ0 = 0.01, a = 2 and b = 1 as proposed rst in Escobar and West (1995). We
shed light on three sets of hyperparameters in order to understand sensitivity of the
estimates under dierent conditions of variability; indeed, each set has a dierent
value of pε(2), which tunes the a-priori variance of Pε, as reported in (2.16). We
tested three dierent values for pε(2): pε(2) = 0.9 in set (A), pε(2) = 0.5 in set
(B) and pε(2) = 0.1 in set (C). Moreover, in each scenario we let the parameter
1/ω range in 0.01, 0.25, 0.5, 0.75, 0.95; note that the extreme case of ω = 100
(or equivalently 1/ω = 0.01) corresponds to an approximation of the DPM model.
42
Figure 2.3: Density esti-mate for case A5: posteriormean (line), 90% point-wise credibility intervals(shadowed area), true den-sity (dashed) and the his-togram of simulated data.
The mass parameter κ is then xed to achieve the desired level of pε(2). As far
as the choice of ε concerns, we set it equal to 10−6: this provides pretty good
approximation a priori (see Section 2.4.1); moreover, posterior inference proved to
be fairly robust with respect to ε. At the end, we got 15 tests, listed in Table 2.1.
As mentioned before, it is possible to choose a prior for ε, even if, for the ρ in (2.20),
the computational cost would greatly increase due to the evaluation of functions
2F1 in (2.23).
We have implemented our Gibbs sampler in C++. All the tests in Sections 2.5
and 2.6 were made on a laptop with Intel Core i7 2670QM processor, with 6GB of
RAM. Every run produced a nal sample size of 5000 iterations, after a thinning of
10 and an initial burn-in of 5000 iterations. Every time the convergence was checked
by standard R package CODA tools.
Here, we focus on density estimation: all the tests provide similar estimates,
quite faithful to the true density. Figure 2.3 shows density estimate and pointwise
90% credibility intervals for case A5; the true density is superimposed as dashed line.
Figure 2.4 displays prior and posterior distributions, respectively, of the number Kn
of groups, i.e. the number of unique values among (θ1, . . . , θn) in (2.18) under two
sets of hyperparameters, A1, representing an approximation of the DPM model, and
A5, where the parameter ω is nearly 1. From Figure 2.4 it is clear that A5 is more
exible than A1: for case A5, a priori the variance of Kn is larger, and, on the other
hand, the posterior probability mass in 5 (the true value) is larger.
In order to compare dierent priors, we take into account ve dierent predictive
goodness-of-t indexes: (i) the sum of squared errors (SSE) , i.e. the sum of the
squared dierences between the yi and the predictive mean E(Yi|data) (yes, we are
using data twice!); (ii) the sum of standardized absolute errors (SSAE), given by
the sum of the standardized error |yi−E(Yi|data)|/√
Var(Yi|data); (iii) log-pseudo
marginal likelihood (LPML), quite standard in the Bayesian literature, dened as
2.5. Normalized Bessel random measure mixtures: density
estimation 43
0.0
0.2
0.4
0.6
0.8
Kn
Den
sity
1 2 3 4 51 2 3 4 5 6 7
0.0
0.2
0.4
0.6
Kn
Den
sity
4 5 6 7 8
Figure 2.4: Prior (left) and posterior (right) distributions of the number Kn ofgroups for test A1 (gray) and A5 (blue).
the sum of log(CPOi), where CPOi is the conditional predictive ordinate of yi,
the value of the predictive distribution evaluated at yi, conditioning on the training
sample given by all data except yi. The last two indexes, (iv) WAIC1 and (v)
WAIC2, as denoted here, were proposed in Watanabe (2010) and deeply analyzed
in Gelman et al. (2014): they are generalizations of the AIC, adding two types of
penalization, both accounting for the eective number of parameters. The bias
correction in WAIC1 is similar to the bias correction in the denition of the DIC,
while WAIC2 is the sum of the posterior variances of the conditional density of
the data. See Gelman et al. (2014) for their precise denition. Table 2.1 shows the
values of the ve indexes for each test: the optimal (according to each index) tests
are highlighted in bold for the experiments (A), (B) and (C). It is apparent that
the dierent tests provide similar values of the indexes, but SSE, indicating that,
from a predictive viewpoint, there are no signicant dierences among the priors.
However, especially when the value of κ is small, i.e. in all tests A and B, a model
with a smaller ω tends to outperform the Dirichlet process case (approximately,
when ω = 100). On the other hand, the SSE index shows quite dierent values
among the tests: it is well-known that this is an index favoring complex models and
leading to better results when data are over-tted. Therefore, tests with a higher
value of κ are always preferable according to this criterion.
We tted our model also to a real dataset, the Hidalgo stamps data of Wil-
son (1983) consisting of n = 485 measurements of stamp thickness in millimeters
(here multiplied by 103). The stamps have been printed between 1872 and 1874
on dierent paper types, see data histogram in Figure 2.5. This dataset has been
analyzed by dierent authors in the context of mixture models: see, for instance,
Nieto-Barajas (2013).
We report posterior inference for the set of hyperparameters which is most in
agreement with our prior belief: the mean distribution is given by P0(dµ, dσ2) =
44
Table 2.1: Predictive goodness-of-t indexes for the simulated dataset.
Test ω κ SSE SSAE WAIC1 WAIC2 LPML
A1 100 0.06 6346.59 811.16 -3312.44 -3312.55 -3312.55A2 4 0.09 5812.86 810.43 -3312.33 -3312.42 -3312.43A3 2 0.1 6089.19 810.99 -3312.38 -3312.47 -3312.48A4 1.33 0.11 6498.23 811.29 -3312.54 -3312.62 -3312.63A5 1.05 0.11 5725.18 810.39 -3312.27 -3312.36 -3312.36
B1 100 0.43 5184.25 809.61 -3311.95 -3312 -3312.01B2 4 0.67 5125.41 809.7 -3312.19 -3312.25 -3312.26B3 2 0.81 4610.39 809.42 -3311.92 -3311.98 -3312B4 1.33 0.93 4246.43 809.07 -3311.75 -3311.83 -3311.84
B5 1.05 1 4571.09 809.08 -3311.96 -3312.05 -3312.06
C1 100 1.56 3707.5 809.36 -3311.73 -3311.86 -3311.88
C2 4 2.67 2194.1 808.8 -3312.02 -3312.23 -3312.26C3 2 3.64 1223.86 809.28 -3312.62 -3312.96 -3312.99C4 1.33 5.29 748.85 808.7 -3313.05 -3313.51 -3313.54C5 1.05 8.95 685 807.96 -3312.9 -3313.36 -3313.38
N (dµ; yn, σ2/κ0) ×IG(dσ2; a, b) as before, and κ0 = 0.005, a = 2 and b = 0.1. The
approximation parameter ε of the ε-NB(ω, κP0) random measure is xed to 10−6;
on the other hand, in order to set parameters ω and κ, we argue as follows: ω ranges
in 1.05, 5, 10, 1000 and we choose the mass parameter κ such that the prior mean
of the number of clusters, i.e. E(Kn), is the desired one. As noted in Section 2.5.1,
a closed form of the prior distribution of Kn is not available, so we resort to Monte
Carlo simulation to estimate it. Table 2.2 shows the four couples of (ω, κ) yielding
E(Kn) = 7: indeed, according to Ishwaran and James (2002) and McAulie et al.
(2006) and references therein, there are at least 7 dierent groups (but the true
number is unknown), corresponding to the number of types of paper used. For an
in-depth discussion about the appropriate number of groups in Hidalgo stamps data,
we refer the reader to Basford et al. (1997). Table 2.2 also reports prior standard
deviations of Kn: even if the a-priori dierences are small, the posteriors appear to
be quite dierent among the 4 tests. All the posterior distributions on Kn support
the conjecture of at least seven distinct modes in the data; in particular, Figure 2.5
(b) displays the posterior distribution of Kn for Test 4. A modest amount of mass
is given to less than 7 groups, and the mode is in 11. Even Test 1, corresponding to
the Dirichlet process case, does not give mass to less than 7 groups, where 9 is the
mode. Density estimates seem pretty good; an example is given in Figure 2.5 (a),
with 90% credibility band for Test 4.
As in the simulated data example, some predictive goodness-of-t indexes are
reported in Table 2.2: the optimal value for each index is indicated in bold. The
SSE is signicantly lower when ω is small, thus suggesting a greater exibility of
the model with small values of ω. The other indexes assume the optimal value in
2.6. Linear dependent NGG mixtures: an application to
sports data 45
0.00
0.05
0.10
0.15
0.20
Kn
Dens
ity
5 6 7 8 9 10 12 14 16 18
Figure 2.5: Posterior inference for the Hidalgo stamp data for Test 4: histogramof the data, density estimate and 90% pointwise credibility intervals (a); posteriordistribution of Kn (b).
Table 2.2: Predictive goodness-of-t indexes for the Hidalgo stamps data.
ω κ E(K) sd(K) SSE SSAE WAIC1 WAIC2 LPML
1 1000 0.98 7 2.04 15.17 384.1 -713.12 -713.96 -714.122 10 0.91 7 2.13 12.85 383.51 -713.22 -714.04 -714.253 5 0.92 7 2.18 13.52 383.68 -713.52 -714.3 -714.44 1.05 1.02 7 2.32 11.12 383.38 -712.84 -713.66 -714.05
Test 4 as well, even if those values are similar along the tests.
Our ε-approximation method turned out to be accurate and fast when compared
with competitors (the slice sampler and an a-posteriori truncation method) when the
mixing random probability measure is the NGG process and the kernel is Gaussian;
see Argiento et al. (2016a), Section 5.
2.6 Linear dependent NGG mixtures: an application to
sports data
Let us consider a regression problem, where the response Y is univariate and con-
tinuous, for ease of notation. We model the relationship (in distributional terms) be-
tween the vector of covariates x = (x1, . . . , xp) and the response Y through a mixture
density, where the mixing measure is a collection Px,x ∈ X of ε−NormCRMs,
being X the space of all possible covariates. We follow the same approach as in
MacEachern (2000) and De Iorio et al. (2009) for the dependent Dirichlet process.
We dene the dependent ε −NormCRM process Px,x ∈ X, conditionally to x,
46
as:
Pxd=
Nε∑j=0
Pjδγj(x). (2.24)
The weights Pj are the normalized jumps as in (2.7), while the locations γj(x),
j = 1, 2, . . ., are independent stochastic processes with index set X and P0x marginal
distributions. Model (2.24) is such that, marginally, Px follows a ε − NormCRMprocess, with parameter (ρ, κP0x), where ρ is the intensity of a Poisson process on
R+, κ > 0, and P0x is a probability on R. Observe that, since Nε and Pj do not
depend on x, (2.24) is a generalization of the single weights dependent Dirichlet
process (see Barrientos et al., 2012, for this terminology). We also assume the
functions x 7→ γj(x) to be continuous.
The dependent ε−NormCRM process in (2.24) takes into account the vector
of covariates x only through γj(x). In particular, when the kernel of the mixture
(2.18) belongs to the exponential family, for each j, γj(x) = γ(x; τj) can be assumed
as the link function of a generalized linear model, so that (2.18) specializes to
Yi|θi,xiind∼ f(y;γ(xi,θi)) i = 1, . . . , n
θi|Pεiid∼ Pε i = 1, . . . , n where Pε ∼ ε−NormCRM(ρ, κP0).
(2.25)
This last formulation is convenient because it facilitates parameters interpretation
as well as numerical posterior computation.
We analyze the Australian Institute of Sport (AIS) data set (Cook and Weisberg,
1994), which consists of 11 physical measurements on 202 athletes (100 females and
102 males). Here the response is the lean body mass (lbm), while three covariates
are considered, the red cell count (rcc), the height in cm (Ht) and the weight in
Kg (Wt). The data set is contained in the R package DPpackage (Jara et al.,
2011). The actual model (2.25) we consider here is when f(·;µ, η2) is the Gaussian
distribution with µ mean and η2 variance; moreover, µ = γ(x,θ) = xtθ, and the
mixing measure Pε is the ε-NGG(κ, σ, P0), as introduced in Argiento et al. (2016a).
We have considered two cases, when mixing the variance η2 with respect to the
NGG process or when the variance η2 is given a parametric density; in both cases,
by linearity of the mean xtθ, the model (here called linear dependent NGGmixture)
can be interpreted as a NGG process mixture model, and inference can be achieved
via an algorithm similar to that in Section 2.4. We set ε = 10−6, which provides
a moderate value for the ratio r(ε) in (2.19), and σ ∈ 0.001, 0.125, 0.25, κ such
that E(Kn) ' 5 or 10. When the variance η2 is included in the location points
of the ε − NGG process, then P0 is N4(b0,Σ0) × IG(ν0/2, ν0η20/2); on the other
hand, when η2 is given a parametric density, then η2 ∼ IG(ν0/2, ν0η20/2). We xed
hyperparameters in agreement with the least squares estimate: b0 = (−50, 5, 0, 0),
Σ0 = diag(100, 10, 10, 10), ν0 = 4, η20 = 1. For all the experiments, we computed
the posterior of the number of groups, the predictive densities at dierent values of
the covariate vectors and the cluster estimate via posterior maximization of Binder's
loss function (see Lau and Green, 2007a).
2.7. Conclusion 47
Moreover, we compared the dierent prior settings computing predictive goodness-
of-t tools, specically log pseudo-marginal likelihood (LPML) and the sum of
squared errors (SSE), as introduced in Section 2.5.2. The minimum value of SSE,
among our experiments, was achieved when η2 is included in the location of the
ε − NGG process, σ = 0.001 and κ = 0.8 so that E(Kn) ' 5. On the other hand,
the optimal LPML was achieved when σ = 0.125, κ = 0.4, and E(Kn) ' 5. Posterior
0.0
0.2
0.4
0.6
0.8
Kn
Den
sity
2 3 4 40 60 80 100 120
40
50
60
70
80
90
10
0
Wt
lbm
Figure 2.6: Posterior distributions of the number Kn of groups (left) and clusterestimate (right) under the linear dependent ε−NGG mixture.
of Kn and cluster estimate under this last hyperparameter setting are in Figure 2.6
((left) and (right), respectively); in particular the cluster estimate is displayed in
the scatterplot of the Wt vs lbm. In spite of the vague prior, the posterior of Kn is
almost degenerate on 2, giving evidence to the existence of two linear relationships
between lbm and Wt.
Finally, Figure 2.7 displays predictive densities and 95% credibility bands for
3 athletes, a female (Wt=60, rcc=3.9, Ht=176 and lbm=53.71), and two males
(Wt=67.1,113.7, rcc=5.34,5.17, Ht=178.6, 209.4 and lbm=62,97) respectively, under
the same hyperparameter setting of Figure 2.6; the dashed lines are observed values
of the response. Depending on the value of the covariate, the distribution shows one
or two peaks: this reects the dependence of the grouping of the data on the value of
x. This gure highlights the versatility of nonparametric priors in a linear regression
setting with respect to the customary parametric priors: indeed, the model is able
to capture in detail the behavior of the data, even when several clusters are present.
2.7 Conclusion
We have proposed a new model for density and cluster estimation in the Bayesian
nonparametric framework. In particular, a nite dimensional process, the ε −
48
Figure 2.7: Predictive distributions of lbm for three dierent athletes: Wt=60,rcc=3.9, Ht=176 (left), Wt=67.1, rcc=5.34, Ht=178.6 (center), Wt=113.7,rcc=5.17, Ht=209.4 (right). The shaded area is the predictive 95% pointwise credi-ble interval, while the dashed vertical line denotes the observed value of the response.
NormCRM , has been dened, which converges in distribution to the correspond-
ing normalized completely random measure, when ε tends to 0. Here, the ε −NormCRM is the mixing measure in a mixture model. In this chapter we have
xed ε very small, but we could choose a prior for ε and include this parameter into
the Gibbs sampler scheme. Among the achievements of the work, we have gener-
alized all the theoretical results obtained in the special case of NGG in Argiento
et al. (2016a), including the expression of the EPPF for an ε−NormCRM process,
its convergence to the corresponding EPPF of the nonparametric underlying pro-
cess and the posterior characterization of Pε. Moreover, we have provided a general
Gibbs Sampler scheme to sample from the posterior of the mixture model. To show
the performance of our algorithm and the exibility of the model, we have illustrated
two examples via normalized completely random measure mixtures: in the rst ap-
plication, we have introduced a new normalized completely random measure, named
normalized Bessel random measure; we have studied its theoretical properties and
used it as the mixing measure in a model to t simulated and real datasets. The
second example we have dealt with is a linear dependent ε−NGG mixture, where
the dependence lies on the support points of the mixing random probability, to t
a well known dataset.
Appendix 2.A: Details on full-conditionals for the Gibbs
sampler
Here, we provide some details about Step 3 of the Gibbs Sampler in Section 2.4.
As far as Step 3a is concerned, the full-conditional L(ε|u,θ,Y ) is obtained integrat-
2.7. Conclusion 49
ing out Nε (or equivalently Nna) from the law L(Nε, u,θ,Y ), as follows:
L(ε|u,θ,Y ) ∝+∞∑
Nna=0
L(Nna, ε, u,θ,Y )
=+∞∑
Nna=0
π(ε)e−ΛεΛNnaε,u
Λε
(Nna + k)
Nna!
k∏i=1
∫ +∞
εκsnie−usρ(s)ds
=
(k∏i=1
∫ +∞
εκsnie−usρ(s)ds
)eΛε,u−Λε Λε,u + k
Λεπ(ε) = fε(u;n1, . . . , nk)π(ε),
where we used the identity∑+∞
Nna=0 ΛNnaε,u (Nna + k)/(Nna!) = eΛε,u(Λε,u + k). More-
over, fε(u;n1, . . . , nk) is dened in (2.32). This step depends explicitly on the ex-
pression of ρ(s). Step 3.b consists in sampling from L(Pε|ε, u,θ) as reported in
Corollary 2.3.1. In order to sample a draw from the posterior distribution of the
(unnormalized) measure, we follow Theorem 2.3.1. The component µ(a)ε,u is obtained
generating independently from L(Jl∗i ) ∝ Jnil∗ie−uJl∗
i ρ(Jl∗i )1(ε,∞)(Jl∗i ), i = 1, . . . , k.
On the other hand, µ(na)ε,u satises the distributional identity described at point 1 of
the proposition, and therefore we simulate it as follows:
1. Draw x from the Bernoulli distribution with parameter p = Λε,u/(Λε,u + k).
2. Draw N (na) from Px(Λε), where Px(Λε) denotes the shifted Poisson distribu-
tion, with support on x, x+ 1, x+ 2, ... and mean λ+ x.
3. If N (na) = 0, let µ(na)ε,u be the null measure. Otherwise, draw an iid sample
(Jj , τj), j = 1, . . . , N (na), from ρε(s)dsP0(dτ), and set µ(na)ε,u =
∑N(na)
j=1 Jjδτj .
Appendix 2.B: Proofs of the theorems
Proof of Theorem 2.3.1
Conditionally to the unnormalized measure µε (see (2.6)), the law of θ is given
by
P(θ1 ∈ dθ1, . . . , θn ∈ dθn|µε) =1
Tnε
k∏j=1
µε(dθ∗j )nj .
By considering the variable U in (2.8), we express the joint conditional distribution
of θ and U as
P(θ1 ∈ dθ1, . . . , θn ∈ dθn, U ∈ du|µε) =un−1
Γ(n)e−Tεudu
k∏j=1
µε(dθ∗j )nj . (2.26)
The posterior distribution of µε can be characterized by its Laplace functional;
50
we have
E(
e−∫Θ f(τ)µε(dτ)|θ1 ∈ dθ∗1, . . . , θn ∈ dθn, U ∈ du
)=
E
e−∫Θ f(τ)µε(dτ)P(θ1 ∈ dθ∗1, . . . , θn ∈ dθn, U ∈ du|µε)
E P(θ1 ∈ dθ∗1, . . . , θn ∈ dθn, U ∈ du|µε)
.
(2.27)
Let us focus on the numerator in (2.27); by (2.26) we obtain:
E(
e−∫Θ f(τ)µε(dτ)P(θ1 ∈ dθ∗1, . . . , θn ∈ dθn, U ∈ du|µε)
)=un−1du
Γ(n)E
e−J0(f(τ0)+u)e−∫Θ(f(τ)+u)µε(dτ)
k∏j=1
(µε(dθ∗j ) + J0δτ0(dθ∗))nj
.(2.28)
Moreover, if P0 is an absolutely continuous probability, then, for each j = 1, . . . , k,
(µε(dθ∗j ) + J0δτ0(dθ∗))nj = µε(dθ
∗j )nj + J
nj0 δτ0(dθ∗j ),
so that
k∏j=1
(µε(dθ∗j )nj + J
nj0 δτ0(dθ∗)) =
k∏j=1
µε(dθ∗)nj +
k∑l=1
δτ0(dθ∗l )Jnl0
∏j 6=l
µε(dθ∗)nj .
Therefore, the expected value on the right hand side of (2.28) is:
E(
e−J0(f(τ0)+u))E
e−∫Θ f(τ)+uµε(dτ)
k∏j=1
µε(dθ∗j )nj
+
k∑l=1
E(e−J0(f(τ0)+u)Jnl0 δτ0(dθ∗l ))E
e−∫Θ f(τ)+uµε(dτ)
∏j 6=l
µε(dθ∗j )nj
.
Representation of a CRM via trasformation of a Poisson process can be extended
to µε(dθ∗j )nj =
∫R+×Θ s
njδτ (dθ∗j )N(ds, dτ) where N is a Poisson process with mean
intensity νε(ds, dτ). If we apply Palm's formula (see Daley and Vere-Jones, 2007,
2.7. Conclusion 51
Proposition 13.1.IV) to µε(dθ∗k)nk , we have that
E
e−∫Θ(f(τ)+u)µε(dτ)
k∏j=1
µε(dθ∗j )nj
= E
e−∫Θ(f(τ)+u)µε(dτ)
k−1∏j=1
µε(dθ∗j )nj
∫R+×Θ
snkk δτk(dθ∗k)N(dsk, dτk)
= E
e−∫Θ(f(τ)+u)(µε)(dτ)
k−1∏j=1
µε(dθ∗j )nj
P0(dθ∗k)
∫ ∞ε
e−(f(θ∗k)+u)sksnkk κρ(sk)dsk
(iterating Palm's formula further k − 1 times)
= E
e−∫Θ(f(τ)+u)(µε)(dτ)
k∏j=1
(P0(dθ∗j )
∫ ∞ε
e−(f(θ∗j )+u)sjsnjj κρ(sj)dsj
)
= exp
−∫R+×Θ
(1− e−s(f(τ)+u)
)νε(ds, dτ)
k∏j=1
P0(dθ∗j )
∫ ∞ε
e−(f(θ∗j )+u)sjsnjj κρ(sj)dsj .
In other words, the numerator of (2.27) is equal to
un−1
Γ(n)
∫R+×Θ e−s(f(τ)+u)νε(ds, dτ) + k
Λεe−
∫R+×Θ(1−e−s(f(τ)+u))νε(ds,dτ)
×k∏j=1
P0(dθ∗j )
∫ ∞ε
e−(f(θ∗j )+u)ssnjκρ(s)ds.
(2.29)
Observe that, if we plug the function f ≡ 0 in (2.29), we obtain the denominator of
the ratio (2.27), that is
P(dθ1, . . . , dθn, du) =un−1
Γ(n)
Λε,u + k
Λεe(Λε,u−Λε)
k∏j=1
P0(dθ∗j )kε(u, nj), (2.30)
where for n > 0, kε(u, n) =∫∞ε e−ussnκρ(s)ds = (−1)n d
dunψε(u), and ψε(u) :=
− log(E(e−uTε)
)= Λε − Λε,u.
We are ready to compute the posterior Laplace functional of µε: by substituting
(2.29) and (2.30) in the numerator and denominator of (2.27), we have
E(
e−∫Θ f(τ)µε(dτ)|θ1 ∈ dθ1, . . . , θn ∈ dθn, U ∈ du
)=
∫R+Θ e−sf(τ)e−suνε(ds, dτ) + k
Λε,u + ke(−
∫R×Θ(1−e−sf(τ))e−suνε(du,dτ))
×
k∏j=1
∫ ∞0
e−sf(θ∗j ) e−susnjρ(s)I(ε,∞)(s)
kε(u, nj)ds
.
(2.31)
52
This expression gives that the posterior Laplace functional of µε, conditionally to
U ∈ du, factorizes in two terms. This proves the independence property in point 3.
We denote the unnormalized process of non-allocated jumps by µ(na)u,ε . Its conditional
Laplace transform is given by the rst factor (between ) in the right hand side
of (2.31). In order to obtain point 1. of the theorem, characterization (2.10) gives
that the law of µ(na)u,ε coincides with the law of a process µ? as given in (2.9), with
(exponential tilted) Lévy intensity e−suνε(ds, dτ) and probability of success of the
Bernoulli mixing random variable p =Λε,u
k+Λε,u. As far as point 2. is concerned, the
Laplace functional (2.31) gives that the process of the allocated jumps has xed
atoms at the observed unique values θ∗1, . . . , θ∗k, i.e. it can be represented as
µ(a)ε (·) =
k∑j=1
J(a)j δθ∗j (·).
In this case, the weigths of the allocated masses J(a)j are independent and distributed
according to
P (J(a)j ∈ ds|θ1 ∈ dθ1, . . . , θn ∈ dθn, U ∈ du) =
e−susnjρ(s)I(ε,∞)(s)
kε(u, nj)ds,
for any j = 1, . . . , k. Finally, point 4. follows easily from (2.30).
Proof of Proposition 2.1
This proposition follows from (2.30). In fact, we rst observe that P(θ1 ∈dθ1, . . . , θn ∈ dθn, U ∈ du) = P(pn, θ
∗1 ∈ dθ∗1, . . . , θ
∗k ∈ dθ∗k, U ∈ du), and then
integrate out θ∗1, . . . , θ∗k and U from (2.30) to obtain (2.12).
Proof of Proposition 2.2
By Proposition 2.1, pε(n1, . . . , nk) =∫ +∞
0 fε(u;n1, . . . , nk)du, where
fε(u;n1, . . . , nk) =un−1
Γ(n)
(k + Λε,u)
Λεe(Λε,u−Λε)
k∏i=1
∫ +∞
εκsnie−usρ(s)ds, (2.32)
with u > 0. On the other hand, the EPPF of a NormCRM(ρ, κP0) can be written
as p0(n1, . . . , nk) =∫ +∞
0 f0(u;n1, . . . , nk)du, where
f0(u;n1, . . . , nk) =un−1
Γ(n)exp
κ
∫ +∞
0(e−us − 1)ρ(s)ds
k∏i=1
∫ +∞
0κsnie−usρ(s),
with u > 0. We rst show that
limε→0
fε(u;n1, . . . , nk) = f(u;n1, . . . , nk) for any u > 0. (2.33)
2.7. Conclusion 53
In particular, we have that
limε→0
∫ +∞
εsnie−usρ(s)ds =
∫ +∞
0snie−usρ(s)ds
and
limε→0
eΛε,u−Λε = exp
κ
∫ +∞
0(e−us − 1)ρ(s)ds
,
being this limit nite for any u > 0. Using standard integrability criteria, it is
straightforward to check that, for any u > 0, limε→0 Λε,u = limε→0 Λε = +∞ and
they are equivalent innite, i.e.
limε→0
k + Λε,uΛε
= limε→0
Λε,uΛε
= 1.
We can therefore conclude that (2.33) holds true. The rest of the proof follows
as in the second part of the proof of Lemma 2 in Argiento et al. (2016a), where
we prove that (i) limε→0∑C∈Πn
pε(n1, . . . , nk) = 1; (ii) lim infε→0 pε(n1, . . . , nk) =
p0(n1, . . . , nk) for all C = (C1, . . . , Ck) ∈ Πn, the set of all partitions of 1, 2, . . . , n;(iii)
∑C∈Πn
p0(n1, . . . , nk) = 1. By Lemma 1 in Argiento et al. (2016a), equation
(2.14) follows.
Proof of formula 2.15
First of all, observe that
(x1 + · · ·+ xN∗ε
)m=
∑m1+···+mN∗ε =m
m1,...,mN∗ε≥0
(m
m1, . . . ,mN∗ε
) N∗ε∏j=1
xmjj (2.34)
=m∑k=1
11,...,N∗ε (k)1
k!
∑n1+···+nk=mnj=1,2,...
(m
n1, . . . , nk
) ∑j1,...,jk
k∏i=1
xniji
where N∗ε = Nε + 1, x0
j = 1 for all xj ≥ 0, and the last summation is over all
positive integers, being (2.34) the multinomial theorem. The second equality follows
straightforward from dierent identications of the set of all partitions of m (see
Pitman, 2006, Section 1.2). Therefore, for any B ∈ B(Θ), m = 1, 2, . . ., we have
54
(here, instead of P0 and τ0 as in (2.7), there are PN∗ε and τN∗ε ):
E(Pε(B)m) = E
E
N∗ε∑j=1
Pjδτj (B)
m
|Nε
= E
E
∑m1+···+mN∗ε =m
m1,...,mN∗ε≥0
(m
m1, . . . ,mN∗ε
) N∗ε∏j=1
(Pjδτj (B))mj |Nε
= E
m∑k=1
11,...,N∗ε (k)1
k!
∑n1+···nk=mnj=1,2,...
(m
n1, . . . , nk
)∑j1,...,jk
E(k∏i=1
Pniji |Nε)k∏i=1
E(δτj (B)|Nε)
= E
m∑k=1
11,...,N∗ε (k)1
k!
∑n1+···+nk=mnj=1,2,...
(m
n1, . . . , nk
)pε(n1, . . . , nk)(P0(B))k
.
We identify this last expression as E(∑m
k=1 P0(B)kP(Km = k|Nε)), where Km is
the number of distinct values in a sample of size m from Pε. Hence, we have proved
that
E(Pε(B)m) = E(E(P0(B)Km |Nε)
)= E
(P0(B)Km
).
Proof of formula 2.17
Suppose that B1, B2 ∈ B(Θ) are disjoint. Therefore
E(Pε(B1)Pε(B2)) = E
E
N∗ε∑j=1
Pjδτj (B1)
N∗ε∑l=1
Plδτl(B2)|Nε
= E
∑l 6=j
j,l=1,...,N∗ε
E(PjPl|Nε)E(δτj (B1))E(δτl(B2)))
= E
P0(B1)P0(B2)∑l 6=j
j,l=1,...,N∗ε
E(PjPl|Nε)
= P0(B1)P0(B2)pε(1, 1).
2.7. Conclusion 55
The general case when B1 and B2 are not disjoint follows easily:
E(Pε(B1)Pε(B2)) = E((Pε(B1 ∩B2))2
)+ E (Pε(B1 \B2)Pε(B1 ∩B2))
+ E (Pε(B2 \B1)Pε(B1 ∩B2)) + E(Pε(B1 \B2)Pε(B2 \B1)),
where now the sets are disjoint. Applying the result above we rst nd that
E(Pε(B1)Pε(B2)) = pε(2)P0(B1 ∩B2) + (1− pε(2))P0(B1)P0(B2),
and consequently formula 2.17 holds true.
Proof of Proposition 2.4
The EPPF of the Dirichlet process appeared rst in Antoniak (1974) (see Pit-
man, 1996); anyhow, it is straightforward to derive it from (2.13):
pD(n1, . . . , nk;κ) =
∫ +∞
0
un−1
Γ(n)e−κ log u+ω
ω
k∏j=1
κΓ(nj)
(u+ ω)njdu
= κk∫ +∞
0
un−1
Γ(n)
(ω
ω + u
)κ 1
(u+ ω)n
k∏j=1
Γ(nj)du =Γ(κ)
Γ(κ+ n)κk
k∏j=1
Γ(nj)
where the last equality follows from formula (3.194.3) in Gradshteyn and Ryzhik
(2007). By denition of the hypergeometric function, we have
1 ≤ 2F1
(nj2,nj + 1
2; 1;
1
(u+ ω)2
)≤ 2F1
(nj2,nj + 1
2; 1;
1
ω2
).
Moreover
ω +√ω2 − 1
(u+ ω) +√
((u+ ω)2 − 1)=
ω
u+ ω
1 +√
1− 1/ω2
1 +√
1− 1/(u+ ω)2
and1 +
√1− 1/ω2
2≤
1 +√
1− 1/ω2
1 +√
1− 1/(u+ ω)2≤ 1,
so that(1 +
√1− 1/ω2
2
)κpD(n1, . . . , nk;κ) ≤ pB(n1, . . . , nk;ω, κ)
≤k∏j=1
2F1
(nj2,nj + 1
2; 1;
1
ω2
)pD(n1, . . . , nk;κ).
56
The left hand-side of these inequalities obviously converges to pD(n1, . . . , nk;κ) as
ω goes to +∞. On the other hand,
2F1
(nj2,nj + 1
2; 1;
1
ω2
)→ 1 as ω → +∞,
thanks to the uniform convergence of the hypergeometric series 2F1(nj2 ,
nj+12 ; 1; z)
on a disk of radius smaller that 1. We conclude that, for any n1, . . . , nk such that
n1 + · · ·+ nk = n, k = 1, . . . , n, and any κ > 0,
limω→+∞
pB(n1, . . . , nk;ω, κ) = pD(n1, . . . , nk;κ).
Chapter 3
Covariate driven clustering:
an application to blood donors data
Blood is an important resource in global healthcare and therefore an ecient blood
supply chain is required. Predicting arrivals of blood donors is fundamental since it allows
for better planning of donations sessions; with the goal of characterizing behaviors of donors,
we analyze gap times between consecutive blood donations. In particular, we take into
account population heterogeneity via model based clustering.
Dening the model boils down to assign the prior for the random partition itself and to
exibly assign the cluster-specic distribution, since, conditionally on the partition, data
are assumed independent and identically distributed within each cluster and independent
between dierent clusters.
In particular, we aim at taking into account possible patterns within available covariates,
which can be either continuous or categorical; the additional covariate information should
drive the prior knowledge on the random partition by increasing the probability that two
donors with similar covariates belong to the same cluster. This is done through a covariate-
dependent nonparametric prior, thus departing from the standard exchangeable assumption.
We introduce a covariate dependent product partition model by modifying the prior on the
partition prescribed by the class of normalized completely random measures. We include
in such a prior a term that takes into account the distance between covariates. After a
brief discussion about the model and a simple illustrative example on simulated data, we t
our model to a large dataset provided by the Milan department of AVIS (Italian Volunteer
Blood-donors Association), the largest provider of blood donations in Italy.
58
3.1 Introduction
Section 1.2.4 of Chapter 1 presented the wide family of product partition mod-
els: in particular, they can be seen, under the assumption of exchangeability, as
an alternative parametrization of nonparametric mixture models with NormCRMs
as mixing measures. This is especially useful when the focus of the analysis is on
clustering, since in this case the prior on the random partition is made explicit.
However, exchangeability should not be assumed in presence of item-specic infor-
mation, which should be included in the prior for the partition. In presence of
covariates such as time, space or external measurements, the exchangeability as-
sumption is, indeed, unreasonable. We aim at assuming a model where two subjects
are more likely to co-cluster a-priori if their corresponding covariate values are sim-
ilar, i.e. they are close in time or space, or they have similar characteristics. Thus,
the goal of this chapter is to develop a model for the random partition depending on
covariates: in the Bayesian nonparametric literature, there are various approaches
that can be adopted, whose focus is the random measure or the random partition
(see the review paper by Foti and Williamson (2015)).
The rst viewpoint adopted to include covariate dependence in random measures
can be found in MacEachern (1999), where for the rst time the dependent Dirichlet
process appeared. The idea is to include dependence on covariates in the support or
in the jumps of the mixing measureG of a mixture model, as in (1.14). Recent papers
investigated deeper this approach: among the others, we mention Chung and Dunson
(2009), Rodriguez and Dunson (2011) (probit stick-breaking with covariates), Ren
et al. (2011) (logistic stick-breaking) and Di Lucca et al. (2013) (time dependent
DP). These works originate from the stick-breaking representation and modify the
way the weights of the measure are built. However, it is not clear how the covariates
aect the prior on the random partition, which is our main interest here.
Furthermore, other works are based on the augmentation of the space where
the Lévy measure is dened (see, for instance, Grin and Leisen (2014), Foti and
Williamson (2015) and Ranganath and Blei (2017)).
However, the main application of this chapter focuses on clustering: for this
reason, we based our model on the work of Müller et al. (2011), who proposed
the PPMx model, a product partition model with covariates information. In that
work, as well as in its generalizations, the cohesion function is restricted to be the
one induced by the Dirichlet process, namely c(Aj) = κ(nj − 1)!. The desired
dependence on covariates is induced by an additional factor, the similarity function,
that depends on covariates of items in each cluster: this coecient multiplies the
cohesion function as follows,
p(ρn = A1, . . . , AK) ∝K∏j=1
c(Aj)g(x∗j ).
where g(x∗j ) is a non-negative function that formalizes the similarity among the
covariates in the j-th cluster (recall that x∗j denotes the collection of covariates
3.1. Introduction 59
corresponding to items belonging to cluster j). As a default choice, they propose
to dene the similarity g(·) as the marginal probability in an auxiliary probability
model, even if xi are not considered random. The use of a probability density for
the construction of g(·) is convenient since it allows for easy computation: indeed,
posterior inference is identical to the posterior inference that we would obtain if the
covariate vector xi were part of the random response vector Yi, i.e. their model can
be rewritten using a DPM formulation on the response and the covariates jointly,
so that ecient Gibbs sampler schemes are available. For more details, see Müller
et al. (2011) and Müller and Quintana (2010). A similar approach can be found
in Park and Dunson (2010), where the dependence on the predictors is included
directly into the cohesion function, as c(Aj , x∗j ) = α(nj−1)!
∫ ∏i∈Aj f(xi|γ)dG0γ(γ).
Generalizations to achieve variable selection and to include spatial dependence can
be found in Quintana et al. (2015), Barcella et al. (2016) and Page and Quintana
(2015), respectively.
However, there is no need to consider similarities that are marginal laws of an
underlying model for covariates: in general, g can be any non-negative function of
some similarity measure such that the prior probability that two items belong to the
cluster increases if their similarity increases. As we will see later, the way we dene
g does not worsen the complexity of the algorithm for posterior inference. Indeed,
we will be able to devise a general MCMC sampler to perform posterior analysis
that does not depend on the specic choice of similarity. The full-conditionals of the
Gibbs sampler are relatively easy to implement thanks to the conjugate structure
of the PPMx.
Other non-exchangeable random partition models have been recently appeared
in the literature. In a recent paper, Dahl et al. (2017) adopt an approach similar to
ours; they dene a distribution that depends on pairwise similarities, dened in term
of distances among subjects. In particular, their prior is dened sequentially, as the
product of conditional probabilities, i.e. the probability of assigning a group to the
next subject conditioning to the previously allocated items. These probabilities are
modied according to the attraction of the current item to the previously allocated
items. This model shows valuable properties, since the prior on the number of
clusters and the cardinality of these clusters are not aected by covariates which,
on the other hand, places more probability on partitions that group similar items.
A related approach can be found in Dahl (2008) and Blei and Frazier (2011). In
particular, in the latter paper the authors propose a variation of the Blackwell-
McQueen urn scheme where the subject assignments are draws from probabilities
that depend on distance measurements.
The outline of the chapter is as follows: in Section 3.2 we propose a variation of
the product partition model with covariate dependence and discuss the inuence of
various choices of similarity function. In Section 3.3 we apply our model to a simple
simulated dataset. Then, we present the main application that motivated this work:
clustering the behavior of blood donors and predict the time of the next donation.
We consider a large dataset provided by the Milan department of AVIS (Italian
60
Volunteer Blood-donors Association), the largest provider of blood donations in Italy
(Section 3.4). We conclude with a brief discussion of the achievements and possible
future developments. Details about MCMC algorithms for posterior inference are
described in the Appendices.
3.2 A covariate driven model for clustering
Our aim is to estimate the clusters in the data as well as their density. As
described earlier, the main novelty here is the elicitation of a prior for the random
partition that, on the one hand exhibits the positive aspects of the NormCRM
processes, and on the other hand is driven by covariate information by means of a
similarity function that depends on the distance among subject-specic covariates.
We start from parametric densities f(·; θ) and specify a hierarchical model that
achieves the goals previously described. In particular, we assume that the data
are independent across groups, conditionally on covariates and the cluster specic
parameters; these are i.i.d from a base distribution P0. The prior on the partition
depends on covariates and it can be represented as a mixture of product partition
models. Concretely, we propose:
Y1, . . . Yn|x1, . . . ,xn, θ∗1, . . . , θ
∗K , ρn ∼
K∏j=1
f(y∗j |x∗j , θ∗j ) (3.1)
θ∗1, . . . , θ∗K |ρn
iid∼ P0
p(ρn|x1, . . . ,xn) ∝∫ +∞
0D(u, n)
K∏j=1
c(u, nj)g(x∗j )du (3.2)
where the notation q∗j stands for the collection of values of a quantity q for all items
belonging to cluster Aj and nj for the cardinality of the j − th group. Note that
f(y∗j |x∗j , θ∗j ) is a generic regression model such as f(y|x, θ) = N(y; xTβ, σ2
)for
the linear regression model or logit (P(y = 1)) = xTβ for logistic regression. In the
latter case, θ = (β, σ), while in the former case θ = β.
It is worth noticing that the likelihood specication in (3.1) may be any model:
in Section 3.4 we will deal with recurrent events, so that a more complex regression
models for gap times will be needed. Moreover, the prior p(ρn|x1, . . . ,xn) in (3.2)
can be equivalently written as
p(ρn|x1, . . . ,xn, u) ∝K∏j=1
c(u, nj)g(x∗j )
where we implicitly assume that g(∅) = 1 and the prior on the auxiliary u is as
3.2. A covariate driven model for clustering 61
follows
p(u) =un−1
Γ(n)(−1)n
d
dune−Ψ(u), u > 0
where Ψ(u) is the Laplace functional of the Lévy intensity ρ(s) of the NormCRM.
In this case, the marginal prior for ρn, given covariates, is
p(ρn|x1, . . . ,xn) = H(g(x))
∫ +∞
0D(u, n)
K∏j=1
c(u, nj)g(x∗j )p(u)du
where H(g(x)) is the intractable normalizing constant of the law of the random
partition ρn, dened as
H(g(x)) =∑ρn∈Pn
∫ +∞
0D(u, n)
K∏j=1
c(u, nj)g(x∗j )du. (3.3)
The joint law for (θ∗jKj=1
, ρn, yini=1) is equal to
K∏j=1
f(y∗j |x∗j , θ∗j )P0(θ∗j )
H(g(x))
∫ +∞
0D(u, n)
K∏j=1
c(u, nj)g(x∗j )du.
The intractability of the normalizing constant H(g(x)) prevents us from considering
any parameter of the similarity function g(·) or of the NGG process as random
variables (for instance, to perform variable selection in the similarity function g).
The issue has been also risen in Dahl et al. (2017) when comparing their model to
the product partition model with covariates of Müller et al. (2011). Finally, the
joint law of data and all parameters, included the auxiliary variable u, is
L(yini=1 , ρn, θ∗1, . . . , θ
∗K , u| xi
ni=1)
(3.4)
=K∏j=1
f(y∗j |x∗j , θ∗j )P0(θ∗j )
H(g(x))D(u, n)
K∏j=1
c(u, nj)g(x∗j )
and
L(ρn, u| xini=1) = H(g(x))D(u, n)
K∏j=1
c(u, nj)g(x∗j )
is the joint law of the random partition and the auxiliary variable u.
Even if the proposed approach is in fact quite general with respect to the choice
of the NormCRM, and thus the form of the cohesion function, in the following we
focus on the specic case of the normalized generalized gamma process, denoted by
NGG(κ, σ, P0). The main reason is that it induces a prior on the number of groups
which is more disperse than that induced by the Dirichlet process. Its Lévy's
62
intensity is
ρ(ds) = κs−1−σe−sds
where κ is a mass parameter and σ a discount parameter. The cohesion function
becomes c(Aj) = (1 − σ)nj−1, where (α)n is the Pochammer symbol, or rising
factorial (note that it does not depend on the auxiliary variable u). It is clear that
the parameter σ has a deep inuence on the clustering behavior. In particular,
the discount parameter aects the variance: the larger it is, the more disperse is
the distribution on the number of clusters. This feature mitigates the annoying
the rich-gets-richer eect, typical of the Dirichlet process and discussed in Section
1.2.1, leading to more homogeneous clusters. For more details on the behavior of
σ, see for instance Argiento et al. (2016a), Lijoi et al. (2007) and Argiento et al.
(2010). The prior on u becomes, in this case,
p(u) ∝ un−1
(u+ 1)n−κKe−(κ/σ(u+1)σ−1), u > 0.
Appendix 3.A reports details about the Gibbs sampler for posterior inference for
the model described in this section.
3.2.1 The choice of the similarity function
It is quite natural to let the similarity be a non-increasing function of the distance
among covariates in the cluster, namely
DAj =∑i∈Aj
d(xi, cAj ) (3.5)
where cAj is the centroid of the set of covariates in cluster j and d is a suitable
distance function, discussed later. Moreover, we assume this function to take value
1 if the cardinality of the set Aj is 1, i.e. |Aj | = 1.
Analytical results about specic quantities of interest, as for instance the prob-
ability of having a specic number k of groups or the probability of observing two
items together, are not easy to compute in closed form. A simple calculation, which
can be useful to intuitively understand the behavior of our prior, species the prob-
ability of observing one cluster in the case n = 2:
p(1, 2 ;κ, u, σ) =c(u, 2)g(x1, x2)
c(u, 2)g(x1, x2) + c(u, 1)2=
(1− σ)g(x1, x2)(1− σ)g(x1, x2) + κ(1 + u)σ
which tends to 1 when g(x1, x2) goes to +∞ (i.e., the distance goes to 0) and
tends to 0 when g(x1, x2) goes to 0, i.e. the covariates are far apart.
However, choosing the similarity function is not an easy task, since it heavily
aects the results, as we will see later. For this reason, we propose a list of reasonable
similarity functions that proved to work reasonably well in practice:
1. gA(x∗j ;λ) = e−tα, for α > 0 (α = 0.5, 1, 2), with t = λDAj ;
3.2. A covariate driven model for clustering 63
0 1 2 3 4 5 6
01
23
4
Distance
Sim
ilari
ty
gA
gB
gC
Figure 3.1: Proposed similarity functions.
2. gB(x∗j ;λ) = 1/tα, t = λDAj , for α > 0 (α = 0.5, 1, 2);
3.
gC(x∗j ;λ) =
e−t log t if t ≥ 1
ee1/e−1
t if t < 1e
, t = λDAj .
The three cases are displayed in Figure 3.1.
Obviously, a deeper understanding of the theoretical properties implied by the
choice of the similarity function is needed, and it is part of future work. However,
the three cases above gave us pretty satisfying results, at least numerically, as we
will show in Section 3.3.
In what follows, we provide guidelines for the elicitation of the parameter λ
driving the inuence of the covariates in the prior of the random partition. This
parameter is analogue to the temperature parameter dened in Dahl et al. (2017)
but, unlike them, the presence of the intractable normalizing constant (3.3) prevents
us from assigning a prior on λ. Therefore, some empirical rules to set this parameter
are needed. Varying this parameter has the eect of rescaling the range of values
where we evaluate the similarity function, since we locate in dierent parts of the
horizontal axis in Figure 3.1. For this reason, in order to select a value for λ, before
running the model we should display the histogram of all λd(xi,xj), i = 1, . . . , n,
j = 1, . . . , n, j 6= i, for some possible values of λ and adjust the range of values
the similarity can take (more or less variability, for example). For instance, suppose
we assume function gA (the blue line in Fig. 3.1): if we choose a very small λ, we
concentrate the values of λD around the origin, and hence we obtain similar values
for gA(·): in this case, the eect of the covariate information on the prior of ρn will
64
be very mild, since the range of values that the similarity can assume is very limited.
A similar argument is valid for large values of λ. In conclusion, we calibrate λ s.t.
gA is evaluated in the range, say, (0, 3), for this particular choice of similarity.
In order to dene the distance d appearing in (3.5), we will consider covariates
that are continuous or binary (categorical or ordinal covariates are transformed into
dummies, i.e. vectors of binary covariates). If x1 and x2 are vectors of covariates,
xj = (xcj ,xbj), where xcj is the subvector of all the continuous covariates and xbj is
the subvector of all binary covariates, we dene
d(x1,x2) = dc(xc1,xc2) + db(xb1,x
b2),
where dc is the Malahanobis distance between vectors, i.e. the Euclidean distance
between standardized vectors of covariates, and db is the Hamming distance between
vectors of binary covariates (see Zhang et al., 2006). Instead of the Malahanobis
distance, we could consider: l1-norm or lp-norm or sup-norm distances. As far
as the sensitivity with respect to the distance d is concerned, we noticed that the
choice of the distance moderately aects the results: in our numerical experiments,
we tested dierent choices of distance (Euclidean vs. Malahanobis, Hamming vs. a
standardized version of it, . . . ). However, a deeper understanding of how to calibrate
the distance function is part of future work.
In spite of that, we made other modeling choices which are subject of current
research: for instance, note that in formula (3.5) we decided not to normalize the
distance with respect to the number of elements. This is due to the fact that we
do not want to introduce undesirable eects of smoothing. In fact, by dividing the
quantity by the number of elements in the group, we obtain a smoothing eect that
lower the eect of the covariates.
Lastly, we note that the similarity function gC mimics the asymptotic behavior
of the cohesion function induced by the NGG: the reason is to balance the eect
of the similarity function with respect to the specic cohesion form of the NGG
process.
3.3 Simulated data
The performance of the model described in Section 3.2 is illustrated through a
simple example on a simulated dataset.
In particular, we simulated a dataset of points (yi, xi1, . . . , xip) for i = 1, . . . , n,
with n = 200 and p = 4. The last 2 covariates are binary and were generated
from the Bernoulli distribution, while the rst 2 were generated from Gaussian
densities. The responses yi's were generated from a linear regression model with
linear predictor xTi β, where β0 := (β0
0 , β01 , β
02 , β
03 , β
04) and variance σ2
e = 0.5. We
have generated 3 dierent groups by simulating both covariates and responses from
distributions with dierent parameters:
3.3. Simulated data 65
Simulated responses
Density
−30 −20 −10 0 10
0.0
00.0
20.0
40.0
60.0
8
Y
−4 −2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0
−30
−10
010
−4
02
4
X_1
X_2
−2
02
4
0.0
0.4
0.8
X_3
−30 −20 −10 0 10 −2 0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
X_4
Figure 3.2: (Top) Simulated data: histogram of responses. (Bottom) Scatterplot ofthe simulated dataset; dierent colors represent dierent groups (3) the data havebeen generated from.
66
Group 1: 75 covariate vectors and responses were independently generated as fol-
lows
(xi1, xi2) ∼ N2(µ1, σ20I2) µ1 = (−3, 3), σ2
0 = 0.5
xi3, xi4iid∼ Bern(0.1)
Yi ∼ N (xTi β0, σ2
e) β0 = (1, 5, 2, 1, 0), σ2e = 0.5
Group 2: 75 covariate vectors and responses were independently generated as fol-
lows
(xi1, xi2) ∼ N2(µ2, σ20I2) µ1 = (0, 0), σ2
0 = 0.5
xi3, xi4iid∼ Bern(0.5)
Yi ∼ N (xTi β0, σ2
e) β0 = (4, 2,−2, 1,−1), σ2e = 0.5
Group 3: 50 covariate vectors and responses were independently generated as fol-
lows
(xi1, xi2) ∼ N2(µ2, σ20I2) µ1 = (3, 3), σ2
0 = 0.5
xi3, xi4iid∼ Bern(0.9)
Yi ∼ N (xTi β0, σ2
e) β0 = (−1,−5,−2,−1, 1), σ2e = 0.5
Figure 3.2 shows the histogram of the responses yi, i = 1, . . . , n: the three groups
are very evident. In the panel below we display the scatterplot of covariates and
responses. As far as the prior is concerned, we included the whole vectors of xi in the
similarity and assume the cohesion function of the NGG process with κ = 0.3, σ =
0.2 such that E(Kn) = 5.9. Moreover, the base measure P0 is Np(0, σ2/κ0Ip×p
)−
IG(σ2; a, b) with κ0 = 0.01, (a, b) = (2, 1). Here, λ = 1.
We run the algorithm described in Appendix 3.A to obtain 5,000 nal iterations,
after a burnin of 2,000 and thinning of 10 iterations. A-posteriori we classied all
datapoints according to the optimal partition, under the dierent similarity func-
tions and obtaining the following missclassication rates:
missclassif rate gA gC g ≡ 1
0% 4% 16%
where g ≡ 1 stands for the model without covariate dependence in the prior for the
partition. By optimal partition we mean the realization, in the MCMC chain, of the
random partition ρn which minimizes posterior expected value of the Binder's loss
function with equal missclassication weights (Lau and Green, 2007a). In this work,
we employed this specic loss function; however, the issue of nding an appropriate
point estimate of the clustering structure of the data based on the posterior is still
subject of research. Various loss functions have been proposed, satisfying principles
such as invariance with respect to permutations of indices and labels. Alternative
3.3. Simulated data 67
loss functions may be found in Fritsch and Ickstadt (2009) and Meil (2007). The
latter introduced the Variation of Information criterion, that quanties the infor-
mation shared between dierent partitions. Moreover, the loss function dened in
Fritsch and Ickstadt (2009) is inspired by the Rand index.
Recently, Wade and Ghahramani (2017) proposed a new approach that is able to
quantify also uncertainty: a credible region of level (1−α) is dened as the smallest
ball around the point estimate with posterior probability greater than (1−α). The
metric on the space of partition is induced by the Binder or Variation of Information
functions.
0.0
00
.05
0.1
00
.15
0.2
00
.25
0.3
0
Ng
Pro
b.
ma
ss
3 4 5 6 7 8 9 10 11
0.0
00
.05
0.1
00
.15
0.2
00
.25
0.3
0
Ng
Pro
b.
ma
ss
3 4 5 6 7 8 9 10 11
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Ng
Pro
b.
ma
ss
3 4 5 6 7
Figure 3.3: Posterior distribution of Kn under gA (left), gC (center) and g ≡ 1(right).
We computed the posterior distribution of Kn, the number of clusters, in the
three cases: see Figure 3.3. Figure 3.4 displays the predictive distribution corre-
sponding to the covariates x1 of the rst subject. The green vertical line corresponds
to the actual observation y1. It is clear that in the last case, i.e. when we do not
include covariate information in the prior for the random partition, the predictive
law is not able to distinguish to which of the three groups the item belongs (thus,
we have three peaks in the law). In cases A and C the predictive law exhibits only
one main peak: the covariate information helps, in this case, in selecting the right
group for the observation. This is also proved by the missclassication table above.
−20 −10 0 10 20
0.0
0.1
0.2
0.3
0.4
0.5
Predictive density
De
nsity
−20 −10 0 10 20
0.0
0.1
0.2
0.3
0.4
Predictive density
De
nsity
−20 −10 0 10 20
0.0
00
.05
0.1
00
.15
Predictive density
De
nsity
Figure 3.4: Predictive distribution of Y1 under gA (left), gC (center) and g ≡ 1(right); vertical lines denote the true value
Figure 3.5 reports the cluster estimate under gA (no missclassication error).
68
Compare Figure 3.6 where we display the cluster estimate under model PPM without
covariates in the prior, i.e. g ≡ 1.
−4 −2 0 2 4
−30
−20
−10
010
X1
Y
−1 0 1 2 3 4 5
−30
−20
−10
010
X2
Y
0.0 0.2 0.4 0.6 0.8 1.0
−30
−20
−10
010
X3
Y
0.0 0.2 0.4 0.6 0.8 1.0
−30
−20
−10
010
X4
Y
Figure 3.5: Cluster estimate under gA.
This very simple illustrative example shows the good performance of our model;
it is worth including covariate information when eliciting the prior for the random
partition, both from the point of view of clustering and predictive accuracy. More-
over, the covariates that enter in the similarity function do not have to be the same
as those in the regression part of the likelihood: one could, for instance, use some
of them to drive the prior on the partition and use the others in the regression
component. The next section describes the analysis of a real dataset about blood
donations.
3.4 The AVIS data on blood donations
The Associazione Volontari Italiani del Sangue (AVIS, Association of Voluntary
Italian Blood Donors) is the major Italian non-prot and charitable organization
for blood donation, bringing together over a million volunteer blood donors across
Italy. The main aim of the association is to foster the development of voluntary,
recurring, anonymous and without prot blood donation at the community level.
Predicting arrivals of blood donors is fundamental since it allows for better planning
of donation sessions. In the next sections we propose a model that allows subject-
specic prediction of the next donation and, at the same time, compute cluster
3.4. The AVIS data on blood donations 69
−4 −2 0 2 4
−30
−20
−10
010
X1
Y
−1 0 1 2 3 4 5
−30
−20
−10
010
X2
Y
0.0 0.2 0.4 0.6 0.8 1.0
−30
−20
−10
010
X3
Y
0.0 0.2 0.4 0.6 0.8 1.0
−30
−20
−10
010
X4
Y
Figure 3.6: Cluster estimate under PPM model, i.e. g ≡ 1.
estimates in order to have deep insights about the population of donors. A general
framework for the analysis of gap-times is briey described in the following section.
For a thorough review of the models available from the literature, see Cook and
Lawless (2007).
3.4.1 A framework for recurrent events
Let us consider a single recurrent event process, where the events occur in con-
tinuous time, starting without loss of generality in t = 0. Let the sequence T1, T2,
. . . , such that 0 ≤ T1 ≤ T2 ≤ . . . , denote the events' time, where Tk is the time of
the k-th event. There are two main ways of describing and modeling this process:
through event counts over a certain time interval and through gap times between
successive events; the choice of the framework is usually driven by the objective
of the analysis. The former approach is often useful when individuals frequently
experience the events of interest, the latter instead is used when events are rela-
tively infrequent and prediction of the time to the next event is of interest, which
is the framework of this application. An important setting in which models based
on gap time between successive events are particularly attractive is the one where a
subject i is restored to a similar state after each event in the same way as a system
returns to a new state after a repair. These are known as renewal processes, in
which the gaps Tj = Tj − Tj−1 (j = 1, 2, . . . ) are conditonally independent and
70
identically distributed. Assume we have more subjects in a sample, and there exists
a dierent recurrent event process for any of them. Moreover, let us assume that
individual i is observed over the time interval [0, τi]. If ni events are observed at
times 0 < ti1 < ti2 < · · · < tini ≤ τi, let tij = tij − ti,j−1 (j = 1, .., ni ) and
ti,ni+1 = τi − tini , where ti0 = 0. These are the observed gap times for individual
i with the sample being possibly right censored. Let f(·|x) and S(·|x) denote the
density and survival functions for the gap time given covariates x. In terms of these
functions the likelihood for N subjects can be written as
L (tij , j = 1, .., ni + 1, i = 1, .., N | xij) =N∏i=1
ni∏j=1
fij(tij |xij)Si (tini+1|xini+1)
(3.6)
If ti,ni+1 = 0, the observation terminates after the ni-th event, hence the nal time
is not censored and the term involving the survival function in (3.6) disappears.
Observe that the likelihood in (3.6) is the one used by standard survival analysis
methods to model a sample involving failure times tij , and the last observation for
each subject is censored. An important inference model of this kind is the AFT
model for a response time Tij for which Yij = log(Tij) has a distribution of the form
Yij = ui + βTj xij + σiεij
where xij is the vector of covariates, xed within gap times but (possibly) varying
across gaps, βj are gap-dependent regression coecients and εij are i.i.d random
variables whose distribution is independent on the covariates. Moreover, ui and σiare subject specic random eects and scale parameter, respectively. In order to
choose an appropriate likelihood, namely a density (and, consequently, a survival
function), for our observations we need to perform a brief explorative study of our
data.
3.4.2 Data pre-processing and choice of the covariates
We conne our interest to whole blood's donations, performed between January
1st 2010, and May 15th 2016, in the main building of AVIS, while donations in the
mobile collection centres or within hospitals (which represents a small fraction) are
neglected. In the prospective of treating donations as recurrent events the date May
15th 2016 is the censoring time of the last observation for almost all the donors,
except those having their last donation exactly on that date. Before tackling the
inferential task that is our main goal, the initial rough dataset containing 18305
total events has been submitted to a ltering and cleaning process. In particular,
we removed the donors whose number of donations was greater to 15 (these are very
few and would increase the variance of the estimates) and those who are marked by
a denitive suspensions, namely the donor is declared no more adequate to donate.
At the end, we came up with a clean dataset containing 17198 donations, made by
3333 donors.
3.4. The AVIS data on blood donations 71
Consistently with Section 3.4.1, we denote by Tij the time (in days) passed from
donation j−1 to donation j for donor i: following the approach of accelerated failure
time models, the response variable is the logarithm of the gap time, Yij = log(Tij).
However, the modeling scheme presented in Section 3.4.1 is enriched by including
a random partition model for clustering donors' trajectories via the mixture model
discussed in Section 3.2. Before describing the model we are going to t to our data,
a brief descriptive analysis is due.
Men
log Gap times
Den
sity
4 5 6 7 8
0.0
0.2
0.4
0.6
0.8
1.0
log( 90 days )
Women
log Gap times
Den
sity
3 4 5 6 7 8
0.0
0.5
1.0
1.5
log( 6.5 months )
Figure 3.7: Histogram of the logarithm of the observed gap-times divided accordingto gender, male (left) and women (right).
An important premise to highlight is that, according to the Italian law, the
maximum number of whole blood donations is 4 per year for men and 2 for women,
with a minimum of 90 days between a donation and the other; this fact causes the
lack of a left tail in the histograms of the gap times (in log-scale) for men and women
displayed in Figure 3.7. Note that the minimum for the men is about to 4.5, and
that exp(4.5) ' 90 days, which is the minimum time that can occur between two
events required by the law. For women, the distribution has a mode approximately
in 5.3 in the log scale: this means 200 days, that corresponds to about 6 months and
a half. Therefore, this means that most of the donors decide to donate as soon as
they are allowed to. That said, one may wonder why there are observations shorter
than the minimum gap time imposed. This may happen in certain situations in
which the doctor, under good donor's health conditions, requires an anticipated
donation; it may also happen when planning a vacation, a donor decides to donate
earlier rather than skip the donation. The strong asymmetry in the distribution of
the logarithm of the gap-times motivated the choice of a skew-normal distribution
as a model for the likelihood. See Section 3.4.3.
In Figure 3.8 the mean and the median of the gap times Tij , i = 1, . . . , for each
value of j, have been computed. Both quantities have higher values for small j and
then they decrease with j increasing; a possible explanation is that the more a donor
proceeds with the donations, the more he becomes loyal to the activity and he devel-
72
2 4 6 8 10 12 14
100
150
200
250
300
350
Donations
Day
s
MeanMedian
Figure 3.8: Mean and median of the gap times (in days) Tij , i = 1, . . . , 3333 , foreach j ∈ 1, . . . , 15.
ops a kind of regularity in donating. Moreover, the number of observations for each j
from 1 to 15 is clearly decreasing: 765, 604, 415, 351, 312, 217, 174, 119, 84, 76, 75, 43,
39, 38, 21. We can notice a change at time t∗ = 8: indeed, until that time the aver-
age percentage of people that do not return to the donate again is around 20%−30%,
while after t∗ it decreases to 10%. This can be interpreted as a sort of loyalty of
donors, that strengthens after a bunch of donations.
As far as the covariates are concerned, the association recorded the following
covariates at each blood donation:
age: continuous (in years)
BMI: an indicator of high body fatness, calculated as a person's weight in
kilograms divided by the square of height in meters
gender: 1 if the donor is a woman, 0 if he is a man
blood type: it has for levels, depending on the blood type (0, A, B, AB)
Rh factor: 1 if it is positive, 0 if negative
smoking habits: 1 if the donor regularly smokes, 0 otherwise
practical activity: 1 if the donor regularly practices a sport, 0 otherwise
Note that all the covariates not strictly related to blood features (as weight, height
and life habits) are declared by the donor and not measured by a doctor: therefore,
they turn out to be quite inaccurate. Table 3.1 shows the empirical frequencies for
the static covariates described above.
In what follows, we considered as time-varying continuous covariates the age
and the body mass index (BMI). On the other hand, gender, blood type, Rh factor,
smoking habits and practical activity are treated as static covariates.
3.4. The AVIS data on blood donations 73
Female A B AB 0 Rh+ Smoker Activity
31.56% 37.86% 12.27% 3.81% 46.06% 86.7% 32.43% 69.78%
Table 3.1: Empirical relative frequencies for the dierent categories of the staticcovariates.
0
1000
2000
3000
20 30 40 50 60 70
Age
cou
nt
0
1000
2000
3000
4000
20 30 40
BMI
cou
nt
Figure 3.9: Histogram of the two time varying covariates: age (left) and BMI (right).
Figure 3.9 shows the histograms for the two donation-dependent covariates: the
1st empirical quartile for the covariate BMI is 21.98, and the 3rd quartile is 26.03,
with a median of 23.9. For the age, the 1st empirical quartile is 28, and the 3rd
quartile is 44, with a median of 35.
3.4.3 Skew normal mixture model
Skew normal mixture models have been successfully employed in various con-
texts: in the Bayesian framework see, for instance, Bayes and Branco (2007),
Frühwirth-Schnatter and Pyne (2010), Canale and Scarpa (2013), Arellano-Valle
et al. (2007).
The skew normal distribution is a continuous probability distribution that gener-
alizes the normal distribution to allow for non-zero skewness: for its properties, see
Azzalini (2005) and Arellano-Valle and Azzalini (2006). The univariate skew nor-
mal distribution has three parameters: location ξ ∈ R , scale ω ∈ R+, and skewness
λ ∈ R, and it is denoted by SN(ξ, ω, λ). The probability density function is
f(x) =2
ωφ
(x− ξω
)Φ
(λ
(x− ξω
))where φ and Φ denote the probability density function and the cumulative density
function of a standard normal random variable, respectively. For λ = 0 the standard
normal N (ξ, ω2) is recovered. Figure 3.10 compares the density for dierent values
of skewness: it is clear how this parameter drives the asymmetry of the distribution.
74
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
pdf S
kew
Nor
mal
λ−4−1014
Figure 3.10: Probability density function of a Skew normal distribution with ξ = 0,ω = 1 and skewness parameter: λ ∈ −4,−1, 0, 1, 4.
We use a stochastic representation of the skew normal distribution. Let Z and
ε be two independent random variables, where Z is distributed as an half normal
distribution, Z ∼ T N [0,+∞)(0, 1) and ε ∼ N (0, 1). Then, the random variable X
dened by
X = δZ +√
1− δ2ε for any δ ∈ (−1, 1)
is skew-normal distributed with λ =δ√
1− δ2: this is a one-to-one correspondence
which maps (−1, 1) into R (i.e. λ ∈ R). In order to take into account also location
and scale parameters, we may introduce the random variable Y , dened through an
ane transformation of X, as
Y = ξ + ωX = ξ + ω(δZ +
√1− δ2ε
)(3.7)
and we have that Y ∼ SN(ξ, ω, λ). However, the parameterization adopted in what
follows is dierent (see Frühwirth-Schnatter and Pyne (2010)), and it is convenient
when performing posterior inference, as made clear later. Thus, we dene
ξ = ξ, ψ = ωδ = ωλ√
1 + λ2, σ2 = ω2
(1− δ2
)= ω2 1
1 + λ2(3.8)
and we use the following stochastic representation:
Y = ξ + ψZ + σ2ε, Y ∼ SN(ξ, ω2, λ
), λ =
ψ
σ, ω2 = σ2 + ψ2
where ψ ∈ R and σ2 ∈ R+. To conclude, we recall that E (Y ) = ξ + ψ
√2
πand
V ar(Y ) = ψ2
(1− 2
π
)+ σ2.
3.4. The AVIS data on blood donations 75
The skew normal distribution is chosen to describe the likelihood of the logarithm
of the observations (gap times), after accounting for a regression term that considers
both static covariates and donation dependent covariates, as follows
Yij = log Tij |si = l,βj ,β0, ul, ωl, λl ∼ SN(ul + x′ijβj + x′iβ0, ω2l , λl),
for j = 1, . . . , (ni + 1), i = 1, . . . , Nd, where each observation is independent on the
others, across i and j. Here, Nd denotes the number of donors and ni the number
of recorded donations for the i-th subject. This is an AFT model. Note that the
cluster dependent parameters are the intercept ul, the scale ωl and the skewness λl,
while the regression coecients do not vary among the groups. Using the stochastic
representation (3.7) and reparametrizing as in (3.8), we obtain the equivalent model
for the likelihood:Yij |si = l,βj ,β0, ul, ψl, σ
2l , ηij ∼ N (ul + x′ijβj + x′iβ0 + ψlηij , σ
2l )
ηij ∼ T N [0,∞)(0, 1)
for j = 1, . . . , (ni + 1), i = 1, . . . , Nd. The priors we assume are the following:
β0 ∼ Np1 (0,Σ0)
β1, . . . ,βJ+1|τ21 , . . . , τ
2p2
iid∼ Np2
(0, diag(τ2
1 , . . . , τ2p2))
τ21 , . . . , τ
2p2
iid∼ IG(ν0, τ0)
p(ρn = (e1, . . . , en)) ∼ PPM ux(
θl = (ul, ψl)T , σ2
l
)Kl=1|K iid∼ N2(θl;θ0, σ
2lK0)× IG(σ2
l ; a, b) (3.9)
where diag(τ21 , . . . , τ
2p2) is a diagonal matrix which diagonal entries are τ2
1 , . . . , τ2p2
and p2 is the number of time-varying covariates in the regression. The prior dis-
tribution for the vector of unique values(θl = (ul, ψl)
T , σ2l
)Kl=1
is the same as
in Frühwirth-Schnatter and Pyne (2010), where the couple (ul, ψl)T is a-priori
Gaussian distributed with mean θ0 = (ξ0, ψ0)T and variance-covariance matrix
σ2lK0 = σ2
l diag(κ0, κ1). These priors are conditionally conjugate, helping us when
devising the Gibbs sampler of Appendix 3.B.
As far as the covariate information for our case study is concerned, we need to
distinguish between what enters as a standard dependent variable through linear
regression and what inuences the prior on the partition through the similarity
function g(·). Clearly, the variables can be repeated, if one thinks that a certain
covariate is important from the viewpoint of both regression and clustering. In the
regression setting, after a thorough investigation (see Gianoli (2016)), we decided
to consider the following covariates: the time-varying continuous covariates are age
and the body mass index (BMI), which is dened as the weight of the body divided
for the square of the body height (p2 = 2). Gender, blood type, Rh factor and
smoking habits are treated as static covariates (hence, p1 = 6). On the other hand,
76
λ κ σ (κ0, κ1) (a, b) LPML Epost(K) Vpost(K)
a 0.01 0.5 0.001 (5,5) (2.04,0.208) 6094.255 5.043 0.041b 0.01 0.5 0.2 (5,5) (2.04,0.208) 5594.895 5.034 0.033c 0.01 0.5 0.5 (5,5) (2.04,0.208) 5528.061 5.891 0.446d 0.005 0.5 0.2 (5,5) (2.04,0.208) 6053.80 4.038 0.038e 0 0.5 0.5 (5,5) (2.04,0.208) 6397.37 3.6155 0.305f 0.01 0.5 0.1 (10,10) (2.5,0.3) 5660.43 7.174 1.084
Table 3.2: Test setting for the AVIS dataset: parameter λ and (κ, σ) refer to thesimilarity function g(·) and the NGG process, respectively. Moreover, (κ0, κ1) arethe diagonal entries of the matrix K0 and (a, b) the parameters of the inverse gammain (3.9). V post denotes posterior variance.
the covariates that enter in the similarity function driving the prior on the partition
are considered all static and take the value of the rst blood donation: age, gender,
BMI, blood type, practical activities habits, Rh factor and smoking habits.
3.4.4 Case study
In order to manage the complexity of the algorithm and the number of data,
we implemented the algorithm in C++. Every run of the Gibbs sampler produced
a nal sample size of 2,000 iterations, after a thinning of 5 and initial burn-in of
2,000 iterations. In all cases, convergence was checked using both visual inspection
of the chains and standard diagnostics available in the R package CODA. We x hy-
perparameters as follows: for the variance covariance matrix of the static regression
coecients Σ0 = diag(1, . . . , 1) and (a0, b0) = (2.25, 0.625) such that the variances
τ21 , . . . , τ
2p2
have a-priori mean 0.5 and variance 1. Moreover, θ0 = (0, 0)T . The
other hyperparameters vary as described in Table 3.2. In tests named a, b, c the
parameter σ increases while the others are xed; in b, d and e, on the other hand,
λ varies. Note that, since in case e λ = 0, we are not considering the eect of the
covariates in the prior for the partition. For all the tests, we chose the similarity
function gC in Section 3.2.1, with the single parameter λ, that rescales the distance
among covariates in a cluster.
As far as the donation-dependent covariates are concerned, all the tests agree
so that we report inference for test c in Table 3.2. Posterior summaries of βj ,
j = 1, . . . , 15, show that the covariate age has little eect: indeed, almost all the
90% posterior credibility intervals (but the 8-th) contain the 0 (see the left panel of
Figure 3.11, for test c). Intuitively, covariate age has a little change in time, since
the temporal window of the study is 6 years. The eect of age on the 8-th gap time
seems to have a slight negative eect: young people tend to have more frequent
donations, once that they became loyal donors. On the other hand, covariate BMI
has a stronger eect, that increases moderately over the subsequent donations. The
eect is positive, meaning that donors with an higher BMI usually experience larger
gap times; this is due to the fact that donors with high BMI undergo more detailed
3.4. The AVIS data on blood donations 77
2 4 6 8 10 12 14
−0.0
06−0
.004
−0.0
020.
000
0.00
20.
004
#donations
90% c.i.
Age effect
2 4 6 8 10 12 14
−0
.00
20
.00
00
.00
20
.00
40
.00
60
.00
80
.01
0#donations
90% c.i.
BMI effect
Figure 3.11: Posterior distribution of the regression coecients corresponding todonation-dependent covariates for Age (left) and BMI (right) under test c in Table3.2.
medical controls, such as cardiological examinations, thus extending the gap time.
The right panel of Figure 3.11 displays posterior mean and 90% credible interval
for the donation-dependent coecients related to the covariate BMI under Test c in
Table 3.2. Here, the 15-th intervals are not displayed because they have a variance
which is signicantly larger than the previous ones.
Concerning the interpretation of static covariates, Figure 3.12 shows posterior
estimates and 90% credibility intervals for the 6 coecients of the regression. We
obtained that the covariate smoking is of little importance. On the contrary, the
eect of the covariate gender is very strong: due to the regulation, women tend
to experience longer gap-times as expected. Also the posterior distribution of the
factor Rh exhibits a support concentrated on (small) positive values. Donors with
positive Rh factor tend to show slightly longer gap-times with respect to donors
with a negative Rh factor: they are less common (13.3% in our population) and can
not receive blood from those with a positive factor.
Also the blood type seems to have a mild eect: even if the posterior of the
parameter related to blood type A contains the 0 in all the cases of Table 3.2, blood
type B has a positive eect, such as AB. Note that we tried to substitute the two
covariates related to blood type and Rh factor with their interaction: the results
were unsatisfactory because of identiability issues for β0.
Now, we compare the tests in Table 3.2 in order to have an insight about the role
of the parameters. From comparing the posterior of the number of groups K in tests
a, b and c, we have that for smaller values of σ the mass is concentrated in 5, while
in case c (σ = 0.5) the mode is in 6 and the variance is also larger, giving mass
78
0.63 0.64 0.65 0.66 0.67
020
40
60
Gender
90% c.i.
0.00 0.01 0.02 0.03
020
40
60
80
A
90% c.i.
−0.02 0.00 0.02 0.04 0.06
05
10
15
20
25
30
35
AB
90% c.i.
−0.01 0.00 0.01 0.02 0.03 0.04
010
20
30
40
50
B
90% c.i.
0.01 0.02 0.03 0.04 0.05 0.06
010
20
30
40
50
RH
90% c.i.
−0.02 −0.01 0.00 0.01
020
40
60
Smoke
90% c.i.
Figure 3.12: Posterior distribution of the regression coecients corresponding tostatic covariates under test c in Table 3.2.
to the set 5, 6, 7, 8, 9. The optimal partition, obtained using a minimization of
the Binder loss function approach, becomes more interpretable when σ is large, see
Figure 3.13. The six groups have dierent features in terms of number of donations
(green and purple clusters gather donors with a smaller number of donations with
respect to the others). The red cluster clearly contains those donors that are very
regular (loyal), since the gap time is approximately constant in time. The blue
and orange groups seem to contain donors that become loyal with time, since the
trajectories decreases with the donations: however, the orange group has a faster
decrease on average and the number of donations is slightly smaller. The small
yellow group gathers the donors that are regular in time but with a relatively small
number of donations. As far as the eect of covariates in the optimal partition is
concerned, it seems that covariates age, Rh and skewness inuence the partition.
We also tried to apply the Variation of Information criterion of Meil (2007) in order
to obtain the optimnal partition. According to this method, we obtained 4 groups
under setting c (not reported here). The cardinality of each group is (407, 1180,
1736, 10). Dierently from the result obtained with the Binder loss function, the
3.4. The AVIS data on blood donations 79
clusters do not dier in terms of number of donations.
0 5 10 15
0500
1000
1500
# Donations
Gap
tim
e (
day
s)
#elem = 394
0 5 10 15
0500
1000
1500
# Donations
Gap
tim
e (
day
s)
#elem = 1133
0 5 10 15
0500
1000
1500
# Donations
Gap
tim
e (
day
s)
#elem = 251
0 5 10 15
0500
1000
1500
# Donations
Gap tim
e (
day
s)
#elem = 502
0 5 10 15
0500
1000
1500
# Donations
Gap tim
e (
day
s)
#elem = 1041
0 5 10 15
0500
1000
1500
# Donations
Gap tim
e (
day
s)
#elem = 11
Figure 3.13: Optimal partition, according to the Binder loss function minimizationmethod, under Test c of Table 3.2. Each panel shows a cluster: the thick black linerepresents the mean curve into each group, computed for each donation. Dierentbehaviors may be noticed.
The hyperparameters of P0 in Test f are dierent from the other tests; in this
case, we obtain larger a-posteriori mean of Kn. This behavior is common among
nonparametric mixture models, that are not particularly robust with respect to the
choice of the base measure.
The choice of the temperature parameter λ related to the similarity function is
not straightforward since it inuences quite strongly the posterior on the number of
clusters: decreasing λ leads to a reduction of the number of clusters K. Indeed, Test
e, corresponding to a prior on the partition that is not inuenced by the covariates,
puts mass in the set 3, 4, 5, 6 with mode in 4. Figure 3.14 displays the optimal
partition under Test e. We found 2 large and 2 very small clusters, that are harder
to interpret with respect to those in Figure 3.13. In general, we suggest to follow
this approach to x λ: rst, compute all the possible pairwise distances among the
set of covariates, then evaluate the range of these values and set a value for λ in
80
order to rescale the range of the distances in a given interval.
0 5 10 15
0500
1000
1500
# Donations
Gap
tim
e (
day
s)
#elem = 1158
0 5 10 15
0500
1000
1500
# Donations
Gap
tim
e (
day
s)
#elem = 2160
0 5 10 15
0500
1000
1500
# Donations
Gap tim
e (
day
s)
#elem = 11
0 5 10 15
0500
1000
1500
# Donations
Gap tim
e (
day
s)
#elem = 4
Figure 3.14: Optimal partition, according to the Binder loss function minimizationmethod, under test e of Table 3.2. In this test, the eect of the cohesion functionin the prior for the partition is null.
Finally, Figure 3.15 shows the predictive distribution for one donor of the popu-
lation and dierent donations: the vertical line is the actual observation. The blue
lines represent the 90% credibility interval for the predictive law, and the black line
is the prediction. Note the skewness of the distributions.
We mention also that we computed cross-validated prediction errors. We ran-
domly removed 333 donors from the dataset and used the remaining 3000 to compute
the posterior distribution of the parameters. Then, we computed the following index
of goodness-of-t:333∑i=1
ni∑j=1
|yij − ytestij |
where the predicted value yij is the mean value of the predictive law. We obtained
value 39338. This can be used when comparing with other methods: this will be
the object of a further investigation.
3.5. Discussion and future work 81
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Time 1
log Gap−Time
De
nsi
ty
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Time 2
log Gap−Time
De
nsi
ty
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Time 3
log Gap−Time
De
nsi
ty
Figure 3.15: Predictive distribution under test a of Table 3.2.
We conclude the analysis of the AVIS dataset by stating the main ndings we
obtained; indeed, we illustrated our ndings to the general manager and the medical
director of the Milan department of AVIS and discussed about how the policy-makers
may take advantage of the results.
From their viewpoint, it is important to understand the dierent proles of
donors and how covariates aect the behavior of the donors, in order to better plan-
ning awareness campaigns and to start surveys for investigating why some donors
stop donating without any medical reason.
3.5 Discussion and future work
In this chapter we presented a product partition model with dependence on
covariates; dierently from Müller et al. (2011), the similarity function we assume is
a generic non-increasing function of the distance among covariates in a cluster. We
proposed three possible choices; as a future development of the work, a thorough
investigation on the properties of the prior on the random partition induced by the
functional form of the similarity is needed. In particular, it is still not clear how to
balance the eect of the covariates with respect to the cohesion function induced by
the underlying completely random measure. For this reason, we provided empirical
guidelines for choosing the hyperparameters.
Moreover, we mention that we tried to depart from the product partition form
of the similarity, by considering g(xA1 , . . . , xAK ) as a trade-o between compactness
and separation, as follows
gcomp(xA1 , . . . ,xAK ) =1
K
K∑l=1
g(xAl)
where the normalization over the number of groups K is made in order not to
promote a large number of groups (since we are summing over the groups Ai).
82
Essentially, we compute an average of the compactness in the clusters. As far as
the separation is concerned, the objective is to encourage clusters that are well
separated, i.e. whose centroids are far from each other. We dened the separation
in this way
gsep(xA1 , . . . ,xAK ) = gsep(c1, . . . , cK) =1
K
K∑l=1
1− e−λt∗i1 + e−λt
∗i
where
t∗i =
nid(ci, c)
d(ci, c)
Here, ni and ci stand for the cardinality and the centroid of group i, and c repre-
sents the global centroid. Again, the normalization with respect to K is made in
order to favor a large k, while the logistic transformations is made to rescale the
contribution to (0, 1), to balance the eect of the compactness (which is a number
in (0, 1), indeed). Summing up, we have
g(xA1 , . . . ,xAK ) = gcomp(xA1 , . . . ,xAK ) + γgsep(xA1 , . . . ,xAK )
where γ is a value that weights the eect of the separation. However, the preliminary
numerical tests were unsatisfactory, since the results were very similar with respect
to the case without covariates in the prior; this may be due to the eect of averaging
we are introducing by summing over all the components of the partition. However,
dierent ways of dening separation can be employed and this will be subject of
future research.
As far as the application on the blood's donations data is concerned, the main
development is to improve predictive accuracy of our model: other possible models
for the likelihood could be employed, indeed.
Appendix 3.A: Gibbs sampler
In this section, we illustrate the Gibbs sampler Pólya urn scheme for the con-
jugate case, i.e. when the base distribution P0 is conjugate to the mixture kernel
f(·|x, θ). In particular, in our case we consider f as the kernel of a Gaussian dis-
tribution, θ =(β, σ2
), where β is the vector of the regression coecients, namely
f(y; x,β, σ2
)= N
(y; xTβ, σ2
). The base measure P0 is given by
θ =(β, σ2
)∼ Np(β;µ0, σ
2B0)× IG(σ2; a0, b0).
The algorithm is an extension of Algorithm 8 in Neal (2000), later generalized
to NormCRMs in Favaro and Teh (2013) (see Section 3 there). It is a marginal
MCMC sampler in its simplest form, thanks to the above mentioned conjugacy. In
this case, the cluster parameters θ∗1, . . . , θ∗k can be eciently marginalized out from
3.5. Discussion and future work 83
the joint distribution (3.4), obtaining
K∏j=1
(m(y∗j )c(u, nj)g(x∗j )
)D(u, n)
where m(y∗j ) is the marginal distribution of data in the j-th cluster. Therefore,
the Gibbs sampler is obtained by repeatedly sampling from the following full-
conditionals (for simplicity of notation, we use the term rest to denote all the
other variables but the one on the left we condition to):
1. for the auxiliary variable u, note that, given the partition ρn, it is indepen-
dent of the observations. Thus, we sample from L(u|ρn,x,y, θ∗1, . . . , θ∗K) ∝un−1e−Ψ(u)
∏Kj=1 c(nj , u), that in the case of the NGG simplies to
L(du|rest) ∝ un−1
(u+ 1)n−κKe−(κ/σ(u+1)σ−1)du.
We use a simple Metropolis-Hastings update with a Gaussian proposal kernel
truncated in (0,+∞).
2. sampling from L(θ∗1, . . . , θ∗K |rest): each θ∗j =
(β∗, σ2∗
j
), for j = 1, . . . ,K is
updated within each cluster according to the usual parametric update in the
conjugate Normal - Normal-inverse gamma distribution case. In particular,
we have that for each j = 1, . . . ,K, the cluster specic parameters can be
sampled independently from the following distributions:
β∗j |σ2∗j , rest ∼ N
(µ∗, σ2∗
j B∗)
where B∗ =(B−1
0 +∑
i∈Aj xixiT)−1
, µ∗ = B∗
(B−1
0 µ0 + +∑
i∈Aj yixi
)and
σ2∗j |rest ∼ IG
a∗j = a0 +nj2, b∗j = b0 +
1
2
βT0 B−10 β0 +
∑i∈Aj
y2i − µ∗TB−1
∗ µ∗
.
3. the random partition ρn can be updated using a form of Gibbs sampling
whereby the cluster assignment of one item Yi is updated once at a time.
Denote with ρ−in−1 the partition of n − 1 items where the i-th item has been
removed and ei = l the event that Yi is assigned to cluster l,where l varies in1, . . . , k−in−1, k
−in−1 + 1
, where k−in−1 is the number of clusters available in the
partition without i. Note that k−in−1 + 1 is included to consider the case where
the item forms a new cluster. Therefore, we have to sequentially sample from
84
the following conditional distribution, for i = 1, . . . , n,
L(ei = l|u, xini=1 , yini=1 , ρ
(−i)n−1) =
L(yini=1 |u,x, ρ(−i)n−1, ei = l)p(ei = l|ρ(−i)
n−1)
L(yini=1 |u,x, ρ(−i)n−1)
,
(3.10)
where l = 1, . . . , k(−i)n−1+1. Moreover, observe that, for any l = 1, . . . , k
(−i)n−1 , k
(−i)n−1+
1, the prior on the partition can be written as:
p(ρ(−i)n−1, ei = l|rest) ∝ D(u, n)
K∏j=1
(c(u, nj)g(x∗j )
)
∝ D(u, n)
k(−i)n−1∏j=1
(c(u, nj)g(x∗j )
)c(nl + 1, u)g(x∗l ∪ xi)
= D(u, n)
k(−i)n−1∏j=1
(c(u, nj)g(x∗j )
) c(nl + 1, u)g(x∗l ∪ xi)c(nl, u)g(x∗l )
∝ p(ρ(−i)n−1, u)
c(nl + 1, u)g(x∗l ∪ xi)c(nl, u)g(x∗l )
Note that g(∅) = 1, and in this case c(nl, u) = c(0, u) = 1. It is important
to highlight the contribution given by the similarity function here, comparing
to the case without covariates, as in Favaro and Teh (2013): the probability
of assigning item i to an already existing cluster l is modied according to
the ratio g(x∗l ∪ xi)/g(x∗l ) that quanties how the total similarity among
the covariates varies adding i to the group. If xi is very similar to the others,
the ratio will be greater than 1; on the other hand, if the covariate is very
dierent, the contribution of the similarity will be less than 1, thus decreasing
the probability of assigning the item to that cluster. Moreover, we have that
the contribution of the likelihood in (3.10) is:
L(yini=1 |u, xini=1 , ρ
(−i)n−1, ei = l) =
k(−i)n−1∏
j=1,j 6=lm(y∗j )m(y∗l ∪ yi)
m(y∗l )
m(y∗l )
= L(y(−i)|ρ(−i)n−1)
m(y∗l ∪ yi)m(y∗l )
,
where m(∅) = 1 in the case of a new cluster. Therefore, (3.10) becomes
L(ei = l|rest) = L(y(−i)|ρ(−i)n−1)
m(y∗l ∪ yi)m(y∗l )
1
L(y1, . . . , yn|ρ(−i)n−1)
p(ei = l, ρ(−i)n−1)
p(ρ(−i)n−1)
∝m(y∗l ∪ yi)
m(y∗l )
c(nl + 1, u)g(x∗l ∪ xi)c(nl, u)g(x∗l )
,
3.5. Discussion and future work 85
so that each ei is sequentially assigned according to this law.
Appendix 3.B: Gibbs sampler for the blood donations ap-
plication
In this section we develop a Gibbs sampler to sample from the posterior of
our model in Section 3.4: thanks to a careful choice of the priors, we have that
most of the full-conditionals are conjugate, thus accelerating the computation. This
is important, given the cardinality of the sample and the dimensionality of the
parameters. Indeed, the state space is given by(ηij)
ni+1j=1 , i = 1, .., Nd,
(Y censi(ni+1)
)i = 1, .., Nd,β0, (βj)
J+1j=1 ,
(τ2i
)p2
i=1, ρn,
(ul, ψl, σ
2l
)Kl=1
.
The full-conditionals are listed below:
Parameters (ηij)ni+1j=1 , i = 1, .., Nd: each ηij can be independently sampled accord-
ing to
L(ηij |rest) ∝ exp
(− 1
2σ2l
(yij − (ul + x′ijβj + x′iβ0 + ψlηij)
)2 − 1
2η2ij
)I (ηij > 0)
which turns out to be a truncated normal, namely
ηij |rest ∼ T N [0,∞)
(ψl
σ2l + ψ2
l
(yij − (ul + x′ijβj + x′iβ0)
),
σ2l
σ2l + ψ2
l
)for j = 1, . . . , ni + 1 and i = 1, . . . , Nd.
Parameters(Y censi(ni+1)
)i = 1, .., Nd: the censored observation are independently
sampled according to
Y censi(ni+1)|rest ∼ T N [yi(ni+1),+∞)
(ul + x′ijβj + x′iβ0 + ψlηij , σ
2l
)for i = 1, . . . , Nd.
Parameter β0: Thanks to conjugacy, we have that the full-conditional is multi-
variate p1-dimensional Gaussian, with mean β∗0 and variance-covariance matrix Σ∗0,
where
Σ∗0 =
(Σ−1
0 +
Nd∑i=1
ni + 1
σ2l
xixTi
)−1
and
β∗0 = Σ∗0
Nd∑i=1
ni+1∑j=1
yij − (ul + x′ijβj + ψlηij)
σ2l
xi
.
Parameters (βj)J+1j=1 : each coecient vector βj can be sampled independently
86
from a multivariate p2-dimensional Gaussian, with mean β∗j and variance-covariance
matrix Σ∗j , where
Σ∗j =
K−10 +
∑i:(ni+1)≥j
1
σ2l
xijxTij
−1
and
β∗j = Σ∗j
∑i:(ni+1)≥j
yij − (ul + x′iβ0 + ψlηij)
σ2l
xij
.
Parameters(τ2i
)p2
i=1: each parameter τ2
m is independently sampled from
τ2m|rest ∼ IG
ν0 +J + 1
2, τ0 +
1
2
J∑j=1
β2jm
m = 1, . . . , p2.
Parameters(ul, ψl, σ
2l
)Kl=1
: the likelihood for data in cluster Al that is used to
build the joint distribution for (ul, ψl, σ2l ) is proportional to
∏i∈Al
ni+1∏j=1
1√2πσ2
l
exp
(− 1
2σ2l
(yij − (ul + ψlηij))2
)
with yij = yij − x′ijβj − x′iβ0. This is equivalent to have data Yi : i ∈ Al, whereeach Yi has dimension ni + 1. If we denote this vector by Yl, of dimension Nl =∑
i∈Al(ni + 1), its conditional distribution is gaussian distribution with mean Xlθland variance-covariance matrix σ2
l INl×Nl . The design matrix Xl has rows (1 η11),
(1 η1(n1+1)), . . . , (1 ηNd(nNd+1)). Moreover, θl = (ul ψl)T .
The prior we chose is conjugate, therefore we only need to update the parameters.
In particular, we have
σ2l |rest ∼ IG
a+Nl
2, b+
1
2
∑i,j
y2i,j + θT0 K
−10 θ0 − θ∗0K∗−1θ∗0
where K∗ =
(∑i,j γijγ
Tij +K−1
0
)−1and θ∗0 = K∗
(∑i,j yijγij +K−1
0 θ0
). The in-
tercept ul and the skew parameter ψl are no longer independent and are conditionally
gaussian distributed of mean θ∗0 and variance-covariance matrix σ2lK∗.
Partition ρn: in order to sample the partition, we need to resort to a generalization
of Neal Algorithm 8, since we are in the non conjugate case. The steps are very
similar those described in Section 3.2 of Favaro and Teh (2013), except for the
presence of the similarity function g(·). In particular, the probability of assigning
the i-th subject to cluster l, l = 1, 2, . . . ,K−in is
3.5. Discussion and future work 87
p (ei = l|rest) ∝ p(ρ−in−1, u
) c(nl + 1, u)g (x∗l ∪ xi)
c(nl, u)g(x∗l)
ni+1∏j=1
N(yij ;ul + xTijβj + xTi β0 + ψlηij , σ
2l
) (3.11)
where the superscript (−i) denotes a quantity that refers to the partition of n − 1
subjects after i has been removed. Moreover, we need to take into account M
possibile new clusters, whose parameters are generated from the prior distribution,um, ψm, σ
2m
,m = 1, . . . ,M . The probability of allocating the subject to one of
these new clusters is as in (3.11) divided byM , with the agreement that c(∅, u) = 1,
g(∅) = 1. We recall that under the NGG assumption,c(nl + 1, u)
c(nl, u)= nl−σ if nl > 0,
and κ(u+ 1)σ otherwise.
Auxiliary parameter u: the auxiliary parameter for the normalized completely
random measure can be simply drawn using a Metropolis-Hastings steps from
L (u|rest) ∝ un−1 exp (−ψ(u))
Kn∏l=1
c(nl, u)
that for the NGG case turns out to be
un−1
(1 + u)n−κKnexp (−κ/σ((u+ 1)σ − 1)) .
Chapter 4
Determinantal point process mixtures
via spectral density approach
This chapter is based on Bianchini et al. (2017).
In this chapter, we consider mixture models with a nite and random number of com-
ponents, rather than assume it innite. However, in the usual framework described in
Chapter 1, it is often the case that we observe an overestimation of the number of groups
(both in the nonparametric case and in nite mixture models). This motivates the introduc-
tion of a model that induces a-priori separation among the location parameters; this can be
reached by dropping the conditional i.i.d. assumption typical of models as (1.4). We explore
a class of determinantal point process (DPP) mixture models dened via spectral represen-
tation, which leads to the required repulsion among the points of the process. We focus on
a power exponential spectral density, even if the proposed approach is in fact quite general.
In the second part of the chapter we generalize our model to account for the presence of
covariates, both in the likelihood as linear regression and in the weights of the mixture by
means of a mixture of experts approach. This yields a trade-o between repulsiveness of
locations in the mixtures and attraction among subjects with similar covariates.
We develop full Bayesian inference through a Gibbs sampler involving a reversible jump
step. Finally, we evaluate the eectiveness of the proposed model by several simulation
scenarios and data illustrations.
90
4.1 Introduction
As we discussed in Chapter 1, mixture models are an extremely popular class of
models, that have been successfully used in many applications. For a review, see,
e.g. Frühwirth-Schnatter (2006). Such models are typically stated as
yi | k,θ,πiid∼
k∑j=1
πjf(yi | θj), i = 1, . . . , n, (4.1)
where π = (π1, . . . , πk) are constrained to be nonnegative and sum up to 1, θ =
(θ1, . . . , θk), and 1 ≤ k ≤ ∞, with k =∞ corresponding to a nonparametric model.
A common prior assumption when k < +∞ is that π ∼ Dirichlet(δ1, . . . , δk) and
that the components of θ are drawn i.i.d. from some suitable prior p0. However, the
weights π may be constructed dierently, e.g. using a stick-breaking representation
(nite or innite), which poses a well-known connection with more general models,
including nonparametric ones. See, e.g., Ishwaran and James (2001b) and Miller and
Harrison (2017). A popular class of Bayesian nonparametric models is the Dirichlet
process mixture (DPM) model, introduced in Ferguson (1983) and Lo (1984). It is
well-known that this class of mixtures usually overestimates the number of clusters,
mainly because of the rich gets richer property of the Dirichlet process. By this
we mean that both prior and posterior distributions are concentrated on a relatively
large number of clusters, but a few are very large, and the rest of them have very
small sample sizes. Mixture models may even be inconsistent; see Rousseau and
Mengersen (2011), where concerns about over-tted mixtures are illustrated, and
Miller and Harrison (2013), for inconsistency of the posterior distribution of k of
DPMs.
Despite their success, mixture models like (4.1) tend to use excessively many
mixture components. As pointed out in Xu et al. (2016), this is due to the fact that
the component-specic parameters are a priori i.i.d., and therefore, free to move.
This motivated Petralia et al. (2012), Fúquene et al. (2016) and Quinlan et al. (2017)
to explicitly dene joint distributions for θ having the property of repulsion among
their components, i.e. that p(θ1, . . . , θk) puts higher mass on congurations such
that components are well separated. For a dierent approach, via sparsity in the
prior, see Malsiner-Walli et al. (2016).
Xu et al. (2016) explored a similar way to accomplish separation of mixture
components, by means of a Determinantal Point Process (DPP) acting on the pa-
rameter space. DPPs have recently received increased attention in the statistical
literature (Lavancier et al., 2015). DPPs are point processes having a product den-
sity function expressed as the determinant of a certain matrix constructed using
a covariance function evaluated at the pairwise distances among points, in such a
way that higher mass is assigned to congurations of well-separated points. We
give details below. DPPs have been used to make inference mostly on spatial data.
Bardenet and Titsias (2015) and Aandi et al. (2014) applied DPPs to model spa-
tial patterns of nerve bers in diabetic patients, a basic motivation being that such
4.1. Introduction 91
bers become more clustered as diabetes progresses. The latter discussed also appli-
cations to image search, showing how such processes could be used to study human
perception of diversity in dierent image categories. Similarly, Kulesza et al. (2012)
show how DPPs can be applied to various problems that are relevant to the machine
learning community, such as nding diverse sets of high-quality search results, build-
ing informative summaries by selecting diverse sentences from documents, modeling
non-overlapping human poses in images or video, and automatically building time-
lines of important news stories. More recently, Shirota and Gelfand (2017) have
described an approximate Bayesian computation method to t DPPs to spatial
point pattern data. The rst paper where DPPs were adopted as a prior for statis-
tical inference in mixture models is Aandi et al. (2013). The statistical literature
also includes a number of papers illustrating theoretical properties for estimators of
DPPs from a non-Bayesian viewpoint; see, for instance, Biscio and Lavancier (2016,
2017) and Bardenet and Titsias (2015).
We discuss full Bayesian inference for a class of mixture densities where the
locations follow stationary DPPs. Our rst contribution is the introduction of an
approach that generalizes and extends the model studied in Xu et al. (2016) who
base their analysis on a special case of DPPs called L-ensambles, which consider a
nite state space. Instead, we resort to the spectral representation of the covariance
function dening the determinant as the joint distribution of component-specic
parameters. Our methods can thus be used with any such valid spectral represen-
tation, as described by Lavancier et al. (2015), which implies great generality of the
proposal. The extensions considered here are stated in the context of both uni- and
multi-dimensional responses, and are detailed in Section 4.2.4.
For the sake of concreteness, our illustrations focus on the case of power ex-
ponential spectral representation; see examples with dierent spectral densities in
Section 4.4.2. This particular specication allows for exible repulsion patterns,
and we discuss how to set up dierent types of prior behavior, shedding light on
the practical use of our approach in that particular scenario. Although we limit
ourselves to the case of isotropic DPPs, inhomogeneous DPPs can be obtained by
transforming or thinning a stationary process (Lavancier et al., 2015). A crucial
point in our models and algorithms is the DPP density expression, which is only
dened for DPPs restricted to compact subsets S of the state space, with respect
to the unit rate Poisson process. When this density exists, it explicitly depends on
S. A sucient condition for the existence of the density is that all the eigenval-
ues of the covariance function, restricted to S, are smaller than 1. We follow the
spectral approach and assume that the covariance function dening the DPP has a
spectral representation. A basic motivation for our choice is that conditions for the
existence of a density become easier to check. We review here the basic theory on
DPPs, making an eort to be as clear and concise as possible in the presentation
of our subsequent models. We discuss applications in the context of synthetic and
real data applications.
A second contribution of this work is the extension of the proposed spectral
DPP model to incorporate covariate information in the likelihood and also in the
92
assignment to mixture components. In particular, subjects with similar covariates
are a priori more likely to co-cluster, just as in mixtures of experts models (see,
e.g., McLachlan and Peel, 2005), where weights are dened as normalized expo-
nential functions. From a computational viewpoint, a third contribution of our
work is the generalization of the reversible jump (RJ) MCMC posterior simulation
scheme proposed by Xu et al. (2016) to the general spectral approach and also to the
covariate-dependent extensions we consider. We consider two RJ MCMC versions
for uni- and multi-variate responses, as discussed later. In all cases the algorithms
require computing the DPP density with respect to the unit rate Poisson process.
We explain how to carry out the calculations, and discuss the need to restrict the
process to (any) compact subset. When extending the model to incorporate covari-
ate information in both likelihood and prior assignment to mixture components, the
RJ MCMC algorithm requires modications, as discussed below.
We explicitly consider the estimation of clusters of subjects in the sample, by
considering the partition that minimizes the posterior expectation of Binder's loss
function (Binder, 1978) under equal misclassication costs. This is a common choice
in the applied Bayesian nonparametric literature (Lau and Green, 2007b). In partic-
ular, we emphasize one conceptual advantage of the separation induced by the prior
assumption, namely, the reduction in the number of clusters a posteriori compared
to the usual mixture models that do not include this feature. Reducing the eective
number of clusters a posteriori helps scaling up our model better than alternatives
with no separation when the sample size grows. We illustrate this particular point
in our data illustrations.
4.2 Using DPPs to induce repulsion
We review here the basic theory on DPPs to the extent required to explain our
mixture model. We use the same notation as in Lavancier et al. (2015), where
further details on this theory may be found.
4.2.1 Basic theory on DPPs
Let B ⊆ Rd ; we mainly consider the cases B = Rd and B = S, a compact subset
in Rd . By X we denote a simple locally nite spatial point process dened on B,
i.e. the number of points of the process in any bounded region is a nite random
variable, and there is at most one point at any location. See Daley and Vere-Jones
(2003; 2007) for a general presentation on point processes. The class of DPPs we
consider is dened in terms of their moments, expressed by their product density
functions ρ(n) : Bn → [0,+∞), n = 1, 2, . . .. Intuitively, for any pairwise distinct
points x1, . . . , xn ∈ B, ρ(n)(x1, . . . , xn)dx1 · · · dxn is the probability that X has a
point in an innitesimal small region around xi of volume dxi, for each i = 1, . . . , n.
More formally, X has n-th order product density function ρ(n) : Bn → [0,+∞) if
this function is locally integrable (i.e.∫S |ρ
(n)(x)|dx < +∞ for any compact S) and,
4.2. Using DPPs to induce repulsion 93
for any Borel-measurable function h : Bn → [0,+∞),
E
6=∑x1,...,xn∈X
h(x1, . . . , xn)
=
∫Bnρ(n)(x1, . . . , xn)h(x1, . . . , xn)dx1 · · · dxn,
where the 6= sign over the summation means that x1, . . . , xn are pairwise distinct.
See also Møller and Waagepetersen (2007). Let C : B×B → R denote a covariance
function.
A simple locally nite spatial point process X on B is called a determinantal
point process with kernel C if its product density functions are
ρ(n)(x1, . . . , xn) = det[C](x1, . . . , xn), (x1, . . . , xn) ∈ Bn, n = 1, 2, . . . ,
where [C](x1, . . . , xn) is the n × n matrix with entries C(xi, xj). We write X ∼DPPB(C); when B = Rd we write X ∼ DPP (C).
Note that, if A is a Borel subset of B, then the restriction XA := X ∩A of X to
A is a DPP with kernel given by the restriction of C to A×A.By Theorem 2.3 in Lavancier et al. (2015), rst proved by Macchi (1975), such
DPP's exist under the two following conditions:
C is a continuous covariance function; hence, by Mercer's Theorem,
C(x, y) =
+∞∑k=1
λSkφk(x)φk(y), (x, y) ∈ S × S, S compact subset,
where λSk and φk(x) are the eigenvalues and eigenfunctions of C restricted
to S × S, respectively.
λSk ≤ 1 for all compact S in Rd and all k.
Formula (2.9) in Lavancier et al. (2015) reports the distribution of the number
N(S) of points of X in S, for any compact S:
N(S)d=
+∞∑k=1
Bk, E(N(S)) =
+∞∑k=1
λSk , Var(N(S)) =
+∞∑k=1
λSk (1− λSk ), (4.2)
where Bkind∼ Be(λSk ), i.e. the Bernoulli random variable with mean λSk . When
restricted to any compact subset S, the DPP has a density with respect to the unit
rate Poisson process which, when λSk < 1 for all k = 1, 2, . . ., has the following
expression:
f(x1, . . . , xn) = e|S|−DSdet[C](x1, . . . , xn), n = 1, 2, . . . , (4.3)
94
where |S| =∫S dx, DS = −
∑+∞1 log(1− λSk ) and
C(x, y) =
+∞∑1
λSk1− λSk
φk(x)φk(y), x, y ∈ S.
When n = 0 the density (as well as the determinant) is dened to be equal to 0. See
Møller and Waagepetersen (2007) for a thorough denition of absolute continuity of
a spatial process with respect to the unit rate Poisson process. However, note that
from the rst part of (4.2) we have P(N(S) = 0) =∏+∞k=1(1− λSk ); this probability
could be positive due to the assumption λSk < 1 for all k = 1, 2, . . ..
From now on we restrict our attention to stationary DPP's, that is, when
C(x, y) = C0(x − y), where C0 ∈ L2(Rd) is such that its spectral density ϕ ex-
ists, i.e.
C0(x) =
∫Rdϕ(y) cos(2πx · y)dy, x ∈ Rd
and x · y is the scalar product in Rd . If ϕ ∈ L1(Rd) and 0 ≤ ϕ ≤ 1, then the
DPP (C) process exists. Summing up, the distribution of a stationary DPP can be
assigned by its spectral density; see Corollary 3.3 in Lavancier et al. (2015).
To explicitly evaluate (4.3) over S =
[−1
2,1
2
]d, we approximate C as suggested
in Lavancier et al. (2015). In other words, we approximate the density of X on S
by
fapp(x1, . . . , xn) = e|S|−Dappdet[Capp](x1, . . . , xn), x1, . . . , xn ⊂ S, (4.4)
where
Capp(x, y) = Capp,0(x− y) =∑k∈Zd
[ϕ(k)
1− ϕ(k)
]cos(2πk · (x− y)), x, y ∈ S, (4.5)
Dapp =∑k∈Zd
log
(1 +
ϕ(k)
1− ϕ(k)
).
To understand why the approximation C(x, y) ≈ Capp,0(x − y) (x − y ∈ S)
follows, as well as the corresponding approximation for the tilted versions of these
functions, we observe that the exact Fourier expansion of C0(x − y) in S is as in
(4.5) with the real part of∫S C0(y)e−2πik·ydy instead of ϕ(k); if we assume C0 such
that C0(t) ≈ 0 for t 6∈ S, then
Re
(∫SC0(y)e−2πik·ydy
)≈ ϕ(k) := Re
(∫RdC0(y)e−2πik·ydy
).
See also Lavancier et al. (2015), Section 4.1. Figure 4.1 displays the value of C0(t)
corresponding to the Gaussian spectral density where s = 0.5 and ρ varies as in
the legend. The vertical dashed line represents the right endpoint of the set S =
4.2. Using DPPs to induce repulsion 95
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
t
C0
ρ
2
0.1
5
Figure 4.1: Value of C0(t) corresponding to the Gaussian spectral density whens = 0.5 and ρ is equal to 0.1, 2, 5.
[−1
2,1
2
]. The approximation C0(t) ' 0 for t /∈ S is perfectly adhered to when ρ is
small. The higher ρ is, the slower is the decay rate of the function C0(t).
When S = R is a rectangle in Rd , we can always nd an ane transformation
T such that T (R) = S =
[−1
2,1
2
]d. Dene Y = T (X). If fappY is the approximate
density of Y as in (4.4), we can then approximate the density of XR by
fapp(x1, . . . , xn) = |R|−ne|R|−|S|fappY (T (x1, . . . , xn)), x1, . . . , xn ⊂ R.(4.6)
In practice, the summation over Zd in (4.5) above is truncated to ZNd, where ZN :=
−N,−N + 1, . . . , 0, . . . , N − 1, N (see Section 4.3 in Lavancier et al., 2015).
A particular example of spectral density that we found useful is
ϕ(x; ρ, ν) = sd exp
−(
s√π
)ν (Γ(d2 + 1)
Γ( dν + 1)
)νdρνd‖x‖ν
, ρ, ν > 0, (4.7)
for xed s ∈ (0, 1) (e.g. s = 12) and ‖x‖ is the Euclidean norm of x ∈ Rd . This
function is the spectral density of a power exponential spectral model (see (3.22)
in Lavancier et al. (2015) when α = s αmax(ρ, ν)). In this case, we write X ∼PES − DPP (ρ, ν). The corresponding spatial process is isotropic. When ν = 2,
the spectral density is
ϕ(x; ρ, ν) = sd exp
−s
2ρ2d
√π‖x‖2
, ρ > 0,
corresponding to the Gaussian spectral density. We discuss more specically the
choice of (4.7) later in Section 4.4.
96
4.2.2 The mixture model with repulsive means
To deal with limitations of model (4.1) or DPMs, we consider repulsive mixtures.
Our aim is to estimate a random partition of the available subjects, and we want
to do so using few groups. By repulsion we mean that cluster locations are a
priori encouraged to be well separated, thus inducing fewer clusters than if they
were allowed to be independently selected. We start from parametric densities
f(·; θ), which we take to be Gaussian, and assume that the collection of location
parameters follows a DPP. We specify a hierarchical model that achieves the goals
previously described. Concretely, we propose:
yi | si = k, µk, σ2k,K
ind∼ N(yi;µk, σ
2k
)i = 1, . . . , n (4.8)
X = µ1, µ2, . . . , µK ,K ∼ PES −DPP (ρ, ν) (4.9)
(ρ, ν) ∼ π (4.10)
p(si = k) = wk, k = 1, . . . ,K for each i (4.11)
w1, . . . , wK | K ∼ Dirichlet(δ, δ, . . . , δ) (4.12)
σ2k | K
iid∼ IG(a0, b0), (4.13)
where the PES-DPP(ρ, ν) assumption (4.9) is regarded as a default choice that
could be replaced by any other valid DPP alternative. The choice of π in (4.10)
will be discussed below in Section 4.4. We note that, as stated, the prior model
may assign a positive probability to the case K = 0. This case of course makes no
sense from the viewpoint of the model described above. Nevertheless, we adopt the
working convention of redening the prior to condition on K ≥ 1, i.e., truncating
the DPP to having at least one point. In practice, the posterior simulation scheme
later described simply ignores the case K = 0, which produces the desired result.
Note also that we have assumed prior independence among blocks of parameters not
involving the locations µk.
Model (4.8)-(4.13) is a DPP mixture model along the lines proposed in Xu et al.
(2016). Indeed, we both use DPPs as priors for location points in the mixture of
parametric densities. However, the specic DPP priors are dierent, as they restrict
to a particular case of DPPs (L-ensambles), and choose a Gaussian covariance func-
tion for which eigenvalues and eigenfunctions are analytically available. We adopt
instead the more general spectral approach for assigning the prior (4.9). Similar to
Xu et al. (2016), we carry out posterior simulation using a reversible jump step as
part of the Gibbs sampler. However, when updating the location points µ1, . . . , µKwe refer to formulas (4.4)-(4.6). Xu et al. (2016) take advantage of the analytical
expressions that we do not have for our case, and that are also unavailable in other
possible specic choices of the spectral density. As a general comment, we under-
line that the numerical evaluation of the DPP density, involving the computation
of the determinant of a K ×K matrix, is not particularly expensive, even in case of
a large dataset; in this case, the repulsion property will favor a moderate number
K of clusters. See Section 4.4.4, where we describe applications of this model to
4.2. Using DPPs to induce repulsion 97
datasets, using the posterior simulation algorithms described in Section 4.2.4. In our
experience, the proposed model scales well compared to mixtures with independent
components.
4.2.3 Competitor repulsive models
We briey introduce the class of parsimonious mixture models in Quinlan et al.
(2017), to be used as a competitor model for our applications. Quinlan et al. (2017)
exploit the idea of repulsion, i.e. when any two mixture components are encouraged
to be well separated, as we do. For the sake of comparison, we introduce their
model for unidimensional data: similarly to our case, they consider a mixture of K
Gaussian components, but assume a xed value k for K in (4.8) and (4.11)-(4.13).
The prior for the location parameters µ1, . . . , µk is called repulsive distribution,
and denoted by NRepk(µ,Σ, τ), where µ ∈ R, Σ, τ > 0; see (3.4)-(3.6) in Quinlan
et al. (2017). This prior is characterized by a repulsion potential that assumes the
following expression:
φ1(r; τ) = − log(
1− e−12τr2)1(0,+∞)(r), τ > 0
Petralia et al. (2012) use a similar model, where the repulsion potential is
φ2(r; τ) =τ
r21(0,+∞)(r), τ > 0
Potential φ2 introduces a stronger repulsion than φ1, in the sense that in Petralia
et al. (2012), locations are encouraged to be further apart than in Quinlan et al.
(2017). Note also that, by nature of the point process, our approach does not
require an upper bound on the allowed number of mixture components (similar to
DPM models), contrary to the approach in Quinlan et al. (2017) and Petralia et al.
(2012). The posterior simulation algorithm we propose for our model is described
in Section 4.2.4.
4.2.4 Gibbs sampler for model in Section 4.2.2
Posterior inference for our DPP mixture model as in (4.8)-(4.13) is carried out
using a Gibbs sampler algorithm. The full-conditionals are outlined below: we pro-
vide the details of the computation only when the conditional posterior distribution
is not straightforward. In what follows, rest refers to the data and all parameters
except for the one to the left of |.
The labels s1, . . . , sn are independently distributed according to a discrete
distribution whose support is 1, 2, . . . ,K:
p(si = k | rest) ∝ wkN(yi;µk, σ
2k
). (4.14)
The distribution of the weights w1, . . . , wK is conjugate: the conditional
98
distribution is still a Dirichlet distribution, where the parameters are δ + nk,
k = 1, . . . ,K.
The variances in each component of the mixture σ21, . . . , σ
2K are generated
independently according to the following distribution:
σ2k | rest ∼ IG
a0 +nk2, b0 +
1
2
∑i: si=k
(yi − µk)2
, k = 1, . . . ,K.
Sampling the means µ1, . . . , µK needs more care: following the reasoning in
Xu et al. (2016), this full-conditional can be written as
p(µ1, . . . , µK | rest) ∝ det[C](µ1, . . . , µK; ρ, ν)K∏k=1
∏i: si=k
N(yi;µk, σ
2k
)
∝K∏k=1
(C(µk, µk)− bC−1−kb
T) ∏i: si=k
N(yi;µk, σ
2k
) ,
thanks to the Schur determinant identity. Note that det[C](µ1, . . . , µK; ρ, ν)
in the above expression follows from the expression of the density of a DPP
on a compact set; see (4.6). Then, b is a vector dened as b = C(µk, µ−k),
µ−k = µjj 6=k and C−1−k is a matrix of dimension (K − 1) × (K − 1) dened
as C (µ−k, ρ, ν). Moreover, µk = T (µk) is the transformed variable that takes
values on the set S = [−1/2, 1/2]d. Typically, the rectangle R such that
T (R) = S is xed in such a way that it is large and contains abundantly all
the datapoints.
We update each mean µk separately for k = 1, . . . ,K using a Metropolis-
Hastings update.
The full-conditional for the parameters (ρ, ν) is as follows
p(ρ, ν | rest) ∝ det(C)
[µ1, . . . , µK , ρ, ν]e(−∑Nk=−N log(1+ϕ(k;ρ,ν)))π(ρ, ν).
The adaptive Metropolis-Hastings algorithm of Roberts and Rosenthal (2009)
is employed in this case, in order obtain a better mixing of the chains and to
avoid the annoying choice of the parameters for the proposal distribution.
In order to sample K we need a Reversible Jump step: standard proposals to
estimate mixtures of densities with a variable number of components are based
on moment matching (Richardson and Green, 1997) and have been relatively
often used in the literature. The idea is to build a proposal that preserves
the rst two moments before and after the move, as in Xu et al. (2016). In
particular, the only possible moves are the splitting move, passing from K to
K + 1, and the combine move, from K to K − 1.
4.2. Using DPPs to induce repulsion 99
(i) Choose move type: uniformly choose among split and combine move
(however, if K = 1 the only possibility is to split)
(ii.a) Combine: randomly select a pair (j1, j2) to merge into a new parameter
indexed with j1. The following relations must hold:
wnewj1 = wj1 + wj2
wnewj1 µnewj1 = wj1µj1 + wj2µj2
wnewj1
(µnew2j1 + σnew2
j1
)= wj1
(µ2j1 + σ2
j1
)+ wj2
(µ2j2 + σ2
j2
)(ii.b) Split: randomly select a component j to split into two new components.
In this case, we need to impose the following relationships:
wnewj1 = αwj , wnewj2 = (1− α)wj
µnewj1 = µj −
√wnewj2
wnewj1
r(σ2j
)1/2, µnewj2 = µj −
√wnewj1
wnewj2
r(σ2j
)1/2σnew2j1 = β(1− r2)
wjwnewj1
σ2j , σnew2
j2 = (1− β)(1− r2)wjwnewj2
σ2j
where α ∼ Beta(1, 1), β ∼ Beta(1, 1) and r ∼ Beta(2, 2).
(iii) Probability of acceptance: the proposed move is accepted with prob-
ability α = min
(1,
1
q(proposed, old)
)if we selected a combine step,
min (1, q(old, proposed)) in the split case. In particular,
q(old, proposed) = |det(J)|p(K + 1,wnew,µnew,σ2new | y)
p(K,wold,µold,σ2old | y)
×psplitK+1
1
(K + 1)
(K + 1)pcombK p(α)p(β)p(r)
where
|det(J)| =w4j(
wnewj1wnewj2
)3/2(σ2j )
3/2(1− r2)
and
p(K + 1,wnew,µnew,σ2new | y)
p(K,wold,µold,σ2old | y)=
likelihood(wnew,µnew,σ2new)
likelihood(wold,µold,σ2old)
×π(σ2new
j1)π(σ2new
j2)
π(σ2j )
DirichletK+1(wnew)
DirichletK(wold)
det(CK+1)
det(CK).
Moreover, psplitK+1 = 0.5 if K > 1, 1 otherwise; pcombK = 0.5 if K > 1, 0
otherwise.
100
We note that in the case of multidimensional data points, the parameters include
covariance matrices Σ1,Σ2, . . . ,ΣK instead of scalar variablesσ2
1, . . . , σ2K
; how-
ever, the marginal inverse-Wishart prior distribution is semi-conjugate, yielding the
update of the parameters similarly as in a standard Normal-Normal/inverse-Wishart
model. The main diculty again lies in the Reversible Jump step: we modied
points (ii.a), (ii.b) and (iii) described above according to the algorithm in Della-
portas and Papageorgiou (2006), Section 3.1. The basic idea is to build moves on
the space of eigenvectors and eigenvalues of the current covariance matrix, so that
the proposed covariance matrices are positive denite.
4.3 Generalization to covariate-dependent models
The methods discussed in Section 4.2 were devised for density estimation-like
problems. We now extend the previous modeling to the case where p-dimensional
covariates x1, . . . , xn are recorded as well. We do so by allowing the mixture weights
to depend on such covariates. In this case, there is a trade-o between repulsiveness
of locations in the mixtures and attraction among subjects with similar covariates.
We also entertain the case where covariate dependence is added to the likelihood part
of the model. Our modeling choice here is akin to mixtures of experts models (see,
e.g., McLachlan and Peel, 2005), i.e., the weights are dened by means of normalized
exponential function.
Building on the model from Section 4.2.2, we assume the same likelihood (4.8)
and the DPP prior for X = µ1, µ2, . . . , µK ,K in (4.9)-(4.10), but change (4.11)
and (4.12) to
p(si = k) = wk(xi) =exp
(βTk xi
)∑Kl=1 exp
(βTl xi
) , k = 1, . . . ,K (4.15)
β2, . . . , βK | Kiid∼ Np (β0,Σ0) , β1 = 0, (4.16)
where the β1 = 0 assumption is to ensure identiability. To complete the model,
we assume (4.13) as the conditional marginal for σ2k; the prior for (ρ, ν) in (4.10) is
later specied. Here β0 ∈ Rp, and to choose Σ0, we use a g-prior approach, namely
Σ0 = φ ×(XTX
)−1, where φ is xed, typically of the same order of magnitude of
the sample size (see Zellner, 1986).
Assuming now (4.8) on top of (4.15)-(4.16) rules out the case of a likelihood
explicitly depending on covariates, which instead would generally achieve a better
t than otherwise. Of course, there are many ways in which such dependence may
be added. For the sake of concreteness, we assume here a Gaussian regression
likelihood, where only the intercept parameters arise from the DPP prior. More
4.3. Generalization to covariate-dependent models 101
precisely, we assume
yi | si = k, xi, µk, σ2k,K
ind∼ N(yi;µk + xTi γk, σ
2k
)i = 1, . . . , n (4.17)
(γ1, σ21), . . . , (γK , σ
2K)|K iid∼ N − IG(γ0,Λ0, a0, b0), (4.18)
where the γk's are p-dimensional regression coecients. The notation in (4.18)
means that γk | σ2k ∼ Np(γ0, σ
2kΛ0), and σ2
k ∼ IG(a0, b0), where γ0 ∈ Rp and Λ0 is
a covariance matrix. The prior for si and βj 's is given in (4.15)-(4.16) as in the
previous model. Note that (4.17) implies that only the intercept term is distributed
according to the repulsive prior. Thus, we allow the response mean to be corrected
by a linear combination of the covariates with cluster-specic coecients, with the
repulsion acting only on the residual of this regression. The result is a more exible
model than the repulsive mixture (4.8)-(4.13). Observe that there is no need to
assume the same covariate vector in (4.17) and (4.15), but we do so for illustration
purposes only.
The Gibbs sampler algorithm employed to carry out posterior inference for this
model is detailed in Section 4.3.1. However, it is worth noting that the reversible
jump step related to updating the number of mixture components K and the update
of the coecients β2, β3, . . . , βK are complicated by the presence of the covariates.
For the β coecients, we resort to a Metropolis-Hastings step, with a multivariate
Gaussian proposal centered in the current value. For K, we employ an ad hoc
Reversible Jump move.
4.3.1 Gibbs sampler in presence of covariates
The Gibbs sampler algorithm employed to carry out posterior inference for model
(4.17)-(4.18), (4.9)-(4.10), (4.15)-(4.16) is dierent from the one in Section 4.2.4
except for the full conditionals of (ρ, ν). The sampling of labels sini=1 diers from
(4.14), since now
p(si = k | rest) ∝ wk(xi)N(yi;µk + xTi γk, σ
2k
)∝ exp
(βTk xi
)N(yi;µk + xTi γk, σ
2k
).
The sampling of the µkKk=1 is similar as the same step in Section 4.2.4, but now
p(µ1, . . . , µK | rest) ∝ det[C](µ1, . . . , µK; ρ, ν)
K∏k=1
∏i: si=k
N(yi − xTi γk;µk, σ2
k
).
However the substantial change from the model without covariates to the model with
covariates is due to the update of K, the number of components in the mixture, and
of β2, . . . , βK (recall that β1 = 0 for identiability reasons); these are indeed
complicated by the presence of the covariates. Moreover, the update of σ2k is now
102
replaced by
p(γk, σ2k | rest) ∝
∏i:si=k
N(yi;µk + xTi γk, σ
2k
)Np(γk; 0, σ2
kΛ0
)IG(σ2
k; a0, b0)
∝ 1
(2πσ2k)nk/2
e
(−
1
2σ2k
∑i: si=k
(yi−µk−xTi γk)2
)Np(γk; 0, σ2
kΛ0
)IG(σ2
k; a0, b0)
where nk = #i : si = k; here we assume the vector of the prior mean of γk, γ0, to
be equal to the 0-vector. This full-conditional is the posterior of the standard con-
jugate normal likelihood, normal - inverse gamma regression model. In particular,
we have that
γk | σ2k, rest ∼ Np
(m∗, σ2
kΛ∗)
with Λ∗ =(Λ−1
0 +∑
i: si=kxix
Ti
)−1and m∗ = Λ∗
(∑i: si=k
yixi). Moreover
σ2k | rest ∼ IG
a0 +nk2, b0 +
1
2
∑i: si=k
y2i −m∗T (Λ∗)−1m∗
.
The full-conditional for the coecients βk, k = 2, . . . ,K is as follows:
p(β2, . . . , βK | rest) ∝K∏k=1
∏i: si=k
exp(βTk xi
)∑K`=1 exp
(βT` xi
)Np(βk;β0,Σ0)
which has no known form. Therefore we resort to a Metropolis Hastings step with
a multivariate Gaussian proposal, centered in the current value of the vector and
with a diagonal covariance matrix, i.e. ζIp×p, where ζ is a tuning parameter chosen
to guarantee convergence of the chains.
On the other hand, the update of K requires a Reversible Jump-type move.
However, the approach used in Section 4.2.4 above is dicult to implement when
mixing weights depend on covariates, as in this case, so that we need to nd another
way to dene a transition probability. Our approach is similar to that of Norets
(2015), with some dierences that will be highlighted in the next lines.
As before, there are two available moves: split or combine. The probability of
proposing one of them is 0.5, except if K = 1, when only the split move can be
proposed.
Split: if this move is picked, Kprop = K + 1, so that we need to create a new group
and its corresponding parameters (the other parameters are kept xed):
(i) randomly pick one cluster, say j, containing at least two items
(ii) randomly divide data associated to this group, yj , into two subgroups, yj1 and
yj2 ;
4.3. Generalization to covariate-dependent models 103
(iii) set γj1 = γj , σ2j1
= σ2j , βj1 = βj , µj1 = µj . Now we need to choose a value for
γj2 , σ2j2, βj2 and µj2 . In Norets (2015), this is done by sampling the new values
from the posterior, conditioning also on the other parameters (even if, for prac-
tical purposes, Gaussian approximations for conditional posteriors are used in
the implementation of the algorithm). Instead, we sample(µj2 , γj2 , σ
2j2
)from
the posterior of the following auxiliary model
yj2 | µ, γ, σ2 iid∼ N (µ+ xTj2γ, σ2)
γ =
[µ
γ
]| σ2 ∼ Np+1(0, σ2Γ0)
σ2 ∼ IG(ξ0, ν0)
where xj2 and yj2 represent covariates and responses in the new group with
label j2, respectively. Parameter βj2 is sampled from a p-dimensional Gaus-
sian distribution with mean βmode and variance covariance matrix Σmode. In
particular, βmode is the argmax of the following expression
∏i: si=j2
exp(βTj2xi
)exp
(βTj2xi
)+∑
j 6=j2 E(
exp(βTj xi
))Np (βj2 ;β0,Σ0) ,
which corresponds to an approximation of the full-conditional of the βk (we
dropped the dependence on the other βj 's by considering the expected value
in the denominator). Note that E(
exp(βTj xi
))is not other than the moment
generating function, thus it is equal to exp(βT0 xi + xTi Σ0xi/2
).
Combine: here Kprop = K − 1, so that it suces to collapse two groups into
one. Specically, we randomly choose one group to delete, say j1, and remove the
corresponding parameters βj1 , µj1 and σ2j1. Then, we choose another group, j2, and
assign all the data yj1 to it.
Acceptance rate: this is simply given by
α(K → K+1) =p(y | K + 1, θK+1)π(K + 1, θK+1)
p(y | K, θK)π(K, θK)
1
f(µj2 , γj2 , σ2j2, βj2)
pSK+1
pCK+1
pc(j1, j2)
ps(j)
α(K → K−1) =p(y | K − 1, θK−1)π(K − 1, θK−1)
p(y | K, θK)π(K, θK)f(µj1 , γj1 , σ
2j1 , βj1)
pCK−1
pSK
ps(j)
pc(j1, j2)
where θK = (σ21:K , γ1:K , µ1:K , β1:K). Moreover, ps(j) is the probability of splitting
component j and similarly for the other terms.
104
4.4 Simulated data and reference datasets
Before illustrating the application of our models to specic datasets, we discuss
some general choices that apply to all examples. Every run of the Gibbs sampler
(implemented in R) produced a nal sample size of 5,000 or 10,000 iterations (unless
otherwise specied), after a thinning of 10 and initial burn-in of 5,000 iterations. In
all cases, convergence was checked using both visual inspection of the chains and
standard diagnostics available in the CODA package. Elicitation of the prior for
(ρ, ν) requires some care, as the role of these parameters is dicult to understand.
Therefore, an extensive robustness analysis with respect to π(ρ, ν) for those datasets
was carried out. See Sections 4.4.1 and 4.4.3. We point out that an initial prior
independence assumption π(ρ, ν) = π(ρ) π(ν) produced bad mixing of the chain.
In particular, when ρ is small with respect to ν, the spectral function ϕ(·) has a
very narrow support, concentrated near the origin, forcing the covariance function
Capp(x, y) to become nearly constant for x, y ∈ S and thus producing nearly singular
matrices. We next investigated the case π(ρ, ν) = π(ρ | ν)× πν(ν), where
ρ | ν d= M(s, ε, ν) + ρ0, ρ0 ∼ gamma(aρ, bρ). (4.19)
Here,M(s, ε, ν) is a constant that is the maximum value of ρ such that ϕ(2; ρ, ν) > ε
(here ϕ(2; ρ, ν) is a reference value chosen to avoid a small support), and ε is a
threshold value, assumed to be small (0.05, for instance). From Figure 4.2, it is
0 1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
0.5
x
Spe
ctra
l den
sity
ν
1
2
5
10
30
0 50 100 150 200
0.0
0.1
0.2
0.3
0.4
0.5
x
Spe
ctra
l den
sity
ν
1
2
5
10
30
Figure 4.2: Power exponential spectral density ϕ(x; ρ, ν) when ρ is 2 (left) and 100(right) and ν varies.
clear that ϕ(·; ρ, ν) goes to 0 too fast when ν is small relative to ρ. It follows that,
in case d = 1,
M(s, ε, ν) =(
logs
ε
)1/ν Γ(1/ν + 1)
s.
4.4. Simulated data and reference datasets 105
On the other hand, two dierent choices for πν were considered: a gamma dis-
tribution, which gave a bad chain mixing, and a discrete distribution on V2 =
0.5, 1, 2, 3, 5, 10, 15, 20, 30, 50 (or on one of its subsets). In this case, the mixing of
the chain was better, but the posterior for ν did not discriminate among the values
in the support. For this reason, in Sections 5.3, 6 and 7, we assume ν = 2, s = 1/2
and
ρd=
√π log
(1
2ε
)+ ρ0, ρ0 ∼ gamma(aρ, bρ). (4.20)
4.4.1 Data illustration without covariates: reference datasets
We illustrate our model via two datasets without covariates with unidimensional
(Galaxy data) and bidimensional (Air Quality data) observations, both publicly
available in R (galaxy from the DPpackage and airquality in the base version).
For the latter data set we removed 42 incomplete observations.
The popular dataset Galaxy contains n = 82 measured velocities of dierent
galaxies from six well-separated conic sections of space. Values are expressed in
Km/s, scaled by a factor of 10−3. We set the hyperparameters in this way: for the
variance σ2k of the components, (a0, b0) = (3, 3) (such that the mean is 1.5 and the
variance is 9/4) and for the weights wk the Dirichlet has parameter (1, 1, . . . , 1).
The other hyperparameters are modied in the tests, as in Table 4.1, where we
report summaries of interest, such as the prior and posterior mean and variance for
the number of components K. In addition, we also display the mean squared error
(MSE) and the log-pseudo marginal likelihood (LPML) as indices of goodness of
t, dened as MSE =∑n
i=1(yi− yi)2 and LPML =∑n
i=1 log(f(yi | y(−i))
), where
yi is the posterior predictive mean and f(yi | y(−i)) is the i−th conditional predic-
tive ordinate, that is the predictive distribution obtained using the dataset without
the i−th observation. Figure 4.3 (left) shows density estimates and the estimated
partition of the data, obtained as the partition that minimizes the posterior ex-
pectation of Binder's loss function under equal misclassication costs (see Lau and
Green, 2007b). The points at the bottom of the plots represent observations, while
colors refer to the corresponding cluster. Figure 4.3 (right) displays the posterior
distribution of K for Test 4 and 6 in Table 4.1.
As a comparison, the same posterior quantities than in Table 4.1 were computed
using the DPM, the Repulsive Gaussian Mixture Models (RGMM) by Quinlan et al.
(2017), and also the proposal by Petralia et al. (2012). To make results comparable,
we assumed the same prior information on hyperparameters common to all the
mixture models. See Table 4.5. From these tables, it is clear that alternative
repulsive models are good competitors to ours, and that they generally achieve a
better t to the dataset. The tests showing the best indexes of goodness of t are
typically those overestimating the number of clusters.
Finally, we recall that in Section 4.4.2 we report some further tests on the Galaxy
dataset to show the inuence of various choices of spectral density on the inference.
We conclude that there is evidence to robustness with respect to the choice of
106
ρ ν E(K) V (K) Epost(K) V post(K) MSE LPML
1 2 2 2 1.67 6.09 1.10 78.95 -171.722 5 10 5.00 7.12 6.07 1.09 78.33 -167.963 aρ = 1, bρ = 1 2 2.18 1.978 6.10 1.10 73.89 -164.474 aρ = 1, bρ = 1 10 2.73 2.15 6.11 1.12 74.93 -162.715 aρ = 1, bρ = 1 discr(V1) 2.47 2.21 6.06 1.08 74.02 -172.546 aρ = 1, bρ = 1 discr(V2) 2.51 2.27 6.10 1.13 76.64 -170.94
Table 4.1: Prior specication for (ρ, ν) and K and posterior summaries for theGalaxy dataset; (aρ, bρ) appear in (4.20); here V1 is 1, 2, 5, 10, 20 and V2 =0.5, 1, 2, 3, 5, 10, 15, 20, 30, 50. V denotes the variance.
Figure 4.3: Density estimates and estimated partition for the Galaxy dataset underTest 4 in Table 4.1, including 90% credibility bands (light blue).
spectral density.
We have considered one further application, this time using the same vari-
ables from the dataset Air Quality (ozone and solar radiation) as considered in
Quinlan et al. (2017). Instead of (4.8), we assume that our likelihood is a bidi-
mensional Gaussian, with bidimensional mean vectors distributed according to the
PES − DPP (ρ, ν) prior as before, and with covariance matrices Σk independent
and identically distributed according to the inverse-Wishart distribution. See Sec-
tion 4.2.4 for changes in the Gibbs sampler with multidimensional data points, this
time adapted from Dellaportas and Papageorgiou (2006). Table 4.2 reports sum-
maries of interest for a few tests carried out, including the prior and posterior mean
and variance for the number of components K, and the LPML. As usual in the
context of other mixture models, we nd that the inference depends on the cho-
sen hyperparameters. If we compare with corresponding inference in Quinlan et al.
4.4. Simulated data and reference datasets 107
(2017), we got lower estimates of K, and a better t of the model to the data. The
posterior predictive densities, not shown here, seem very similar to those in Fig. 9
(b) in Quinlan et al. (2017).
Test ρ ν E(K) V ar(K) E(K|data) V ar(K|data) LPML
7 3 2 3 2.62 2.18 0.39 -246.818 ρ0 ∼ gamma(1, 0.5) 2 2.7 2.37 2.15 0.21 -257.66
Table 4.2: Prior specication for (ρ, ν) and K and posterior summaries for theairquality dataset; ρ0 appear in (4.20).
4.4.2 Dierent spectral densities: application to the Galaxy dataset
We consider the proposed model with dierent spectral densities, to check its
robustness on the inference. All the models presented in this chapter are, as a matter
of fact, general and in principle any spectral density ϕ(·) satisfying the conditions
for the existence of the DPP process can be employed. The choice of the spectral
density in (4.7) is motivated by its strong repulsiveness (see Lavancier et al., 2015).
However, in this section we show inference on the Galaxy dataset obtained when
spectral representations other than the power spectral density, drive the DPP.
We choose isotropic covariance functions that are well-known in the spatial statis-
tics literature: the Whittle-Matérn and the generalized Cauchy. Both densities de-
pend on three parameters: intensity ρ > 0, scale α > 0 and shape ν > 0. In order
to assure ϕ(x) < 1 for all x, ρ must be smaller than ρmax = α−dM , where M needs
to be specied for each of the two cases. For the Whittle-Matérn we have
ϕ(x; ρ, α, ν) = ρΓ(ν + d/2) (2α
√π)d
Γ(ν) (1 + ‖2παx‖)ν+d/2, M =
Γ(ν)
2dπd/2Γ(ν + d/2)
and for the generalized Cauchy
ϕ(x; ρ, α, ν) = ρ21−ν (2
√π)d
Γ(ν + d/2)‖2παx‖dKν(‖2παx‖), M =
Γ(ν + d/2)
Γ(ν)πd/2
where d is the dimension of the space where x lives (d = 1 in what follows) and
Kν(·) is the modied Bessel function of the second kind.
We x ρ =1
2ρmax and (α, ν) equal to: (i) (0.1,0.1), (ii) (0.1,2), (iii) (1,0.1) in
the tests below. To t the Galaxy data to the model in Section 4.2.2, the selected
hyperparameter values are δ = 1, the parameter of the Dirichlet, and (a0, b0) =
(3, 3), the parameters of the inverse gamma (see (4.12) and (4.13)).
Table 4.3 displays posterior summaries for the two families of spectral densities
under hyperparameter settings (i), (ii) and (iii). Posterior summaries of the number
of components K and goodness-of-t values are close to those of Table 4.1. This
108
Whittle-Matérn
Test E(K) Var(K) E(K | data) Var(K | data) MSE LPML
(i) 10.21 17.29 6.07 1.09 73.67 -167.22(ii) 2.09 2.15 6.08 1.09 73.89 -167.68(iii) 3.53 9.87 6.07 1.10 75.80 -167.33
Generalized Cauchy
Test E(K) Var(K) E(K | data) Var(K | data) MSE LPML
(i) 5.65 14.49 6.09 1.09 76.98 -166.60(ii) 1.84 1.73 6.07 1.10 75.75 -167.42(iii) 0.25 0.06 6.07 1.12 80.66 -169.84
Table 4.3: Prior mean and variance of K and posterior summaries for the Galaxydataset with Whittle-Matérn (top) and generalized Cauchy (bottom) spectral den-sities.
gives evidence to robustness with respect to the choice of the spectral density.
4.4.3 Tests on data from a mixture with 8 components
We simulated a dataset with n = 100 observations from a mixture of 8 compo-
nents. Each component is the Gaussian density with mean θk and σ2k = σ2 = 0.05:
the means θk are evenly spaced in the interval (−10, 10). In the model (4.8)-
(4.13), we set aρ = 2.0025, bρ = 0.050125 so that E(ρ0) = 0.05 and V ar(ρ0) = 1;
again, s = 0.5 and δ = 1. We recall that ρ0 is dened in (4.19).
Table 4.4 reports hyperparameters values for dierent tests and posterior sum-
maries of interest, as well as prior mean and variance of K. In particular, we show
the posterior mean and variance for the number of componentsK (with which we as-
sess the eectiveness of the model for clustering), the mean squared error (MSE) and
the log-pseudo marginal likelihood (LPML) (that helps quantifying the goodness-
of-t). In all cases we obtained a pretty satisfactory estimate of the exact number
of components, which is 8: the posterior is concentrated around the true value with
a very small variance. See also Figure 4.4.
From the density estimation viewpoint, we have from Table 4.4 that both MSE
and LPML are similar for all the tests, thus indicating robustness with respect to
the prior choice of parameters ρ and ν. However, preferable tests seem to be S2 and
S7; see Figure 4.5, where density estimates and estimated partitions for these two
cases are displayed. The posterior density of ρ under Tests S2 and S7 is shown in
Figure 4.6.
4.4.4 Comparison to alternative models
We now consider tting alternative models to the Galaxy and two simulated
datasets, one from the mixture with 8 components introduced in the previous sec-
4.4. Simulated data and reference datasets 109
Prior specication
Test ρ ν E(K) V (K)
S0 9.00 1 8.98 45.12S1 9 10 9 23.05S2 aρ = 1, bρ = 1 1 1.94 1.99S3 aρ = 1, bρ = 1 2 2.18 1.99S4 aρ = 1, bρ = 1 10 2.74 2.17S5 aρ = 1, bρ = 1 discr(2,5,20) 2.52 2.11S6 aρ = 1, bρ = 1 discr(V1) 2.45 2.18S7 aρ = 1, bρ = 1 discr(V2) 2.5 2.25
Posterior summaries
Test E(K | data) V (K | data) MSE LPML
S0 7.98 0.20 4.65 2.39S1 7.99 0.19 4.62 3.10S2 8.00 0.17 4.62 3.66S3 7.991 0.16 4.62 3.03S4 7.99 0.17 4.63 2.96S5 7.99 0.16 4.63 3.61S6 7.99 0.17 4.65 3.42S7 7.99 0.18 4.63 3.36
Table 4.4: Prior specication for (ρ, ν) and the corresponding mean and vari-ance induced on K (top). Hyperparameters (aρ, bρ) appear in (4.20), while V1 =1, 2, 5, 10, 20 and V2 = V1 ∪ 0.5, 3, 15, 30, 50. Posterior summaries for the simu-lated dataset from a mixture with 8 components are in the bottom subtable.
0.0
0.2
0.4
0.6
0.8
1.0
K
Prob
abilit
y m
ass
7 8 9
0.0
0.2
0.4
0.6
0.8
1.0
K
Prob
abilit
y m
ass
7 8 9 10
Figure 4.4: Posterior distribution of K for the simulated dataset from the mixtureof 8 components under Tests S2 (left) and S7 (right) in Table 4.4.
tion, and the second consisting of 10,0000 observations generated from a mixture
110
data
Den
sity
−10 −5 0 5 10
0.0
0.1
0.2
0.3
0.4
0.5
0.6
| || |||||| ||| || ||| || || | || | ||||| ||| | ||| | ||| | || ||| | ||| || || | ||| ||| | ||| | | ||| | | || || || | | | || || | ||| | | ||| | ||| |
data
Den
sity
−10 −5 0 5 10
0.0
0.1
0.2
0.3
0.4
0.5
0.6
| || |||||| ||| || ||| || || | || | ||||| ||| | ||| | ||| | || ||| | ||| || || | ||| ||| | ||| | | ||| | | || || || | | | || || | ||| | | ||| | ||| |
Figure 4.5: Density estimate and estimated partition for the simulated dataset fromthe mixture of 8 components under Tests S2 (left) and S7 (right) in Table 4.4. Thepoints at the bottom of the density estimate represent the data, and each colorrepresents one of the eight estimated clusters.
of 20 components. We consider rst the gold standard of Bayesian nonparametric
models, the DPM, and then the RGMM by Quinlan et al. (2017), and the similar
specication in Petralia et al. (2012). The same prior information on hyperparam-
eters common to all the mixture models was assumed, i.e. the same marginal prior
for σ2k and (w1, . . . , wk). Hyperparameter τ in the potentials φ1 and φ2 was set ac-
cording to the suggestion in Quinlan et al. (2017) (τ = 5.54). As a comparison, the
DPM
Test α E(K) E(K|data) V ar(K|data) MSE LPML
7 gamma(0.5, 1) 2.9 6.166 1.549 62.703 -151.7978 0.8 4.3 5.936 1.25 61.255 -151.1469 0.45 3 4.371 1.142 139.659 -169.97810 gamma(4, 2) 7.7 7.271 1.594 36.708 -149.258
Repulsive models
Model E(K|data) V ar(K|data) MSE LPML
Quinlan et al. (2017) 6.462 0.440 38.122 -162.574Petralia et al. (2012) 7.621 0.757 20.964 -156.522
Table 4.5: Prior specication for α and posterior summaries for the Galaxy datasetusing the function DPdensity in DPpackage (top) and repulsive models (bottom).
same posterior quantities than in Table 4.1 were computed; see Tables 4.5 and 4.6.
The DPM was tted via the function DPdensity available from DPpackage (Jara
et al., 2011), while the code for the alternative repulsive models was gently provided
4.4. Simulated data and reference datasets 111
ρ
Den
sity
5 10 15 20 25 30
0.00
0.05
0.10
0.15
ρ
Den
sity
0 10 20 30 40 50 60
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Figure 4.6: Posterior distribution of ρ for the simulated dataset from the mixtureof 8 components under Tests S2 (left) and S7 (right) in Table 4.4.
by José Quinlan and Garritt Page.
DPM
Test α E(K) E(K|data) V ar(K|data) MSE LPML
11 0.43 3 7.961 0.5 4.779 -11.24612 gamma(4, 2) 8.17 8.665 0.910 4.248 -10.116
Repulsive models
Model E(K|data) V ar(K|data) MSE LPML
Quinlan et al. (2017) 10.73 1.407 3.121 -4.754Petralia et al. (2012) 8.51 0.357 4.152 -4.022
Table 4.6: Posterior summaries for the simulated dataset from the mixture of 8components using using the function DPdensity in DPpackage (top) and repulsivemodels (bottom).
Comparison of the tables above and Tables 4.1 and 4.4 show that the alternative
repulsive models are good competitors to ours, and according to the dataset and
hyperparameters specication, they may achieve a better (Galaxy) or worse (sim-
ulated data) t to the data. The tests showing the best indexes of goodness of t
are typically those overestimating the number of clusters. It is well-known that,
in general, clustering in the context of nonparametric mixture models as DPMs is
strongly aected by the base measure (see, e.g. Miller and Harrison, 2017). The
same disadvantage aects the mixture models in Quinlan et al. (2017) and Petralia
et al. (2012). Our model, on the other hand, avoids the delicate choice of the base
measure leading to more robust estimates of K.
As a further comparison, see also Figure 4.7 which displays the posterior dis-
tribution of K under the DPM mixture and our models for the Galaxy dataset.
112
4 6 8 10 12 14
0.0
0.1
0.2
0.3
0.4
K
Prob
abilit
y m
ass
DPP − 4
DPP − 6
DPM
Figure 4.7: Posterior distribution of the number K of components for the Galaxydataset under Test 4 (black) and 6 (blue) in Table 4.1 and under the DPM model(red) as in Test 7 in Table 4.5.
For the second simulated dataset, we considered applicability for a moderately
large sample size, generating 10,000 observations from a 20-component mixture, 10
of them being Gaussian, and the rest being skew-normal distributions with positive
and negative skewness. The true density is showed in Figure 4.8. To estimate the
true number of clusters, we tted dierent alternative models to this dataset: our
model, the repulsive mixture models by Quinlan et al. (2017) and Petralia et al.
(2012), and the nite mixture model implemented in the mclust R package via the
function Mclust (Fraley et al., 2012) with a number of components between 10 and
25. The same prior information on hyperparameters common to all the Bayesian
mixture models was assumed. The Mclust function returns the estimates of the
number of components corresponding to the best three models, in this case 11, 17
and 18. Though the run-time for this application is around 15 times longer than in
Model E(K|data) V ar(K|data) MSE LPML
PES −DPP 16.41 1.38 1356.43 -13239.54Quinlan et al. (2017) 14.13 0.146 1475.98 -13771.56Petralia et al. (2012) 20.81 0.564 1002.05 -12940.49
Table 4.7: Posterior summaries for the large simulated dataset.
the case of the Galaxy data, our algorithm reduces the eective number of clusters
4.4. Simulated data and reference datasets 113
data
Den
sity
−10 −5 0 5 10
0.00
0.05
0.10
0.15
Figure 4.8: Histogram, truedensity (red) and density esti-mate (black) of the large sim-ulated dataset, including 90%credibility bands (light blue).
a posteriori, thus helping our model scaling up. Intuitively, the increase in the run-
time is mostly due to the larger number of mixture components and the much larger
sample size than in the case of other datasets illustrated here.
4.4.5 Simulated data with covariates
We consider the same simulated dataset as in Müller et al. (2011), Section 5.2; the
simulation truth" consists of 12 dierent distributions, corresponding to dierent
covariate settings (see Figure 1 of that paper). Model (4.8)-(4.10), (4.13)-(4.16) was
tted to the dataset, assuming β0 = 0, Σ0 = 400×(XTX
)−1, aρ = 1, bρ = 1.2, and
a0, b0 such that the prior mean of σ2k is 50 and variance is 300. Recall also that here
we assume ν = 2.
As an initial step, inference for the complete dataset (1000 observations) was
carried out, yielding a posterior of K, not reported here, mostly concentrated over
the set 8, 9, . . . , 16, with a mode at 11. Figure 4.9 shows posterior predictive dis-
tributions for the 12 dierent reference covariate values, along with 90% credibility
intervals. These are in good accordance with the simulation truth (compare Figure
1 in Müller et al., 2011).
To replicate the tests in Müller et al. (2011), a total of M = 100 datasets of size
200 were generated by randomly subsampling 200 out of the 1000 available obser-
vations. Computational burden over multiple repetitions was controlled by limiting
the posterior sample sizes to 2,000. Table 4.8 displays the root MSE for estimating
E(y | x1, x2, x3) for each of the 12 covariate combinations dening the dierent clus-
ters for our model and for the PPMx, as in Table 1 of Müller et al. (2011). The com-
putations also include evaluation of the root MSE and LPML for all the 100 datasets
for estimating the data used to train the model, with MSEtrain =∑n
i=1 (yi − yi)2 ,
114
0 20 40 60 80 100 120
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
0 20 40 60 80 100 120
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
0 20 40 60 80 100 120
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
0 20 40 60 80 100 120
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
0 20 40 60 80 100 120
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
0 20 40 60 80 100 120
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
0 20 40 60 80 100 120
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
0 20 40 60 80 100 120
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
0 20 40 60 80 100 120
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
0 20 40 60 80 100 120
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
0 20 40 60 80 100 120
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
0 20 40 60 80 100 120
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
Figure 4.9: Predictive distribution corresponding to the 12 dierent reference valuesof the covariates. The simulation truth can be found in Figure 1 of Müller et al.(2011).
4.4. Simulated data and reference datasets 115
where yi is the expected value of the estimated predictive distribution, and for a
test dataset of 100 new data, MSEtest =∑n
i=1
(ytesti − yi
)2. In addition, we report
LPMLtrain, value of the Log Pseudo Marginal Likelihood for the training dataset.
Table 4.9 shows the values compared to other competitor models, i.e the linear de-
pendent Dirichlet process mixture (LDDP) dened in De Iorio et al. (2004), the
product partition model (PPMx) in Müller et al. (2011) and the linear dependent
tailfree process model (LDTFP) in Jara and Hanson (2011). The best values are in
bold: our model performs well according to the LPML, while the MSE suggests to
use PPMx or LDTFP. In general, our model is competitive with respect to other
popular models in the literature. Moreover, in the LDDP case, we have that the
average number of clusters is 20.6 with variance 2.266, thus indicating a less parsi-
monious model compared to ours.
x1 x2 x3 DPP PPMX
-1 0 0 6.1 7.90 0 0 6.7 3.91 0 0 7.2 2.8-1 1 0 6.5 5.40 1 0 6.5 4.61 1 0 6.8 4.0-1 0 1 6.8 6.10 0 1 6.1 4.21 0 1 5.7 4.5-1 1 1 5.9 9.50 1 1 6.6 8.31 1 1 5.8 6.2
avg 6.4 5.6
Table 4.8: Root MSE for estimating E(y | x1, x2, x3) for 12 combinations of co-variates (x1, x2, x3) and PPMx as competing model of reference (compare also theresults in Table 1 of Müller et al., 2011).
DPPx LDDP PPMX LDTFP
Root MSEtrain 324.531 304.742 278.395 304.374Root MSEtest 216.675 215.1694 217.2459 212.761
LPMLtrain -871.8 -902.2295 -873.1671 -901.465
Table 4.9: Comparison with competitors for the simulated dataset with covariates:best values according to each index are in bold. DPPx denotes our model, whileLDDP is the linear dependent Dirichlet process mixture, PPMx is the productpartition model with covariates, and LDTFP is the linear dependent tailfree processmodel.
In summary, our extensive simulations suggest that the proposed approach tends
to require less mixture components than other well-known alternative models, while
116
still providing a reasonably good t to the data.
4.5 Biopic movies dataset
For this illustrative example we consider the Biopics data available in the R
package fivethirtyeight (Ismay and Chunn, 2017). This dataset is based on the
IMDB database, related to biographical lms released from 1915 through 2014. An
interesting explorative analysis of the data can be found in goo.gl/M2QWFt.
We consider the logarithm of the gross earnings at US box oce as a response
variable, with the following covariates: (i) year of release of the movie (in a suitable
scale, continuous); (ii) a binary variable that indicates whether the main character
is a person of color; and (iii) a categorical variable that considers if the country of
the movie is US, UK or other. After removing the missing data from the dataset,
we were left with n = 437 observations and the number of covariates p = 4. We
note that 76 biopics have a person of color as a subject and the frequencies of
the category origin are (256, 79, 64) for US, UK and other, respectively; other
means mixed productions (e.g. US and Canada, or US and UK). In what follows,
the hyperparameters in model (4.17)-(4.18), (4.9)-(4.10), (4.15)-(4.16) are chosen
as β0 = 0, (aρ, bρ) = (1, 1). The prior mean and variance of K induced by these
hyperparameters are 2.162 and 1.978, respectively. The scale hyperparameter φ in
the g-prior for β and (a0, b0) vary as determined in Table 4.10, where m and v
denote the prior mean b0/(a0 − 1) and variance b20/((a0 − 1)2(a0 − 2)), respectively,
of the inverse gamma distribution for σ2k as in (4.18). We also assume γ0 equal to
the vector of all 0's, while Λ0 is such that the marginal a priori variance of γk is
equal to diag(0.01, 0.1, 0.1, 0.1), in accordance to the variances of the corresponding
frequentist estimators.
Test φ m v E(K | data) sd(K |data) MSE LPML
A 50 5 1 4.49 1.10 1126.32 -960.89B 200 5 10 4.45 1.19 983.55 -954.55C 50 3 +∞ 5.66 1.27 501.22 -918.74D 200 10 5 4.21 1.33 1805.83 -980.61E 100 2 1 5.31 1.21 564.26 -935.56F 200 2 10 5.51 1.26 557.44 -925.22
Table 4.10: Prior specication for βk's and σ2k's parameters and posterior summaries
for the Biopics dataset; m and v are prior mean and variance, respectively, of σ2k.
Posterior mean and variance of the number K of mixture components are in thefth and sixth columns, respectively, while the last two columns report MSE andLPML, respectively.
The posterior of K is robust with respect to the choice of prior hyperparameters;
on the other hand, our results show that by not including covariates in the likelihood,
i.e. setting all γk's are equal to 0, inference onK is much more sensitive to the choice
4.5. Biopic movies dataset 117
of (a0, b0) (results not shown here).
Predictive inference was also considered, by evaluating the posterior predictive
distribution at the following combinations of covariate values: (i) (mean value for
covariate year, US, white); (ii) (mean value for covariate year, US, color); (iii)
(mean value for covariate year, UK, white); (iv) (mean value for covariate year,
UK, color); (v) (mean value for covariate year, other, white); and (vi) (mean value
for covariate year, other, color). Corresponding plots are shown in Figure 4.10.
y
Den
sity
8 10 12 14 16 18 20
0.00
0.05
0.10
0.15
0.20
0.25 (i)
(ii)
(iii)
(iv)
(v)
(vi)
Figure 4.10: Predictive distribution for the log of gross earnings for cases (i)− (vi)under Test E in Table 4.10 for the Biopics dataset.
These distributions appear to be quite dierent in the six cases: in particular, we
can observe that in cases (i) and (ii), the posterior is shifted towards higher values.
This is quite easy to interpret, since the measurements are given by the earnings
in the US box oces; therefore, we expect that in general US movies will be more
protable in that market. The dierence due to the race is, on the other hand, less
evident. However, the predictive densities show slightly higher earnings for movies
where the subject is a person of color, if the origin is other ((v) and (vi)). Movies
from the UK, on the other hand, exhibit the opposite behavior ((iii) and (iv)).
We report here the posterior cluster estimate for Test B in Table 4.10. We found
three groups, with sizes 10, 193, 234, respectively; see Figure 4.11 for the estimated
clusters and boxplots of the response. As a comparison, it can be useful to report
the total average values for the response, 15.36, and for the covariates: 7.89 (year),
0.18 (UK), 0.15 (other), 0.83 (white). These 3 groups have a nice interpretation in
terms of covariates: group 1 is the smallest, with a high average response (17.18),
118
2 4 6 8
81
01
21
41
61
82
0
Year of release
Box
−offic
e e
arn
ings
1 2 3
81
01
21
41
61
82
0
Box
−offic
e e
arn
ings
Figure 4.11: Cluster estimate (left) under our model (Test B in Table 4.10) forthe Biopics dataset. Each color represents one of the three estimated clusters.Coordinate y is the response, i.e. log box-oce earning, while coordinate x is thecovariate year of release. The boxplot of the response per group is in the right panel.
and it is characterized by a high percentage of movies from other countries, with a
person of color as its subject. Group 2 corresponds also to a high average response
(16.42), but the average values of UK, other and person of color are similar to the
total averages (0.14, 0.09, 0.84, respectively). The average response in group 3 is
smaller (14.40) than the total sample mean, while the average values of UK, other
and person of color are 0.22, 0.17, 0.84, respectively.
To assess eectiveness of the proposed model, we compare the results with the
linear dependent Dirichlet process mixture model introduced in De Iorio et al. (2004)
and implemented in the LDDPdensity function of DPpackage (Jara et al., 2011).
Prior information has been xed as follows: for Test G the mass parameter of
the Dirichlet process α is set equal to 0.3 such that E(K) = 2.87 and V ar(K) =
1.81, that approximately match the prior information we gave on the parameter K.
Similarly, under Test H, α is distributed according to the gamma(1/4, 1/2), such
that the prior mean onK is 3.6 and variance 22.18. The normal baseline distribution
is a multivariate Gaussian with mean vector 0 and a random covariance matrix which
is given a non-informative prior and the inverse-gamma distribution for the variances
of the mixture components has parameters such that mean and variance are equal
to 5, 1, respectively similarly as in Table 4.10. Posterior summaries can be found in
Table 4.11.
As a comparison between the estimated partitions under our model (Figure 4.11)
and the LDDP mixture model, Figure 4.12 displays the estimated partition obtained
under the LDDP model under Test G, that has 3 groups with sizes 300, 127, 10.
4.6. Air quality index dataset 119
Figure 4.12: Cluster estimate obtained under a linear dependent Dirichlet processmodel with prior specication G in Table 4.11.
Case E(K | data) sd(K | data) MSE LPML
G 2.95 1.03 1282.49 -937.51H 3.56 2.36 682.98 -914.00
Table 4.11: Posterior summaries for the tests on the Biopics dataset under a lineardependent Dirichlet process mixture.
4.6 Air quality index dataset
The Air Quality Index (AQI) is an index for reporting air quality, see for instance
https://airnow.gov/index.cfm?action=aqibasics.aqi. It describes how clean
or polluted the air is, and what associated health eects might be a concern for the
population. The Environmental Protection Agency calculates the AQI for ve major
air pollutants regulated by the Clean Air Act: ground-level ozone, particle pollution,
carbon monoxide, sulfur dioxide, and nitrogen dioxide. Data can be obtained from
several sources, for instance, from http://aqicn.org/. For a real-time map, see
https://clarity.io/airmap/.
For the purpose of this illustration, we investigate the spatial relations in mea-
surements of the AQI made on September 13th, 2015, at 16pm. We consider 1147
locations scattered around North and South America, shown in Figure 4.13 here (the
values of AQI have been standardized). Note that the highest AQI values, indicating
the most polluted air, are depicted in red in the map. We ran the MCMC algorithm
to t model (4.8)-(4.10), (4.13)-(4.16), with a burn-in of 10,000, a thinning of 10
and a nal sample size of 5,000. As before, β0 = 0 and ν = 2. Table 4.12 displays
dierent settings of the hyperparameters for which the prior mean on the number of
groups is 1.996 and the prior standard deviation is 1.290 (computed using a Monte
Carlo approach). The dierent hyperparameter settings dier for the specication
of φ, the scale hyperparameter in the g-prior in (4.16), and the prior mean m and
variance v of σ2k; see (4.13).
Figure 4.14 shows the estimated clusters obtained under Test AQ1. The north
- east coast seems to be associated with better environmental conditions, and it is
120
Figure 4.13: Air quality indexdataset, where the number of lo-cations is 1147. Yellow pointsdenote areas with the smallestvalue of AQI, while red denotespoints with the highest value.
Test φ m v mean(K) sd(K) MSE LPML
AQ1 1000 2 1 6.999 1.469 861.048 -1101.013AQ2 500 10 +∞ 5.192 1.143 870.689 -1235.988AQ3 1000 0.1 1 9.045 2.243 840.0685 -1071.931
AQ4 500 5 +∞ 7.811 2.665 835.596 -1160.631
Table 4.12: Prior specication for the Air quality index dataset. The scale parameterφ appears in the g-prior specication of (4.16), while m and v denote prior meanand variance, respectively, of σ2
k as in (4.13).
clear that important urban sprawls are generally grouped together. More in detail,
the Binder loss function method estimated 6 groups characterized by the following
mean and standard deviations of the AQI: (0.95, 0.44) in the red group, (-0.27, 0.45)
in the yellow, (-0.70, 0.21) in the green, (1.7,1.64) in the light blue, (0.28, 0.54) in the
blue, (-0.51, 0.29) in the pink group; yellow, pink and green points are associated to
lower values of AQI, while red and light blue to higher values. The boxplots of the
AQI by cluster in Figure 4.14 are clearly interpretable: the cluster depicted in light
blue gathers the polluted cities in south America and big cities in the West coast of
the U.S. (Las Vegas, Los Angeles, Seattle, for instance). On the other hand, yellow
and green points indicate less dangerous environmental conditions that characterize
the North-East coast: however, the small red cluster contains the big cities that are
present in this area (Chicago, New York, Philadelphia, Boston).
Figure 4.15 displays three dierent predictive laws that correspond to dierent
locations: Sacramento, which shows the lowest predicted values of AQI, New York,
where the environmental conditions are worse, and Monterrey, that presents an
intermediate situation. Figure 4.16 shows the posterior predictive mean for a grid
4.6. Air quality index dataset 121
1 2 3 4 5 6
02
46
AQ
I
Cluster
4
2
5
3
6
1
Figure 4.14: Estimated partition of the Air Quality index dataset under hyperpa-rameters as those in Test AQ1 in Table 4.12. The number of estimated clusters is 6,each denoted by a dierent color, with sizes 17, 221, 183, 136, 306, 284, respectively.
of locations scattered around north America.
Similarly as for the Biopics dataset, we compare the inference under our model
with the linear dependent Dirichlet process mixture model introduced in De Iorio
et al. (2004). Prior information is xed as follows: α is distributed according to
the gamma(1, 1) distribution for Test AQ5, so that the prior mean and variance of
K are 7.15 and 36, respectively, i.e. the prior of K is vague. On the other hand,
in Test AQ6 the mass parameter α of the Dirichlet process is set equal to 0.15
such that E(K) = 2.09 and V ar(K) = 1.02, which approximately matches the prior
information given on K (mean 1.996 and variance 1.29). The baseline distribution is
122
AQI
De
nsity
0 2 4 6
0.0
0.2
0.4
0.6
0.8
AQI
De
nsity
0 2 4 6
0.0
0.2
0.4
0.6
0.8
AQI
De
nsity
0 2 4 6
0.0
0.2
0.4
0.6
0.8
Figure 4.15: Predictive distribution corresponding to 3 dierent locations (NewYork, Sacramento, Monterrey) under Test AQ4 in Table 4.12 for the Air QualityIndex dataset.
Figure 4.16: Prediction over a grid of coordinates for the Air Quality Index datasetunder Test AQ4 in Table 4.12.
a multivariate Gaussian with mean vector 0 and a random covariance matrix which
is given a non-informative prior and the hyperparameters of the inverse-gamma
distribution for the variances of the mixture components are such that prior mean
and variance are equal to 5 and 1, respectively. Posterior summaries can be found
in Table 4.13.
4.7. Conclusion 123
Test E(K | data) Var(K | data) MSE LPML
AQ5 5.14 0.38 827.72 -1100.73AQ6 5.03 0.16 827.04 -1100.06
Table 4.13: Posterior summaries for the Air Quality Index dataset under the lineardependent Dirichlet process mixture for two dierent prior specications.
4.7 Conclusion
This work deals with mixture models where the prior has the property of repul-
sion across location parameters. Specically, the discussion is centered on mixtures
built on determinantal point processes (DPPs), that can be constructed using a
general spectral representation. The methods work with any valid spectral density,
but for the sake of concreteness, illustrations were discussed in the context of the
power exponential case.
Though we limit ourselves to the case of isotropic DPPs, inhomogeneous DPPs
can be obtained by transforming or thinning a stationary process. However, we
believe that this case is not very interesting, unless there is a strong reason to
assume non-homogeneous locations a priori.
Our computational experiments and data illustrations show that the repulsion
induced by the DPP priors indeed tends to eliminate the annoying case of very
small clusters that commonly arises when using models that do not constrain lo-
cation/centering parameters. This happens with very small sacrice of model t
compared to the usual mixture models.
Another advantage of our model over DPMs is that we avoid the delicate choice
of the base measure of the Dirichlet process, leading to more robust estimates on
the number K of components in the mixture.
Chapter 5
Constructing stationary time series of
completely random measures via
Bayesian conjugacy
One exible approach to building stationary time-dependent processes exploits the
mathematical notion of conjugacy in a Bayesian framework. Under this approach, the
transition law L (Xt|Xt−1) of a process Xt is dened as the predictive distribution of
an underlying Bayesian model (see e.g. Pitt and Walker (2005)). Then, if the model is
conjugate, the transition kernel can be analytically derived, making the approach partic-
ularly appealing. We aim at achieving such a convenient mathematical tractability in the
context of completely random measures (CRMs), i.e. when the variables exhibiting a time
dependence are CRMs. In order to take advantage of the conjugacy, here we consider
the large class of exponential family of completely random measures (see Broderick et al.
(2017)). This leads to a simple description of the process which has an AR(1)-type structure
and oers a framework for generalizations to more complicated forms of time-dependence.
The proposed process can be straightforwardly employed to extend CRM-based Bayesian
nonparametric models such as feature allocation models to time-dependent data. These
processes can be applied to problems from modern real life applications in very dierent
elds, from computer science to biology. In particular, we develop a dependent latent fea-
ture model for the identication of features in images and a dynamic Poisson factor analysis
for topic modelling, which are tted to synthetic and real data.
126
5.1 Stationary autoregressive typeAR(1)models for uni-
variate data
An intense research activity of the past decades has been focused on constructing
strictly stationary autoregressive type (AR-type) models with arbitrary stationary
distributions (see, for instance, Mena and Walker (2007), Pitt and Walker (2005),
Jørgensen and Song (1998)). We will focus on the approach introduced in Pitt et al.
(2002) and later generalized in various frameworks (e.g. more general time depen-
dences and nonparametric approaches). The aim is to build a strictly stationary
process Xt whose marginal laws are xed and denoted by p(x). A suitable auxil-
iary random variable Y , with conditional distribution p(y|x), is introduced and the
transition density driving the AR(1)-type model Xt is obtained as
p(x|xt−1) =
∫p(x|y)p(y|xt−1)ν(dy) (5.1)
with
p(x|y) =p(y|x)p(x)∫
p(y|x)p(x)η(dx)
where ν and η are reference measures, such as the Lebesgue or counting measures.
This construction implies that p(·) is the invariant density for the transition in (5.1),i.e.
p(x) =
∫p(x|xt−1)p(xt−1)η(dxt−1).
Note that the transition density of the process Xt has the interpretation of
p(x|xt−1) = EY |Xt−1(p(x|y)|xt−1)
where the expectation is with respect to p(y|xt−1). The latter can be seen as the
posterior distribution of the model X|Y ∼ p(x|y); Y ∼ p(y) where p(x|y) acts as a
conditional sampling model (the likelihood) and p(y) as prior.
In this parametric framework Pitt et al. (2002) studied a wide class of models
for Xt, obtained when p(x|y) belongs to the exponential family and p(y) is the
corresponding conjugate prior. One of the advantages of this approach is that the
integral of the transition kernel in (5.1) has a closed analytical form. This char-
acteristic makes these models particularly appealing from an applicative point of
view, since the latent variable Y results in a mathematical trick to build the desired
dependence.
In this work we aim at achieving the same mathematical tractability but in a
more general context where the observations are completely random measures, intro-
duced in Chapter 1; nevertheless, it is quite often the case in practical applications
that the CRMs are merely latent variables. Other works extending the approach
to a nonparametric framework are, among the others, Mena and Walker (2005) and
Antoniano-Villalobos and Walker (2016).
More general time dependences can be found in Mena and Walker (2007) and
5.2. Exponential completely random measures 127
Pitt and Walker (2005).
5.2 Exponential completely random measures
One of the main ingredients of the model we are going to propose in Section 5.3 is
the exponential family of CRMs, introduced in Broderick et al. (2017). As mentioned
in Chapter 1, a broad class of Bayesian nonparametric priors can be viewed as models
for the allocation of data points to traits. These processes give us traits paired with
rates or frequencies with which the traits occur in some population. Corresponding
likelihoods assign each data point in the population to some nite subset of traits
conditioned on the trait frequencies. What makes these models nonparametric is
that the number of traits in the prior is countably innite. That is, such a model
allows the number of traits in any dataset to grow with the size of the data. Thus,
nonparametric models allow for a great exibility but present also many challenges
from a computational viewpoint, since an innite number of parameters are involved.
In this sense, having conjugagy would be a valuable advantage when dealing with
this kind of models in real life applications. Conjugacy asserts that the prior belongs
to the same family of distributions as the posterior: the exponential family of CRMs
provides the opportunity of building models with a conjugate structure. Hence, we
are able to consider marginal processes, which take a particularly straightforward
form, as well as to avoid to handle innite-dimensional parameters, namely the
prior and the posterior. For instance, in Section 1.3.2 in Chapter 1 we gave a useful
marginal representation of a general class of completely random measures. In order
to keep the chapter self-contained, we recall some basic notions that are needed to
present the family of exponential CRMs.
From now on we will represent each trait by a point ψ in some (Polish) space
Ψ of traits. Further, let Jk be the frequency, or rate, of the trait represented by
ψk, where k ≥ 1 indexes the countably many traits. In particular, Jk ∈ R+. Then,
(Jk, ψk) is a couple consisting of the frequency of the k − th trait together with
the trait itself. We can represent the full collection of pairs of traits with their
frequencies by a discrete measure on Ψ that places weight Jk at location ψk, namely
G =∑k≥1
Jkδψk .
Next, we form data point X conditionally to G, viewed as a discrete measure as
well. Each atom of X represents a pair consisting of a trait to which the individual
is allocated and a degree to which the individual is allocated to this particular trait.
That is, X is a discrete measure whose support coincides with the support of G and
X =∑j≥1
xkδψk , (5.2)
where now xk ∈ R+ represents the degree to which the data point belongs to trait
ψk.
128
Recall that any (homogeneous) completely random measure may be uniquely
characterized by its Lévy's intensity, that can be factorized as
ν(ds× dψ) = ρ(ds)P0(dψ)
where ρ is any σ−nite, deterministic measure on R+ and P0 is any proper (diuse)
probability distribution on Ψ.
Each jump xk in (5.2) is drawn according to some distribution H that takes Jk,
the weight of G at location ψk, as a parameter; i.e.,
xk ∼ H(dx|Jk) independently across k.
Some assumptions are needed on the prior and the likelihood:
1. ρ(R+) = +∞: we require that the measure has a countable innite number of
atoms.
2. each data point can be allocated to only a nite number of traits. Thus, we
require the number of atoms in every X to be nite, that is H(dx|J) must be
discrete, with support N = 0, 1, 2, . . . , for all J , and write h(x|J) for the
probability mass function of x given J . Moreover, note that, by construction,
the pairs (Jk, ψk)Kk=1 form a marked Poisson point process with rate measure
µmark(ds× dx) := ρ(ds)h(x|s), so we assume
∞∑x=1
νx(R+) < +∞ for νx := ρ(ds)h(x|s).
Given these assumptions, the exponential family of completely random measures
oers a convenient framework for developing our model. We recall Denition 4.1
of Broderick et al. (2017), discarding the xed part of the measure, not of interest
here:
Definition 5.1
We say that a CRM G is an exponential CRM if the ordinary component has rate
measure µ(ds× dψ) = ρ(ds)P0(dψ) for some probability distribution P0 and weight
rate measure ρ of the form
ρ(ds) = γ exp< η(s), ξ > +λ[−A(s)] (5.3)
where γ > 0, ξ and λ are hyperparameters, η(·) is the natural parameter and A(·)is known as the cumulant function.
Theorem 4.2 of Broderick et al. (2017) states that these random measures are
automatic conjugate priors for an exponential CRM likelihood, as follows:
Theorem 5.1
Let G =∑∞
k=1 Jkδψk . Let X be generated conditionally on G according to an
5.3. Building a stationary time dependent model for a
sequence of discrete random measures 129
exponential CRM with xed-location atoms at ψk∞k=1 and no ordinary component.
In particular, the distribution of the weight xk at ψk of X has the following density,
belonging to the exponential family for parametric distributions, conditioning on
the weight Jk at ψk of G:
h(x|Jk) = κ(x) exp < η(Jk), φ(x) > −A(Jk) .
Here, κ(x) is a function of data and φ(x) is the sucient statistic. Then, a conjugate
prior for X is the exponential CRM distribution, with weight rate measure as in
(5.3).
As a consequence, very simple marginal and size-biased representations can be
derived (for more details, see Broderick et al. (2017)). We also remark the fact
that many models that are well-known in the literature are entailed in this frame-
work: Beta-Bernoulli, Poisson-Gamma, Beta-Negative Binomial processes, among
the others.
5.3 Building a stationary time dependent model for a
sequence of discrete random measures
We start by motivating the model presented here: all the papers mentioned in
Section 5.1 focused on exibly modeling time-dependent univariate continuous or
count data. However, it is important to build models that reect the complexity
of the data that are available nowadays and coming from very dierent sources
(for instance, images, documents, etc.). Driven by this challenge, we propose an
extension of that models where we consider discrete measures, in the same framework
of Broderick et al. (2017). Some related works are given, among the others, by Srebro
and Roweis (2005), Williamson et al. (2010) and Caron et al. (2012). We come out
with a model which is very exible, due to the nonparametric structure oered by
completely random measures, but it is at the same time mathematically tractable,
thanks to the conjugacy property guaranteed by the exponential family of CRMs.
The main purpose here is to extend the nonparametric generalized latent trait model
discussed in Section 1.3.3 to include time dependence in the underlying process.
Therefore, we are going to dene a model for expressing formulas (1.26) − (1.27)
and (1.29) − (1.30) in Chapter 1. Note that in what follows, the process Xt isnothing other than Θt in Section 1.3.3.
5.3.1 The model
The aim is to build a model for discrete measures evolving in time of the form
(5.2), namely a time series as
Xt =
+∞∑k=1
xtkδψk , t = 0, 1, . . . , T (5.4)
130
where xtkind∼ h(·|Jk) and h has the form of the exponential family. We are going
to exploit the construction described in Section 5.1: however, the auxiliary random
variable that we are going to consider is an expCRM . This choice allows us to write
down the posterior, p(G|Xt−1), that is composed by two components:
the ordinary part, that is a CRM whose Lévy intensity is updated in this
way,
ρpost(s) = γκ(0) exp < ξ + φ(0), η(s) > −(λ+ 1)A(s)
the xed-locations component, that can be written as
Knew∑j=1
Jnew,jδψnew,j
where Knew is the number of components of Xt−1 that we have actually ob-
served (number of k s.t. x(t−1)k > 0). Here,
Jnew,j ∼ fnew,j(s) ∝ exp < ξ + φ(xnew,k), η(s) > −(λ+ 1)A(s)
We are also able to compute the transition kernel in (5.1), p(Xt|Xt−1), obtained
by integrating out the latent parameter: this is specied in the next proposition.
Proposition 1
The transition kernel for a sequence of discrete random measures belonging to the
exponential family can be described by two parts:
1. the values of xtk corresponding to the ψk that have been observed in Xt−1 =∑x(t−1)kδψk are sampled according to
hcond(xtk = x|x(t−1)k) = κ(x) exp−B(ξ + x(t−1)k, λ) +B(ξ + x(t−1)k + x, λ+ 1)
where x(t−1)k > 0, exp(B(a, b)) =
∫exp(< a, η(θ) > −bA(θ))dθ.
2. For every x = 1, 2, . . . , new atoms are observed: these are ρnewt,x ∼ Poisson(Mt,x),
and ψt,x,jiid∼ P0, j = 1, . . . , ρt,x. Here, Mt,x = γκ(0)κ(x) expB(ξ + φ(0) +
φ(x), λ+ 2).
The result follows easily by applying Corollary 6.2 in Broderick et al. (2017).
Looking closely at p(Xt|Xt−1), it is clear that this distribution is given by two con-
tributions: an innovation term, which consists of sampling new items ψk according
to a thinned Poisson process, and an inserting/deleting (thinning) process, where
we re-sample the value xtk related to the location ψk that has been observed at time
t− 1. Thus, we can write down the two contributions as follows
Xt | Xt−1d=∑k
xthintk δψ(t−1)k+∑x≥1
ρt,x∑j=1
xδψnewj,x. (5.5)
5.3. Building a stationary time dependent model for a
sequence of discrete random measures 131
The following proposition species the likelihood of the model for (X0, X1, . . . , XT ).
Proposition 2
The likelihood of our model is the following:
L(X0, X1, . . . , XT |γ, ξ, λ,∆) = L(X0|γ, ξ, λ,∆)
T∏t=1
L(Xt|Xt−1, γ, ξ, λ,∆)
∝∏x≥1
L(ρnew0,x |γ, ξ, λ)
ρ0∏j=1
P0(ψ0j |∆)
T∏t=1
ρt−1∏l=1
hcond(xthintl |x(t−1)l > 0, ξ, λ)×
×ρnewt∏j=1
P0(ψnewtj |∆)∏x≥1
L(ρnewt,x |ξ, λ, γ)
∝∏x≥1
T∏t=0
Poisson(ρnewt,x ;Mt,x)×ρ0∏j=1
P0(ψ0j |∆)
T∏t=1
ρnewt∏j=1
P0(ψnewtj |∆)
×T∏t=1
ρt−1∏l=1
κ(xthintl ) exp−B(ξ + x(t−1)k, λ) +B(ξ + x(t−1)k + xthintl , λ+ 1)
∝ exp
−∑x≥1
T∑t=0
(Mt,x − ρnewt,x log(Mt,x)
)+
+
T∑t=1
ρt−1∑l=1
(−B(ξ + x(t−1)k, λ) +B(ξ + x(t−1)k + xthintl , λ+ 1)
)
×∏x≥1
T∏t=0
1
ρnewt,x !×
T∏t=1
ρt−1∏l=1
κ(xthintl )
ρ0∏j=1
P0(ψ0j |∆)
T∏t=1
ρnewt∏j=1
P0(ψnewtj |∆)
where ∆ and (γ, ξ, λ) are the parameters of P0 and of the Lévy density, respectively.
Mt,x has been dened above at point 2., and it depends on γ, ξ and λ.
Moreover, ρnew0,x = #k s.t x0k = x, x = 1, 2, . . . and ρnewt,x = #k s.t xtk =
x and ψtk is new , x = 1, 2, . . . , are the number of items with label x observed at
time 0 and t; ρ0 =∑
x≥1 ρnew0,x and ρnewt =
∑x≥1 ρ
newt,x are the number of items that
have been observed for the rst time at time t = 0, 1, . . . , T . Lastly, ρt−1 is the
number of observed items/traits at the previous time.
As a further analysis of the proposed model, it can be interesting to investigate
whether an autoregressive relationship between Xt and Xt−1 holds or not. In partic-
ular, an autoregressive model species that the mean depends linearly on previous
132
values. Can we recover a similar relationship? By exploiting relation (5.5), we have
E (Xt(A)|Xt−1) =
∫ ∑k
xthintk δψ(t−1)k(A) +
∑x≥1
ρt,x∑j=1
xδψnewj,x(A)
×× L
(dxthintk , k ≥ 1, dψnewj,x , j = 1, . . . , ρt,x, dρt,x, x ≥ 1|Xt−1
)=∑k≥1
E(xthintk |x(t−1)k
)δψ(t−1)k
(A) +∑x≥1
+∞∑L=1
xPoisson(L;Mt,x)P0(A)
so that we end up with
E (Xt(A)|Xt−1) =∑k≥1
E(xthintk |x(t−1)k
)δψ(t−1)k
(A) +∑x≥1
xMt,xP0(A). (5.6)
We are going to specify this relationship for the three special cases below.
Other values of interest that may be useful for interpreting and xing the hy-
perparameters of the exponential CRM, namely (ξ, λ, γ), are the expected value of
the number of features associated to a (strictly) positive weight and the expected
value of the total mass of the CRM.
Proposition 3
Let X be a completely random measure dened as in (5.4): then, the expected value
of the number of features associated to a positive weight is
E
( ∞∑k=1
I (xk > 0)
)=
∞∑x=1
γκ(x) exp (B(ξ,+φ(x), λ+ 1)) (5.7)
and the expected value of the total mass is
E (X(Ψ)) =∞∑x=1
γxκ(x) exp (B(ξ,+φ(x), λ+ 1)) . (5.8)
Proof. Formula (5.7) can be computed by using the law of total expectation and
the size biased representation of a CRM of Corollary 5.2 in Broderick et al. (2017)
with m = 1:
E
( ∞∑k=1
I (xk > 0)
)= EG
EX
∑i≥1
ρi∑j=1
I (xi,j > 0) | G =∑i≥1
ρi∑j=1
Ji,jδψi,j
= EG
∑i≥1
ρi
=∑i=1
γκ(i) exp (B(ξ + φ(i), λ+ 1))
since ρi ∼ Poisson(Mi) and Mi = γκ(i) exp (B(ξ + φ(i), λ+ 1)) (see formula (28)
5.3. Building a stationary time dependent model for a
sequence of discrete random measures 133
in Broderick et al. (2017)); moreover, the size-biased representation allows us to
reorder atoms such that the weights xk are represented in increasing order, i.e. we
have ρ1 weights taking value 1, ρ2 weights taking value 2, etc. This helps us in
the computation. In addition, formula (5.8) is obtained with a similar reasoning, as
follows:
E (X(Ψ)) = E
( ∞∑k=1
xk
)= EG
EX
∑i≥1
ρi∑j=1
xi,j | G =∑i≥1
ρi∑j=1
Ji,jδψi,j
= EG
EX
∑i≥1
iρi | G =∑i≥1
ρi∑j=1
Ji,jδψi,j
=∞∑i=1
iMi =∞∑i=1
γiκ(i) exp (B(ξ,+φ(i), λ+ 1)) .
The two formulas above are specied for the Poisson-Gamma case in Section
5.3.3. Before looking closely at three special cases, a remark on the trend of the
total number of traits is due. We saw how, at each time instant some new features
appear, generated from the base distribution P0: denote with KT the total number
of features appeared in the process up to time T . Then, the growth rate of KT is
linear. This is due to the stationarity, since
ρnewt ∼ Poisson
∑x≥1
Mt,x
where Mt,x = γκ(0)κ(x) exp (B(ξ + φ(0) + φ(x), λ+ 2)) does not depend t. Thus
E (KT ) =T∑t=1
∑x≥1
Mx = γκ(0)T∑x≥1
κ(x) exp (B(ξ + φ(0) + φ(x), λ+ 2))
grows linearly with T .
5.3.2 Example 1: Beta - Bernoulli
As prior for G, consider the well-known Beta process, rst introduced by Hjort
(1990), corresponding to the following choice of Lévy intensity,
ρ(s) = γs−1(1− s)c−1, γ > 0, c > 0
and a Bernoulli process likelihood (Thibaux and Jordan (2007)), where
xtk|Jkind∼ Be(Jk)
134
so that the jumps xtk ∈ 0, 1.Regarding the predictive law, simple calculations lead to the following results
for the conditional distribution of xtk given x(t−1)k = 1,
xtk|x(t−1)k = 1 ∼ Be(
1
c+ 1
)and
ρnewt ∼ Poisson(
γ
c+ 1
).
Equation (5.6) turns out to be
E (Xt(A)|Xt−1) =1
c+ 1Xt−1(A) +
γ
c+ 1P0
which is of AR(1) type.
Moreover, the likelihood is the following:
L(X0, X1, . . . , XT |γ, c,∆) = Poisson(ρ0; γ/(c+ 1))
ρ0∏j=1
P0(ψ0,j |∆)
×T∏t=1
ρt−1∏l=1
Be(xthintl ;1
c+ 1)
ρnewt∏j=1
P0(ψnewtj |∆)Poisson
(ρnewt | γ
c+ 1
)∝ cNT−ST
(c+ 1)−NnewT
(c+ 1)NTγN
newT exp
−γ(T + 1)
c+ 1
T∏t=0
ρnewt∏j=1
P0(ψnewjt |∆)
where NT =∑T
t=1 ρt−1 (number of total observed traits), ST =∑T
t=1
∑ρt−1
l=1 xthintl
(total number of items survived after the thinning) and NnewT =
∑Tt=0 ρ
newt .
Figure 5.1 shows some simulated data, where the number of time steps is 6
and the values of γ and c vary. In particular, γ acts as a mass parameter, namely
increasing γ leads to more atoms from the innovation part. On the other hand,
if the parameter c increases, the items are less persistent, since the probability of
observe them again is smaller.
Now, suppose to assign two independent gamma distributions as prior for the
parameters γ and c, γ ∼ gamma(a, b) and c ∼ gamma(s, r). In this case, we have
that the full-conditional for γ is
γ|data, c,∆ ∼ gamma(a+NnewT , b+ T/(c+ 1))
and the full-conditional for c is proportional to
cNT−ST+s−1(c+ 1)−NnewT −NT exp
−γ(T + 1)
c+ 1− rc
I(c>0),
so that a step of Metropolis-Hastings method is needed.
5.3. Building a stationary time dependent model for a
sequence of discrete random measures 135
ψ1newψ2
new
X_ 0
ψ1new ψ2
new
X_ 0
ψ1new ψ1
thinψ2thin
X_ 1
ψ1new ψ2
new ψ4newψ1
thin ψ2thin
X_ 1
ψ1new ψ1
thin ψ2thin
X_ 2
ψ1new ψ2
newψ3new ψ4
newψ5new ψ6
newψ1thin ψ3
thinψ4thin
X_ 2
ψ1new ψ2
newψ3newψ1
thin ψ2thin ψ3
thin
X_ 3
ψ1newψ2
new ψ3newψ1
thin ψ2thinψ3
thin ψ4thinψ5
thinψ6thin ψ7
thinψ8thin
X_ 3
ψ1thin ψ2
thinψ3thinψ4
thin ψ5thin ψ6
thin
X_ 4
ψ1newψ2
new ψ3newψ1
thin ψ2thinψ3
thin ψ4thinψ5
thin ψ6thinψ7
thinψ8thin ψ9
thinψ10thin
X_ 4
ψ1newψ1
thin ψ2thinψ3
thinψ4thin ψ5
thin ψ6thin
X_ 5
ψ1newψ2
new ψ3new ψ5
new ψ6newψ7
new ψ8new ψ1
thinψ2thin ψ3
thinψ4thin ψ5
thinψ6thin ψ7
thinψ8thin ψ9
thinψ10thinψ11
thin ψ12thinψ13
thin
X_ 5
ψ1newψ2
new ψ3new ψ4
newψ5newψ1thinψ2
thin ψ3thinψ4
thinψ5thin ψ6
thin
X_ 6
ψ1new ψ2
new ψ3new ψ4
newψ1thinψ2
thin ψ3thin ψ5
thin ψ6thinψ7
thin ψ8thin ψ9
thinψ10thinψ11
thin ψ12thinψ13
thin ψ14thinψ15
thin ψ16thinψ17
thin ψ18thinψ19
thin
X_ 6
Figure 5.1: Simulated data, where T = 6, and (γ, c) = (1, 0.1) (left column), (γ, c) =(5, 0.1) (right column).
136
Figure 5.2: Posterior distribution for c in the two settings: (a, b, s, r) = (3, 1, 1, 1)(left) and (a, b, s, r) = (3, 1, 0.1, 0.05) (right). The vertical line represents the truevalue of the parameter c.
As a toy example, we simulated from this process a time series with T = 100
time instants, γ = 3 and c = 1. Two dierent sets of prior have been chosen:
rst, we assigned (a, b, s, r) = (3, 1, 1, 1), second, (a, b, s, r) = (3, 1, 0.1, 0.05), where
the a-priori variance for c is small (1) and then higher (40). After running 10000
iterations of the Gibbs sampler, we obtained the posterior distributions for c shown
in Figure 5.2. In both cases, the posterior is concentrated around the exact value.
Figure 5.3 shows the predictive distribution for some quantities of interest: the
number of new items observed at time t+ 1, L(ρnewt+1 |data) (left), and the posterior
probability that an item is observed s times consecutively, where s = 1, 2, . . . (right).
The last quantity, conditionally on c, is given by
(1
c+ 1
)s.
5.3.3 Example 2: Poisson - Gamma
Consider now another couple of conjugate processes of the exponential CRMs
family, namely the Poisson likelihood and the Gamma processes. Suppose the weight
xk at location ψk has support on N and has a Poisson density with parameter
Jk ∈ R+:
h(x|Jk) =1
x!Jxk e
−Jk =1
x!exlog(Jk)−Jk
so that
κ(x) =1
x!φ(x) = x η(s) = log(s) A(s) = s.
The conjugate process in this case is the so-called generalized Gamma process, where
ρ(s) = γsξe−λs, γ > 0, ξ ∈ (−2,−1], λ > 0.
5.3. Building a stationary time dependent model for a
sequence of discrete random measures 137
2 4 6 8 10
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Nr. of new items
Pro
babi
lity
mas
s
2 4 6 8 10
0.0
0.1
0.2
0.3
0.4
0.5
Time
Pro
babi
lity
Figure 5.3: Predictive distribution for the number of new items observed at timet+ 1, L(ρnewt+1 |data) (left), and posterior probability that an item is observed s timesconsecutively, where s = 1, 2, . . . (right)
From now on, set ξ = −1, as usual in the literature. Broderick et al. (2017) rst
established the conjugacy of the Poisson - Gamma processes. From an applicative
viewpoint, this model recently emerged in the literature as a prior in Bayesian non-
parametric learning scenarios; in particular, a Poisson - Gamma process may be
employed when we assume to have multiple latent features associated with observa-
tions and each feature can have multiple occurrences within each data point. See,
for instance, Titsias (2008) and Roychowdhury and Kulis (2015). In the latter work,
the authors propose a variational algorithm for inference under models involving the
Gamma processes and derive an error bound for that approximation. The model is
then applied to the problem of learning latent topics in document corpora. In Tit-
sias (2008), the issue of learning visual object recognition systems from unlabelled
images is investigated.
In our model, we have
hcond(xtk = x|x(t−1)k = y) =
(x+ y − 1
x
)(1− 1
λ+ 2
)y 1
(λ+ 2)x
namely xtk|x(t−1)k ∼ NegBin(x(t−1)k;
1
λ+ 2
)and
ρnewt,x ∼ Poisson(γ
x
1
(λ+ 2)x
), x = 1, 2, . . . .
Figure 5.4 shows two simulated datasets, where the mass parameter γ is xed
to 10; λ takes values 0.1 (left panel) and 2.5 (right). It is clear that parameter λ
takes the role of controlling the thinning part, since a larger value of λ implies less
repetitions and in general smaller values of the degrees.
138
ψ1new ψ2
newψ3newψ4
new
X_ 0
ψ1new ψ2
new ψ3new
X_ 0
ψ1newψ2
newψ3newψ4
new ψ5newψ7
newψ8newψ9
thin ψ10thinψ11
thin
X_ 2
ψ1newψ3
new ψ4newψ5
thin ψ6thin
X_ 2
ψ1new ψ2
new ψ3newψ4
newψ6new ψ7
new ψ8newψ9
thinψ11thinψ12
thinψ13thin
X_ 3
ψ1new ψ2
thinψ3thinψ4
thin
X_ 3
ψ1new ψ2
new ψ3newψ4
new ψ5newψ6
new ψ7thinψ8
thinψ9thin ψ10
thin ψ11thinψ13
thinψ14thin
X_ 4
ψ1new ψ2
new
X_ 4
ψ1new ψ2
newψ3newψ4
new ψ5newψ6
newψ7thinψ8thin ψ9
thinψ10thinψ11
thin ψ12thin ψ13
thinψ15thinψ16
thin
X_ 5
ψ2newψ3thin
X_ 5
ψ1new ψ2
newψ3newψ4
new ψ5new ψ6
new ψ7newψ8
thin ψ9thinψ10
thinψ11thin ψ12
thinψ13thinψ14
thin ψ15thinψ17
thinψ18thin
X_ 6
ψ1newψ2
new ψ4thin
X_ 6
Figure 5.4: Simulated data, where T = 6, and (γ, λ) = (10, 0.1) (left column),(γ, λ) = (10, 2.5) (right column).
5.3. Building a stationary time dependent model for a
sequence of discrete random measures 139
Relation (5.6) is, in this case,
E (Xt(A)|Xt−1) =∑k≥1
1
λ+ 2
(1− 1
λ+ 2
)−1
x(t−1)kδψ(t−1)k(A) +
∑x≥1
xγ
x
1
(λ+ 2)xP0(A)
=1
λ+ 1Xt−1(A) +
γ
λ+ 1P0(A)
which is of AR(1) type.
The likelihood is
L(X0, X1, . . . , XT |γ, λ,∆) =∏x≥1
Poisson(ρnew0,x |γ, λ)×T∏t=1
∏x≥1
Poisson(ρnewt,x |γ, λ)
×ρt−1∏l=1
NegBin(xthintl |xt−1,l > 0, 1/(λ+ 2))T∏t=0
ρt∏j=1
P0(ψtj |∆)
∝ exp(−γ(T + 1)log
(λ+ 2
λ+ 1
))γ∑x≥1
∑Tt=0 ρ
newt,x (λ+ 2)−
∑x≥1 x
∑Tt=0 ρ
newt,x
× (λ+ 1)NT (λ+ 2)−NT−ST
where NT =∑T
t=1
∑ρt−1
l=1 x(t−1)l, ST =∑T
t=1
∑ρt−1
l=1 xthintl . As prior on (λ, γ), we
assume
(λ, γ) ∼ gamma(s, r)× gamma(a, b).
In the recent literature, there exist works related to dynamic modeling of count
matrices, that can be compared to our proposal. Among the others, Acharya et al.
(2015) (gamma process dynamic Poisson factor analysis) and Han et al. (2014)
(Dynamic rank factor model).
We conclude with the computation of formulas (5.7) and (5.8) in this case:
E
(∑k
I (xk > 0)
)= γ
∞∑i=1
Γ(ξ + 1 + i)
i! (λ+ 1)ξ+1+i= γ
Γ(ξ + 2)
(λ(λ+ 1)ξ+2
λξ+2− λ− 1
)(ξ + 1)(λ+ 1)ξ+2
(ξ=−1)= γ log
(λ+ 1
λ
)and
E (X(Θ)) = E
(∑k
xk
)= γ
Γ(ξ + 2)
λξ+2
(ξ=−1)= γ/λ.
5.3.4 Example 3: Beta prime - Odds Bernoulli
The last example regards another process where xk ∈ 0, 1, called Odds Bernoulliprocess, introduced in Broderick et al. (2017). In this case, the mass probability of
140
xk is
h(x|Jk) = Jxk (1 + Jk)−1 = exp (xlog(Jk)− log(1 + Jk)) ,
that is, if J = ρ/(1 − ρ), where ρ is the probability of a successful Bernoulli draw,
it can be seen as an odds ratio. Then,
κ(x) = 1 φ(x) = x η(s) = log(s) A(s) = log(1 + s).
The conjugate process is the so-called Beta prime process, with
ρ(s) = γsξ(1 + s)−λ, s > 0, γ > 0, ξ ∈ (−2,−1], λ > ξ + 1.
Simple calculations lead to
exp(−B(a, b)) =Γ(b)
Γ(a+ 1)Γ(b− a+ 1)
so that the conditional law is
xtk = x|x(t−1)k = 1 ∼ Be(x;
1
λ(ξ + 2)
)and
ρnewt ∼ Poisson(γ
Γ(ξ + 2)Γ(λ− ξ)Γ(λ+ 2)
).
Moreover, relation (5.6) is, in this case,
E (Xt(A)|Xt−1) =ξ + 2
λXt−1(A) + γ
Γ(ξ + 2)Γ(λ− ξ)Γ(λ+ 2)
P0(A)
which is again of AR(1) type.
We conclude specifying the likelihood:
L(X0, X1, . . . , XT |γ, ξ, λ,∆) ∝ exp
−γΓ(ξ + 2)Γ(λ− ξ)
Γ(λ+ 2)(T + 1)
(γ
Γ(ξ + 2)Γ(λ− ξ)Γ(λ+ 2)
)NT× λ−ST (ξ + 2)
∑Tt=1
∑ρt−1j=1 xthintl (λ− ξ − 2)ST−
∑Tt=1
∑ρt−1j=1 xthintl
where ST =∑T
t=1 ρt−1 and NT =∑T
t=0 ρnewt (ρnew0 = ρ0).
5.4 Application: latent feature model on a synthetic
dataset of images
The Indian Buet process (Griths and Ghahramani (2011)) and its dependent
extensions (Williamson et al. (2010) and Miller et al. (2012), among the others) have
been used in models for unsupervised learning in which a linear Gaussian latent
feature model is employed to investigate the hidden binary features. In particular,
5.4. Application: latent feature model on a synthetic
dataset of images 141
suppose that each datum xt is generated from a Gaussian distribution with mean
ztA where A is a feature matrix, so that xt = ztA + noise. We refer to the model
described in Section 5 of Griths and Ghahramani (2011), but a similar example
can be found in Ruiz et al. (2014). Consider a simulated dataset consisting of an
8 × 8 image, evolving over T time steps. The features that generated our data are
represented in Figure 5.5.
Figure 5.5: Features.
The noise is Gaussian distributed with mean zero and variance σ2X , that will
be specied later. Note that the features, namely the rows of matrix A, are the
location points of the measure in our representation: a-priori, they are Gaussian
distributed too, with zero mean and variance given by σ2AI.
The goal is to build a MCMC algorithm to sample from
L(A, z1, z2, . . . , zT , γ, c, σ2X , σ
2A|x(1),x(2), . . . ,x(T ))
where x(t), t = 1, . . . , T are vectors of dimension 64 representing the images; we wish
to recover the inner features of Figure 5.5. The complete model is the following:
L(X|Z, σ2X) ∼
T∏t=1
ND(x(t)|z(t)A, σ2XI)
A ∼ NK+×D(0, σ2A)
σ2x ∼ invgamma (αx, βx)
σ2A ∼ invgamma (αA, βA)
L(z(1), z(2), . . . , z(T )|γ, c
)= L
(z(1)|γ, c
) T∏t=2
L(z(t)|z(t−1), γ, c
)(γ, c) ∼ gamma(a, b)× gamma(s, r)
142
where K+ is the number of columns of Z = (z1, z2, . . . , zT )T with sums greater than
0 and D is the dimension of each datum, D = 64 in this case. The prior for the
feature matrix A is the Gaussian distribution for matrices of dimension K+ × D:
each column has mean 0 and variance-covariance matrix σ2AI. For the law of the
binary matrices assigning the features to data, Z = (z1, . . . , zT ), see the example in
5.3.2.
Note that, using CRMs notation, zt are the jumps of the measure, and features
a1,a2, . . . , namely the rows of A are the traits. Thus, one could dene the
underlying time dependent CRMs of our model as
Yt =∑k≥1
ztkδak .
The model described above is entailed in the general class of models in Chapter 1,
formulas (1.28) - (1.30). There, the kernel K is given by the multivariate Gaussian
distribution, the link function is the identity and the base measure P0 is again mul-
tivariate Gaussian. The generalized Indian Buet, GIB, process has been replaced
by the autoregressive lag 1 type process described in Section 5.3.2.
5.4.1 Devising a particle Gibbs sampler
Sequential Monte Carlo (SMC) methods are a popular class of algorithms em-
ployed for sampling from general high-dimensional probability distributions; they
proved to be very ecient when dealing with state-space (or hidden Markov state)
models, which is in fact our case (see Doucet and Johansen (2009)). Our goal is
to sample from a distribution of the form p (θ,A, z1, . . . , zT |x1, . . . ,xT ) where the
hidden Markov state process is a process dened on the (ideally) innite dimensional
vector of binary variables, representing the presence or the absence of features at a
certain time. Moreover, θ =(σ2X , σ
2A, c, γ
)here. We employ the particle Gibbs sam-
pler of Andrieu et al. (2010) to reach this goal. In that work, the authors proposed
a valid particle approximation to the Gibbs sampler, where a conditional SMC step
is required.
In a nutshell, the algorithm consists in alternating two steps: the updating of
the static parameters θ through their full-conditionals, θ ∼ p (θ|Z,x1, . . . ,xT ,A)
(using Metropolis-Hastings steps, when needed) and a run of a conditional SMC
algorithm where the target is pθ (z1, . . . , zT ,A|x1, . . . ,xT ), conditional on the pre-
viously drawn θ and its ancestral lineage. After this step, we have N couples par-
ticle/weight(Z(i),A(i), wi
)i=1,...,N
that approximate the target distribution, where
Z(i) is a sequence of T binary vectors and A(i) is a K+(i) × D matrix whose rows
contain the features. Sampling from it requires simply to draw an index n from
a discrete distribution with weights (w1, . . . , wN ) and consider the corresponding
particle,(Z(n),A(n)
). In the following, we are going to specify the steps of the
algorithm that are related to our specic model: for an overview of the method, see
Andrieu et al. (2010).
5.4. Application: latent feature model on a synthetic
dataset of images 143
Before that, we need to express a sequential representation of the feature matrix
A. Remember that our target is L(A, z1, . . . , zT ,θ|x1, . . . ,xT ), that is proportional
to the joint law.
A convenient way of expressing this joint law is the following, where in each line
we point out the dependence of the parameters on a certain time (note that θ is
considered xed):
t = 1 : L (z1)L(A1|z1, σ
2A
)N(x1|z1A
1, σ2XI)
t = 2 : L (z2|z1)L(A2|A1, z1:2, σ
2A
)N(x2|z2A
1:2, σ2XI)
. . . . . .
t = T : L (zT |zT−1)L(AT |A1:(T−1), z1:T , σ
2A
)N(xT |z2A
1:T , σ2XI)
where At is the submatrix of A containing a number K+t of draws from the prior,
where K+t is equal to the number of new features that are discovered at time t, i.e.
that appear for the rst time at time t. Thus, A1:t is the union of all the features
observed up to time t, namely the union of the rows ofA1, . . . ,At
. Therefore,
L(At|A1:(t−1), z1:T , σ
2A
), thanks to the specic choice of prior, can be written as∏D
d=1N(atd|0, σ2
AI)where atd is the d−th column of the matrix At, of length K+
t .
The total number of features, K+, can be simply obtained by∑T
t=1K+t .
This representation helps us in devising a particle Gibbs sampler. We give details
on the conditional SMC step: this update is similar to a standard SMC algorithm
but it ensures that a prespecied path (Z,A) with ancestral lineage B1:T survives all
the resampling steps, whereas the remaining N−1 particles are generated according
to a proposal distribution which is hopefully similar to the target:
Time n=1
a. For i = 1, 2, . . . , N, i 6= B1, sample(z
(i)1 ,A1(i)
)∼ qθ
(z
(i)1 ,A1(i)
)= L
(A1(i)|z(i)
1 , σ2A, σ
2X ,X1
)π(z
(i)1 |γ, c
)b. Compute the unnormalized weight of each particle i = 1, 2, . . . , N
w(i)1 =
N(X1|z(i)
1 A1(i), σ2XI)L(A1(i)|z(i)
1 , σ2A
)π(z
(i)1 |γ, c
)L(A1(i)|z(i)
1 , σ2A, σ
2X ,X1
)
Time n ∈ 2, . . . ,T
a. For i 6= Bn, resample the particles according to a discrete distribution with
weights proportional tow
(1)n−1, w
(2)n−1, . . . , w
(N)n−1
, denoted D(wn−1).
144
Dene ξn−1(i) ∼ D(wn−1) and update the path of the i−th particle ξ(i) =
(ξ1(i), . . . , ξn−1(i)).
b. For i = 1, 2, . . . , N, i 6= Bn, sample(z(i)n , A
n(i))∼ qθ
(z(i)n ,A
n(i))
= L(An(i)|z(i∪ξ(i))
1:n ,A1:(n−1),(i∪ξ(i)), σ2A, σ
2X ,X1 : n
)×π(z
(i)1 |z
(ξn−1(i))(n−1) , γ, c
)c. Compute the unnormalized weight of each particle i = 1, 2, . . . , N
w(i)n =
N(Xn|z(i)
n A1:n,(i∪ξ(i)), σ2XI)L(An(i)|A1:(n−1),ξ(i), z
(i∪ξ(i))1:n , σ2
A
)L(An(i)|A1:(n−1),ξ(i), z
(i∪ξ(i))1:n , σ2
A, σ2X ,X1:n
)where qθ is the proposal distribution for the particles, consisting of drawing znfrom its prior distribution (namely applying thinning to the parent particle and
innovation) and sampling the newKn+ features of A, An, from their conditional
distribution.
The distribution
L(An(i)|z(i∪ξ(i))
1:n ,A1:(n−1),(i∪ξ(i)), σ2A, σ
2X ,X1 : n
)can be easily computed by observing that the full-conditional of the complete matrix
A1:n is
L (A1:n|z1:n,x1:n) =D∏d=1
N(Ad
1:n|µd,Σ)
where Ad1:n is the d−th column of the matrix, Σ is the variance-covariance matrix
with inverse given by
Σ−1 =1
σ2X
n∑t=1
ztzTt +
1
σ2A
I
and the mean is
µd = Σ×n∑t=1
xtdσ2X
zt.
The properties of Gaussian vectors then guarantee that
L(An(i)|z(i∪ξ(i))
1:n ,A1:(n−1),(i∪ξ(i)), σ2A, σ
2X ,X1 : n
)=
D∏d=1
N(And |µdn,Σd
)namely we sample the D columns (of length Kn
+) independently from a Gaussian
distribution of variance-covariance matrix
Σn = Σ+ − ΛT Σ−1Λ
5.4. Application: latent feature model on a synthetic
dataset of images 145
and mean
µdn = µd2 + ΛT Σ−1(ad − µd1
)where Σ+, Σ and Λ are submatrices of Σ, dened according to Figure 5.6
Figure 5.6: Decomposition of matrix Σ into submatrices.
where Σ has dimensions (K1:(n−1)+ ×K1:(n−1)
+ ), Λ (K1:(n−1)+ ×Kn
+) and Σ+ (Kn+×
Kn+), where K1:n
+ stands for∑n
t=1Kt+, the number of features appeared up to time
n. Moreover, µd1 is the subvector of µd containing the rst K1:(n−1)+ elements and µd2
is the subvector of µd containing the last Kn+ elements. Finally, ad is the subvector
of A1:nd containing the rst K
1:(n−1)+ elements.
Note that the proposal requires to sample z(t) from the prior: we simply need
to sample z(t)prop,k ∼ Be(1/(c+ 1)) for those k s.t. z
(t−1)k = 1 and then add new ρ
(t)new
components to the vector, where ρ(t)new ∼ Poisson(γ/(c+ 1)).
As far as the static parameters θ are concerned, the full-conditional are as follows:
σ2A|rest ∼ inv − gamma
(αA +
K+D
2, βA +
1
2
∑Dd=1 ||ad||2
)
σ2X |rest ∼ inv − gamma
(αX +
TD
2, βX +
1
2
∑Tt=1 ||xt − ztA||2
)
γ|rest ∼ gamma(aγ +Nnew
T , bγ +T
c+ 1
)where Nnew
T =∑T
t=1 ρnewt .
L (c|rest) ∝ cNT−ST+ac−1(c+ 1)−NnewT −NT exp
(− γT
c+ 1− bcc
)with NT =
∑Tt=2 ρt−1, ST =
∑Tt=2
∑ρt−1
l=1 xthintl .
The algorithm has been implemented in Rcpp (see Eddelbuettel et al., 2011) a
language extension of R that allows us to combine C++ and R. In this way we are
able to improve the performance of the algorithm by rewriting key functions in C++.
The following proposition establishes the independence of the weights on the
features, namely the images:
Proposition 4
The weights of the Particle Gibbs sampler are independent of the features.
Proof.
146
A convenient characteristic of the the particle lter is that the weights w(i)n do
not depend on the new n-th features that appear at time n, n = 1, . . . , T . Indeed,
we have that for every n ≥ 1,
w(i)n =
N(Xn|z(i)
n A1:n,(i∪ξ(i)), σ2XI)L(An(i)|A1:(n−1),ξ(i), z
(i∪ξ(i))1:n , σ2
A
)L(An(i)|A1:(n−1),ξ(i), z
(i∪ξ(i))1:n , σ2
A, σ2X ,X1:n
)= N
(Xn|z(i)
n A1:n,(i∪ξ(i)), σ2XI)L(An(i)|A1:(n−1),ξ(i), z
(i∪ξ(i))1:n , σ2
A
)×
L(X1:n|A1:n,(i∪ξ(i)), z
(i∪ξ(i))1:n , σ2
X
)L(An(i)|A1:(n−1),ξ(i), z
(i∪ξ(i))1:n , σ2
A
)L(X1:n|A1:(n−1),ξ(i), z
(i∪ξ(i))1:n , σ2
X
)−1
= N(Xn|z(i)
n A1:n,(i∪ξ(i)), σ2XI) L(X1:n|A1:(n−1),ξ(i), z
(i∪ξ(i))1:n , σ2
X
)L(X1:n|A1:n,(i∪ξ(i)), z
(i∪ξ(i))1:n , σ2
X
)=L(X1:n|A1:n,(i∪ξ(i)), z
(i∪ξ(i))1:n , σ2
X
)∏n−1t=1 N
(Xt|zξt(i)t A1:t,ξ1:t(i), σ2
XI)
=
∫L(X1:n|A1:(n−1),ξ(i), z
(i∪ξ(i))1:n ,A+, σ2
X
)L(dA+|z(i∪ξ(i))
1:n
)∏n−1t=1 N
(Xt|zξt(i)t A1:t,ξ1:t(i), σ2
XI)
where A+ is a (Kn+ ×D) matrix whose entries are Gaussian distributed with mean
0 and variance σ2A. Therefore,
w(i)n =
∏n−1t=1 N
(Xt|zξt(i)t A1:t,ξ1:t(i), σ2
XI)
∏n−1t=1 N
(Xt|zξt(i)t A1:t,ξ1:t(i), σ2
XI)×
×∫N(Xn|z(i∪ξ(i))
1:n
[A1:(n−1),ξ(i),A+
], σ2
XI)L(dA+|z(i∪ξ(i))
1:n
)=
∫N(Xn|z(i∪ξ(i))
1:n
[A1:(n−1),ξ(i),A+
], σ2
XI)L(dA+|z(i∪ξ(i))
1:n
)=
∫RK
n+×D
D∏d=1
N
ynd − Kn−1∑l=1
zn,lA1:(n−1),ξt(i)l,d ;
Kn+∑
j=1
zn,Kn−1+jA+j,d, σ
2X
×
Kn+∏
k=1
D∏d=1
N(dA+
j,d; 0, σ2A
).
As far as the time n = 1 is concerned, we have that the unnormalized weigth for
5.4. Application: latent feature model on a synthetic
dataset of images 147
Figure 5.7: Images, t ∈ 1, 5, 10, 15, 19, 25.
the i-th particle is, using Bayes' theorem,
w(i)1 =
N(X1|z(i)1 A1(i), σ2
XI)L(A1(i)|z(i)1 , σ2
A
)π(z(i)1 |γ, c
)L(A1(i)|z(i)1 , σ2
A, σ2X ,X1
)=N(X1|z(i)1 A1(i), σ2
XI)L(A1(i)|z(i)1 , σ2
A
)π(z(i)1 |γ, c
)N(X1|z(i)1 A1(i), σ2
XI)L(A1(i)|z(i)1 , σ2
A
)m(X1|z(i)1 )
= m(X1|z(i)1
)π(z
(i)1 |γ, c).
5.4.2 Numerical results
In order to assess the eectiveness of our algorithm, we simulated 3 dierent
datasets: the rst (i) with T = 25, 4 features (the rst 4 of Figure 5.5) and a
medium/low value of noise, σ2X = 0.01; the second (ii) has a higher level of noise,
σ2X = 0.05, and the third one (iii) contains T = 100 observations, 6 latent fea-
tures and a medium level of noise, σ2X = 0.02. Figure 5.7 shows a sample of six
observations in the (i)-th dataset.
We set the parameters of the priors as follows:
(αA, βA): parameters of an inverse gamma of mean 1 and variance 2;
(αX , βX): parameters of an inverse gamma of mean 0.1 and variance 0.1;
(aγ , bγ): parameters of a gamma of mean 0.5 and variance 2;
148
(ac, bc): parameters of a gamma of mean 1.1 and variance 1.
As far as the MCMC parameters are concerned, we set N = 2000 particles for
the conditional SMC step and 2000 total iterations after a burnin of 1000 and a
thinning of 5 for all the tests. For the rst dataset, the algorithm identied exactly
4 features, that are represented in Figure 5.8 (those are the estimated features at
the last iteration). Also zt is perfectly estimated in this case, for any t = 1, . . . , 25.
The problem of estimating features with more robust approaches is not trivial:
we are not, indeed, aware of any established method for the estimation relying
on the minimization of some posterior loss function between true and estimated
features, similarly as in Lau and Green (2007a) for clustering. Recently, an R
package called sdols has been released by David Dahl and Peter Müller. The
package provides methods for summaries of distributions on feature allocations,
based on sequentially allocated latent structure optimization algorithm to minimize
dierent loss functions: however, the paper associated is not yet available, thus the
use of the package is limited, due to lack of documentation. Moreover, we would need
more general approaches, going beyond feature allocation (see the next example of
Poisson Factor Analysis). A simpler idea is the use of a 0-1 loss function, leading to
the MAP estimator. However, also this approach presents many disadvantages: in
fact, in such a high dimensional parameter space, it is very unlikely that the same
parameter is observed more than once.
Figure 5.8: Features A at the last iteration for simulated data (i).
The traceplots of the parameters σ2X , σ
2A, c and γ do not exclude convergence
and present a good mixing (plots not reported here).
Figure 5.9 shows the expected value of the predictive distribution for a new
image, at time T + 1 (b), and for the image at time t = 19 (a).
For the simulated data (ii), where the noise is higher, the algorithm found 6
features, where two of them contain only noise. However, the predictive means
5.5. Application: Poisson Factor Analysis for time dependent
topic modelling 149
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(a)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(b)
Figure 5.9: (a) Predictive distribution (left) and real observation (right) for timet = 19 and (b) for a new time instant T+1, x(T+1). The last image shows observationat time T .
remain satisfying for all t = 1, . . . , T + 1.
Finally, under simulation truth (iii), we obtained 7 features at the last iteration:
six of them are a good estimate of the features in Figure 5.5 and the last one
contains only noise. We note, however, that some observations contain actually
only noise since no features are present: the algorithm is thus able to capture also
these situations.
5.5 Application: Poisson Factor Analysis for time de-
pendent topic modelling
We consider the problem of learning latent topics in a time series of documents.
A popular approach is to consider a binary matrix that counts if a word is present
or not in the document: in this framework, Perrone et al. (2016) provide a model
to include time dependence. Here, on the other hand, the observations are given
by dwt, the number of times word w appears in the document observed at time t;
therefore, this count matrix D has dimensions V ×T , where V is the vocabulary size
and T is the number of documents. Our aim is to learn the latent factors, namely
the topics, and to investigate how the popularity of each topic is changing over time.
150
Taking into account time dependence, we allow for the possibility that a new topic
may be discovered or may stop being relevant at some point in time.
We model the count matrix D via a Poisson likelihood
D ∼ Poisson (ΦI) (5.9)
where the notation stands for dwt ∼ Poisson(∑K+
l=1 φwlIlt
), independently across
w = 1, . . . , V , t = 1, . . . , T . In (5.9), the matrix Φ has dimensions V ×K+, whereK+
is the number of topics appeared up to time T , and it is called factor loading matrix ;
its columns, denoted φkK+
k=1 are vectors of length V representing the topics. Each
topic is interpreted as a distribution on the dictionary and therefore modeled as
φ1, . . . ,φK |δiid∼ Dirichlet(δ, δ, . . . , δ).
This approach is called Poisson Factor Analysis (PFA) and it has been successfully
applied to topic modeling (in some cases also considering time dependence) in Zhou
and Carin (2015), Acharya et al. (2015), Roychowdhury and Kulis (2015), Zhou
et al. (2012), among the others.
On the other hand, I is a matrix of dimensions K+×T called factor counts: the
columns ItTt=1 are a sequence from a time dependent Poisson - Gamma process,
as in Section 5.3.3. It contains the strength, or importance, of each topic at time t,
t = 1, . . . , T . Therefore, K+ can be interpreted as the random variable representing
the number of topics in the observations: ideally, it can grow up to innite as T
grows.
We now introduce the notation, useful in what follows: Lt denotes the number
of words in the t-th documents, namely Lt =∑V
w=1 dwt, t = 1, . . . , T . Moreover,
if x is a matrix of dimension M × N , we denote x·j as the sum over the rows,
x·j =∑M
i=1 xij , j = 1, . . . , N and xi· as the sum over the columns, xi· =∑N
j=1 xij ,
i = 1, . . . ,M . Finally, we call Kt the number of topics appeared up to time t (so
that K+ = KT ).
In order to simplify the MCMC inference, it is common to augment (5.9) in this
way
dwt =Kt∑l=1
dlwt, dlwt ∼ Poisson(φwlIlt), ind. w = 1, . . . , V, t = 1, . . . , T, l = 1, . . . ,Kt
(5.10)
so that the entries of D can be explained as a sum of smaller counts, each produced
by a hidden topic.
5.5. Application: Poisson Factor Analysis for time dependent
topic modelling 151
Lemma 5.1
An equivalent representation of the model in (5.10) is the following:
dwt =
Kt∑l=1
dlwt,
dl1t, dl2t, . . . , d
lV t|dl·t,φl ∼MultinV (dl·t;φ1l, . . . , φV l)
dl·t|Ilt ∼ Poisson (Ilt) .
(5.11)
We assume independence across l = 1, . . . ,Kt, t = 1, . . . , T .
The equivalence can be readily proved, since under (5.10) we have
L(dlwt, w = 1, . . . , V, l = 1, . . . ,Kt, t = 1, . . . , T |I,Φ
)=
T∏t=1
Kt∏l=1
V∏w=1
Poisson(dlwt;φwlIlt)
=
T∏t=1
Kt∏l=1
V∏w=1
((φwlIlt)
dlwt
dlwt!e−φwlIlt
)= e−
∑Tt=1
∑Ktl=1 Ilt
T∏t=1
Kt∏l=1
V∏w=1
((φwlIlt)
dlwt
dlwt!
)and under (5.11)
L(dlwt, w = 1, . . . , V, l = 1, . . . ,Kt, t = 1, . . . , T |I,Φ
)=
T∏t=1
Kt∏l=1
MultinV (dlwt, w = 1, . . . , V ; dl·t;φ1l, . . . , φV l)Poisson(dl·t; Ilt)
=
T∏t=1
Kt∏l=1
(dl·t!
dl1t! . . . dlV t!
φdl1t1l . . . φ
dlV t
V l
Idlwte
−Ilt
lt
dlwt!
)= e−
∑Tt=1
∑Ktl=1 Ilt
T∏t=1
Kt∏l=1
V∏w=1
((φwlIlt)
dlwt
dlwt!
)which are in fact the same.
We are going to adopt model (5.11) for its convenience when building an algo-
rithm for the following model:
dwt =K+∑l=1
dlwt, dlwt ∼ Poisson (φwlIlt) w = 1, . . . , V, t = 1, . . . , T
φ1, . . . ,φK |δiid∼ Dirichlet(δ, δ, . . . , δ)
I1, . . . , IT |γ, λ ∼ TD − PoissonGamma(γ, λ)
(λ, γ) ∼ gamma(s, r)× gamma(a, b)
δ ∼ gamma(a0, b0)
(5.12)
The model above can be employed to model a real world dataset about texts
observed over a time period. For instance, a dataset that has been widely considered
in the literature is the State of the Union dataset. It contains the transcripts of
65 US State of the Union addresses, from Truman in 1945 to Bush in 2006. After
removing stop words and terms that occur less than 10 times in total, 2755 words are
152
left in our dictionary. Figure 5.10 shows the wordclouds for two presidents, Clinton
in 1997 and Bush in 2003: it is useful to highlight popular or trending terms based
on frequency of the words for that particular address.
americaspecific
outlaysmoon
loss
check
need
edpassed
brut
al
safeguard
east
money
white
clos
ing
want
wise
think
they
com
mon
sens
ebenefit
american
occasion
thought
when
celebratecharter
knowledge
financial
central
july
cost
headnet
broader
wish
earth
amendments
estate
calledsafeguards
action
occa
sion
months
amendments
called
passedopenedwhite
wish
glory
turning
network
cost
industry
aids
needed
main
head
americanmen
thought
americawise
hostile
esta
te
wait
vigorous
money
injustice
diversity risks
Figure 5.10: (Left) Wordcloud for the address of Clinton in 1997 and (right) forBush in 2003.
In the following section, we describe an algorithm for posterior inference under
our model.
5.5.1 Particle Gibbs sampler
We specify the same algorithm as in Section 5.4.1, the Particle Gibbs sampler
of Andrieu et al. (2010), for the model we are taking into account. The parameters
that are sampled according to their full-conditionals are, in this case, θ = δ, γ, λwhile the parameters addressed by the conditional Particle Filter step are given by
the topics and their trends, (φl) , l = 1, 2, . . . , It t = 1, .., T.The full-conditionals for θ are the following (where rest denotes all the variables
but the one on the left of the expression):
Parameter δ: the full-conditional for this parameter is
L(δ|rest) ∝K∏k=1
V∏w=1
φδ−1kw
∏Vw=1 Γ(δ)
Γ(δV )δa0−1e−δb0 , δ > 0.
Parameter λ: its full-conditional is
L(λ|rest) ∝(λ+ 2
λ+ 1
)−γT(λ+ 1)NT (λ+ 2)−NT−ST−LT λaλ−1e−bλλ, λ > 0
with NT =∑T
t=2
∑ρt−1
l=1 I(t−1)l ST =∑T
t=2
∑ρt−1
l=1 Ithintl , LT =∑
x≥1 x∑T
t=1 ρnewt,x .
A Metropolis-Hastings step is used, since the distribution is not of a known
form.
5.5. Application: Poisson Factor Analysis for time dependent
topic modelling 153
Parameter γ: its full-conditional is
L(γ|rest) = gamma
aγ +∑x≥1
T∑t=1
ρnewt,x , bγ + T × log
(λ+ 2
λ+ 1
)As far as the conditional sequential Monte Carlo step is concerned, we need rst
to write down the law we aim to sample from:
L (It , t = 1, . . . , T, φl, l = 1, 2, . . . |D,θ) ∝T∏t=1
V∏w=1
(Kt∑l=1
φwlIlt
)dwt×
× exp
(−
V∑w=1
Kt∑l=1
φwlIlt
)L(It|It−1, γ, λ)
KT∏l=1
Dirichlet(φl; δ, . . . , δ)
∝T∏t=1
V∏w=1
(Kt∑l=1
φwlIlt
)dwtexp
(−
V∑w=1
Kt∑l=1
φwlIlt
)L(It|It−1, γ, λ)×
×ρt∏l=1
Dirichlet(φl; δ, . . . , δ)
where ρt stands for the number of new topics appeared at time t (innovation).
Suppose we have N particles:
Time t=1
a. Propose
I(i)1 , (φ
(i)l ), l = 1, . . . ,K
(i)1
as follows:
Sample K(i)1 ∼ Poisson
(∑x≥1Mx
)where Mx =
γ
x
(1
λ+ 2
)x,
x = 1, 2, . . .
Perform an EM (expectation maximization) step to compute the follow-
ing values:(I(i), (φ
(i)l ), l = 1, . . . ,K
(i)1 , (dlw), l = 1, . . . ,K
(i),w=1,...,V1
):
(i) initialize the φl's and Il's;
(ii) at iteration m calculate dwl = dw1φ
(m−1)wl I
(m−1)l1∑K
k=1 φ(m−1)wk I
(m−1)k1
;
(iii) set I(m)k1 =
∑Vw=1 dwl and (iv) φ
(m)wl ∝
dwl
I(m)l1
.
Repeat steps (ii), (iii) and (iv) until a convergence criterion is satised.
Propose I(i)1 from a truncated normal of dimension K
(i)1 :
I(i)1 ∼ T NK
(i)1
(I(i), I
(I(i)))
154
where I(I(i))
= diag(I(i))is the inverse of the Fisher information
matrix. The truncation forces the values to be in [0,+∞). In order to
obtain integer values, we apply the ceiling function to each element of
the vector.
Propose φ(i)l according to its full-conditional, i.e.
φ(i)l ∼ DirichletV
(δ + d1l, . . . , δ + dV l
)where dwl are the values computed during the EM step.
b. The (unnormalized) weight for each particle is given by
w(i)1 =
∏Vw=1
(∑K(i)1
l=1 φwlI(i)l1
)dwtexp
(−∑K
(i)1
l=1 I(i)l1
)L(I
(i)1 |γ, λ
)q(K
(i)1 ;φ
(i)l , l = 1, . . . ,K
(i)1 ; I
(i)1
)×K
(i)1∏
l=1
DirichletV
(φ
(i)l ; δ, . . . , δ
)
where q(K
(i)1 ;φ
(i)l , l = 1, . . . ,K
(i)1 ; I
(i)1
)is the proposal distribution that gen-
erated the i-th particle. In particular, this can be computed by evaluating
three contributions as follows
1. qK(K(i)1 ) = Poisson
(K
(i)1 ;∑
x≥1Mx
);
2. qI(I(i)1 ) =
∏K(i)1
l=1
(Φ
(I
(i)l1 ; Il1,
√Il1
)−Φ
(I
(i)l1 − 1; Il1,
√Il1
))1− Φ
(0; Il1,
√Il1
) where
Φ(x;µ, σ) is the cumulative density function of a univariate Gaussian
distribution with mean µ and standard deviation σ;
3. qφ
(φ
(i)l , l = 1, 2, . . .
)=∏K
(i)1
l=1 DirichletV
(φ
(i)l ; δ + dlw1, . . . , δ + dlwV
).
Time t ∈ 2 . . . T
a. Propose
I(i)t , (φ
(i)l ), l = 1, . . . ,K
(i)t,inn
as follows:
Sample K(i)t,inn ∼ Poisson
(∑x≥1Mx
)where Mx =
γ
x
(1
λ+ 2
)x,
x = 1, 2, . . . ;
Perform the same EM step as for t = 1 in order to create(I(i), (φ
(i)l ), l = 1, . . . ,K
(i)1 , (dlw), l = 1, . . . , (K
(ξ(i))t−1 +K
(i)t,inn), w = 1, . . . , V
).
Note that the rst K(ξ(i))t−1 are xed.
5.5. Application: Poisson Factor Analysis for time dependent
topic modelling 155
Propose I(i)t from a truncated normal of dimension K
(i)t = (K
(ξ(i))t−1 +
K(i)t,inn):
I(i)t ∼ T NK
(i)t
(I(i), I
(I(i)))
where I(I(i))
= diag(I(i))is the inverse of the Fisher information
matrix. The truncation forces the values to be in [0,+∞). In order to
obtain integer values, we apply the ceiling function to each element of
the vector.
Propose the new topics appeared at time t, φ(i)l for l ∈ 1, 2, . . . ,K(i)
t,inn,according to its full-conditional, i.e.
φ(i)l ∼ DirichletV
(δ + d1l, . . . , δ + dV l
)where dwl are the values computed during the EM step.
b. The (unnormalized) weight for each particle is given by
w(i)t =
∏Vw=1
(∑K(i)t
l=1 φwlI(i)lt
)dwtexp
(−∑K
(i)t
l=1 I(i)lt
)L(I
(i)t |I
ξ(i)t−1, γ, λ
)q(K
(i)1 ;φ
(i)l , l = 1, . . . ,K
(i)1 ; I
(i)1
)×K
(i)t,inn∏l=1
DirichletV
(φ
(i)l ; δ, . . . , δ
)
where q(K
(i)1 ;φ
(i)l , l = 1, . . . ,K
(i)1 ; I
(i)1
)is the proposal distribution that gen-
erated the i-th particle. In particular, this can be computed by evaluating
three contributions as follows
1. qK(K(i)t,inn) = Poisson
(K
(i)t,inn;
∑x≥1Mx
);
2. qI(I(i)t ) =
∏K(i)t
l=1
(Φ
(I
(i)lt ; Ilt,
√Ilt
)−Φ
(I
(i)lt − 1; Ilt,
√Ilt
))1− Φ
(0; Ilt,
√Ilt
) ;
3. qφ
(φ
(i)l , l = 1, 2, . . .
)=∏K
(i)t,inn
l=1 DirichletV
(φ
(i)l ; δ + dlw1, . . . , δ + dlwV
).
In order to speed up the computational time, the Sequential Monte Carlo step has
been implemented in C++ thanks to Rcpp, Eddelbuettel et al. (2011).
5.5.2 Application to a simulated dataset
We simulated a very simple dataset consisting of T = 23 documents with
three well-separated topics. The true topics are depicted in Figure 5.12 and the
156
5 10 15 20
Time
Doc
umen
ts
Figure 5.11: Simulated dataset dwt, w = 1, 2, . . . , t = 1, . . . , 23: the horizontal axisrepresents time, the vertical axis the vocabulary. The color purple denotes the value0.
observations in Figure 5.11, where along the horizontal axis we have the time
t ∈ 1, 2, . . . , 23, while the vertical axis represents the vocabulary: brighter col-
ors depict higher values for the counts. It is clear that the rst six documents
contain the rst topic only, then from t = 7 to t = 14 only the second topic, and
the last eight documents contain only the third topic (see the simulation truth in
the left panel of Figure 5.14). As far as the prior information is concerned, we
0 5 10 15 20 25 30
0.0
00
.05
0.1
00
.15
φ1
Words
0 5 10 15 20 25 30
0.0
00
.05
0.1
00
.15
φ2
Words
0 5 10 15 20 25 30
0.0
00
.05
0.1
00
.15
φ3
Words
Figure 5.12: The real three topics that generated data in Figure 5.11.
xed the parameter δ of the Dirichlet distribution in (5.12) at 0.0001. Moreover,
λ ∼ gamma(2, 1) and γ is gamma distributed with mean 3 and variance 5.
We run the algorithm described in Section 5.5.1 with N = 2000 particles, 1000
nal iterations after a burnin of 100 iterations. The last iteration of the algorithm
ended up with K = 6 estimated topics, displayed in Figure 5.13; the corresponding
trends can be found in the right panel of Figure 5.14. The truth is recovered fairly
well, even if there are three topics, φ2, φ3, φ5, that contain only noise, i.e. only one
or two words are selected. These are, indeed, associated with a trend Ilt, l ∈ 2, 3, 5
5.5. Application: Poisson Factor Analysis for time dependent
topic modelling 157
0 5 10 15 20 25 30
0.0
00
.05
0.1
00
.15
0.2
00
.25
φ^
1
Words
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
φ^
2
Words
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
φ^
3
Words
0 5 10 15 20 25 30
0.0
00
.05
0.1
00
.15
0.2
00
.25
φ^
4
Words
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
φ^
5
Words
0 5 10 15 20 25 30
0.0
00
.05
0.1
00
.15
φ^
6
Words
Figure 5.13: The estimated six topics, as in the last iteration of the algorithm.
with very low values, compared to the three main topics, namely φ1, φ4, φ6. On
the other hand, the trends corresponding to them are reasonably recovered.
Finally, Figure 5.15 contains the comparison between the posterior predictive
mean (in purple) and true observations (in black) at time t ∈ 1, 15, 23: posteriorinference of model (5.12) was able to correctly recreate the observed documents.
As far as the test cases are concerned, we still need to perform sensitivity analysis
with respect to the hyperparameters and investigate the behavior of the Particle
Gibbs sampler when increasing the number of observed documents and/or topics.
5.5.3 Application to the State of the Union dataset
We provide some preliminary studies on the real data application about the
State of the Union dataset, mentioned in Section 5.5. We explore the dataset
consisting of the full text of 65 speeches of American presidents between 1945 (Tru-
man) to 2006 (G.W. Bush). We pre-processed the data by removing stop words
and punctuation and discarded words appearing fewer than 10 times. We ended up
with a vocabulary of 2723 words. Our goal is to discover what topics appear in the
158
5 10 15 20
01
02
03
04
05
0
Time
Tre
nd
to
pic
s
Topic 1
Topic 2
Topic 3
5 10 15 20
01
02
03
04
05
06
0
Time
Tre
nd
to
pic
s
Figure 5.14: The real trends for the three topics that generated data in Figure 5.11(left). The estimated trends for the six topics obtained at the last iteration of thealgorithm (right).
corpus and to track the evolution of their popularity over time.
The Particle Gibbs sampler was run for 1000 iterations with a burn-in period of
500 iterations and a thinning of 5. As far as the hyperparameters are concerned we
set δ ∼ gamma(1, 5), λ ∼ gamma(2, 10) and γ ∼ gamma(2, 5).
The model estimated 20 topics: some of them are meaningful and easily inter-
pretable. Remember that topics are dened by their distribution over words, so it
is possible to label them by looking at their most likely words. Figure 5.16 shows 9
interpretable topics: in the top-left part of each plot the 10 most likely words of the
topics are listed. The thick line represents the estimated temporal evolution of the
topic weights. One of the qualitative advantages of modeling time dependency ex-
plicitly is that interesting insights into the importance of topics and their change in
time: the topic (b), for example, refers to the terrorism related to the Iraqi conict
and it appears late in time. Similarly, the topic related to Internet and education,
(a), appears just before 2000.
On the other hand, there are topics that persist in time, such as (e) and (f):
in particular, by looking at the most representative words for the topics, we can
deduce that topic (e) is related to money and economy, and topic (f) to peace and
patriotism.
5.6 Discussion and future developments
In this chapter we illustrated a strategy for dening a time dependent process
whose values are completely random measures. We provided a simple description
5.6. Discussion and future developments 159
0 5 10 15 20 25 30
05
10
15
Words
Re
co
nstr
ucte
d d
ata
@ t
ime
1
0 5 10 15 20 25 30
02
46
810
Words
Re
co
nstr
ucte
d d
ata
@ t
ime
1
5
0 5 10 15 20 25 30
02
46
Words
Re
co
nstr
ucte
d d
ata
@ t
ime
2
3
Figure 5.15: Posterior predictive mean for observations at time t ∈ 1, 15, 23 (inpurple, from left to right) and real observations (in black).
of the proposed process which has an AR(1)-type structure and oers a framework
for generalizations to more complicated forms of time-dependence.
In particular, as a further development of the work, we aim at investigating
the extension to p-lagged dependence: from the analogy between linear time series
processes for real valued random variables and for point processes, we have that an
AR(p) process may be seen as
Xt =
p⋃j=1
φj(Xt−j) ∪ εt
where φ1(·), φ2(·), . . . , φp(·) are thinning operators and εt is an innovation term.
In this framework, we can generalize our model through mixture of transition
distributions as in Mena and Walker (2007):
f(xt|xt−1, xt−2, . . . , xt−p) =
p∑k=1
wkfk(xt|xt−k)
where wk ≥ 0 and∑
k wk = 1. The stationarity is preserved (see Proposition 1 in
Mena and Walker, 2007). If we consider a transition kernel fk that is equal for every
k, we have
f(xt|xt−1, . . . , xt−p) =
p∑k=1
wk
∫p(xt|G)p(dG|xt−k) =
∫p(xt|G)
p∑
k=1
wkp(dG|xt−k)
that can be interpreted as
Xt =
p⋃j=1
wjφj(Xt−j) ∪ εt
160
since only the thinning part is aected by the conditioning on Xt−k.
Another extension we would like to investigate is motivated by the following
remark: the time dependent model proposed in this chapter does not allow for re-
appearance of traits. More in detail, suppose that at time t we observe a trait
ψ and that this trait is deleted by the thinning process, namely at time t + 1 is
not observed anymore. Then, under our model, trait ψ has null probability of being
observed again for any time t > t. This is due to the fact that the centering measure
P0 is absolutely continuous.
However, in real data applications, allowing for the re-appearance of traits may
be of interest: in the case of topic modeling, for example, it may happen that a
topic disappears and then it appears again after a few years.
Finally, as a further development, we aim at providing a general algorithm to
tackle posterior inference for the wide class of models we proposed. Consider, indeed,
the general framework described in Section 1.3.3; the distribution of the vector of
scores in (1.29) can be replaced by the time dependent prior developed in this
chapter, Section 5.3. Then, a generalization of Particle Gibbs samplers to perform
posterior inference devised for the specic applications in Sections 5.4.1 and 5.5.1
may be dened.
5.6. Discussion and future developments 161
1950 1960 1970 1980 1990 2000
02
04
06
0
Time
Weig
ht
hireforgeclassroomscelebrateinternetridmillenniumpartnershipseasierteen
(a)1950 1960 1970 1980 1990 2000
01
02
03
04
05
06
07
0
Time
Weig
ht
saddamhusseinterroristiraqiiraqsbrutalmurdernevercoalitionthroughout
(b)1950 1960 1970 1980 1990 2000
02
04
06
08
0
Time
Weig
ht
gunsneighborhoodneighborhoodstestingliterallyshesseniorsviolentrichardson
(c)
1950 1960 1970 1980 1990 2000
02
04
06
08
0
Time
Weig
ht
missilessovietsextratechnologyadvancedballisticletsmissilediplomacyintellectual
(d)1950 1960 1970 1980 1990 2000
02
00
04
00
06
00
08
00
01
00
00
Time
Weig
ht
dollarsyearmillionfiscalexpendituresprogramgovernmentbillioneconomicyears
(e)1950 1960 1970 1980 1990 2000
50
01
00
01
50
02
00
0
Time
Weig
ht
peaceworldmustamericahopenationsmightgreatentirehard
(f)
1950 1960 1970 1980 1990 2000
02
04
06
08
01
00
12
0
Time
Weig
ht
lowincomegainearnersaspirationssufferenjoyedunfortunatelycrisisbearhelped
(g)1950 1960 1970 1980 1990 2000
02
04
06
08
01
00
Time
Weig
ht
rulersstoodreadyskilledaspectscommunistasiasubversiontiderussia
(h)1950 1960 1970 1980 1990 2000
05
01
00
15
0
Time
Weig
ht
tonightlatepoliceblackgasillegaldemocratssaysknewhatred
(i)
Figure 5.16: Posterior topic weights over the years 1945-2006 and 10 most likelywords for each topic (State of the Union address data set).
Bibliography
Acharya, A., Ghosh, J., and Zhou, M. (2015). Nonparametric Bayesian factor anal-
ysis for dynamic count matrices. In AISTATS.
Aandi, R. H., Fox, E., Adams, R. P., and Taskar, B. (2014). Learning the param-
eters of determinantal point process kernels. In ICML, pages 12241232.
Aandi, R. H., Fox, E., and Taskar, B. (2013). Approximate inference in continuous
determinantal processes. In Advances in Neural Information Processing Systems,
pages 14301438.
Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle markov chain monte
carlo methods. Journal of the Royal Statistical Society: Series B, 72(3):269342.
Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian
nonparametric problems. The Annals of Statistics, 2:11521174.
Antoniano-Villalobos, I. and Walker, S. G. (2016). A nonparametric model for
stationary time series. Journal of Time Series Analysis, 37(1):126142.
Arbel, J. and Prünster, I. (2017). A moment-matching Ferguson & Klass algorithm.
Statistics and Computing, 27(1):317.
Arellano-Valle, R. and Azzalini, A. (2006). On the unication of families of skew-
normal distributions. Scandinavian Journal of Statistics, 33(3):561574.
Arellano-Valle, R., Bolfarine, H., and Lachos, V. (2007). Bayesian inference for
skew-normal linear mixed models. Journal of Applied Statistics, 34(6):663682.
Argiento, R., Bianchini, I., and Guglielmi, A. (2016a). A blocked Gibbs sam-
pler for NGG-mixture models via a priori truncation. Statistics and Computing,
26(3):641661.
Argiento, R., Bianchini, I., and Guglielmi, A. (2016b). Posterior sampling from
ε-approximation of normalized completely random measure mixtures. Electronic
Journal of Statistics, 10(2):35163547.
Argiento, R., Guglielmi, A., Hsiao, C., Ruggeri, F., and Wang, C. (2015). Modelling
the association between clusters of snps and disease responses. In Mitra, R.
and Mueller, P., editors, Nonparametric Bayesian Methods in Biostatistics and
Bioinformatics. Springer.
164
Argiento, R., Guglielmi, A., and Pievatolo, A. (2010). Bayesian density estimation
and model selection using nonparametric hierarchical mixtures. Computational
Statistics and Data Analysis, 54:816832.
Asmussen, S. and Glynn, P. W. (2007). Stochastic simulation: algorithms and
analysis, volume 57. Springer Science & Business Media.
Azzalini, A. (2005). The skew-normal distribution and related multivariate families.
Scandinavian Journal of Statistics, 32(2):159188.
Barcella, W., Iorio, M. D., Baio, G., and Malone-Lee, J. (2016). Variable selection
in covariate dependent random partition models: an application to urinary tract
infection. Statistics in Medicine, 35(8):13731389.
Bardenet, R. and Titsias, M. (2015). Inference for determinantal point processes
without spectral knowledge. In Advances in Neural Information Processing Sys-
tems, pages 33933401.
Barndor-Nielsen, O. E. (2000). Probability densities and Lévy densities. University
of Aarhus. Centre for Mathematical Physics and Stochastics.
Barrientos, A. F., Jara, A., and Quintana, F. A. (2012). On the support of MacEach-
ern's dependent Dirichlet processes and extensions. Bayesian Analysis, 7(2):277
310.
Barrios, E., Lijoi, A., Nieto-Barajas, L. E., and Prünster, I. (2013). Modeling with
normalized random measure mixture models. Statistical Science, 28:313334.
Barry, D. and Hartigan, J. A. (1993). A Bayesian analysis for change point problems.
Journal of the American Statistical Association, 88(421):309319.
Basford, K., McLachlan, G., and York, M. (1997). Modelling the distribution of
stamp paper thickness via nite normal mixtures: The 1872 Hidalgo stamp issue
of Mexico revisited. Journal of Applied Statistics, 24(2):169180.
Bayes, C. L. and Branco, M. D. (2007). Bayesian inference for the skewness param-
eter of the scalar skew-normal distribution. Brazilian Journal of Probability and
Statistics, pages 141163.
Bianchini, I., Guglielmi, A., and Quintana, F. A. (2017). Determinantal point
process mixtures via spectral density approach. arXiv preprint arXiv:1705.05181.
Binder, D. A. (1978). Bayesian cluster analysis. Biometrika, 65:3138.
Biscio, C. A. N. and Lavancier, F. (2016). Quantifying repulsiveness of determinantal
point processes. Bernoulli, 22:20012028.
Biscio, C. A. N. and Lavancier, F. (2017). Contrast estimation for parametric sta-
tionary determinantal point processes. Scandinavian Journal of Statistics, 44:204
229.
Bibliography 165
Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Pólya urn
schemes. The Annals of Statistics, pages 353355.
Blei, D. M. and Frazier, P. I. (2011). Distance dependent chinese restaurant pro-
cesses. Journal of Machine Learning Research, 12:24612488.
Blei, D. M., Griths, T. L., and Jordan, M. I. (2010). The nested chinese restaurant
process and bayesian nonparametric inference of topic hierarchies. Journal of the
ACM, 57(2):7.
Bondesson, L. (1982). On simulation from innitely divisible distributions. Advances
in Applied Probability, 14(4):855869.
Broderick, T., Wilson, A. C., and Jordan, M. I. (2017). Posteriors, conjugacy, and
exponential families for completely random measures. Bernoulli (Forthcoming
papers).
Canale, A. and Scarpa, B. (2013). Informative Bayesian inference for the skew-
normal distribution. arXiv preprint arXiv:1305.3080.
Caron, F., Davy, M., and Doucet, A. (2012). Generalized Polya urn for time-varying
Dirichlet process mixtures. arXiv preprint arXiv:1206.5254.
Chung, Y. and Dunson, D. (2009). Nonparametric Bayes conditional distribution
modeling with variable selection. Journal of the American Statistical Association,
104:16461660.
Cook, R. D. and Weisberg, S. (1994). An introduction to regression graphics. John
Wiley & Sons.
Cook, R. J. and Lawless, J. (2007). The statistical analysis of recurrent events.
Springer Science & Business Media.
da Silva, A. F. and da Silva, M. A. F. (2012). Package "dpmixsim".
Dahl, D. B. (2008). Distance-based probability distribution for set partitions with
applications to Bayesian nonparametrics. JSM Proceedings. Section on Bayesian
Statistical Science, American Statistical Association.
Dahl, D. B., Day, R., and Tsai, J. W. (2017). Random partition distribution indexed
by pairwise information. Journal of the American Statistical Association, pages
112.
Daley, D. J. and Vere-Jones, D. (2003). Basic properties of the Poisson process. An
Introduction to the Theory of Point Processes: Volume I: Elementary Theory and
Methods, pages 1940.
Daley, D. J. and Vere-Jones, D. (2007). An introduction to the theory of point
processes: volume II: general theory and structure. Springer.
166
De Blasi, P., Favaro, S., Lijoi, A., Mena, R. H., Prünster, I., and Ruggiero, M. (2015).
Are Gibbs-type priors the most natural generalization of the Dirichlet process?
IEEE transactions on pattern analysis and machine intelligence, 37(2):212229.
De Iorio, M., Johnson, W. O., Müller, P., and Rosner, G. L. (2009). Bayesian non-
parametric nonproportional hazards survival modeling. Biometrics, 65(3):762
771.
De Iorio, M., Müller, P., Rosner, G. L., and MacEachern, S. N. (2004). An ANOVA
model for dependent random measures. Journal of the American Statistical As-
sociation, 99:205215.
Delatola, E.-I. and Grin, J. E. (2011). Bayesian nonparametric modelling of the
return distribution with stochastic volatility. Bayesian Analysis, 6(4):901926.
Dellaportas, P. and Papageorgiou, I. (2006). Multivariate mixtures of normals with
unknown number of components. Statistics and Computing, 16(1):5768.
Di Lucca, M. A., Guglielmi, A., Müller, P., and Quintana, F. A. (2013). A sim-
ple class of Bayesian nonparametric autoregression models. Bayesian Analysis,
8(1):63.
Doucet, A. and Johansen, A. M. (2009). A tutorial on particle ltering and smooth-
ing: Fifteen years later. Handbook of nonlinear ltering, 12(656-704):3.
Dunson, D. B. (2000). Bayesian latent variable models for clustered mixed outcomes.
Journal of the Royal Statistical Society: Series B, 62(2):355366.
Dunson, D. B. (2003). Dynamic latent trait models for multidimensional longitudinal
data. Journal of the American Statistical Association, 98(463):555563.
Eddelbuettel, D., François, R., Allaire, J., Chambers, J., Bates, D., and Ushey, K.
(2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software,
40(8):118.
Erdélyi, A., Magnus, W., Oberhettinger, F., Tricomi, F. G., and Bateman, H.
(1953). Higher transcendental functions, volume 2. McGraw-Hill New York.
Escobar, M. and West, M. (1995). Bayesian density estimation and inference using
mixtures. Journal of American Statistical Association, 90:577588.
Favaro, S. and Teh, Y. (2013). MCMC for normalized random measure mixture
models. Statistical Science, 28(3):335359.
Feller, W. (1971). An Introduction to Probability Theory and Its Applications, vol.
II. John Wiley, New York, second edition edition.
Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The
Annals of Statistics, pages 209230.
Bibliography 167
Ferguson, T. S. (1983). Bayesian density estimation by mixtures of normal distri-
butions. In M. H. Rizvi, J. R. and Siegmund, D., editors, Recent Advances in
Statistics: Papers in Honor of Herman Cherno on his Sixtieth Birthday, pages
287302. Academic Press.
Ferguson, T. S. and Klass, M. (1972). A representation of independent increment
processes without Gaussian components. Ann. Math. Statist., 43:16341643.
Foti, N. and Williamson, S. (2015). A survey of non-exchangeable priors for Bayesian
nonparametric models. IEEE Transactions on pattern Analysis and Machine In-
telligence, 37:359371.
Fraley, C., Raftery, A. E., Murphy, T. B., and Scrucca, L. (2012). mclust (Version
4) for R: Normal Mixture Modeling for Model-Based Clustering, Classication,
and Density Estimation.
Fritsch, A. and Ickstadt, K. (2009). Improved criteria for clustering based on the
posterior similarity matrix. Bayesian Analysis.
Frühwirth-Schnatter, S. (2006). Finite mixture and Markov switching models.
Springer Series in Statistics. Springer, New York.
Frühwirth-Schnatter, S. and Pyne, S. (2010). Bayesian inference for nite mixtures
of univariate and multivariate skew-normal and skew-t distributions. Biostatistics,
11(2):317336.
Fúquene, J., Steel, M., and Rossell, D. (2016). On choosing mixture components
via non-local priors. arXiv preprint arXiv:1604.00314.
Gelman, A., Hwang, J., and Vehtari, A. (2014). Understanding predictive informa-
tion criteria for Bayesian models. Statistics and Computing, 24(6):9971016.
Ghahramani, Z. and Griths, T. L. (2006). Innite latent feature models and the
Indian buet process. In Advances in neural information processing systems, pages
475482.
Gianoli, I. (2016). Analysis of gap times of recurrent blood donations via Bayesian
nonparametric models. Master thesis, Politecnico di Milano, Italy.
Gradshteyn, I. and Ryzhik, L. (2007). Table of integrals, series, and products -
Seventh Edition. Academic Press, San Diego (USA), sixth edition.
Grin, J. and Walker, S. G. (2011). Posterior simulation of normalized random
measure mixtures. Journal of Computational and Graphical Statistics, 20:241
259.
Grin, J. E. (2013). An adaptive truncation method for inference in Bayesian
nonparametric models. arXiv preprint arXiv:1308.2045.
168
Grin, J. E. and Leisen, F. (2014). Compound random measures and their use in
Bayesian nonparametrics. arXiv preprint arXiv:1410.0611.
Griths, T. L. and Ghahramani, Z. (2011). The Indian buet process: An intro-
duction and review. Journal of Machine Learning Research, 12(Apr):11851224.
Han, S., Du, L., Salazar, E., and Carin, L. (2014). Dynamic rank factor model
for text streams. In Advances in Neural Information Processing Systems, pages
26632671.
Hartigan, J. A. (1990). Partition models. Communications in statistics-Theory and
methods, 19(8):27452756.
Hjort, N. L. (1990). Nonparametric Bayes estimators based on beta processes in
models for life history data. The Annals of Statistics, pages 12591294.
Ishwaran, H. and James, L. (2001a). Gibbs sampling methods for stick-breaking
priors. J. Amer. Statist. Assoc., 96:161173.
Ishwaran, H. and James, L. F. (2001b). Gibbs sampling methods for stick-breaking
priors. Journal of the American Statistical Association, 96:161173.
Ishwaran, H. and James, L. F. (2002). Approximate Dirichlet process computing in
nite normal mixtures. Journal of computational and graphical statistics, 11(3).
Ismay, C. and Chunn, J. (2017). vethirtyeight: Data and Code Behind the Stories
and Interactives at 'FiveThirtyEight'. R package version 0.1.0.
James, L., Lijoi, A., and Prünster, I. (2009). Posterior analysis for normalized ran-
dom measures with independent increments. Scandinavian Journal of Statistics,
36:7697.
Jara, A. and Hanson, T. E. (2011). A class of mixtures of dependent tail-free
processes. Biometrika, 98:553.
Jara, A., Hanson, T. E., Quintana, F. A., Müller, P., and Rosner, G. L. (2011). DP-
package: Bayesian semi-and nonparametric modeling in r. Journal of Statistical
Software, 40(5):1.
Jørgensen, B. and Song, P. X.-K. (1998). Stationary time series models with expo-
nential dispersion model margins. Journal of Applied Probability, pages 7892.
Kallenberg, O. (1983). Random measures. Academic Pr.
Kingman, J. (1967). Completely random measures. Pacic Journal of Mathematics,
21(1):5978.
Kingman, J. F. C. (1993). Poisson processes, volume 3. Oxford university press.
Bibliography 169
Kulesza, A., Taskar, B., et al. (2012). Determinantal point processes for machine
learning. Foundations and Trends in Machine Learning, 5:123286.
Lau, J. W. and Green, P. J. (2007a). Bayesian model based clustering procedures.
Journal of Computational and Graphical Statistics, 16:526558.
Lau, J. W. and Green, P. J. (2007b). Bayesian model-based clustering procedures.
Journal of Computational and Graphical Statistics, 16:526558.
Lavancier, F., Møller, J., and Rubak, E. (2015). Determinantal point process mod-
els and statistical inference: Extended version. Journal of the Royal Statistical
Society: Series B, 77:853877.
Lawrence, N. (2005). Probabilistic non-linear principal component analysis with
Gaussian process latent variable models. Journal of Machine Learning Research,
6.
Lijoi, A., Mena, R. H., and Prünster, I. (2005). Hierarchical mixture modeling with
normalized inverse-Gaussian priors. Journal of the American Statistical Associa-
tion, 100(472):12781291.
Lijoi, A., Mena, R. H., and Prünster, I. (2007). Controlling the reinforcement in
Bayesian nonparametric mixture models. Journal of the Royal Statistical Society
B, 69:715740.
Lo, A. Y. (1984). On a class of bayesian nonparametric estimates: I. density esti-
mates. The Annals of Statistics, 12:351357.
Lomelí, M., Favaro, S., and Teh, Y. W. (2017). A marginal sampler for σ-stable
poissonkingman mixture models. Journal of Computational and Graphical Statis-
tics.
Lopes, H. F. and West, M. (2004). Bayesian model assessment in factor analysis.
Statistica Sinica, pages 4167.
Macchi, O. (1975). The coincidence approach to stochastic point processes. Advances
in Applied Probability, pages 83122.
MacEachern, S. N. (1999). Dependent nonparametric processes. In ASA proceedings
of the section on Bayesian statistical science, pages 5055.
MacEachern, S. N. (2000). Dependent Dirichlet processes. Technical report, De-
partment of Statistics, The Ohio State University.
Malsiner-Walli, G., Frühwirth-Schnatter, S., and Grün, B. (2016). Model-based
clustering based on sparse nite Gaussian mixtures. Statistics and Computing,
26:303324.
170
McAulie, J. D., Blei, D. M., and Jordan, M. I. (2006). Nonparametric empirical
Bayes for the Dirichlet process mixture model. Statistics and Computing, 16(1):5
14.
McCulloch, C. E. and Neuhaus, J. M. (2001). Generalized linear mixed models.
Wiley Online Library.
McLachlan, G. and Peel, D. (2005). Finite Mixture Models. John Wiley & Sons,
Inc.
Meil , M. (2007). Comparing clusterings - an information based distance. Journal
of Multivariate Analysis.
Mena, R. H. and Walker, S. G. (2005). Stationary autoregressive models via a
Bayesian nonparametric approach. Journal of Time Series Analysis, 26(6):789
805.
Mena, R. H. and Walker, S. G. (2007). Stationary mixture transition distribution
models via predictive distributions. Journal of statistical Planning and Inference,
137(10):31033112.
Miller, J. W. and Harrison, M. T. (2013). A simple example of Dirichlet process
mixture inconsistency for the number of components. In Advances in neural in-
formation processing systems, pages 199206.
Miller, J. W. and Harrison, M. T. (2017). Mixture models with a prior on the number
of components. Journal of the American Statistical Association. In Press.
Miller, K. T., Griths, T., and Jordan, M. I. (2012). The phylogenetic indian
buet process: A non-exchangeable nonparametric prior for latent features. arXiv
preprint arXiv:1206.3279.
Møller, J. and Waagepetersen, R. P. (2007). Modern statistics for spatial point
processes. Scandinavian Journal of Statistics, 34:643684.
Moustaki, I. and Knott, M. (2000). Generalized latent trait models. Psychometrika,
65(3):391411.
Müller, P. and Quintana, F. (2010). Random partition models with regression on
covariates. Journal of statistical Planning and Inference, 140(10):28012808.
Müller, P., Quintana, F., and Rosner, G. L. (2011). A product partition model
with regression on covariates. Journal of Computational and Graphical Statistics,
20:260278.
Müller, P., Quintana, F. A., and Rosner, G. A. (2011). A product partition model
with regression on covariates. Journal of Computational and Graphical Statistics,
20:260278.
Bibliography 171
Neal, R. (2000). Markov chain sampling methods for Dirichlet process mixture
models. Journal of Computational and Graphical Statistics, 9:249265.
Nieto-Barajas, L. E. (2013). Lévy-driven processes in Bayesian nonparametric in-
ference. Boletin de la Sociedad Matemática Mexicana, 19.
Norets, A. (2015). Optimal retrospective sampling for
a class of variable dimen-sion models. Unpublished
manuscript, Brown University, available at http://www. brown.
edu/Departments/Economics/Faculty/Andriy_Norets/papers/optretrsampling.
pdf.
Page, G. L. and Quintana, F. A. (2015). Spatial product partition models. Bayesian
Analysis.
Park, J.-H. and Dunson, D. B. (2010). Bayesian generalized product partition model.
Statistica Sinica, pages 12031226.
Perrone, V., Jenkins, P. A., Spano, D., and Teh, Y. W. (2016). Poisson random
elds for dynamic feature models. arXiv preprint arXiv:1611.07460.
Petralia, F., Rao, V., and Dunson, D. B. (2012). Repulsive mixtures. In Advances
in Neural Information Processing Systems.
Pitman, J. (1996). Some developments of the Blackwell-Macqueen urn scheme. In
Ferguson, T. S., Shapley, L. S., and B., M. J., editors, Statistics, Probability
and Game Theory: Papers in Honor of David Blackwell, volume 30 of IMS Lec-
ture Notes-Monograph Series, pages 245267. Institute of Mathematical Statistics,
Hayward (USA).
Pitman, J. (2003). Poisson-Kingman partitions. In Science and Statistics: a
Festschrift for Terry Speed, volume 40 of IMS Lecture Notes-Monograph Series,
pages 134. Institute of Mathematical Statistics, Hayward (USA).
Pitman, J. (2006). Combinatorial Stochastic Processes. LNM n. 1875. Springer,
New York.
Pitt, M. K., Chateld, C., and Walker, S. G. (2002). Constructing rst order
stationary autoregressive models via latent processes. Scandinavian Journal of
Statistics, 29(4):657663.
Pitt, M. K. and Walker, S. G. (2005). Constructing stationary time series models
using auxiliary variables with applications. Journal of the American Statistical
Association, 100(470):554564.
Quinlan, J. J., Quintana, F. A., and Page, G. L. (2017). Parsimonious Hierarchical
Modeling Using Repulsive Distributions. arXiv preprint arXiv:1701.04457.
172
Quintana, F. A. and Iglesias, P. L. (2003). Bayesian clustering and product partition
models. Journal of the Royal Statistical Society: Series B, 65(2):557574.
Quintana, F. A., Müller, P., and Papoila, A. L. (2015). Cluster-specic variable
selection for product partition models. Scandinavian Journal of Statistics.
Ranganath, R. and Blei, D. M. (2017). Correlated random measures. Journal of the
American Statistical Association.
Regazzini, E., Lijoi, A., and Prünster, I. (2003). Distributional results for means of
random measures with independent increments. The Annals of Statistics, 31:560
585.
Ren, L., Du, L., Carin, L., and Dunson, D. (2011). Logistic stick-breaking process.
The Journal of Machine Learning Research, 12:203239.
Richardson, S. and Green, P. J. (1997). On bayesian analysis of mixtures with an
unknown number of components (with discussion). Journal of the Royal Statistical
Society: series B, 59:731792.
Roberts, G. O. and Rosenthal, J. S. (2009). Examples of adaptive MCMC. Journal
of Computational and Graphical Statistics, 18:349367.
Rodriguez, A. and Dunson, D. B. (2011). Nonparametric Bayesian models through
probit stick-breaking processes. Bayesian analysis, 6(1).
Rosi«ski, J. (2001). Series representations of Lévy processes from the perspective of
point processes. In Lévy processes, pages 401415. Springer.
Rousseau, J. and Mengersen, K. (2011). Asymptotic behaviour of the posterior
distribution in overtted mixture models. Journal of the Royal Statistical Society:
Series B, 73:689710.
Roychowdhury, A. and Kulis, B. (2015). Gamma processes, stick-breaking, and
variational inference. In AISTATS.
Ruiz, F. J., Valera, I., Blanco, C., and Perez-Cruz, F. (2014). Bayesian nonpara-
metric comorbidity analysis of psychiatric disorders. Journal of Machine Learning
Research, 15(1):12151247.
Sethuraman, J. (1994). A constructive denition of Dirichlet priors. Statistica sinica,
pages 639650.
Shirota, S. and Gelfand, A. E. (2017). Approximate Bayesian Computation and
Model Assessment for Repulsive Spatial Point Processes. Journal of Computa-
tional and Graphical Statistics. In press.
Srebro, N. and Roweis, S. (2005). Time-varying topic models using dependent
Dirichlet processes. Univ. Toronto, Canada, Tech. Rep. TR, 3:2005.
Bibliography 173
Thibaux, R. and Jordan, M. I. (2007). Hierarchical beta processes and the Indian
buet process. In AISTATS.
Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal component anal-
ysis. Journal of the Royal Statistical Society: Series B, 61(3):611622.
Titsias, M. K. (2008). The innite gamma-Poisson feature model. In Advances in
Neural Information Processing Systems, pages 15131520.
Trippa, L. and Favaro, S. (2012). A class of normalized random measures with an
exact predictive sampling scheme. Scandinavian Journal of Statistics, 39(3):444
460.
Wade, S. and Ghahramani, Z. (2017). Bayesian cluster analysis: Point estimation
and credible balls. Bayesian Analysis.
Wallach, H., Jensen, S., Dicker, L., and Heller, K. (2010). An alternative prior
process for nonparametric Bayesian clustering. In Proceedings of the Thirteenth
International Conference on Articial Intelligence and Statistics, pages 892899.
Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely
applicable information criterion in singular learning theory. The Journal of Ma-
chine Learning Research, 11:35713594.
Williamson, S., Orbanz, P., and Ghahramani, Z. (2010). Dependent indian buet
processes. In AISTATS.
Wilson, I. (1983). Add a new dimension to your philately. The American Philatelist,
97:342349.
Xu, Y., Müller, P., and Telesca, D. (2016). Bayesian inference for latent biological
structure with determinantal point processes. Biometrics, 72:955964.
Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis
with g-prior distributions. Bayesian inference and decision techniques: Essays in
Honor of Bruno De Finetti, 6:233243.
Zhang, P., Wang, X., and Song, P. X.-K. (2006). Clustering categorical data based on
distance vectors. Journal of the American Statistical Association, 101(473):355
367.
Zhou, M. and Carin, L. (2015). Negative binomial process count and mixture model-
ing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):307
320.
Zhou, M., Hannah, L., Dunson, D. B., and Carin, L. (2012). Beta-negative binomial
process and Poisson factor analysis. In International Conference on Articial
Intelligence and Statistics, pages 14621471.