Bayesian Joint Analysis of Heterogeneous Data
1Priyadip Ray, 2Lingling Zheng, 1Yingjian Wang, 2Joseph Lucas,3David Dunson and 1Lawrence Carin
1Electrical & Computer Engineering Department2Institute for Genomic Science and Policy
3Department of Statistics
Duke University
Durham, NC, USA
Abstract
A Bayesian factor model is proposed for integrating multiple disparate,
but related datasets. The approach is based on factoring the latent space
(feature space) into a shared component and a data-specific component, with
the dimension of these spaces inferred via a beta-Bernoulli process. For cases
in which there are space-time covariates, the factor scores and/or loadings are
modeled via a Gaussian process (GP), with inhomogeneity addressed through
a novel kernel stick-breaking process (KSBP) based mixture of GPs. Theoret-
ical properties of the KSBP-GP factor model are discussed, and an MCMC
algorithm is developed for posterior inference. The proposed approach is
first demonstrated by jointly analyzing multiple types of genomic data (gene
expression, copy number variations, and methylation) for ovarian cancer pa-
tients, showing that the model can uncover key attributes related to cancer;
these heterogeneous data allow consideration of model performance in the ab-
sence of space-time covariates. Analysis of space-time-dependent data is con-
sidered in the form of multi-year unemployment rates at various geographical
locations, with these data analyzed jointly with time-evolving stock prices of
companies in the S & P 500.
Key words: Data fusion, joint factor analysis, Gaussian process, DNA microarray
analysis, spatio-temporal data
1 Introduction
An important research problem in statistical signal processing and machine learning
concerns integration/fusion of multiple disparate, but statistically related datasets.
1
For example, in genomic signal processing, integration of DNA copy number vari-
ations and gene expressions may help identify key drivers in cancer mechanisms.
In econometrics and social sciences, joint analysis of multiple heterogeneous finan-
cial and social databases may reveal interesting underlying socio-economic trends.
Though the range of potential applications is immense, the increase in data di-
mension, data heterogeneity and the presence of noise often make such data fusion
problems challenging.
A key assumption employed when modeling such high-dimensional data is that
the intrinsic dimension of the data is much lower than the observed data dimen-
sion, i.e., the data lie in or are close to a low-dimensional subspace. For modeling
multiple disparate datasets, approaches often rely on the assumption that the data
are different manifestations of a single shared low-dimensional latent space (feature
space). The problem then lies in identifying this low-dimensional shared feature
space and the data specific mappings from this shared space to the observed data.
Classical data analysis techniques for multiple datasets, such as canonical correla-
tion analysis (CCA) (Hotelling, 1936; Borga, 1998; Hardoon et al., 2004), compute
a low-dimensional shared linear embedding of a set of variables, such that the cor-
relations among the variables is maximized in the embedded space. Probabilistic
approaches to CCA have been proposed in (Bach and Jordan, 2005; Wang, 2007;
Rai and Daume, 2009). For joint analysis of multiple data sets, (Bach and Jordan,
2005; Wang, 2007; Rai and Daume, 2009) assume the existence of underlying shared
latent variables and conditional independence of the data given the latent variables.
However the assumption of a single shared latent space may be limiting, and a more
flexible approach is to factorize the latent space into a component that is shared
among all datasets and a component that is specific to each. Such models are more
likely to capture the shared features among all datasets while still preserving the
idiosyncratic features unique to each.
Bayesian and semi-Bayesian latent variable models have developed to factorize
the latent space into a shared and data-specific part (Archambeau and Bach, 2008;
Klami and Kaski, 2008). However, in these approaches the number of latent factors
are chosen a priori. Alternatively, one may consider multiple factor models, each
with a different number of factors, and perform model selection based on information
criteria such as AIC (Akaike, 1987) or BIC (Schwarz, 1978). However, as it is often
challenging to check modeling assumptions in high-dimensions, a nonparametric
2
or semiparametric model is desirable. In this paper we propose a Bayesian factor
analysis approach for integrating multiple heterogeneous datasets, with the number
of factors inferred from the available data. Our proposed approach is based on
factoring the latent space into shared and data-specific components, employing a
beta-Bernoulli process (Griffiths and Ghahramani, 2005; Thibaux and Jordan, 2007;
Paisley and Carin, 2009) to infer the dimension of these latent spaces.
We further extend the proposed approach to problems with multiple heteroge-
neous data sources exhibiting spatio-temporal dependencies. Gaussian Process (GP)
priors (Rasmussen and Williams, 2005) provide a particularly effective solution for
incorporating knowledge of the spatial locations and the time stamps in spatio-
temporal data. In (Luttinen and Ilin, 2009; Schmidt and Laurberg, 2008; Schmidt,
2009), the authors propose GP factor analysis (GPFA) techniques for modeling a
single spatio-temporal dataset, with spatial dependence incorporated in the factor
loadings and with temporal dependence incorporated in the factor scores. These
approaches are primarily parametric, i.e., the number of factor loadings is assumed
known a priori. Further in the GPFA model considered in (Luttinen and Ilin, 2009),
the factor loadings capture the spatial correlation and each factor loading is drawn
from a single GP. However, the underlying assumption, that the correlation pattern
is spatially invariant, may not be valid for many real data. For example, it is likely
that the spatial correlation pattern among the cities in densely populated regions is
different from that among less densely populated regions.
To model data with such non-stationary spatial covariance structure, we first
propose a new GP factor model, in which the factor loadings are drawn from a
mixture of GPs. Though not specifically in the context of factor models, mixture
of GPs have been applied previously to model data with non-stationary covariance
structure. Most mixture of GP approaches, such as (Shi et al., 2003), assume a
known number of mixture components. Nonparametric approaches, with the num-
ber of mixture components inferred from the data, have been proposed in (Tresp,
2001; Rasmussen and Ghahramani, 2002; Meeds and Osindero, 2006; Gramacy and
Lee, 2007). In (Gramacy and Lee, 2007), the authors propose a tree-based parti-
tioning approach to divide the input space and fit different base models to data
independently in each region. In (Rasmussen and Ghahramani, 2002; Meeds and
Osindero, 2006), the authors propose an input-dependent Dirichlet process mixture
of GPs to model non-stationary covariance functions.
3
We propose a novel GP mixture model based on the kernel stick breaking process
(KSBP) (Dunson and Park, 2008). Our approach is similar in spirit to (Rasmussen
and Ghahramani, 2002), but it has several advantages, discussed in depth below.
We also provide a detailed theoretical analysis of the properties of the imposed prior.
The proposed KSBP-GP approach (where a smaller subset of the data, instead of
the entire data, is associated with each GP) has the added advantage of improved
computational efficiency relative to a single GP. The computational efficiency of
our approach may be improved further by adopting various existing approximations
for GP regression. A detailed overview of such approximations may be found in
(Rasmussen and Williams, 2005; Candela et al., 2007).
To first isolate the component of the model that shares heterogeneous data
sources, without consideration of space-time phenomena, we consider the joint anal-
ysis of genomic data for ovarian cancer patients. Three different datasets are con-
sidered: gene expression, copy number variations, and DNA methylation levels for
ovarian cancer patients. We demonstrate that the joint analysis of gene expres-
sion/copy number variations and gene expression/DNA methylation levels can po-
tentially identify genomic and epigenomic regulators influencing cancer pathophys-
iology outcomes. To clearly illustrate the performance of the KSBP-GP mixture
prior, we isolate it from the factor model and apply it separately on the classic mo-
torcycle data (Silverman, 1985), which consists of measurements of the acceleration
of the head of a motorcycle rider in the first moments after impact.
To demonstrate the model on data with space-time dependencies, we next an-
alyze time-evolving unemployment rates at various geographical locations in the
United States. We consider two types of data: one contains time series of unem-
ployment rates at 83 counties in the state of Michigan (relatively uniform spatial
population density) and the other contains time series of unemployment rates at
187 metro cities across the entire United States (highly non-uniform spatial popu-
lation density). Further, by jointly analyzing other readily available auxiliary time-
dependent data sources that have significant correlation with the job market, we
may be able to improve learning space-time unemployment rates. An example of
such an auxiliary data source, considered in this paper, is the time-dependent stock
prices of companies listed in the S & P 500. The fundamental idea is to appro-
priately borrow statistical strength between these distinct but correlated data, to
obtain a better representation of each data type. We demonstrate that the KSBP-
4
GP joint factor model can exploit the shared statistical features across multiple data
types, to learn a better model for each data type, and considerably improve spatial
imputation of unemployment rates.
The remainder of the paper is organized as follows. In Section 2 we present the
proposed hierarchical Bayesian model for jointly analyzing heterogeneous data. In
Section 3 we discuss priors on factor loadings and in Section 4 we discuss priors
on factor scores. In Section 5 we provide theoretical properties of the proposed
KSBP-GP mixture prior. Section 6 outlines an MCMC inference algorithm, and
Section 7 provides experimental results for the proposed model on the analysis of
multiple genomic data, motorcycle data and econometric data. Finally we provide
concluding remarks in Section 8.
2 Joint Bayesian Factor Analysis
LetX(r)
r=1,R
represent data fromR different modalities, whereX(r) = (x(r)1 , . . . ,x
(r)M ) ∈
RNr×M . In sparse factor modeling, learning a single shared matrix of factor load-
ings for different signal classes has been proposed in (Mairal et al., 2008). However
for heterogeneous data such as that considered here, learning a shared set of factor
loadings is more difficult.
The joint factor model may be represented as
X(r) = D(r)(W (c) +W (r)
)+E(r) (1)
The matrix D(r) = (d(r)1 , . . . ,d
(r)K ) ∈ RNr×K consists of the factor loadings specific
to data modality r, factor scores W (r) = (w(r)1 , . . . ,w
(r)M ) ∈ RK×M are specific to
data from modality r, W (c) = (w(c)1 , . . . ,w
(c)M ) ∈ RK×M consists of the factor scores
common among all modalities, and E(r) = (ε(r)1 , . . . , ε
(r)M ) ∈ RNr×M consists of the
noise/residual specific to data of modality r.
Note that one may alternatively consider
X(r) = D(rc)W (c) +D(r)W (r) +E(r) (2)
where D(rc) correspond to factor loadings associated with common factors, with
D(rc) reflective of how these common factors are viewed by modality r; D(r) are
factor loadings associated with factors specific only to modality r. The framework
5
in (1), which we employ throughout, allows the ability to share factor loadings
between the common and modality-specific factors, and the manner in which W (c)
andW (r) are modeled allows sufficiently flexibility to yield (2) if the data so warrant.
We wish to impose the condition that any x(r)i is a sparse linear combination of
the factor loadings. Hence, the factor scores are represented as,
w(r)i = s
(r)i b
(r)i and w
(c)i = s
(c)i b
(c)i (3)
where s(r)i ∈ RK , s
(c)i ∈ RK , b
(r)i ∈ 0, 1
K , b(c)i ∈ 0, 1
K and represents the
Hadamard product (elementwise vector product).
The choice of priors for D(r), s(r)i and s
(c)i are application-dependent and in
the case of D(r), also modality-dependent; these are discussed below for specific
examples. The sparse binary vectors b(r)i are drawn from the following beta-Bernoulli
process (Griffiths and Ghahramani, 2005; Thibaux and Jordan, 2007; Paisley and
Carin, 2009)
b(r)i ∼
K∏k=1
Bernoulli(πk) , π ∼K∏k=1
Beta(cα, c(1− α)) (4)
with πk representing the kth component of π and α ∈ (0, 1). In practice K is
finite, and the above equation represents a finite approximation to the beta-Bernoulli
process, where the number of non-zero components of each b(r)i is a random variable
drawn from Binomial (K,α). If α is set to ρK
, in the limit K → ∞ this reduces
to the number of non-zero components in b(r)i being drawn from Poisson(ρ); this
corresponds to the Indian buffet process (IBP) (Griffiths and Ghahramani, 2005;
Thibaux and Jordan, 2007; Paisley and Carin, 2009). We may therefore explicitly
impose a prior belief on the number of non-zero components in w(r)i . The shared
binary vectors b(c)i are modeled similarly as b
(r)i . The noise or residual in (1) is
modeled as
ε(r)i ∼ N (0, γ(r)
ε
−1INr), γ(r)
ε ∼ Gamma(a0, b0) (5)
where INr represents the Nr ×Nr identity matrix.
The construction in (1) imposes the belief that there are underlying (low-dimensional)
features represented by the factor scores that may be shared across modalities, via
W (c); however, each modality has a unique mapping from these low-dimensional fac-
tor scores to the high-dimensional data, reflected by D(r). Further, each modality
6
may also have idiosyncratic low-dimensional features, characterized by W (r). The
common and idiosyncratic features are learned jointly, via the simultaneous anal-
ysis of all modalities. A unique feature of the above construction is that it allows
complete sharing of some low-dimensional features across different data modalities
as well as partial sharing, i.e., a shared feature may be slightly perturbed via W (r)
and shared across different modalities.
3 Imposing Structure on Factor Loadings
3.1 Simple construction
In the absence of covariates, the factor loadings may be drawn i.i.d. from a Gaussian
distribution (for ease of notation, we henceforth drop the modality index r, unless
referring to multiple data modalities simultaneously),
dk ∼ N (0, γ−1s IN), γs ∼ Gamma(a5, b5) (6)
3.2 Imposing sparsity
In many biological applications, it is desirable that the factor loading matrix is
sparse (Carvalho et al., 2008). To impose sparsity on the factor loadings, we employ
a Student-t sparseness-promoting prior (Tipping, 2001). In this construction, djk,
the jth component of dk, is drawn
djk ∼ N (0, τ−1jk ) , τjk ∼ Gamma(a1, b1) (7)
However, there are multiple ways one may desire to impose sparsity, such as using the
spike-slab prior (Ishwaran and Rao, 2005; Carvalho et al., 2008; Chen et al., 2011).
This consists of a discrete-continuous mixture of a point mass at zero, referred to
as the ‘spike’, and any other distribution, such as the Gaussian distribution, known
as the ‘slab’. A hierarchical beta-Bernoulli construction of the spike-slab prior for
imposing sparsity on the factor loadings is provided in (Chen et al., 2011). We found
that the spike-slab prior works as well as the model presented above; however, for
the sake of brevity, we include only the results for the Student-t sparseness prior in
this paper.
7
3.3 Covariate dependent factor loadings
3.3.1 Stationary covariance structure
We may model the covariate-dependent factor loadings as being drawn from a GP,
with kth column of D, denoted dk, drawn
dk ∼ N (0,Σ) (8)
Σ(n,m) = τ exp
−β ‖rn − rm‖
2
2
+ σδn,m (9)
where rn and rm represent the nth and mth instances of the covariates, respectively.
The kernel (9) embeds the covariates into the covariance matrix, and is characterized
by three parameters: τ controls the signal variance, σ controls the noise variance,
and β is the bandwidth or scale parameter, and it controls the amount of smoothing.
Other popular kernel choices may be found in (Rasmussen and Williams, 2005) and
references therein. Equation (9) imposes the belief that data that have similar
covariates are likely to be correlated and the correlation decays with increasing
distance in the covariate space. This presumes a stationary covariance structure.
We will utilize this prior for spatio-temporal data of unemployment rates over an
approximately uniformly populated region.
3.3.2 Non-stationary covariance structure
For data with non-stationary covariance structure, we propose to divide the data
into clusters, with each cluster sharing a unique GP. We utilize a mixture-of-GP
approach based on the kernel stick-breaking process (KSBP) (Dunson and Park,
2008). The KSBP is based on the stick-breaking process (Sethuraman, 1994), which
involves sequential breaks of “sticks” of length wl from an original stick of unit
length∑∞
l=1 wl = 1. In a stick-breaking process, the cluster indicators are drawn as
z(i) ∼∞∑l=1
wlδl, i = 1, . . . , N ; wl = Vl
l−1∏j=1
(1− Vj) ; Vl ∼ Beta(1, γ) (10)
Due to the property of the beta distribution for small γ, it is likely that only a
relatively small set of “sticks” will have appreciable weight.
The primary difference between a stick-breaking process and a kernel stick-
8
breaking process is that the stick weights are further modulated by an additional
bounded kernel K(r; r∗l , φl) → [0, 1] which is a function of the covariates r. This
imposes the belief that data that are closer in the covariate space will have similar
stick weights wl(r), and hence are likely to share the same cluster. In a KSBP, the
cluster indicators are drawn as
z(i) ∼∞∑l=1
wl(ri)δl , wl(r) = VlK(r; r∗l , φl)l−1∏j=1
[1− VjK(r; r∗j , φj)] (11)
where we employ a radial basis function (RBF) kernel,
K(r; r∗l , φl) = exp
(−‖r − r
∗l ‖
2
φl
)(12)
and Vl ∼ Beta(1, γ), γ ∼ Gamma(a4, b4), r∗l ∼ F , and φl ∼ H. The expression δl
corresponds to a unit point measure at the index l.
Here F denotes the prior distribution over potential kernel-center locations, and
is expected to be continuously varying over the entire space; however, this may
complicate computations. Hence, we choose a discrete prior, i.e., F ∼∑
h ehr∗h,
where r∗h constitutes of a grid of potential basis locations (eh > 0 and∑
h eh =
1). Similarly, the KSBP kernel widths, which may be interpreted as the region of
influence of the basis functions, are inferred from the data. For ease of posterior
inference, a discrete prior H ∼∑
h phφh is chosen and φh represents a set of
potential kernel widths (ph > 0 and∑
h ph = 1). Details on inference of these
parameters are provided in Appendix B.
For the case of covariate-dependent factor loadings, a single KSBP is drawn, with
associated covariates. For each factor loading dk, the components of this vector are
apportioned to clusters, via the KSBP. Each such cluster corresponds to a unique
GP, with associated GP parameters; all components of dk associated with a given
cluster are drawn jointly from this GP. Specifically, the lth GP for dk is characterized
by covariance
Σkl(n,m) = τl exp
−βl‖rn − rm‖2
2
+ σlδn,m
for n,m ∈ Akl, where Akl = n : zk(n) = l for n = 1, . . . , N and zk(n) represents
the cluster indicator for the nth element of dk; Akl is the set of components of dk
that are associated with GP component l, as apportioned via the KSBP. In practice
9
we truncate the number of GP mixture components to a large value J .
We will utilize this prior for spatio-temporal data of unemployment rates of cities
over a large non-uniformly populated region. The KSBP-GP not only imposes
the belief that cities form clusters based on similarities in their unemployment
rates, but it additionally imposes that cities that are geographically proximate are
likely to belong to the same cluster. This is the fundamental difference from a
Dirichlet process mixture of GPs (Rasmussen and Ghahramani, 2002), which does
not incorporate spatial information in the clustering process; other related mixture
models are discussed in Muller et al. (1996); Gelfand et al. (2005); Duan et al.
(2007). The benefits of the KSBP-GP mixture over a DP-GP mixture, for modeling
non-stationary spatial data, are illustrated in the results provided later.
4 Imposing Structure on Factor Scores
In the simplest scenario (for example for the genomic data considered in the paper)
the factor scores may be drawn i.i.d. from a Gaussian distribution as,
s(r)i ∼ N (0, γ−1
r IK) s(c)i ∼ N (0, γ−1
c IK) (13)
We impose broad gamma prior on γr and γc: γr ∼ Gamma(a2, b2) and γc ∼Gamma(a3, b3).
For data with covariates and with a stationary covariance structure, the scores
may be drawn in a similar manner as the factor loadings as discussed earlier in
Section 3.3.1. Let S(r) = (s(r)1 , . . . , s
(r)M ) ∈ RK×M . The rows of S(r), of length M
and denoted s(r)(k) for the kth row, are drawn from a Gaussian process (for example
for time-dependent data, s(r)(k) represents the time dependence of factor k, at M time
points). We employ a GP construction
s(r)(k) ∼ N (0,Σ′) (14)
Σ′(p, q) = τ (t) exp
−β
(t) ‖tp − tq‖2
2
+ σ(t)δp,q (15)
where tp and tq represent the pth and qth covariates respectively. Note that the rows
of S(c) = (s(c)1 , . . . , s
(c)M ) are modeled similarly as the rows of S(r).
We have employed this construction for the space-time econometric data and
10
time-dependent stock-price data. For data with non-stationary covariance structure
(non-stationary temporal behavior), the factor scores may be drawn from a KSBP-
GP mixture, as discussed earlier in Section 3.3.2. However, for the data considered
in this paper, it was deemed not necessary to employ a KSBP-GP mixture for the
factor scores.
5 Conditional and Marginal Properties of KSBP-
GP Mixture
We next present analytical results on the correlation properties of the KSBP-GP
mixture prior. We are particularly interested in observing the effect of the KSBP
parameters (GP bandwidth parameters and the stick-breaking parameter γ) on the
covariance structure. For ease of understanding, the interpretations of the properties
are provided in context to the space-time data considered in this paper. The proofs
of the following properties are provided in Appendix A.
5.1 Conditional spatial covariance
Let Θk = r∗kl, φkl, Vkll=1,∞ represent the parameter set of the KSBP and Ωk =
τ (s)kl , β
(s)kl , σ
(s)kl l=1,∞ represent the parameter set for the GPs corresponding to the
kth factor loading. It can be shown that the conditional spatial covariance is
Cov(dik, di′k|Θk,Ωk) =< ψik,ψi′k > (16)
where dik represents the ith element of dk andψik = [σk1wik1, σk2wik2, . . . , σk∞wik∞]T ,ψi′k =
[σk1wi′k1, σk2wi′k2, . . . , σk∞wi′k∞]T , σkl =√
Σk,l(i, i′) and wikl = VklK(ri, r∗kl;φkl)
∏l−1j=1[1−
VkjK(ri, r∗kj;φkj)].
Equation (16) shows the dependence of the conditional spatial covariance on the
KSBP kernel parameter φkl. Note that a smaller φkl implies little borrowing of spa-
tial information, whereas a larger φkl implies greater sharing of spatial information.
As φkl becomes smaller, the elements of dk are less correlated spatially and in the
limit φkl → 0,Cov(dik, di′k|Θk,Ωk) = 0, the model reduces to the elements of dk
being drawn independently from a normal distribution. As φkl becomes larger, more
long range spatial correlation is encouraged and in the limit φkl → ∞, the kernel
effect vanishes and the model reduces to a Dirichlet process (DP) mixture of GP
11
prior on dk. Note that in our model the factor loadings dkk=1,K are drawn from a
mixture of Gaussians and hence they are non-Gaussian in nature; this is a powerful
feature of our proposed model, since most spatio-temporal factor models assume
that the factor loadings are Gaussian (Luttinen and Ilin, 2009; Lopes et al., 2008,
2011).
5.2 Marginal spatial covariance
The marginal spatial covariance is
Cov(dik, di′k) =∞∑l=1
∫Θk,Ωk
Σk,l(i, i′)p(zk(i) = zk(i
′) = l|Θk)P (dΘk)P (dΩk)
To demonstrate the properties of the marginal spatial covariance, we analyze its
behavior under some simplified assumptions. We assume that all the GPs share the
same kernel parameters, i.e., Σk,l(i, i′) = Σk(i, i
′). We further assume a rectangular
kernel given by, K(r, r∗;φ) = 1, for ||r− r∗||2 ≤ ∆; K(r, r∗;φ) = 0, for ||r− r||2 >∆; we also assume that r∗ is drawn uniformly. The the marginal spatial covariance
reduces to
Cov(dik, di′k) =E[Σk(i, i
′)]2+γ
2
1π
[arcsin
√∆2−||
ri−ri′2 ||22
∆2 −||ri−ri′
2 ||2
√∆2−||
ri−ri′2 ||22
∆2 ]
− 1(17)
where E denotes the expectation operator. Equation (17) casts insight on the effects
of the stick-breaking parameter γ, the distance between two spatial locations ||ri −ri′ ||2 as well as the kernel width ∆ on the marginal spatial covariance. For example,
when γ increases, Cov(dik, di′k) decreases and vice versa. This result is intuitively
pleasing, since when γ increases, the number of inferred clusters increases, which
reduces the probability of ri, ri′ sharing the same GP, which in turn reduces the
correlation among them. We also observe that when the distance between the two
spatial location ||ri−ri′ ||2 increases, Cov(dik, di′k) decreases, and when the width of
the kernel ∆ increases, the Cov(dik, di′k) increases. A limiting property is obtained
when the kernel width ∆→∞, Cov(dik, di′k) = E[Σk(i,i′)]1+γ
.
12
5.3 Spatio-temporal covariance
For data point xij located at ri at time tj and data point xi′j′ located at ri′ at time
tj′ , the covariance is given by
Cov(xij, xi′j′)
= α2
K∑k=1
Cov(dik, di′k)Cov(skj, skj′)(18)
Equation (18) reveals that the spatio-temporal covariance is equal to the sum of
the products of the spatial and temporal covariance along every dimension. It is
evident that when dik, di′k do not share the same GP along any dimension then
Cov(xij, xi′j′) = 0.
6 MCMC Inference
The conditional posterior distribution of all the model parameters for the joint
factor model (for data without covariates, such as the heterogeneous genomic data
considered in this paper) may be derived analytically. We use a Gibbs sampler to
draw samples from the posterior distribution of the model parameters. For the factor
analysis results on multiple genomic data presented in Sections 7.1.2 and 7.1.3, the
number of Gibbs burn-in samples is set to 3000 and the number of collection samples
is set to 1000. Broad gamma hyperpriors are chosen for the variance terms with
a0 = b0 = a2 = b2 = a3 = b3 = 10−5. The results are relatively insensitive to these
settings and various other settings such as a0 = b0 = a2 = b2 = a3 = b3 = 10−3 or
a0 = b0 = a2 = b2 = a3 = b3 = 10−6 yielded very similar results. The shrinkage
parameters on the factor loadings are set at a1 = 10−3 and b1 = 10−6 (for Gene-copy
number analysis results) and a1 = 1 and b1 = 10−2 (for Gene-Methylation analysis
results).
For the joint KSBP-GP factor model, the conditional posteriors for all the model
parameters, except the GP parameters, are available in closed form. For the GP
parameters, we obtain point estimates via restricted range maximum likelihood
estimation (Casella and Berger, 2001). For the specific application considered here
(joint modeling of econometric data), we would like the shared latent space to learn
more macroscopic similarities (such as global economic trends) between the different
13
data and the unique latent space to learn more microscopic features unique to any
particular data. This is imposed by specifying different intervals for the bandwidth
parameter during the restricted range MLE based optimization to update the GP
parameters. For the kernel stick-breaking process we truncate the sum in (11) to
J = 50 terms, with wJ(r) = 1−∑J−1
l=1 wl(r). For the results presented on motorcycle
data, unless otherwise noted, the number of Gibbs burn-in samples is set to 3000
and the number of collection samples is set to 1000. For the results presented on
econometric data, the number of Gibbs burn-in samples is set to 1500 and the
number of collection samples is set to 500. All important (unique) update equations
are provided in Appendix B.
Since much correlation is encoded in the priors, the mixing of the MCMC sampler
was also carefully examined. The sampler was run extensively for different number
of burn-in and collection samples. It was also run multiple times in parallel with
different initial values. The results of these experiments were found to be consistent
and repeatable across such runs.
7 Experimental Results
We perform a number of experiments to validate the joint factor model as well as its
various components. The proposed model (for data without covariates) is applied
to heterogeneous genomic data for ovarian cancer patients. Next, the KSBP-GP
mixture prior is isolated from the factor model, and validated separately on motor-
cycle data (measurements of the acceleration of the head of a motorcycle rider after
impact). Finally, the KSBP-GP factor model is demonstrated first on the analy-
sis of spatially inhomogeneous data of a single modality (multi-year unemployment
rates of metro cities across US), and then on multi-modal space-time data (US and
Michigan unemployment rates and stock prices of companies in the S & P 500).
7.1 Joint analysis of heterogeneous genomic data
There are numerous publications on combining different types of DNA modifications
with gene expression. Perhaps the most natural of these are methods such as expres-
sion quantitative trait loci (eQTL) analysis (Kendziorski et al., 2006); more recent
eQTL formulations are discussed in (Scott-Boyer et al., 2012). CNAmet (Louhimo
and Hautaniemi, 2011) defines a similar approach to relate gene expression changes
14
with either copy number change or DNA methylation. Other approaches use well
established models for each of the individual data types, then combine the results
into a statistic that addresses the problem of interest. The approach of Jeong et al.
(2010) is an example of this for the identification of genes that are regulated by DNA
methylation. A shortcoming of all of these approaches is that they do not reduce
the dimension of the individual data sets through an accounting of their respective
correlation structures.
In (Lanckriet et al., 2004) the authors used kernel functions predefined for each
data type, and mapped to the same vector space, which allows joint analysis in the
common range of the kernels. Copy number and expression in cancer (CONEXIC)
(Akavia et al., 2010) has been proposed as a Bayesian scoring function that measures
how well a set of candidate gene regulators correlate with the expression of gene
modules (groups of genes that are correlated with each other). Another approach
(Lucas et al., 2010) utilizes a sparse factor model to model the correlation structure
of the gene expression data, but uses post-hoc hypothesis tests to draw connections
between gene expression and copy number data. These approaches do allow for
effective dimension reduction, but don’t use correlation structure in one data set to
inform estimations of correlation in the others.
The most direct approach to jointly modeling the correlation structure of heter-
geneous genomic data is to require the factor matrix to be shared, as in Shen et al.
(2009). Their model does not contain a data-type-specific factor structure equiva-
lent to W (r) in our model, and is therefore somewhat less flexible. In addition, they
utilize standard normal distributions on the elements of the factor matrix, eliminat-
ing the possibility of discovering factors that are relevant for only a subset of the
subjects.
7.1.1 Data description
The data in this study include ovarian cancer gene expression, copy number varia-
tion (CNV) and methylation data collected from the Cancer Genome Atlas (TCGA)
project (http://cancergenome.nih.gov/). We aim to integrate gene expression/CNVs
and gene expression/methylation from 74 ovarian cancer patients. For computing
purposes, we downsized the original massive data into smaller sets. Independent
gene-by-gene filtering (based on criteria such as overall mean and overall variance)
is typically employed to reduce data dimension as well as increase the number of
15
discoveries in high-throughput experiments (Bourgon et al., 2010; Gentleman et al.,
2005; Talloen et al., 2007). In our analysis, a filtering criteria was established for
the gene expression data to eliminate probes with sample mean below 6, or stan-
dard deviation below 0.4, which resulted in a gene expression data set downsized
from 22277 to 5976. Comparative genomic hybridization (CGH) data was filtered
to remove Agilent Human Genome CGH 244A probes containing missing values.
This set was further filtered by keeping only one in 50 probes, leaving 4443 probes.
Methylation data (Illumina Infinium human methylation 27K bead assay) was fil-
tered to retain only higher variance samples (resulting in 4722 probes) and was
inverse-probit transformed to lie on the real line.
7.1.2 Analysis of gene expression and copy number variation data
Samples
Fac
tors
Gene expression binary matrix
20 40 60
10
20
30
40
50
60
Samples
CNV binary matrix
20 40 60
10
20
30
40
50
60
Samples
Shared binary matrix
20 40 60
10
20
30
40
50
60
Figure 1: The inferred feature selection matrices unique to data of modality r(B(r)) and common to both data modalities (B(c)). From left to right, the figuresare binary matrices unique to gene expressions, CNVs and shared between geneexpressions and CNVs, respectively. The y-axis shows the indicator of each factor,and x-axis represents the 74 subjects. The inferred factors and samples selected bythe model are assigned as 1 (red) and 0 (blue) otherwise. Results are shown for themaximum-likelihood collection sample (for illustration purposes).
We applied the joint Bayesian factor model to gene expression and CGH in order
to identify factors that are representative of correlated changes in gene expression
and DNA copy number variations. We set the upper bound on the number of factors
as K = 60, and obtained 1 specific to gene expression, 4 unique to CNVs and 19
shared between both modalities (Figure 1). Figure 2 shows the correlation structure
16
Gene expression
CN
V
(a) Correlation coefficient
20 40 60 80 100
5
10
15
20
25
30
35
40
45
50
55 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Gene expression
CN
V
(b) Correlation coefficient
20 40 60 80 100
5
10
15
20
25
30
35
40
45
50
55 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Chr 17Chr 6
Chr 17
Chr 6
Figure 2: Correlation structure between gene expression and CNVs of top loadedgenes from factor 41. The figure displays correlation coefficients between the twodata. Panel (a) and (b) shows the correlation results from patients selected anddropped out by the model, respectively. It is observed that CNVs from chromosome6 and genes from chromosome 17 have a reverse correlation pattern.
of the probe sets (gene expression) and CGH clones (CNA) that are included in joint
factor number 41 (the factor numbering is arbitrary, and changes between collection
samples, with these results illustrative; in these and related results we depict the
maximum likelihood collection sample). As expected, correlation between the factor
genes for those patients who were included in this factor is higher than for those
not included. In results shown in Section 7.2.2, when analyzing the econometric
data, we demonstrate how the approximate posterior distribution may be utilized
(beyond just the maximum likelihood collection sample).
It is well known that some variations in cancer gene expression are caused by
gene dosage changes due to CNVs. In addition, because of the mechanism by which
CNV occurs, it tends to happen in contiguous regions. Of the 20 CNV factors
identified, one is a nearly perfect representation of batch effects in the data and the
remaining 19 display copy number amplification/deletion in specific chromosomal
regions. Most of these show similar gene expression changes in the same region. We
demonstrate this behavior in Figure 3, which shows that the largest factor loadings
from both CNV and gene expression for factor 18 are clustered around the same
region of chromosome 8.
We identified highly associated copy number variations in the chromosomal arm
8q12.3-8q24.13 (factor 18), which is a known region for frequent high-level ampli-
17
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y−2
−1
0
1
2
3Gene expression factor18
Chromosome
Fac
tor
load
ings
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y−1.5
−1
−0.5
0
0.5
1
1.5CNV factor18
Chromosome
Fac
tor
load
ings
B
A
Figure 3: Factor analytic relationship between CNV and gene expression. Thefigures show the factor loadings from the 18th factor of the joint factor model fit toCNV (Panel A) and gene expression data (Panel B), respectively. These results arefor the maximum-likelihood collection sample, for ease of interpretation, although afull approximate posterior distribution is inferred. The red denotes odd-numberedchromosomes and the blue denotes even-numbered chromosomes.
cation associated with disease progression in human cancers (Frank et al., 2007;
Pils et al., 2005). The rediscovery of genes in this region also validates our ap-
proach. For example, E2F5 (8q21.2, Unique ID: 1875), an important gene in the
regulation of cell cycle, is known to be overexpressed in ovarian epithelial cancer
(Kothandaraman et al., 2010). Over-expressed genes, MTDH (8q22.1, Unique ID:
92140) and EBAG9 (8q23, Unique ID: 9166), have been recognized in a variety of
cancers including ovarian and breast cancers (Akahira et al., 2004; Rennstam et al.,
2003; Emdad et al., 2007). Another gene in this region whose expression level is
known to be important in tumor biology is WWP1 (8q21, Unique ID: 11059). This
recapitulation of some of the well known features of aneuploidy in cancer suggests
that our joint model is appropriately capturing correlation structure between gene
expression and CGH data.
As described above, many factors we obtained are associated with individual
chromosomal locations, as demonstrated in Figure 3. However, there is also a sub-
set of factors (1, 14, 32, 41, 45, 57) which are representative of multiple regions.
Figure 4 shows that the largest factor loadings in CNV/gene expression for factor
41 come from both chromosome 6 and 17. This is the explanation of the checker-
18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y−10
−5
0
5
10Gene expression factor41
Chromosome
Fac
tor
load
ings
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y−4
−2
0
2
4
6CNV factor41
Chromosome
Fac
tor
load
ings
A
B
Figure 4: Dual peaks shown in the loadings of factor 41 of the joint factor modelfit to CNV (Panel A) and gene expression (Panel B) data. The red denotes odd-numbered chromosomes and the blue denotes even-numbered chromosomes.
board regions of positive and negative correlation in Figure 2 as well. The copy
number variations from the top ranked CGH probes in the two locations are highly
negatively correlated, with copy number gain in chromosome 6 and loss in the other.
There are a number of possible mechanistic explanations for this feature. For ex-
ample, it is possible that wholesale duplication of one region is lethal to the cells
without shutting down the apoptosis pathway. Such a shut down might be accom-
plished by deletion of other regions. Previous approaches to the joint analysis of
gene expression and CNV through the use of factor models, such as Lucas et al.
(2010), have failed to find these relationships.
The proposed joint factor model provides the flexibility of discovering factors that
are relevant only for a subset of the subjects. It is interesting to note that a similar
model which enforces that all subjects are included in the inferred factors, performed
poorly compared to the proposed model and discovered much fewer factors which
captured correlated changes in gene expressions and copy number variations.
7.1.3 Analysis of gene Expression and DNA methylation data
In this analysis, 18 common factors were inferred between methylation and gene ex-
pression. Unlike CNVs, methylation does not typically occur in contiguous regions,
therefore it is not surprising that no regional peaks were detected. Methylation acts
19
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y−3
−2
−1
0
1Gene expression factor5
Chromosome
Fac
tor
load
ings
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y−1
−0.5
0
0.5
1Methylation factor5
Chromosome
Fac
tor
load
ings
B
A
Figure 5: SPON1 gene identified in the loadings peak from factor 5 of the jointfactor model fit to DNA methylation (Panel A) and gene expression (Panel B) data.The red denotes odd-numbered chromosomes and the blue denotes even-numberedchromosomes.
as an epigenetic regulator and silences tumor suppressor genes by changing chromo-
somal structures. We detected a gene, SPON1(11p15.2, Unique ID: 10418), which
appears to be predominantly regulated by methylation of its CpG site (Figure 5).
Elevated expression of this gene relative to normal tissue is a known hallmark of
ovarian cancer (Pyle-Chenault et al., 2005), however, the mechanism of this overex-
pression was previously unknown. SPON1 encodes VSGP/F-spondin protein pro-
moting proliferation in vascular smooth cell during ovarian folliculogenesis, which
has been identified as a potential diagnostic marker or therapeutic target for ovarian
carcinoma (Pyle-Chenault et al., 2005; Miyamoto et al., 2001).
In contrast to the almost single gene precision of factor 5, factor 24 shows strong
correlation between methylation and gene expression in many different loci across
the entire genome (Figure 6). The list of CpG sites heavily loaded on this factor
are displayed in Table 1. Pathway analysis on these candidate genes reveals that
many are involved in DNA binding and regulation of transcription. The correla-
tion of methylation levels at all of these sites combined with their correlated gene
expression levels suggests that they are all the targets of a single methlylation pro-
gram, however, the existence of coordinated methylation enzymes that target these
locations is unconfirmed.
20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y−1
−0.5
0
0.5
1Gene expression factor24
Chromosome
Fac
tor
load
ings
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y−0.4
−0.2
0
0.2
0.4
0.6Methylation factor24
Chromosome
Fac
tor
load
ings
A
B
Figure 6: Loadings from factor 24 with strong correlations between methylation(Panel A) and gene expression (Panel B) at many different loci. The red denotesodd-numbered chromosomes and the blue denotes even-numbered chromosomes.
We implemented the joint factor model for analysis of multiple genomic data in
non-optimized Matlab on a quad core PC with 2.2 GHz CPU and 4 GB ram. The
average time per iteration of the Gibbs sampler for the results in Section 7.1.2 is 72
seconds and for the results in Section 7.1.3 is 55 seconds.
7.2 Joint analysis of space-time-varying data
7.2.1 Motorcycle Data
Among the primary contributions in this paper is the introduction of a new KSBP-
GP mixture prior for modeling non-stationary data, as well as its integration in the
factor model. To clearly illustrate the performance of the KSBP-GP mixture model,
we first isolate it from the factor model and apply it separately on the classic mo-
torcycle data (Silverman, 1985), which has been used frequently in recent literature
to demonstrate the success of non-stationary models (Rasmussen and Ghahramani,
2002; Meeds and Osindero, 2006). The motorcycle data, represented as xm ∈ R94×1,
constitutes measurements of the acceleration of the head of a motorcycle rider in
the first moments after impact (over the first 60 milliseconds). We model xm via a
KSBP-GP mixture, and therefore components of xm are clustered via KSBP, using
21
HOXA13 SOCS2 SLC38A4 SPAG6 EPHX3 BST2 PITX2 FERD3L
GPR133 FCRL3 F10 BCAN ALX4 CDIPT CPT1C BCAP31
SOX1 ZNF385D IGF2AS ADCY4 EOMES GATA4 CABP5 PAX7
FLRT1 LEP TRPA1 HOXD10 PLEKHB1 GPR142 STK19 EVX1
SLC4A11 ZDHHC11 ZNF750 NXN AJAP1 VSX2 TRH FOXI1
RAC3 RENBP MYO3A GATA4 GRIK3 CARD14 APCDD1L CA3
PCDHAC1 BCAP31 SNCAIP CYP4F22 FCN1 SSNA1 GBP4 CASQ1
ARHGAP4 KLHL6 CEACAM3 CEBPG ABCB4 LYZL4 TRHDE CDX2
SCML1 PTHLH KLF11 SLC22A18 DENND2D C2orf43 PI3 ESX1
CBLN4 MAGEB6 AIM2 ZDHHC8P HEPACAM2 A2BP1 TERC C3
Table 1: Genes showing significantly differential methylation pattern in their CpGsites. The list is generated from candidates displayed in Figure 6A.
time as a covariate; all components of xm within the same KSBP-defined cluster
are drawn from an associated GP.
As shown in Figure 7 (left), the data roughly shows three regions: a low flat
noise region, a curved region and a flat high noise region. Typically as shown in
Figure 8(b), the model infers three to four dominant clusters (GPs); by “dominant”
we mean that approximately 94% of the data points are in these clusters. As shown
in Figure 7 (right), the KSBP-GP identifies the three regions of the data (time
intervals 0-10 ms, 10-40 ms and 40-60 ms), as distinct clusters, drawn from GPs
with different kernel parameters. Figure 8(a) shows 100 samples drawn from the
posterior (predictive distribution) of the KSBP-GP function evaluated (using GP
regression) at intervals of 0.5 ms. The choice of 100 Gibbs samples is to facilitate
comparison to earlier literature (Rasmussen and Ghahramani, 2002; Meeds and
Osindero, 2006), which provide results based on a similar number of Gibbs samples.
The inferred levels of uncertainty are in close agreement with those reported in
(Rasmussen and Ghahramani, 2002; Meeds and Osindero, 2006), but arguably, the
KSBP-GP captures the varying levels of uncertainty over the time interval 30 ms to
60 ms slightly better than (Rasmussen and Ghahramani, 2002; Meeds and Osindero,
2006), where they are flatter.
In Figure 8(c) we also show the frequency of the number of inferred clusters com-
puted over 1000 Gibbs collection samples. We observe that the KSBP-GP mixture
model primarily infers four to nine clusters to model the data. However as seen
from Figure 8(b), approximately 94% of the data points are primarily associated
with 3 to 4 dominant clusters. This is in close agreement with the approximately
22
0 20 40 60−150
−100
−50
0
50
100
Time (ms)
Acc
eler
atio
n (g
)
0 20 40 60−150
−100
−50
0
50
100
Figure 7: Motorcycle data (left) and typical inferred clusters for the maximum like-lihood Gibbs collection sample (right). Here different colors correspond to differentKSBP clusters (clustering shown for the most-likely collection sample).
three non-stationary regions as seen in Figure 7 (left) and is significantly better
than (Rasmussen and Ghahramani, 2002), which reported that the posterior distri-
bution uses between 3 and 15 experts to fit the data and with a low probability of
using up to 31 experts. Concerning additional advantages of the proposed KSBP-
GP model over the gating model proposed in (Rasmussen and Ghahramani, 2002):
The gating model (Rasmussen and Ghahramani, 2002) is a purely conditional model
(note (Meeds and Osindero, 2006) has extended the gating model of (Rasmussen
and Ghahramani, 2002) to a fully generative model and has reported results sim-
ilar to (Rasmussen and Ghahramani, 2002)). Further, the gating parameter in
(Rasmussen and Ghahramani, 2002) is not conjugate to its prior and has to be sam-
pled via Metropolis-Hastings. The proposed KSBP-GP model is a fully generative
model and the KSBP kernel parameters φ (equivalent to the gating parameter in
(Rasmussen and Ghahramani, 2002)) as well as the anchor points denoted by r∗j are
conjugate to their priors and may be efficiently sampled by Gibbs sampling. Details
regarding sampling of φ and r∗j are provided in Appendix B.
7.2.2 Unemployment data across United States
In this section we provide results for the proposed KSBP-GP factor model on unem-
ployment data across United States (we first consider space-time data from a single
23
0 20 40 60−150
−100
−50
0
50
100
Time (ms)A
ccel
erat
ion
(g)
(a)
3 4 5 60
0.2
0.4
0.6
0.8
Number of dominant clusters
Fre
quen
cy
(b)
3 4 5 6 7 8 9 10110
0.1
0.2
0.3
0.4
Number of inferred clusters
Fre
quen
cy
(c)
Figure 8: (a) 100 samples from the posterior for interpolation at intervals 5ms, (b)frequency of the number of dominant clusters over 1000 Gibbs collection samples,(c) frequency of number of inferred clusters over 1000 Gibbs collection samples.
24
data modality, and below we then consider multiple modalities). The data contains
unemployment rates of 187 metro cities across United States, sampled monthly,
over the period 1991 to 2009. We first show typical clustering results obtained via
the KSBP-GP mixture model. We have set the truncation level for the KSBP to
J = 50 and typically about 15 clusters are inferred. In Figure 9(a)-(d), we show the
probabilities of the cities being associated with 4 dominant inferred clusters. Also,
in our model, we infer the bandwidth of the GP associated with each cluster. For
example, the inferred bandwidth associated with the GP shown in Figure 9(a) is
18.35 whereas the inferred bandwidth for the cluster shown in Figure 9(b) is 0.4028.
It is interesting to observe that the inferred bandwidth associated with the north-
eastern cities is much smaller compared to the midwestern cities. It is intuitively
pleasing to note that the model infers short-range correlation pattern among the
more populated northeastern states whereas it infers more long-range correlation
pattern among the sparsely populated midwestern states.
In Table 2, we provide comparative interpolation results for our proposed model
and two other simpler models. In our experiment, we divide the unemployment
data across US (consisting of unemployment rates of 187 cities), into two parts;
87 cities (approximately 46% of the total number of cities) are drawn uniformly
at random once, and the unemployment rates of these cities are used for model
learning. The unemployment rates of the remaining 100 cities are assumed to be
unknown and are interpolated based on the learned model (using Gaussian process
regression (Rasmussen and Williams, 2005; Lopes et al., 2008) methods) and the
results are provided in Table 2. The results demonstrate that the KSBP-GP is a
better model for the spatially varying US unemployment data.
Table 2: Row 1: Average MSE of reconstruction of the unemployment rates (whichare in units of percent), for the 87 cities across US used for model learning. Row 2:Average MSE of interpolation of the unemployment rates for the 100 missing citiesin US. The results are the posterior means computed over 500 Gibbs collectionsamples. B1 refers to a factor model where the factor loadings are drawn froma Dirichlet process (Sethuraman, 1994) mixture of GPs and B2 refers to a factormodel where the factor loadings are drawn from a GP.
KSBP-GP B1 B2
0.0802 0.0700 0.53661.3288 1.4329 1.8046
25
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(a)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(b)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(c)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(d)
Figure 9: (a)-(b) Four dominant inferred clusters obtained via the KSBP-GP mix-ture model. Darker shades indicate higher probability of a city being associated witha specific cluster. Note that the stick weights of the KSBP, captures the probabilityof association of a city with a particular cluster.
26
7.2.3 Joint Modeling: Michigan/United States plus S & P 500
To demonstrate the performance of the proposed joint KSBP-GP factor model, in
this paper we consider the integration of three different data types: Unemployment
rates of 83 counties in Michigan (modality r = 1), unemployment rates of 187
metro cities across the United States (modality r = 2) and the stock prices of 500
companies in S & P 500 (modality r = 3). For the Michigan data, the columns
of D(1) are drawn from a GP as in (8) (we assume spatial homogeneity across
Michigan) and for the US data the columns of D(2) are drawn from the KSBP-GP
(we assume spatial heterogeneity across US). For the stock price data, we have no
spatial information and the components of D(3) are drawn i.i.d. from a Gausssian
as in (6).
1990 1992 1994 1996 1998 2000 2002 2004 2006 20080
2
4
6
8
10
12
Years
Une
mpl
oym
ent r
ates
(%
)
True unemployment rates at Delta county of Michigan
Interpolated by joint GPFA (Michigan + US + S & P 500)
Interpolated by GPFA using only Michigan data
Figure 10: Typical interpolation result for GPFA and joint GPFA models for 79%missing data for a single missing county in Michigan (results averaged over 500collection samples).
Sampling of unemployment rates on a fine spatial scale is a difficult task. Hence,
we perform the following experiment to demonstrate our proposed joint GPFA
model. We assume that we have coarsely sampled unemployment data across Michi-
gan (unemployment rates of 18 counties, i.e., approximately 21% of the total coun-
ties, drawn uniformly at random once) and across the US (unemployment rates of
40 cities, i.e, 21% of the total cities, drawn uniformly at random once), with data
27
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008−0.05
0
0.05
Years
Latent feature shared across all data modalities
−0.5
0
0.5
1Latent feature unique to Michigan
Recession Period
Figure 11: Typical inferred latent feature unique to Michigan (inferred row of S(r),where r = Michigan) / common to all data modalities (one inferred row of S(c)) forthe maximum likelihood Gibbs collection sample.
sampled monthly from these sites over the period 1991-2009. However, we have the
stock prices of all the 500 companies listed in the S & P 500, sampled monthly over
the same time period. Next, using GP regression (Rasmussen and Williams, 2005;
Lopes et al., 2008), we obtain the posterior distributions of the unemployment rates
at the remaining 79% counties of Michigan conditioned on (a) 21% observed data in
Michigan and (b) 21% observed data in Michigan and the US plus the stock prices
of all companies in the S & P 500. For these two scenarios, we compute the posterior
mean of the unemployment rates at the missing counties, using 500 Gibbs collection
samples. The MSE of interpolation of the unemployment rates (which are in units
of percent) for a typical example missing county in Michigan for the two cases are
1.2447 (using only undersampled Michigan data) and 0.4922 (using Michigan + US
+ S & P data). We also show the time series of the interpolation for the missing
county in Figure 10. The results clearly demonstrate that jointly analyzing the
Michigan, the United States and the S & P data provides significant improvement
in the MSE of interpolation for the missing counties in Michigan, relative to interpo-
lating based only on observed Michigan data. In Figure 11 we show typical inferred
factor scores (latent features) unique to Michigan and common to all data modali-
28
ties. It is interesting to observe that the latent features unique to Michigan capture
more microscopic and data specific behavior. For example, the latent feature in
Figure 11, specific to Michigan, captures the well known phenomenon of seasonal
variation in unemployment rates. On the other hand, latent features shared across
all data modalities, capture more macroscopic and global bahavior pertaining to the
general economy. For example, the shared latent feature in Figure 11 captures the
US economic cycle over the last two decades.
We implemented the KSBP-GP factor model for analysis of econometric data in
non-optimized Matlab on a quad core PC with 2.2 GHz CPU and 4 GB ram. The
average time per iteration for the analysis in Section 7.2.2 is 1 minute and for the
analysis in Section 7.2.3 is 1.5 minutes. Note that the results are for the most naive
implementation of a GP and the computational times may be significantly improved
by adopting efficient approximations for GP regression as provided in (Rasmussen
and Williams, 2005; Candela et al., 2007).
8 Conclusions
A joint factor analysis method is introduced for modeling multiple disparate but
statistically related data. The proposed approach was first demonstrated on the
joint analysis of heterogeneous genomic data related to ovarian cancer. The pro-
posed model uncovered key drivers of cancer, some of which have been previously
reported in literature as well as some new genomic causes of cancer (potentially).
Next the proposed approach was extended to a Gaussian process based factor analy-
sis approach to integrate space-time varying data. The approach can readily handle
spatial inhomogeneity and is also applicable to large and realistic datasets. This is
achieved via the introduction of a novel KSBP based mixture of GP prior. The-
oretical properties of the KSBP-GP factor model are also derived and discussed.
The joint KSBP-GP factor analysis model is shown to be particularly effective in
improving the model learning of undersampled data with the aid of other available
correlated data. Further, the joint model produced interpretable low dimensional
features (shared as well as data-specific features).
In this paper we have focussed on integrating multiple heterogeneous but sta-
tistically correlated datasets, via a joint factor analysis approach where the latent
space is factorized into shared and data specific components. Moreover, data spe-
29
cific linear mappings from the latent space to the observation spaces where obtained
via joint analysis of all data modalities. However, for certain applications, the as-
sumption that the data lie in or close to a low-dimensional subspace is restrictive
and a better assumption is that the data lie on some low-dimensional manifold. In
the future we wish to relax the linearity assumption of our joint factor model via a
mixture of factor analyzers (MFA) approach.
References
Akahira, J.-I., Aoki, M., Suzuki, T., Moriya, T., Niikura, H., Ito, K., Inoue, S., Oka-
mura, K., Sasano, H., and Yaegashi, N. (2004). “Expression of EBAG9/RCAS1
is associated with advanced disease in human epithelial ovarian cancer.” Br J
Cancer , 90, 11, 2197–2202.
Akaike, H. (1987). “Factor analysis and AIC.” Psychometrika, 52, 317–332.
Akavia, U. D., Litvin, O., Kim, J., Sanchez-Garcia, F., Kotliar, D., Causton, H. C.,
Pochanard, P., Mozes, E., Garraway, L., and Pe’er, D. (2010). “An Integrated
Approach to Uncover Drivers of Cancer.” Cell , 143, 6, 1005–1017.
Archambeau, C. and Bach, F. (2008). “Sparse probabilistic projections.” In Neural
Information Processing Systems .
Bach, F. and Jordan, M. I. (2005). “A probabilistic interpretation of canonical
correlation analysis.” Tech. rep.
Borga, M. (1998). “Learning Multidimensional Signal Processing.” Linkping Studies
in Science and Technology. Dissertations No. 531, Linkping University, Sweden.
Bourgon, R., Gentleman, R., and Huber, W. (2010). “Independent filtering increases
detection power for high-throughput experiments.” Proceedings of the National
Academy of Sciences , 107, 21, 9546–9551.
Candela, J. Q., Rasmussen, C. E., and Williams, C. K. I. (2007). “Approximation
Methods for Gaussian Process Regression.” Tech. rep., Microsoft Research.
Carvalho, C., Chang, J., Lucas, J., Nevins, J., Wang, Q., and West, M. (2008).
“High-Dimensional Sparse Factor Modelling: Applications in Gene Expression
Genomics.” Journal of the American Statistical Association, 103, 484, 1438–1456.
30
Casella, G. and Berger, R. (2001). Statistical Inference. Duxbury Resource Center.
Chen, M., Zaas, A., Woods, C., Ginsburg, G., Lucas, J., Dunson, D., and Carin, L.
(2011). “Predicting Viral Infection From High-Dimensional Biomarker Trajecto-
ries.” Journal of the American Statistical Association, 1–21.
Duan, J., Guindani, M., and Gelfand, A. (2007). “Generalized spatial Dirichlet
process models.” Biometrika, 94, 809–825.
Dunson, D. and Park, J.-H. (2008). “Kernel stick-breaking processes.” Biometrika,
95, 2, 307–323.
Emdad, L., Sarkar, D., Su, Z.-Z., Lee, S.-G., Kang, D.-C., Bruce, J., Volsky, D., and
Fisher, P. (2007). “Astrocyte elevated gene-1: Recent insights into a novel gene
involved in tumor progression, metastasis and neurodegeneration.” Pharmacology
& Therapeutics , 114, 2, 155 – 170.
Frank, B., Bermejo, J. L., Hemminki, K., Sutter, C., Wappenschmidt, B., Meindl,
A., Kiechle-Bahat, M., Bugert, P., Schmutzler, R., Bartram, C. R., and Bur-
winkel, B. (2007). “Copy number variant in the candidate tumor suppressor gene
MTUS1 and familial breast cancer risk.” Carcinogenesis , 28, 7, 1442–1445.
Gelfand, A., Kottas, A., and MacEachern, S. (2005). “Bayesian nonparametric
spatial modeling with Dirichlet process mixing.” J. Am. Stat. Ass., 100, 1021–
1035.
Gentleman, R., Carey, V., Huber, W., Irizarry, R., and Dudoit, S. (2005). Bioinfor-
matics and Computational Biology Solutions Using R and Bioconductor (Statistics
for Biology and Health). Secaucus, NJ, USA: Springer-Verlag New York, Inc.
Gramacy, R. and Lee, H. (2007). “Bayesian Treed Gaussian Process Models with
an Application to Computer Modeling.” Journal of the American Statistical As-
sociation.
Griffiths, T. and Ghahramani, Z. (2005). “Infinite latent feature models and the
Indian buffet process.” In Neural Information Processing Systems , 475–482.
Hardoon, D. R., Szedmak, S., and Shawe-Taylor, J. (2004). “Canonical Correlation
Analysis: An Overview with Application to Learning Methods.” Neural Compu-
tation, 16, 12, 2639–2664.
31
Hotelling, H. (1936). “Relations Between Two Sets of Variates.” Biometrika, 28,
3/4, 321–377.
Ishwaran, H. and Rao, J. S. (2005). “Spike and slab variable selection: Frequentist
and Bayesian strategies.” Annals of Statistics , 33, 730–773.
Jeong, J., Li, L., Liu, Y., Nephew, K., Huang, T., and Shen, C. (2010). “An empirical
Bayes model for gene expression and methylation profiles in antiestrogen resistant
breast cancer.” BMC Medical Genomics , 3, 1, 55.
Kendziorski, C. M., Chen, M., Yuan, M., Lan, H., and Attie, A. D. (2006). “Sta-
tistical Methods for Expression Quantitative Trait Loci (eQTL) Mapping.” Bio-
metrics , 62, 1, 19–27.
Klami, A. and Kaski, S. (2008). “Probabilistic approach to detecting dependencies
between data sets.” Neurocomputing , 72, 1-3, 39–46.
Kothandaraman, N., Bajic, V., Brendan, P., Huak, C., Keow, P., Razvi, K., Salto-
Tellez, M., and Choolani, M. (2010). “E2F5 status significantly improves malig-
nancy diagnosis of epithelial ovarian cancer.” BMC Cancer , 10, 1, 64.
Lanckriet, G. G., De Bie, T., Cristianini, N., Jordan, M., and Noble, W. S. (2004).
“A statistical framework for genomic data fusion.” Bioinformatics , 20, 16, 2626–
2635.
Lopes, H. F., Gamerman, D., and Salazar, E. (2011). “Generalized spatial dynamic
factor models.” Comput. Stat. Data Anal., 55, 1319–1330.
Lopes, H. F., Salazar, E., and Gamerman, D. (2008). “Spatial Dynamic Factor
Analysis.” Bayesian Analysis , 3, 4, 759–792.
Louhimo, R. and Hautaniemi, S. (2011). “CNAmet: an R package for integrating
copy number, methylation and expression data.” Bioinformatics , 27, 6, 887–888.
Lucas, J. E., Kung, H. N., and Chi, J. T. A. (2010). “Latent Factor Analysis to Dis-
cover Pathway-Associated Putative Segmental Aneuploidies in Human Cancers.”
Plos Computational Biology , 6, 9.
Luttinen, J. and Ilin, A. (2009). “Variational Gaussian-process factor analysis
for modeling spatio-temporal data.” In Neural Information Processing Systems ,
1177–1185.
32
Mairal, J., Bach, F., Ponce, J., Sapiro, G., and Zisserman, A. (2008). “Supervised
Dictionary Learning.” In Neural Information Processing Systems , 1033–1040.
Meeds, E. and Osindero, S. (2006). “An alternative infinite mixture of Gaussian
process experts.” In Neural Information Processing Systems , 883–890.
Miyamoto, K., Morishita, Y., Yamazaki, M., Minamino, N., Kangawa, K., Matsuo,
H., Mizutani, T., Yamada, K., and Minegishi, T. (2001). “Isolation and Charac-
terization of Vascular Smooth Muscle Cell Growth Promoting Factor from Bovine
Ovarian Follicular Fluid and Its cDNA Cloning from Bovine and Human Ovary.”
Archives of Biochemistry and Biophysics , 390, 1, 93 – 100.
Muller, P., Erkanli, A., and West, M. I. K. E. (1996). “Bayesian curve fitting using
multivariate normal mixtures.” Biometrika, 83, 1, 67–79.
Paisley, J. and Carin, L. (2009). “Nonparametric factor analysis with beta process
priors.” In Proceedings of the 26th International Conference on Machine Learning ,
777–784.
Pils, D., Horak, P., Gleiss, A., Sax, C., Fabjani, G., Moebus, V., Zielinski, C.,
Reinthaller, A., Zeillinger, R., and Krainer, M. (2005). “Five genes from chromo-
somal band 8p22 are significantly down-regulated in ovarian carcinoma.” Cancer ,
104, 11, 2417–2429.
Pyle-Chenault, R., Stolk, J., Molesh, D., Boyle-Harlan, D., McNeill, P., Repasky,
E., Jiang, Z., Fanger, G., and Xu, J. (2005). “VSGP/F-spondin: a new ovarian
cancer marker.” Tumor Biol..
Rai, P. and Daume, H. (2009). “Multi-Label Prediction via Sparse Infinite CCA.” In
Advances in Neural Information Processing Systems 22 , eds. Y. Bengio, D. Schu-
urmans, J. Lafferty, C. K. I. Williams, and A. Culotta, 1518–1526.
Rasmussen, C. and Ghahramani, Z. (2002). “Infinite mixtures of Gaussian process
experts.” In Neural Information Processing Systems , 881–888.
Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine
Learning (Adaptive Computation and Machine Learning). The MIT Press.
33
Rennstam, K., Ahlstedt-Soini, M., Baldetorp, B., Bendahl, P.-O., Borg, A., Karhu,
R., Tanner, M., Tirkkonen, M., and Isola, J. (2003). “Patterns of Chromosomal
Imbalances Defines Subgroups of Breast Cancer with Distinct Clinical Features
and Prognosis. A Study of 305 Tumors by Comparative Genomic Hybridization.”
Cancer Research, 63, 24, 8861–8868.
Schmidt, M. N. (2009). “Function Factorization using Warped Gaussian Processes.”
In Proceedings of the 26th International Conference on Machine Learning , 921–
928. Montreal.
Schmidt, M. N. and Laurberg, H. (2008). “Non-negative matrix factorization with
Gaussian process priors.” Computational Intelligence and Neuroscience.
Schwarz, G. (1978). “Estimating the dimension of a model.” The Annals of Statis-
tics , 6, 461–464.
Scott-Boyer, M., Imholte, G., Tayeb, A., Labbe, A., Deschepper, C., and Gottardo,
R. (2012). “An Integrated Hierarchical Bayesian Model for Multivariate eQTL
Mapping.” Stat. Appl. Genetics Molecular Biology , 11.
Sethuraman, J. (1994). “A constructive definition of Dirichlet priors.” Statistica
Sinica, 4, 639–650.
Shen, R., Olshen, A. B., and Ladanyi, M. (2009). “Integrative clustering of multiple
genomic data types using a joint latent variable model with application to breast
and lung cancer subtype analysis.” Bioinformatics , 25, 22, 2906–2912.
Shi, J. Q., Murray-Smith, R., and Titterington, M. (2003). “Bayesian Regression
and Classification Using Mixtures of Gaussian Processes.” In International Jour-
nal of Adaptive Control and Signal Processing , 149–161.
Silverman, B. W. (1985). “Some Aspects of the Spline Smoothing Approach to Non-
Parametric Regression Curve Fitting.” Journal of the Royal Statistical Society.
Series B (Methodological), 47, 1, 1–52.
Talloen, W., Clevert, D.-A., Hochreiter, S., Amaratunga, D., Bijnens, L., Kass,
S., and Gohlmann, H. (2007). “I/NI-calls for the exclusion of non-informative
genes: a highly effective filtering tool for microarray data.” Bioinformatics , 23,
21, 2897–2902.
34
Thibaux, R. and Jordan, M. I. (2007). “Hierarchical beta processes and the Indian
buffet process.” In Proceedings of the 11th Conference on Artificial Intelligence
and Statistic.
Tipping, M. E. (2001). “Sparse bayesian learning and the relevance vector machine.”
J. Mach. Learn. Res., 1, 211–244.
Tresp, V. (2001). “Mixtures of Gaussian processes.” In Neural Information Pro-
cessing Systems , 654–660.
Wang, C. (2007). “Variational Bayesian Approach to Canonical Correlation Analy-
sis.” Neural Networks, IEEE Transactions on, 18, 3, 905 –910.
35
Appendix A.
In this appendix we provide the proofs for the correlation properties provided in
Section 5.
Proof of conditional spatial covariance
Let Θk = r∗kl, φkl, Vkll=1,∞ represent the parameter set of the KSBP and Ωk =
τ (s)kl , β
(s)kl , σ
(s)kl l=1,∞ represent the parameter set for the GPs corresponding to the
kth factor loading. The conditional spatial covariance is
Cov(dik, di′k|Θk,Ωk) =∞∑l=1
Σk,l(i, i′)p(zk(i) = zk(i
′) = l|Θk)
=∞∑l=1
Σk,l(i, i′)wiklwi′kl
=< ψik,ψi′k >
(19)
where, ψik = [σk1wik1, σk2wik2, · · · , σk∞wik∞]T ,ψi′k = [σk1wi′k1, σk2wi′k2, · · · , σk∞wi′k∞]T ,
σkl =√
Σk,l(i, i′) and wikl = VklK(ri, r∗kl;φkl)
∏l−1j=1[1− VkjK(ri, r
∗kj;φkj)].
Proof of marginal spatial covariance
The marginal spatial covariance is
Cov(dik, di′k) =∞∑l=1
∫Θ,Ω
Σk,l(i, i′)p(z(i) = z(i′) = l|Θ)P (dΘ)P (dΩ) (20)
As it is difficult to obtain a closed form expression for (20), we make the following
simplifying assumptions. If all the GPs share the same kernel parameters, i.e.,
36
Σk,l(i, i′) = Σk(i, i
′), then the marginal spatial covariance is
Cov(dik, di′k) = E[Σk(i, i′)]
∞∑l=1
∫Θ
p(z(i) = z(i′) = l|Θ)P (dΘ)
= E[Σk(i, i′)]
∞∑l=1
EV 2l K(ri, r
∗l ;φ
∗l )K(ri′ , r
∗l ;φ
∗l )
·l−1∏`=1
[1− V`K(ri, r∗` ;φ
∗`)][1− V`K(ri′ , r
∗` ;φ
∗`)]
= E[Σk(i, i′)]
∞∑l=1
EV 2K(ri, r∗;φ∗)K(ri′ , r
∗;φ∗)
· [E(1− V K(ri, r∗;φ∗))(1− V K(ri′ , r
∗;φ∗))]l−1
= E[Σk(i, i′)]
E[V 2K(ri, r∗;φ∗)K(ri′ , r
∗;φ∗)]
1− E[(1− V K(ri, r∗;φ∗))(1− V K(ri′ , r∗;φ∗))]
=E[Σk(i, i
′)]E[V ]E[V 2]
· E[K(ri,r∗;φ∗)]+E[K(ri′ ,r∗;φ∗)]
E[K(ri,r∗;φ∗)K(ri′ ,r∗;φ∗)]
− 1
(21)
where E denotes the expectation operator. Since Θ = r∗l , φ∗l , Vl∞l=1 are drawn
i.i.d., we drop the subscript l from line 4 onwards of (21). For V ∼ Beta(1, γ), we
have:E[V ]
E[V 2]=
2 + γ
γ(22)
To obtain a simplified expression forE[K(ri,r
∗;φ∗)]+E[K(ri′ ,r∗;φ∗)]
E[K(ri,r∗;φ∗)K(ri′ ,r∗;φ∗)]
, we make another
simplifying assumption, and instead of a Gaussian kernel we assume a rectangular
kernel for the KSBP given by, K(r, r∗;φ∗) = 1, for ||r−r∗||2 ≤ ∆; K(r, r∗;φ∗) = 0,
for ||r−r∗||2 > ∆; we also assume that r∗ is drawn uniformly with density function
given by Pr∗(r∗) = 1
S , where S denotes the area of the entire support of r∗. Hence
we have
E[K(ri, r∗;φ∗)] = E[K(ri′ , r
∗;φ∗)] =1
SScircle =
1
Sπ∆2 (23)
E[K(ri, r∗;φ∗)K(ri′ , r
∗;φ∗)] =1
SS∩ (24)
where S∩ denotes the area of the intersection of two circles with centers at ri and
37
ri′ and with identical radius ∆. Hence
S∩ = 2(Ssector − Striangle)
= 22 arcsin
√∆2−||
ri−ri′2||22
∆2
2ππ∆2 − ||ri − ri
′
2||2√
∆2 − ||ri − ri′
2||22 (25)
Hence we have
E[K(ri, r∗;φ∗)] + E[K(ri′ , r
∗;φ∗)]
E[K(ri, r∗;φ∗)K(ri′ , r∗;φ∗)]=
1
1π[arcsin
√∆2−||
ri−ri′2||22
∆2 − ||ri−ri′
2||2
√∆2−||
ri−ri′2||22
∆2 ]
(26)
Substituting (22) and (26) in (21), we have
Cov(dik, di′,k) =E[Σk(i, i
′)]2+γ
2
1π
[arcsin
√∆2−||
ri−ri′2 ||22
∆2 −||ri−ri′
2 ||2
√∆2−||
ri−ri′2 ||22
∆2 ]
− 1(27)
Proof of spatio-temporal covariance
For data point xij located at ri at time tj and data point xi′j′ located at ri′ at time
tj′ , the covariance is given by
38
Cov(xij, xi′j′) = Cov(K∑k=1
dikbkjskj,
K∑k=1
di′kbkj′skj′)
=K∑k=1
Cov(dikbkjskj, di′kbkj′skj′) +∑k 6=k′
Cov(dikbkjskj, di′k′bk′j′sk′j′)
(a)=
K∑k=1
Cov(dikbkjskj, di′kbkj′skj′)
(b)=
K∑k=1
E(bkj)E(bkj′)Cov(diksi′k, dkjskj′)
(c)= α2
K∑k=1
Cov(diksi′k, dkjskj′)
(d)= α2
K∑k=1
Cov(dik, di′k)Cov(skj, skj′)
(28)
where the steps of the proof are justified as follows: (a) when k 6= k′, dikbkjskj
and di′k′bk′j′sk′j′ are independent and therefore Cov(dikbkjskj, di′k′bk′j′sk′j′) = 0; (b)
bkj, bk′j′ , dikskj and di′k′sk′j′ are independent; (c) bkj, bk′j′ are drawn independently
from a Bernoulli distribution with expectation α; (d) dik, di′k′ and skj, sk′j′ are drawn
from independent GPs and therefore Cov(diksi′k, dkjskj′) = Cov(dik, di′k)Cov(skj, skj′).
Appendix B.
An MCMC algorithm for posterior inference of the joint KSBP-GP factor model
proposed in the paper is provided below:
Sample d(r)k
p(d(r)k |−) ∼
M∏i=1
N(x
(r)i ;D(r)(s
(c)i b
(c)i + s
(r)i b
(r)i ), γ(r)
ε
−1INr
)N(d
(r)k ; 0,Σk
)(29)
In this and the notation below, p(d(r)k |−) is the probability of d
(r)k conditioned
on all other parameters being fixed to the last value in the sequence of Gibbs update
equations.
39
It can be shown that d(r)k is drawn from a normal distribution
p(d(r)k |−) ∼ N
(µ
d(r)k,Σ
d(r)k
)(30)
where
Σd
(r)k
=
(Σ−1k + γε
M∑i=1
(s
(c)ki b
(c)ki + s
(r)ki b
(r)ki
)2
INr
)−1
(31)
µd
(r)k
= γεΣdk
M∑i=1
(s(c)i b
(c)i + s
(r)i b
(r)i )x
−k,(r)i (32)
where
x−k,(r)i = x
(r)i −D(r)(s
(c)i b
(c)i + s
(r)i b
(r)i ) + d
(r)k (s
(c)ki b
(c)ki + s
(r)ki b
(r)ki ) (33)
Note, for modeling S & P 500 data, the factor loadings are drawn i.i.d. from a
Gaussian distribution and Σk = γ−1s INr , and for modeling Michigan data the factor
loadings are drawn from a GP and Σk(n,m) = τ(s)k exp
−β
(s)k ‖rn−rm‖
2
2
+ σ
(s)k δn,m.
Finally, for modeling the US data, the factor loadings are drawn from a mixture of
GPs
d(r)k ∼
J∏l=1
N (0,Σk,l) (34)
Hence the elements of d(r)k which belong to the lth cluster, denoted by d
(r)k,l , are drawn
from a GP with covariance Σk,l and the update equations for the lth GP cluster are
identical to (30), (31) and (32), with d(r)k replaced by d
(r)k,l and Σk replaced by Σk,l.
Sample b(c)k , b
(r)k
p(b(c)ik |−) ∼
R∏r=1
N(x
(r)i ;D(r)(s
(c)i b
(c)i + s
(r)i b
(r)i ), γ(r)
ε
−1INr
)Bernoulli(b
(c)ik ; πk)
(35)
The posterior probability that b(c)ik = 1 is proportional to
p1 = πk
R∏r=1
exp
(−γ
(r)ε
2
(s
(c)ik
2d
(r)k
Td
(r)k − 2s
(c)ik d
(r)k
Tx−k,(r)i
))(36)
40
The posterior probability that b(c)ik = 0 is proportional to
p0 = 1− πk (37)
Hence, b(c)ik may be drawn from a Bernoulli distribution
b(c)ik ∼ Bernoulli
p1
p1 + p0
(38)
Similarly, b(r)ik may be drawn from a Bernoulli distribution
b(r)ik ∼ Bernoulli
p′1p′1 + p′0
(39)
where
p′1 = πk exp
(−γ
(r)ε
2
(s
(r)ik
2d
(r)k
Td
(r)k − 2s
(r)ik d
(r)k
Tx−k,(r)i
))(40)
p′0 = 1− πk (41)
Sample s(c)(k), s
(r)(k)
In this and the notation below, x(r)(i) represent the ith column of row of matrix
X(r).
p(s(c)(k)|−) ∼
R∏r=1
Nr∏i=1
N(x
(r)(i) ;(S(c)T B(c)T
+ S(r)T B(r)T)d
(r)(i) , γ
(r)ε
−1IM
)N(s
(c)(k); 0,Σ′k
)(42)
Note, s(c)(k) represents the kth column of row of matrix S(c).
It can be shown that s(c)(k) is drawn from a normal distribution
p(s(c)(k)|−) ∼ N (µsk ,Σsk) (43)
41
where
Σs
(c)(k)
=
(Σ′−1k +
R∑r=1
Nr∑i=1
(γ(r)ε
(d
(r)ki
)2)(b
(c)(k)b
(c)(k)
T)IM
)−1
(44)
µs
(c)(k)
= Σs
(c)(k)
R∑r=1
Nr∑i=1
(γ(r)ε d
(r)ki
)(b
(c)(k) x
−k,(r)(i)
)(45)
where,
x−k,(r)(i) = x
(r)(i) − (S(c)T B(c)T
)d(r)(i) +
(s
(c)(k) b
(c)(k)
)d
(r)ki −
(s(r)T b(r)T
)d
(r)(i) (46)
Similarly, it can be shown that s(r)(k) is drawn from a normal distribution
p(s(r)(k)|−) ∼ N
(µ
s(r)k,Σ
s(r)k
)(47)
where
Σs
(r)k
=
(Σ′−1k +
Nr∑i=1
(γ(r)ε
(d
(r)ki
)2)(b
(r)(k)b
(r)(k)
T)IM
)−1
(48)
µs
(r)k
= Σs
(r)k
Nr∑i=1
(γ(r)ε d
(r)ki
)(b
(r)(k) x
−k,(r)(i)
)(49)
where,
x−k,(r)(i) = x
(r)(i) − (S(r)TB(r)T
)d(r)(i) +
(s
(r)(k) b
(r)(k)
)d
(r)ki −
(S(c)T B(c)T
)d
(r)(i) (50)
Sample zk (cluster labels for kth factor loading)
p(zk(i) = l|−) ∼ N (µ, σ2)wikl(ri) (51)
where
wikl(ri) = Vkl(ri; r∗kl, φkl)
l−1∏j=1
[1− VklK(ri; r∗kj, φkj)] (52)
42
and (Rasmussen and Williams, 2005),
µ = Σk,l(ri, r\i)TΣk,l(r\i, r\i)
−1d(r)k,l\i (53)
σ2 = Σk,l(ri, ri)−Σk,l(ri, r\i)TΣk,l(r\i, r\i)
−1Σk,l(ri, r\i) (54)
Here the elements of d(r)k which belong to the lth cluster is denoted by d
(r)k,l . The
notation d(r)k,l\i denote all indicators except number i and r\i denotes the spatial
locations corresponding to the elements in d(r)k,l\i.
Sample Vk
A data augmentation approach to update Vkll=1,J has been proposed in Dun-
son and Park (2008). It requires the introduction of two auxiliary variables: Aikl ∼Bernoulli(Vkl) and Bikl ∼ Bernoulli (K(ri, ; r
∗kl, φkl)) , independently for each l,
with zk(i) = min l : Aikl = Bikl = 1. We can then alternate between sampling
(Aikl, Bikl) from their conditional distribution given zk(i) and updating Vkl by sam-
pling from the conditional posterior distribution
Vkl ∼ Beta
1 +∑
i:zk(i)≥l
Aikl, γk +∑
i:zk(i)≥l
(1− Aikl)
(55)
Sample φk
A discrete prior is placed on the KSBP kernel width parameter φkjj=1,J
p(φkj) =∑h
phφh (56)
where φh are potential kernel widths. The posterior takes the form,
p(φkj) =∑h
pnewh φh (57)
where,
pnewh = ph
J∏j=1
[VkjK(ri; r
∗kj, φh)
]I(zk(i)=j) (1− VkjK(ri; r
∗kj, φh)
)I(zk(i)>j)(58)
43
Sample r∗k
A discrete prior is placed on the KSBP basis locationr∗kjj=1,J
p(r∗kj) =∑h
ehr∗h (59)
where r∗h constitutes a grid of potential locations. The posterior takes the form,
p(φj) =∑h
enewh r∗h (60)
where,
enewh = ph
J∏j=1
[VkjK(ri; r∗h, φkj)]
I(zk(i)=j) (1− VkjK(ri; r∗h, φkj))
I(zk(i)>j) (61)
Sample πk
p(πk|−) ∼ Beta(πk; cα, c(1− α))M∏i=1
Bernoulli(bki; πk) (62)
It can be shown that πk may be drawn from a Beta distribution as
p(πk|−) ∼ Beta(cα +M∑i=1
bki, c(1− α) +M −M∑i=1
bki) (63)
Sample γk
p(γk|−) ∼J∏l=1
Beta(Vkl; 1, γk)Gamma(γk; a2, b2) (64)
It can be shown that γk may be drawn from a Gamma distribution as
γk ∼ Gamma(a2 + J − 1, b2 −J∑l=1
ln(1− Vkl)) (65)
Sample γ(r)ε
44
p(γ(r)ε |−) ∼
M∏i=1
N(x
(r)i ;D(s
(c)i b
(c)i + s
(r)i b
(r)i ), γ(r)
ε
−1INr
)Gamma(γ(r)
ε , a0, b0)
(66)
It can be shown that γ(r)ε may be drawn from a Gamma distribution as,
p(γ(r)ε |−) ∼ Gamma
(a0 +
1
2MNr, b0 +
1
2
M∑i=1
∥∥∥x(r)i −D(r)(s
(c)i b
(c)i + s
(r)i b
(r)i )∥∥∥2)
(67)
Update of GP parameters Ωkl
For the GP parameters, the full conditional posteriors are difficult to obatin in
closed form. We obtain point estimates for these parameters via maximum likelihood
estimation. Let Ωl = τ (s)kl , β
(s)kl , σ
(s)kl represent the parameter set for the lth GP
(spatial GP) corresponding to the kth factor loading. The MLE Ωkl is obtained as
(Rasmussen and Williams, 2005),
Ωkl = maxΩkl
−1
2ln det(Σk,l)−
1
2d
(r)k,l
TΣk,ld
(r)k,l −
Sl2
ln(2π)
(68)
where det(Σk,l) denotes the determinant of the matrix Σk,l and Sl denotes the
number of elements in d(r)k,l . The parameters for the temporal GPs may be updated
in a similar manner.
45