mining massive amounts of genomic data: a semiparametric ... · mining massive amounts of genomic...
TRANSCRIPT
Mining Massive Amounts of Genomic Data: A
Semiparametric Topic Modeling Approach
Ethan X. Fang∗ Min-Dian Li† Michael I. Jordan‡ Han Liu§
January 1, 2015
Abstract
Characterizing the functional relevance of transcription factors (TFs) in different biological
contexts is pivotal in systems biology. Given the massive amount of genomic data, computa-
tional identification of TFs is often necessary to generate new hypotheses for experimentalists.
In this paper, we use large gene expression and chromatin immunoprecipitation (ChIP) data
corpuses to conduct high-throughput TF-biological context association analysis. This work
makes two contributions: (i) From a methodological perspective, we propose a unified topic
modeling framework for exploring and analyzing large and complex genomic datasets. Under
this framework, we develop new statistical optimization algorithms and semiparametric theo-
retical analysis which are also applicable to a variety of large-scale data analyses. (ii) From a
scientific perspective, our method provides an informative list of new discoveries in biology. Our
data-driven analysis of 38 TFs in 68 tumor biological contexts identifies functional signatures
of epigenetic regulators, such as SUZ12 and SET-DB1, and nuclear receptors, in many tumor
types. In particular, the TF signature of SUZ12 is present in a broad range of tumor types,
suggesting the important role of SUZ12-mediated histone methylation in tumor biology.
1 Introduction
A fundamental goal of systems biology and functional genomics is to understand global regulation
of gene expression. Transcription factors (TFs), cofactors, and epigenetic proteins represent major
regulators of gene expression, the disturbance of which contributes to the pathogenesis of a plethora
of human diseases, including cancer (Darnell, 2002; Arrowsmith et al., 2012). A major approach
∗Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA;
e-mail: [email protected]†Department of Cellular and Molecular Physiology, Section of Comparative Medicine and Program in Integrative
Cell Signaling and Neurobiology of Metabolism, Yale University School of Medicine, New Haven, CT 06520, USA;
e-mail: [email protected]‡Department of EECS and Statistics, University of California, Berkeley, CA 94720, USA; e-mail:
[email protected]§Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA;
e-mail: [email protected]
1
to understand such regulators is to integrate their genomic binding patterns with gene expression
profiles in different biological contexts, including physiologic or pathologic conditions from different
organisms. Currently, chromatin immunoprecipitation (ChIP), followed by microarray or DNA
sequencing (-chip or -seq, collectively referred to as ChIPx) are widely used to examine the function
of TFs, cofactors and epigenetic proteins in different biological contexts. Since the launch of
the ENCODE and modENCODE consortia, ChIPx data from more than 140 TFs and histone
modifications in more than 100 cell types from four organisms including human have been made
available (Landt et al., 2012). Meanwhile, more than 36,728 series of gene expression profiles in
different organisms from frogs to human have been deposited to the Gene Expression Omnibus
(GEO) public database (Edgar et al., 2002; Zhu et al., 2008; Barrett et al., 2013).
Although a variety of computational methods have been proposed to integrate ChIPx and gene
expression data and to predict associations between TFs and biological contexts (Boulesteix and
Strimmer, 2005; Faith et al., 2007; Zhu et al., 2008; Wu et al., 2013a), there are several critical
challenges associated with such methods. First, current methods are often not able to cope with
the high dimensionality associated with gene-expression datasets, where the dimension of the data
is generally much larger than the sample size. Second, current methods often assume idealized
distributions such as Gaussian that are generally not a good match to the complex distributions
that arises in these datasets, particularly in the tail of the distributions. Third, current methods
do not address issues of heterogeneity that arise when the overall dataset is formed from different
sources.
In this paper, we exploit the transelliptical distribution recently studied by Han and Liu (2012)
to propose a semiparametric transelliptical topic (TROPIC) modeling framework to address the
challenges of integrating genomic binding patterns with gene expression profiles across biological
contexts. A semiparametric model contains both finite- and infinite-dimensional parameters, and
is used in our framework to model gene expression data. The infinite-dimensional component of
the model provides flexibility, and the finite-dimensional component comprise the parameters of
major scientific interest. Such a topic modeling framework, combined with a hierarchical mixture
architecture, enables us to effectively extract common information from large aggregated datasets
that exhibit high heterogeneity. In addition, we develop new statistical optimization algorithms to
estimate the parameters in these topic models efficiently. We will show that under mild assumptions,
our estimation procedure converges at a near-optimal rate.
We show how to use our method to perform high-throughput TF-biological context analysis
of ChIPx and gene expression profiles by matching target genes of a TF with feature genes of a
biological context. We first show that the TROPIC method reveals the TF signature of c-MYC
in a conserved cohort of tumors with ChIPx data from different sources. We next show that the
TROPIC method is more reliable than ChIP-PED (Wu et al., 2013a), the first of such analytic
methods, but provides a simple platform. To illustrate the effectiveness of TROPIC, we further
apply our framework to ChIPx data involving 38 TFs and epigenetic proteins, and gene expression
profiles of 68 tumor types. Figure 1 illustrates the work flow. We find that the TF signature
of the epigenetic regulator SUZ12 is prevalent in a broad range of tumors. Classic tumor-related
TFs, such as NF-κB, c-FOS, c-JUN, ESR1 and PAX5, are also prevalent in tumors. Interestingly,
2
Gene Topics
Gen
es Top target gene (TTG)Topic gene match TTGTopic gene does not match TTG
Significant Topic
BRCA1
EWSR1
MYC Potential connection
Unexplored area
Confirmed connections
3000
+ C
hiPx
sam
ples
for m
ultip
le T
Fs
60,000+ gene expression samplesfrom multiple biological contexts
Breast Cancer
Ewing Sarcoma
Kidney CancerTF-DNAinteraction
(107/sample)Text
Gene ExpressionData (Microarray)
(104/sample)
Text
Application: Data-Driven Science
>Ăďϭ >ĂďϮ >Ăďϯ >Ăď ŵ
DzZϭ
'ĞŶĞƐ(> ϭϮK)
'ĞŶĞ ĞdžƉƌĞƐƐŝŽŶ ĚĂƚĂ
ŝŽůŽŐŝĐĂů ĐŽŶƚĞdžƚƐƌĞ
ĂƐƚĐĂŶ
ĐĞƌ
ǁŝŶŐΖƐ
ƐĂƌĐŽŵ
Ă
^ƚĞŵĐĞůů
(> ϮK)
,/WͲĐŚŝƉ ĚĂƚĂ
;> ϭKͿdƌĂŶ
ƐĐƌŝƉƟŽ
Ŷ ĨĂĐƚŽƌ
'ŽĂů ƐƐŽĐŝĂƟŽŶ ŵŝŶŝŶŐ ďĞƚǁĞĞŶ d&Ɛ ĂŶĚ ŝŽůŽŐŝĐĂů ĐŽŶƚĞdžƚƐDŝdžƚƵƌĞ ŽĨ ƚƌĂŶƐĞůůŝƉƟĐĂů ƚŽƉŝĐ ŵŽĚĞůƐ ;&ĂŶŐ ĂŶĚ >ŝƵ ϮϬϭϯͿ
A
B
C
Figure 1: (A) Our method integrates datasets arising from gene expression and ChIPx. (B) We assess whether top
target genes (red) have significant overlap with topic genes (purple). (C) We systematically explore the associations
between biological contexts and transcription factors. The current state of the art is that only a small proportion
(red) of the joint ChIPx and expression data in human has been investigated; we analyze the unexplored area (grey)
in order to guide biologists in the design of new experiments.
several nuclear receptors, e.g., HNF4A, and RXRA, exhibit significant relevance in a wide spectrum
of tumor types.
2 Data and Methodology
We exploit a gene expression dataset (McCall et al., 2011) consisting of n = 13, 182 samples of
M = 2, 631 biological contexts generated from Affymetrix Human 133A (GPL96) arrays. The data
was downloaded from GEO, preprocessed and normalized using frozen-RMA (McCall et al., 2010) to
reduce batch effect. For each probeset, we standardize its expression values to have zero mean and
unit standard deviation across all array samples. The data contain 20,248 probes, corresponding
to d = 12, 704 genes.
3
2.1 Data Modeling
The gene expression data X ∈ Rn×d is highly heterogeneous since it is collected from multiple
biological contexts and labs. Such heterogeneity invalidates the classical Gaussian model and
motivates us to adopt a more flexible model based on the transelliptical distribution (Han and Liu,
2012).
A random vector X = (X1, ..., Xd)T ∈ Rd follows a transelliptical distribution, denoted X ∼
TE(µ,Σ;Z, f1, ..., fd), if there exist monotone univariate functions f1, ..., fd : R→ R such that the
transformed data f(X) =(f1(X1), ..., fd(Xd)
)Tfollows an elliptical distribution with mean µ and
covariance matrix Σ. More details regarding this distribution are provided in Appendix A.
To model the heterogeneity of the gene expression data X, we assume the expression data
from the m-th biological context are generated from a transelliptical random vector Xm. This
results in a transelliptical mixture model, i.e., each gene expression sample is generated from X ∼∑Mm=1 πmXm ∈ Rd where M is the total number of biological contexts and
∑Mm=1 πm = 1.
The transelliptical mixture model has a natural hierarchical interpretation (Liu et al., 2012).
Specifically, for each biological context m, we assume that there exists a latent Gaussian random
vector Ym ∼ Nd(µm,Σm). As shown in Figure 2, the Gaussian random vector can be converted
into an elliptical random vector Zm ∼ ECd(g,µm,Σm) via a global stochastic scaling factor ξm.
Compared to the Gaussian distribution, elliptical distributions are powerful at modeling heavy-tail
distributions with possibly nontrivial tail dependency. However, elliptical distributions are still re-
strictive since they must be symmetric. The elliptical random vector can be further converted into
a possibly asymmetric transelliptical random vector through marginal monotone transformations.
The transelliptical model is semiparametric since it contains both finite-dimensional parameters
(the mean and covariance matrix) and infinite-dimensional parameters (the stochastic scaling vari-
able and marginal transformations). Such a semiparametric architecture naturally addresses the
heterogeneity issue in modeling the expression data. For the purposes of statistical inference, we
treat the stochastic scaling factor ξm and marginal transformations as nuisance parameters and di-
rectly infer the latent means and covariance matrices µm’s and Σm’s. We define Y ∼∑Mm=1 πmYm
to be the latent Gaussian mixture random vector associated with X.
2.2 Transelliptical Topic Model
We assume the gene expression data X ∈ Rn×d can be summarized by a small number of “topic”
vectors v1,v2, ...,vT ∈ Rd with T n. This general approach has been used in many applications,
including text mining (Blei et al., 2003; Mimno, 2012), social media analysis (Purushotham et al.,
2012), image processing (Wang et al., 2009) and others (Bakalov et al., 2012; Yao et al., 2009; Shalit
et al., 2013). In particular, motivated by the approach to topic modeling based on the singular
value decomposition (Deerwester et al., 1990), we define the topics of the transelliptical mixture
random vector X to be the leading eigenvectors of the latent mean-adjusted covariance matrix
S = Σ +µµT , where Σ and µ are the covariance matrix and mean of the latent Gaussian mixture
4
EwingSarcoma
KidneyCancer
Y | = 2
N(µ2,2)
Y | = M
N(µM ,M )
h2 hM
2 M
1 2 M
Multi(1, . . . ,M )Mixturedistribution
Lighted tail
Heavy tail
Asymmetric
Expression forEwing Sarcoma
Expression forKidney Cancer
Gau
ssia
nEl
liptic
alTr
anse
llipt
ical
BreastCancer
Y | = 1
N(µ1,1)
h1
1
Expression forBreast Cancer
Figure 2: The hierarchical structure of a transelliptical mixture distribution. Each biological context
m has a underlying normal distribution Ym ∼ N(µm,Σm). Each Ym is transformed to an elliptical
random vector and then to a transelliptical random vector. The observed data are generated from
the transelliptical random vector.
random vector Y , i.e.,
S = Cov(Y ) + E(Y )E(Y T ) =M∑
m=1
πmSm, (1)
where each Sm = Σm + µmµTm.
The first term Cov(Y ) captures population-level variability, and the second term E(Y )E(Y T )
captures location information. Recall that for a positive semidefinite matrix, S ∈ Rd×d, we can
write S =∑d
i=1 λivivTi where λ1 ≥ λ2 ≥ ... ≥ λd ≥ 0 are the eigenvalues of S, and vi are the
corresponding eigenvectors, such that the best rank-k approximation of S is∑k
i=1 λivivTi for all 1 ≤
k ≤ d (Trefethen and Bau III, 1997). Thus, the leading topics provide a latent representation that
summarizes important aspects of the first- and second-order statistical structure of the distribution
of X. We additionally assume that the topics v1,v2, ...vT ∈ Rd are s-sparse; i.e., we assume at
most s of the d elements of each vt are non-zero where s d. Such sparsity assumptions have
been widely adopted in the latent variable modeling literature as a tool for addressing the curse of
dimensionality; see, e.g., Carvalho et al. (2008) and Wang and Blei (2009). The nonzero components
of the topics represent features which are important in one or more Xm’s. To summarize, the
transelliptical topic model is defined as:
5
Definition 2.1 (Transelliptical topic model). The transelliptical topic model, denoted by T (S;M, s),
is the set of distributions X ∼ ∑Mm=1 πmXm, where each Xm ∼ TEd(µm,Σm;Z, f
(m)1 , ..., f
(m)d ),
such that S =∑M
m=1 πm(Σm + µmµTm) and the first T leading eigenvectors of S are s-sparse.
Since transelliptical distributions can be heavy-tailed or asymmetric, we exploit a combination
of rank correlation (Han and Liu, 2012) and an M-estimator proposed by Catoni (2012) to estimate
the mean-adjusted covariance matrix S. For parameter estimation, we adopt the truncated power
(TPower) method (Yuan and Zhang, 2013) initialized by a semidefinite program that is known as
the Fantope Projection and Selection (FPS) method (Vu et al., 2013). More details regarding these
estimators can be found in Appendix B.
We now present a theorem which shows that our proposed method achieves the minimax optimal
rate of convergence, OP(√
(s log d)/n), for estimating the sparse topic vectors.
Theorem 2.2. Let X ∼ T (S;M, s). We assume the first T eigenvalues of S, λ1, ..., λT , have a
smallest spectral gap such that λt − λt+1 ≥ Cd for all t = 1, ..., T − 1 and Cd > 0. Denote the
estimated topics to be v1, ..., vT . Under “sign sub-Gaussian condition” (Han and Liu, 2013), with
suitable choice of tuning parameters, with probability at least 1−O(d−1), we have
‖vt − vt‖2 ≤ C ·√s log d
n, (2)
for some constant C.
Note that whenX follows a Gaussian or elliptical mixture distribution, the topics are the leading
eigenvectors of E(XXT ). To connect our topic model with existing work, suppose we have T topics
[v1, ...,vT ] = W ∈ Rd×T where each column vt ∈ Rd is a topic. We assume that the observed data
matrix X ∈ Rn×d is generated through some random combination of topics v1, ...,vT ; i.e., we
assume that the observed data matrix XT = WA where the random matrix A ∈ RT×n is generated
from some unknown distribution. In Deerwester et al. (1990), a singular value decomposition of
the observed data matrix XT = UDVT is conducted, such that if the columns of U are viewed
as the topics, then A = DV can be viewed as a random combination matrix. It is easily seen
that if d is fixed and n → ∞, the columns of U converge to the leading eigenvectors of E(XXT )
asymptotically. Thus, our definition of topics can be viewed as a generalization of that of Deerwester
et al. (1990).
Our topic modeling framework, based on a transelliptical mixture distribution, is nongenerative,
in distinction to the bulk of the literature on topic modeling, which focuses on generative models
(Blei et al., 2003; Mimno, 2012). Our topics are defined in the latent space and the transformations
to the observed data are treated as nuisance parameters; however, the topics in the latent space
can be viewed as informative summaries of the distribution of the random vector X.
2.3 TROPIC for TF-Biological Context Analysis
We now introduce the TROPIC method for conducting TF-biological context analysis. Given a
transcription factor and a biological context, we first identify the biological context’s feature genes
6
using the estimated topics from the gene expression data. Next, we exploit the ChIPx data to
identify the top target genes of the TF. We then test if the feature genes of the biological context
and the top target genes of the TF have significant overlap. If so, we conclude that the feature
genes and target genes significantly match, and the TF is deemed functionally significant in the
biological context.
In more detail, let v1, v2, ..., vT be the estimated topics from the gene expression profiles X. We
let v(m) denote the leading eigenvector of the estimated latent mean-adjusted covariance matrix
of Xm, which can also be viewed as the leading “topic” of the m-th biological context. We can
view v(m) as encoding summary information for the m-th biological context. However, the sample
size of the m-th biological context is possibly very small, which results in the instability of the
estimated v(m). To resolve this problem, we regress v(m) on the population topicsv1, v2, ..., vT
to identify a subset Sm =vm1 , ..., vmK
which explains the greatest fraction of the variability.
We then construct a binary feature vector, v(g)m , where v
(g)m (i) = 1 if there exists some k such that
vmk(i) 6= 0, and v
(g)m (i) = 0 otherwise, where v(i) denotes the i-th component of v.
We further construct a binary target gene vector u(m)j corresponding to the j-th TF. The
elements of u(m)j corresponding to the top target genes of the j-th TF are set to be 1, where we
first use CisGenome (Ji et al., 2008) to perform peak detection using the ChIPx data of the j-th
TF, and then we use ChIPXpress (Wu and Ji, 2013) to identify the top target genes of the TF.
We then test if v(g)m and u
(m)j have significant overlap. If so, we conclude that the feature genes
of the m-th biological context significantly match the target genes of the j-th TF, and infer that
the regulation of the j-th TF is functionally important in the m-th biological context. A more
detailed presentation of the protocol can be found in the Appendix C.
3 Results and Discussions
We apply the TROPIC method to the analyze the association between 38 TFs and a total of 68
tumor-related biological contexts where the sample sizes of each biological contexts are greater than
20. In this section, we discuss several important biological findings that arise from this analysis.
3.1 TROPIC Reliably Predicts TF Signature in a Conserved Cohort of Tumor
Types with ChIPx Data from Different Sources
To test the hypothesis that the adaptively selected target genes from the ChIPx data represent
the major targets of a TF, we use the TROPIC method to examine the association between major
targets of MYC and 68 sources of tumors, with ChIPx data from 6 sources, respectively. The ChIPx
data are different in the prepared laboratory and cell type. As shown in Figure 3A, ChIPx data for
MYC predicts a conserved cohort of tumor types (14/18), suggesting our selection criteria faithfully
preserves major targets of MYC regardless of the origins of the data. In particular, as shown in
Figure 3B, ChIPx data from three different cell types predicts MYC signature in 12 tumors shared
by all cell types. The cell types chosen are originally from umbilical vein endothelium (HUVEC),
lymphoblastoid tumor (GM12878), myelogenous leukemia (K562), and cervical malignant tumor
7
(HeLa), which have distinct cellular physiology. Experimental variance is another concern for
extrapolating TF function to a new biological context. We compare the outcomes of TROPIC
from two laboratories and found K562 cell-derived and GM12878 cell-derived ChIPx data predict
MYC signature in a highly overlapped cohort of tumors, 12/14 and 14/16, respectively, as shown
in Figure 3C. Together, the results indicate that our selection criteria to process ChIPx data can
reliably predict TF signatures in new biological contexts.
12
: UTA HUVEC: UTA GM12878: UTA K562
0
1
0
1
1
1
12
0
1
1
1
1
2
: Yale Hela: Yale GM12878: Yale K562
22
: Yale K562: UTA K562
12 21
: UTA GM12878: Yale GM12878
14
Yale: Hela
UTA: HUVEC
UTA: GM12878
UTA: K562
Yale: GM12878
Yale: K562
Bio
logi
cal
Con
text
ChIPSources Brea
st: Tu
mor LGLA
ALK Positive
Anaplas
tic Ly
mphoma
Breast:
Tumor S
troma
B Cell: L
ymphoma
Ewing Tumor: B
one Tumor
Breast:
Tumor P
ost-Men
opausa
l
Classic
al Hodgkin
Lymphoma
Lung: Lung Can
cer C
ell Line
Melanoma B
L: Mela
noma Cell
Line
Lympoblas
toid Cell Lines
MCF7: Brea
st Aden
ocarci
noma
Melanoma M
etasta
tic Deri
vativ
es, L
ung
Squamous C
ervica
l Epith
ellium: T
umor
Bone Marr
ow: T-A
LL
Melanoma M
etasta
tic Deri
vativ
es, S
.C
Lung: Tumor
A375 C
ell Line:
Malignan
t Mela
mona
K562 C
ell Line:
CML
A
B C
Figure 3: TROPIC predicts the TF signature in a conserved cohort of tumor types with ChIPx
data from different sources. (A) The diagram that shows significant biological contexts from 68
tumors for MYC. The horizontal panel shows significant biological contexts. The vertical panel
shows sources of ChIPx data for MYC. The red color indicates an adjusted P-value < 0.05. (B)
A Venn diagram of the number of significant biological contexts for ChIPx data from different cell
types. (C) A Venn diagram of the number of significant biological contexts for ChIPx data from
different laboratories.
3.2 TROPIC Predicts TF Signature in a Bigger Cohort of Tumor Types than
ChIP-PED
ChIP-PED is an alternative method to predict TF signatures in biological contexts where ChIPx
data are not available. To estimate the accuracy of our TROPIC method, we choose ChIPx data
for MYC and SET-DB1, which represent a TF and an epigenetic protein, and apply the TROPIC
8
method to predict associations of TFs with 68 tumors. Note that throughout the paper, we use
the FDR method (Benjamini and Hochberg, 1995) to adjust the P-values for multiple comparison.
However, for a fair comparison in Figure 4, we adjust the P-values of the two methods using
Bonferroni’s method as ChIP-PED does. By applying Bonferroni’s adjusted P-value of 0.05 as the
threshold, the results show that the tumor types predicted by ChIP-PED have significant overlap
with that predicted by the TROPIC method as shown in Figure 4. In particular, the TROPIC
method predicts MYC signature in seven tumors (Figure 4A, ChIPx source: UTA GM12878 without
MCF7) whereas ChIP-PED predicts MYC signature in a sub-cohort of four tumors. Two types
of lymphoma and K562 cell line are predicted by both methods, which is supported by previous
studies (Li et al., 2003; Slack and Gascoyne, 2011). Melanoma is another common tumor type
affected by MYC (Zhuang et al., 2008; Leonetti et al., 1996), which is predicted by our method.
Similarly, ChIP-PED predicts SET-DB1 signature in a sub-cohort of 6 tumors out of 11 predicted
by the TROPIC method as shown Figure 4B. Both TROPIC and ChIP-PED methods predict
melanoma as a significant biological context, which is consistent with a recent study (Ceol et al.,
2011). The difference is likely due to the additional assumption by ChIP-PED method, where
ChIP-PED assumes that the target genes and TF will both have significantly high/low expressions.
Meanwhile, our TROPIC method sets no threshold value for the expression level of TFs and does
not match the expression level of target genes to the expression level of TFs. It is reasonable that
altered expression of TF contributes to changes in its target genes, especially given that tumor
cells are known to show increased activity of oncogenic TFs (Darnell, 2002). However, increased
activity of TFs is not necessarily associated with increased level of expression. It is known that
chromosomal translocations and point mutations in oncogenic TFs, cofactors, or epigenetic proteins
can contribute to increased activity of TFs. In addition, decreased activity of TFs, cofactors, or
epigenetic proteins can be counted as features of the biological context by the TROPIC method,
so long as the inactivation leads to a dramatic change on target genes. This extends the power of
TROPIC to predict TF signature in a biological context that has inactivated TFs, as commonly
observed in chromosomal transcolations and truncations. In summary, the TROPIC method can
predict the TF signature regardless of the expression level and the activation status of the protein,
and thus provides a bigger cohort of tumor types for a specific TF.
3.3 TROPIC Predicts Novel Biological Contexts in Tumors
To test whether the TROPIC method is applicable to other regulators of gene expression, we
further apply the transelliptical topic modeling framework to context-specific analysis of ChIPx
data comprising 38 TFs, cofactors, and epigenetic proteins, and gene expression of 68 tumor types.
3.3.1 Epigenetic Regulators are Relevant to Many Tumor Types
Epigenetic control of gene expression is emerging as a crucial contributor to tumorigenesis and
metastasis (Suva et al., 2013). Histone methylation is an important and widespread form of epige-
netic mechanism. Emerging evidence indicates that deregulation of histone methylation contributes
to tumor formation (Martin and Zhang, 2005; Greer and Shi, 2012; Dawson and Kouzarides, 2012;
9
A375 C
ell Line:
Malignan
t Mela
mona
B Cell Progen
itor: A
LL
ALK Positive
Anaplas
tic Ly
mphoma
B Cell: L
ymphoma
TROPIC 1
Biol
ogic
alCo
ntex
t
Method
ChIP-PED 1
Bone Marr
ow: Mye
loma
Bone Marr
ow: T-A
LL
Left Frontal
Lobe: Glio
blastoma
Lympoblas
toid Cell Lines
Lung: Tumor
Melanoma M
etasta
tic Deri
vativ
e, Lung
Melanoma B
L: Mela
noma Cell
Line
Yolk sa
c Tumor: T
umor
Cervix:
Cance
r
Favorab
le Hist
ology Wilm
s Tumor: N
on-Rela
pse
Favorab
le Hist
ology Wilm
s Tumor: R
elapse
Lung: Lung Can
cer C
ell Line
Melanoma M
etasta
tic Deri
vativ
e, S.C
Classic
al Hodgkin
Lymphoma
Blood: Leu
kemia
MCF7: Brea
st Aden
ocarci
noma
Biol
ogic
alCo
ntex
tMethod
TROPIC
Breast:
Tumor L
GLA
ALK Positive
Anaplas
tic Ly
mphoma
Breast:
Tumor S
troma
B Cell: L
ymphoma
Ewing Tumor: B
one Tumor
Breast:
Tumor P
ost-Men
opausa
l
Classic
al Hodgkin
Lymphoma
Lung: Lung Can
cer C
ell Line
Melanoma B
L: Mela
noma Cell
Line
Lympoblas
toid Cell Lines
MCF7: Brea
st Aden
ocarci
noma
Melanoma M
etasta
tic Deri
vativ
es, L
ung
Squamous C
ervica
l Epith
ellium: T
umor
Bone Marr
ow: T-A
LL
Melanoma M
etasta
tic Deri
vativ
es, S
.C
Lung: Tumor
A375 C
ell Line:
Malignan
t Mela
mona
K562 C
ell Line:
CML
ChIP-PED
A
B
MYC
SET-DB1
ChIP-PED 2
TROPIC 2
Figure 4: Comparison between TROPIC and ChIP-PED. (A) Diagram that shows significant bio-
logical contexts from 68 tumors for MYC computed from TROPIC and ChIP-PED. The red square
indicates an adjusted P-value < 0.05. (B) Diagram that shows significant biological contexts from
68 tumors for SET-DB1 computed from TROPIC and ChIP-PED, where the first two rows indicate
the results from one ChIPx dataset, and the last two rows show the results from another ChIPx
dataset. The red square indicates an adjusted P-value < 0.05.
Chi et al., 2010). We include several epigenetic regulators in the TROPIC analysis and present the
results as shown in Figure 5.
SUZ12: Multiple subunits of polycomb repressive complex 2 (PRC2) that trimethylates his-
tone 3 lysine 27 are either mutated or dyeregulated in different tumors (Sparmann and van Lo-
huizen, 2006). SUZ12 is a core subunit of PRC2. Previous studies report altered expression level
of PRC2/SUZ12 in a wide range of human primary tumors, such as T cell acute lymphoblastic
leukemia (T-ALL) (Ntziachristos et al., 2012), ovarian (Li et al., 2012, 2007), metastatic prostate
(Yu et al., 2007), lung (Martın-Perez et al., 2010), melanoma (Martın-Perez et al., 2010), brain and
glial tumors (Crea et al., 2010). To test whether the SUZ12 signature is present in tumors, we apply
the TROPIC method to analyze SUZ12 in human tumor samples. The results indicate that SUZ12
signature is present in 48 out of 68 tumor samples (70.59%), including most of the reported tumor
types as shown in Figure 5. Genetic manipulation of SUZ12 results in difference in tumor prolifera-
tion in the context of ovarian cancer and mantle cell lymphoma (Li et al., 2012; Martın-Perez et al.,
2010). However, whether the function of SUZ12 in other tumor types is significant is largely un-
10
SUZ12
JUND
SETDB1
EP300
GABP
NFKB
FOS
JUN
IRF4
ESR1
RXRA
HNF4A
Biol
ogic
alCo
ntex
t
ChIPProteins Acu
te Ly
mphoblastic
Leuke
mia
B Cell Progen
itor: A
LL
ALK Positive
Anaplas
tic Ly
mphoma
B Cell Prec
ursor: A
LL
B Cell: C
hronic Ly
mphoblastic
Leuke
mia
B Cell: L
ymphoma
Bladder:
sTCC
Bladder
Tumor: T
2-4
Bladder:
mTCC
Blasts
and M
onuclear
Cells:
Leuke
mia
Blood: Acu
te Mye
loid Leuke
mia
Bone Marr
ow: Chronic
Lymphocy
tic Leu
kemia
Bone Marr
ow Mononucle
ar Cell
s: AML
Bone Marr
ow: Acu
te Ly
mphocytic
Leuke
mia
Bone Marr
ow: Leu
kemia
Bone Marr
ow: Mye
loma
Bone Marr
ow: T-A
LL
Breast:
Cance
r
Bone Marr
ow: Multip
le Mye
loma
Bone Marr
ow: Wald
enstr
oms Mac
roglobulinem
ia
Brain: G
lioblas
toma
Brain: T
umor
Breast:
Cance
r Dutal
Breast:
Tumor
Breast:
Tumor E
pitheli
um
Breast:
Tumor L
argely
Opera
ble or L
ocally
Advance
d
Breast:
Tumor L
argely
OLAI
Breast:
Tumor L
ymph Node-N
egati
ve
Breast:
Tumor P
ost-Men
opausa
l
Breast:
Tumor S
troma
Cervix:
Cance
r
Breast:
Tumor N
ode-Neg
ative
Blood: Leu
kemia
A375 C
ell Line:
Malignan
t Mela
mona
Colon: Tumor
Glioblas
toma: Tu
mor
Ewing Tumor: B
one Tumor
Germ Cell
: Tumor
Left Frontal
Lobe: Glio
blastoma
K562 C
ell Line:
Normal
Leuke
mia Cell
s: Acu
te Ly
mphoblastic
Leuke
mia
Lung: Aden
ocarci
noma
Lympoblas
toid Cell Lines
Lung: Tumor
Mammary
Glan
d: Tumor
MCF7: Brea
st Aden
ocarci
noma
Melanoma M
etasta
tic Deri
vativ
e, Lung
Melanoma B
L: Mela
noma Cell
Line
Ovaria
n Tumor: E
ndometroid
Ovary:
Cance
r
Posterio
r Foss
a:Pilo
cytic
Astrocy
toma
Skin: M
elanoma
Prostate:
Tumor
Squamous C
ell: C
arcinoma
T Cell: A
cute
Lymphoblas
tic Leu
kemia
Yolk sa
c Tumor: T
umor
Favorab
le Hist
ology Wilm
s Tumor: N
on-Rela
pse
Favorab
le Hist
ology Wilm
s Tumor: R
elapse
Lung: Lung Can
cer C
ell Line
Melanoma M
etasta
tic Deri
vativ
e, S.C
Ovaria
n Tumor: M
ucinous
Ovaria
n Tumor: S
erous
Skin: M
etasta
tic M
elanoma
Squamous C
ervica
l Epith
elium: T
umor
Liposarco
ma Cultu
re:incu
bated w
ith doxo
rubicin
Liposarco
ma Cultu
re:incu
bated w
ith PBS
Right Frontal
Lobe:Glio
blastoma
Classic
al Hodgkin
Lymphoma
SUZ12
JUND
SETDB1
EP300
GABP
NFKB
FOS
JUN
IRF4
ESR1
RXRA
HNF4A
Biol
ogic
alCo
ntex
t
ChIPProteins
Epig
enet
icpr
otei
nsHi
ppo
Path
way
Nucl
ear
Rece
ptor
Onc
ogon
icTF
Epig
enet
icpr
otei
nsHi
ppo
Path
way
Nucl
ear
Rece
ptor
Onc
ogon
icTF
PAX5
PAX5
Figure 5: Results of TROPIC on 13 TFs on 68 tumor-related biological contexts. The red square
indicates an adjusted P-value < 0.05.
11
known. In addition, traditional screening via expression profiling, somatic mutation mapping, and
knockdown underestimates the functional relevance of TFs and transcriptional regulators. It has
been reported that portions of SUZ12 are commonly fused to JAZF1 gene in normal and neoplastic
endometrial cells (Li et al., 2007, 2008). JAZF1-SUZ12 contributes to tumorigenesis independent
of the expression and sequence of SUZ12 gene but exhibits TF signature that can be identified by
the TROPIC method (Figure 5, see ovarian tumor: endometroid). Despite a large body of evidence
supporting PRC2/SUZ12 as an oncoprotein, a recent study shows PRC2/SUZ12 acts as a tumor
suppressor in T-ALL (Ntziachristos et al., 2012). Our results identify SUZ12 signature in T-ALL,
suggesting that the TROPIC method focuses on the functional significance of TFs regardless of
the positive/negative role played by TFs. Together, our data suggests that SUZ12 is an important
regulator of gene expression in a broad range of tumor types.
SET-DB1: SET-DB1 is another epigenetic regulator that methylates histone 3 lysine 9 residue
into mono-, di-, and tri-methylated form (Greer and Shi, 2012). Originally discovered in fruit flies,
mammalian SET-DB1 is involved in the maintenance of embryonic stem cells by repressing the
expression of developmental regulators (Bilodeau et al., 2009). A recent study reports that SET-
DB1 is amplified in melanoma and accelerates the onset of tumor (Ceol et al., 2011). The same
study also finds that the copy number of SET-DB1 is increased in breast, liver, lung, and ovarian
tumors. Increased copy number of a certain gene does not necessarily lead to increased activity,
but a significant representation of TF signature will support the tumor-relevant role of that gene.
To verify whether SET-DB1 signature is present in tumor samples, we include SET-DB1 in the
TROPIC analysis. The results show that SET-DB1 signature is present in 20 sources of tumors,
including melanoma, breast, and lung tumor as shown in Figure 5. In particular, melanoma and
Wilms tumors are predicted by both TROPIC and ChIP-PED to be significant biological contexts
for SET-DB1 (Figure 4B first two rows). Whether SET-DB1 is involved in tumorigenesis in Wilms
tumor awaits further studies. In addition to confirm the presence of reported tumor types, the data
suggests that SET-DB1 is an important regulator in several types of blood and solid tumors.
3.3.2 Ets Family Protein GABP is Significantly Associated with Leukemia
GABP: Hippo tumor suppressor signaling is a conserved molecular pathway for the control of
organ size and has implicated in cancer (Harvey et al., 2013; Halder and Johnson, 2011). Hippo
pathway gauges the organ size by restricting both cell growth and cell proliferation, as well as
inducing cell death. Dysregulation of Hippo signaling is observed in a broad range of human
cancers, however, somatic or germline mutations in Hippo pathway are uncommon (Harvey and
Tapon, 2007). Recently, GA-binding protein (GABP), a member of ETS transcription factor family,
has been found to drive the expression of YAP (Wu et al., 2013b), the effector TF of Hippo pathway.
Loss of GABP down-regulates the level of YAP, resulting in a block at the G1/S phase of cell cycle
and increased cell death, which establishes GABP as an important regulator of Hippo pathway. We
test whether GABP signature is associated with tumors by the TROPIC method. The results show
that GABP signature is present in 19 sources of tumors as shown in Figure 5, including melanoma,
breast, lung, and prostate tumors, with an enrichment of lymphoblatic leukemia (8/19, 42.11%). It
has been reported that activation of Hippo-YAP pathway are deregulated in solid tumors (breast,
12
lung, colorectal, and liver) (Halder and Johnson, 2011). Our analysis suggests that GABP is a
contributing pathogenic factor in breast and lung tumors through Hippo-YAP pathway.
3.3.3 Classic Oncogenic TFs are Implicated in Many Tumor Types
NF-κB and AP-1: Historically, transcription factors, such as NF-κB (RELA) and AP-1 (FOS,
JUN, JUND, etc), are among the first cohort of oncogenes. These TFs are master regulators in cell
proliferation, differentiation, survival, stress response, and inflammation, most of which represent
hallmarks of tumor cells (Li and Yang, 2011; Piette et al., 1997; Shaulian and Karin, 2002; Hanahan
and Weinberg, 2011). A large body of studies has implicated the critical role of NF-κB (RELA)
and AP-1 in lymphoma and leukemia (Eferl and Wagner, 2003; Rayet and Gelinas, 1999). We
test whether our method can reveal lymphoma and leukemia as the significant biological contexts
for those proteins. Importantly, chromosomal amplification, over-expression and rearrangement of
these genes contribute to tumorigenesis, which is likely to be filtered out by existing methods. We
apply the TROPIC method to NF-κB (RELA), FOS, JUN, and JUND. The results show that many
biological contexts of blood tumors are significant for these TFs whereas IRF4 is not significant in
most of the tumors except in myeloma and T-cell acute lymphoblastic leukemia (T-ALL) (Figure
5) as reported previously (Yoshida et al., 1999). These results demonstrate the high credibility of
TROPIC in predicting biological contexts.
ESR1: Estrogen receptor 1 (ESR1 or estrogen receptor alpha) is a classic steroid nuclear
receptor that is activated by estrogen hormone. Estrogen is a hormone that regulates the behavior
and physiology. ESR1-deficient mice are sterile with incomplete development of sex organs (Ogawa
et al., 1998; Dupont et al., 2000). ESR1 also acts in other tissues, such as bone and adipose
tissue (Heine et al., 2000; Nakamura et al., 2007). It has been reported that estrogen promotes
apoptosis of osteoblasts by ESR1 and induction of FAS death ligand (Nakamura et al., 2007).
ESR1-deficient mice are obese with increased number and size of adipose tissue (Heine et al.,
2000). ESR1 is involved in the pathogenesis of breast cancer and endometrial cancer. Expression
of ESR1 is widely used as a prognostic marker for breast cancer (Knight et al., 1977; Gruvberger
et al., 2001). The wide spectrum of physiological function for ESR1 indicates its pathogenic role is
beyond the realm of reproductive tissue-derived cancers. To test this hypothesis, we run TROPIC
analysis with ChIPx data for ESR1 and found that ESR1 is associated significantly with 30 out of 68
tumor-related biological contexts (Figure 5). As expected, breast cancer and ovarian cancer exhibit
ESR1 signature. Surprisingly, many types of B-cell (acute or chronic) lymphoblastic leukemia are
significantly associated with ESR1. It is known that estrogen promotes proliferation and survival
of B cells (Grimaldi et al., 2002; Thurmond et al., 2000). ESR1 pathway may contribute to the
pathogenesis of B-cell lymphoma and leukemia via increasing cell proliferation and survival.
PAX5: Paired box protein 5 (PAX5) is a transcription factor in B cell development and has been
implicated in several types of lymphoma (Shaffer et al., 2002). PAX5 activates a transcriptional
program of various B-cell-specific genes, which is required for directing bone-marrow progenitor
cells to differentiate into B cells (Morrison et al., 1998b). Urbanek et al. (1994) reported that loss
of PAX5 in mice leads to a complete arrest of B cell development at an early precursor stage. PAX5
is also important in the late stage of B cell differentiation. De-regulation of PAX5 is commonly ob-
13
served in several types of lymphoma in the form of chromosomal translocation. A t(9:14)(p13;q32)
chromosomal transclocation brings the potent Emu enhancer of the IgH gene (a gene expressed in
mature B cells) into close proximity of the PAX5 promoter and results in increased expression of
PAX5 in late B-cell differentiation (Busslinger et al., 1996; Iida et al., 1996; Morrison et al., 1998a).
As expected, the significant biological contexts of PAX5 include B-cell lymphoma and leukemia
(i.e. B-ALL and B-CLL) (Figure 5). PAX5 is not only a master regulator of B cell biology, but also
an important pattern organizer in the development of central nervous system and genital tracts
(Urbanek et al., 1997; Bouchard et al., 2000). However, whether PAX5 is implicated in other types
of cancers is not known. Our TROPIC analysis shows that PAX5 is associated significantly with
28 out of 68 tumor biological contexts (Figure 5), including solid tumors from brain (brain and
glial cells) and reproduction organs (ovary and bladder). These observations indicate that the
tumorigenic role of PAX5 is beyond the realm of B-cell lymphoma.
3.3.4 Nuclear Receptor RXRA and HNF4A are Broadly Implicated in Tumors
Nuclear receptors represent a superfamily of ligand-activated transcription factor that modulates
cell growth, differentiation, survival and metabolism (Mangelsdorf et al., 1995; Evans, 1988). The
ligands for nuclear receptors include hormones and metabolites, ranging from retinoic acid (RAs),
vitamin D, steroid hormones, to lipid species. Retinoid X receptor A (RXRA) recognizes 9-cis
retinoic acid (9-cis RA), and heterodimerize with other nuclear receptors to modulate cellular
function. RAs are widely explored as therapeutics for both blood and solid tumors (Altucci et al.,
2007). A presence of RXRA signature in tumors will be useful to estimate the plausibility of
RA-based therapy in that specific tumor. We examine the significant tumor contexts for RXRA
and found that more than 65% of tumor contexts (47/68) are relevant to RXRA (Figure 5). This
highlights the important roles of RXRA biology in those tumors. Similarly, another nuclear receptor
hepatocyte nuclear factor 4 A (HNF4A) is significantly associated with a broad spectrum of tumors
(33/68, 48.52%), including melanoma, leukemia, breast, cervical, lung, and ovarian tumors (Figure
5). HNF4A has long been thought as a critical regulator of metabolism and contributes to the
pathogenesis of type I diabetes. It is well known that there is a strong link between diabetes and
cancer (Gullo et al., 1994; Vigneri et al., 2009). Our results suggest that HNF4A may be a genetic
link between diabetes and cancer.
3.4 The Estimate of the True Positive Rate
To evaluate the quality of our results, we randomly select 100 pairs of our found functionally
important TF-biological pairs. Next, we search existing literatures to find if the connections between
each TF-biological context pair has been experimentally proved. In total, we find 48/100 pairs
have been explicitly verified by biologists. Furthermore, 78/100 pairs have been mentioned in
the literatures. Thus, very conservatively, the true positive rate of our results is 48-78%. This
provides strong evidence that our method is able to guide the biologists to conduct experiments
more efficiently.
14
4 Discussion
We present a semiparametric topic modeling framework to conduct high-throughput TF-biological
contexts analysis. Our approach addresses several key challenges in Big Data analysis, including
high dimensionality, distributional complexity, and data heterogeneity. Theoretically, our method
guarantees a nearly optimal rate of convergence across a wide family of possibly heavy-tailed distri-
butions. Practically, our method is computationally simple and robust to very noisy data. Consid-
ering the limited source of ChIPx data and the massive expanding pool of gene expression profiles,
TROPIC has the potential to assist in the construction of the global regulatory networks of large
numbers of genes.
One drawback of our method comparing with ChIP-PED is that our method does not reveal
the detailed regulation pattern of a TF in different biological contexts (e.g., whether this TF acts as
an activator or repressor). A natural way to address the issue is to further divide the feature genes
and target genes into different groups according to the signs of their correlations, and consider the
topics in different groups of genes.
Acknowledgement
Han Liu is supported by NSF Grants III-1116730 and NSF III-1332109, NIH R01MH102339, NIH
R01GM083084, and NIH R01HG06841, and FDA HHSF223201000072C. The authors are also grate-
ful for the host of the Simons Institute of Theory of Computation at UC Berkeley. Min-Dian Li
is supported by a scholarship from the CSC-Yale World Scholars Program and the Glenn/AFAR
Scholarship for Research in the Biology of Aging.
Appendix
A Elliptical and Transelliptical Models
In this section, we briefly review the transelliptical distribution (Han and Liu, 2012) and discuss
its relationship with the other distribution families.
We start with some notations. For a vector u = (u1, ..., ud)T ∈ Rd, the `0, `p and `∞ vector
norms are defined as ‖u‖0 := card(supp(u)), ‖u‖p := (∑d
j=1 |uj |p)1p and ‖u‖∞ := max1≤j≤d |uj |.
For a matrix A = [ajk]d×d, the `max-norm is defined as ‖A‖max := max1≤j,k≤d|Ajk|. Let Sd−1 :=
u ∈ Rd : ‖u‖2 = 1 be the d-dimensional unit sphere. For any two vectors a, b ∈ Rd and two
squared matrices A, B ∈ Rd×d, we denote their inner products by 〈a,b〉 := aTb and 〈A · B〉 :=
Tr(ATB) respectively. Throughout the Appendix, we use a generic constant C whose value may
vary from line to line.
The transelliptical model is a semiparametric distribution familiy in which the nonparametric
components provide modeling flexibility, while the parametric components encode the important
15
information we can estimate efficiently. Before describing the transelliptical distribution, we briefly
overview the elliptical model, which can be viewed as a subfamily of the transelliptical model.
Recall that a random vector X = (X1, X2, ..., Xd)T ∈ Rd is continuous if the marginal distribu-
tions of X1, ..., Xd are all continuous, and we say X possesses density if X is absolutely continuous
with respect to Lebesgue measure. The elliptical distribution is defined below.
Definition A.1. A random vector X ∈ Rd (assuming its density exists) follows an elliptical
distribution if its density is of the following form:
f(x) = c|Σ|−1/2g((x− µ)TΣ−1(x− µ)
), (3)
where µ ∈ Rd; Σ ∈ Rd×d is positive definite; g : R+ → R+ is a univariate function on [0,∞), and c
is a normalization constant. We denote X ∼ ECd(µ,Σ, g).
Remark A.2. In general, we say a random vector X ∈ Rd follows an elliptical distribution if it can
be represented as Xd= µ + ξAU , where µ ∈ Rd, A ∈ Rd×p, p ≤ d and p = rank(Σ), AAT = Σ;
ξ ≥ 0 is a random variable independent of U ; U ∈ Sp−1 is uniformly distributed on the unit sphere
in Rp. It is seen that X does not necessarily possess a density as ξ does not always possess a
density, and Σ is only assumed to be positive semidefinite which might not be of full rank. In this
paper, we restrict our discussion on elliptical distributions which possess densities.
Assume that a random vector X ∈ Rd possesses a density and a covariance matrix (i.e., the
second moments of X are finite). The next proposition (Anderson and Fang, 1990) characterize
the relationship between the matrix Σ and the covariance matrix of X.
Proposition A.3. If a random vector X ∈ Rd follows an elliptical distribution possessing density
as defined in (3), then the matrix Σ ∈ Rd×d in (3) is a scatter matrix of X, i.e., Σ is proportional
to the covariance matrix of X.
The next proposition provides a condition for (µ,Σ, g) to be identifiable for X.
Proposition A.4. Let X = (X1, ..., Xd)T ∈ Rd be a random vector. If X ∼ ECd(µ,Σ, g) is
continuous and possesses a density, then (i) Σjj > 0 for all j ∈ 1, ..., d; (ii) (µ,Σ, g) is identifiable
for X under the constraint that Σjj = Var(Xj) for all j ∈ 1, ..., d.In the sequel, we adapt the identifiability condition that Var(Xj) = Σjj for all j ∈ 1, ..., d. In
order to model more complex distributions, Han and Liu (2012) extend the elliptical family to the
more flexible transelliptical family.
Definition A.5. (Transelliptical Distribution). A continuous random vectorX = (X1, X2, ..., Xd)T
follows a transelliptical distribution, denoted by X ∼ TEd(µ,Σ;Z, f1, ..., fd), if there exist mono-
tone univariate functions f1,..., fd, such that
(f1(X1), ..., fd(Xd))T d
= Z ∼ ECd(µ,Σ, g). (4)
We further assume that each fj(·) preserves the marginal mean and variance of Xj , i.e., E(Xj) =
E(Zj) and Var(Xj) = Var(Zj), such an identifiability condition is motivated by the “normal refer-
ence rule” (i.e., the model should reduce to a Gaussian model if the data are actually Gaussian.).
We call the matrix Σ the latent covariance matrix of X.
16
Note that the definition of transelliptical distribution is slightly different from the original
definition in Han and Liu (2012), as we impose a different identifiability condition. Namely, the
aim of Han and Liu (2012) is to conduct scale-invariant PCA on the latent correlation matrix of X.
Thus, the identifiability condition in Han and Liu (2012) is that µ = 0 and the diagonal components
of Σ are all 1’s. While this form of identifiability provides ease of estimation, it loses the marginal
location and scale information. Thus we assume that E(Xj) = E(Zj) and Var(Xj) = Var(Zj).
B Estimating Leading Topics
The leading topics of the transelliptical topic model can be estimated using a combination of sparse
semidefinite programming and algorithmic statistics.
Let X ∼ ∑Mm=1 πmXm where Xm ∼ TEd(µm,Σm;Zm, f
(m)1 , ..., f
(m)d ). To conduct transellip-
tical topic analysis, we first need to estimate each µm and Σm in order to estimate the pooled
mean-adjusted covariance matrix. As the transelliptical family contains heavy-tailed and asym-
metric distributions, classical sample mean and covariance matrices do not achieve the desired rate
of convergence and new estimation procedures are needed.
B.1 Estimating the Latent Means
Let X ∼ TEd(µm,Σm;Zm, f1, ..., fd), we exploit an M-estimator proposed by Catoni (2012) to
estimate the mean of X. Let µ = (µ1, ..., µd)T . Given n independent samples x1, ...,xn of X where
each xi = (xi1, ..., xid)T , we estimate µj using the marginal data x1j , ...xnj.
The estimator is defined as follows. Suppose we want to estimate the mean of a random variable
Z. Let z1, ..., zn be n independent realizations of Z and ψ : R → R be a continuous and strictly
increasing function satisfying − log(1− z + z2/2) ≤ ψ(x) ≤ log(1 + z + z2/2).
The estimator for the mean of Z is defined as the unique value µ such that
n∑
i=1
ψ(αδ(zi − µ)
)= 0, (5)
where δ and αδ are two parameters chosen adaptively from the data. For the choices of ψ, δ and
αδ, see Catoni (2012) for more detailed discussions.
For n samples x1,x2, ...,xn independently drawn from random vector X ∈ Rd, let E(X) =
(µ1, ..., µd)T . Choosing δ = d−2/2, we exploit the estimator in (5) to estimate the marginal means,
µ1, ..., µd. Theoretically, Catoni (2012) shows that, with probability at least 1−O(d−1), we have
max1≤j≤d
|µj − µj | ≤ C√
log d
n. (6)
B.2 Estimating Latent Covariance Matrices
In order to estimate the pooled covariance matrix of X, we need to estimate the latent covariance
matrix Σm of each Xm.
17
For a transelliptically distributed random vector X ∼ TEd(µ,Σ;Z, f1, ..., fd), it is easy to see
that the sample covariance matrix is not a consistent estimator of Σ due to the transformations
f1, ..., fd. It has been demonstrated in Han and Liu (2012) that we can efficiently estimate the latent
correlation matrix, i.e., the correlation matrix of the latent Gaussian random vector Y associated
with X. More specifically, we make use of the Kendall tau correlation matrix as defined below.
Definition B.1. The sample Kendall tau correlation matrix C = [ρjk]d×d is defined as
ρjk = sin(π
2τjk
), (7)
for all j, k ∈ 1, ..., d, where τjk = 2n−1(n− 1)−1∑
1≤i<i′≤n sign((xij − xi′j)(xik − xi′k)
)if j 6= k,
and τjk = 1 otherwise.
The next proposition from Han and Liu (2012) shows that the Kendall tau correlation matrix
enjoys a parametric rate of convergence in high-dimensional setting with respect to the `max norm.
Proposition B.2. Given n independent samples x1, ...,xn of a random vector X ∈ Rd following
a transelliptical distribution X ∼ TEd(µ,Σ;Z, f1, ..., fd). Let C be the latent correlation matrix,
i.e., the correlation of matrix of the latent Gaussian random vector Y associated with X, and let
C be the Kendall tau correlation matrix introduced in (7). We have, with probability at least
1−O(d−1),
‖C−C‖max ≤ C√
log d
n, (8)
where C is a generic constant which does not depend on d and n.
In our application, we need to estimate the latent covariance matrix. A direct approach is to
use the relationship between the correlation matrix C and the covariance matrix Σ that
Σjk = ρjkσjσk, where each σj is the marginal standard deviation of Xj .
Next, we construct an estimator for the marginal standard deviations based on (5). Given
n samples z1, ..., zn independently drawn from random variable Z, we first estimate the marginal
mean by the estimator defined in (5). Then, we use the same estimator to estimate the mean of
Z2 using z21 , ..., z2n by (5). Denoting the estimated mean of Z and Z2 by µ and M respectively, we
construct an estimator of the standard deviation of Z by
σ :=
√max
M − µ2, ε
, where ε > 0 is a small positive number. (9)
Denote the estimated marginal standard deviations of X1, ..., Xd by σ1, ..., σd. It is easy to see that,
with probability at least 1−O(d−1),
|σj − σj | ≤ C√
log d/n for all j. (10)
Combining the M-estimator for the standard deviations with the Kendall tau correlation matrix
defined in (7) gives us a covariance matrix estimator Σ = [Σjk]d×d, where
Σjk = σj σkρjk, for all 1 ≤ j, k ≤ d, (11)
18
where ρjk is the Kendall tau correlation defined in (7). We will show in the next sections that this
estimator of covariance matrix enjoys a parametric rate of convergence in the family of transelliptical
distributions and is robust in more complex settings.
After estimating the latent mean-adjusted covariance matrix and mean of eachXm, we estimate
the pooled mean-adjusted covariance matrix S. Suppose that for each m = 1, ...,M , we have smsamples of Xm. Let S =
∑Mm=1 sm. The estimator for the pooled latent mean-adjusted covariance
matrix S is constructed as
S =
M∑
m=1
smS
(Σm + µTmµ
Tm
), (12)
where the Σm and µm are estimated by (11) and (5) respectively, and µ =∑M
m=1smS µm. It follows
immediately that, with probability at least 1−O(d−1),
‖S− S‖max ≤ C√
log d/n. (13)
B.3 Estimating Leading Topics
As we have discussed in Section 2.2, the topics of the random vectorX ∼∑Mm=1 πmXm, where each
Xm ∼ TEd(µm,Σm;Z, f(m)1 , ..., f
(m)d ), are defined as the leading eigenvectors of the pooled latent
mean-adjusted covariance matrix S defined in (1). We further assume that the leading eigenvectors
v1,...,vT are s-sparse, i.e., ‖vt‖0 ≤ s for each t = 1, . . . , T .
Given the estimators for the latent-mean covariance matrices Σm defined in (11), we first analyze
the concentration of the spectral norm of Σm−Σm, where Σ is the covariance matrix of the latent
Gaussian mixture random vector Ym associated with Xm.
Theorem B.3. Given n i.i.d samples x1, ...,xn of random vector X ∈ Rd where X follows a
transelliptical distribution, i.e., X ∼ TEd(µ,Σ;Z, f1, ..., fd), let σ = (σ1, ..., σd)T be the estimated
marginal standard deviations derived from Cantoni’s estimator and C = [ρjk]d×d be the estimated
Kendall tau correlation matrix defined in (7), and let D = diag(σ). Let Σ = DCD. Under “sign
sub-Gaussian condition” (Han and Liu, 2013), We have, with probability at least 1−O(d−1)
‖Σ−Σ‖2 ≤ C√d log d
n, (14)
where C is a constant.
Furthermore, let η(Σ−Σ, s) = supv∈Sd−1∩B0(s) vT (Σ−Σ)v, with probability at least 1−O(d−1),
η(Σ−Σ, s) ≤ C√s log d
n, (15)
where C is a constant.
19
Proof. We have
‖DCD−DCD‖2 (16)
≤ ‖DC(D−D) + (DC−DC)D‖2≤ ‖DC−DC‖2‖D‖2 + ‖DC‖2‖D− D‖2≤ ‖DC + (D−D)C−DC‖2‖D‖2 + ‖DC‖2‖D− D‖2≤ ‖D‖22‖C−C‖2 + ‖D‖2‖D−D‖2‖C‖2 + ‖DC‖2‖D− D‖2≤ σ2max‖C−C‖2 + ‖D‖2‖D−D‖2‖C‖2 + ‖D‖2‖D−D‖2‖C‖2. (17)
We then consider the three terms of (17) one by one.
For the first term, by Han and Liu (2013), with probability at least 1−O(d−1),
σ2max‖C−C‖2 ≤ C√dlog d/n, (18)
for some constant C.
For the second term, we have, with probability at least 1−O(d−1),
‖D‖2‖D−D‖2‖C‖2 = ‖D−D‖22‖C‖2 + ‖D‖2‖D−D‖2‖C‖2 ≤ C√
log d/n, (19)
where C is a constant, and the inequality holds by (10).
For the last term, we have that, with probability at least 1−O(d−1),
‖D‖2‖D−D‖2‖C‖2 ≤ ‖D‖2‖D−D‖2‖C−C‖2 + ‖D‖2‖D−D‖‖C‖2≤ σmaxC1
√d log d/n+ C2
√log d/n
≤ C√d log d/n, (20)
where C1, C2, C are constants, and the first inequality has been proved by Han and Liu (2013)
under the assumption that maxj=1,...,d σj ≤ σmax and (10).
Combining (18), (19) and (20) together, (14) holds as desired.
Next, we establish a concentration result for the sparse spectral norm η(S − S, s). For any
v ∈ B0(s) ∩ Sd−1, we have
|vT (DCD−DCD)v|= |vT (DCD−DCD)v|= |vT (DC
(D−D) + (DC−DC)D
)v|
≤ |vT (D−D)CDv|+ |vTD(C−C)Dv|+ |vT DC(D−D)v|. (21)
Now, we bound the three terms in (21) one by one. For the first term, we have that for any
v ∈ B0(s) ∩ Sd−1, with probability at least 1−O(d−1),
|vT (D−D)CDv|≤ |vT (D−D)CDv|+ |vT (D−D)(C−C)Dv|≤ C
√log d/n ·
√s log d/n, (22)
20
for some constant C, where the last inequality is by (10) and Han and Liu (2013).
For the second term we have that for any v ∈ B0(s)∩Sd−1, with probability at least 1−O(d−1),
|vTD(C−C)Dv| ≤ maxv∈B0(s)∩Sd−1
σ2max|vTD(C−C)Dv| ≤ C√s log d/n, (23)
for some constant C, where the last inequality is by Han and Liu (2013).
For the third term we have that for any v ∈ B0(s)∩ Sd−1, with probability at least 1−O(d−1),
|vT DC(D−D)v|≤ |vT D(C−C)(D−D)v|+ |vT DC(D−D)v|≤ C1
√log d/n ·
√s log d/n+ C2
√log d/n ·
√s log d/n, (24)
for some constants C1 and C2, where the last inequality follows from (10) and Han and Liu (2013).
Plugging (22), (23) and (24) into (21), (15) follows as desired.
As a consequence of Theorem B.3 and (5), by the triangle inequality and induction, we have the
following result for the rate of convergence of the estimated pooled mean-adjusted latent covariance.
Corollary B.4. The estimated pooled mean-adjusted latent covariance matrix S defined in (13)
satisfies that, with probability at least 1−O(d−1),
‖S− S‖2 ≤ C1
√d log d
n, and η(S− S, s) ≤ C2
√s log d
n, (25)
for some constants C1 and C2.
Given the fast rate of convergence of S, we exploit the truncated power method (Yuan and
Zhang, 2013) and the Fantope Projection and Selection method (Vu et al., 2013) to estimate the
topics.
The truncated power (TPower) method is a modification of the power method to compute the
leading eigenvector. Note that to compute the s-sparse leading eigenvector of a matrix S, we are
solving the following optimization problem:
maxv
vT Sv, subject to v ∈ Sd−1 and ‖v‖0 ≤ s. (26)
The TPower method approximately solves (26) iteratively. At the k-th iteration, we have a inter-
mediate eigenvector yk. We sort the absolute values of the elements of yk, then truncate all the
elements of yk except for the elements with the largest s absolute values. For y = (y1, ..., yd)T ∈ Rd,
we denote the truncation with respect to the set A by y(A) = (y1 · I(1 ∈ A), ..., yd · I(d ∈ A))T . in
more detail, using the power method, at each iteration, we project the intermediate eigenvector
to the set Sd−1 and the `0 ball with radius s by letting yk+1 = yk(Ak)/‖yt(Ak)‖2 where Ak is
the set of indices of yk with the largest s absolute values. It has been shown by Yuan and Zhang
(2013) that, with suitable initialization, the solution of the truncated power method converges to
the sparse leading eigenvector at a fast parametric rate.
21
Algorithm 1 Algorithm to estimate the first T latent topics
Input: n independent samples drawn from random variable X.
Output: v1, v2,..., vT which are k-sparse.
S←∑Mm=1
smS
(Σm + µTmµ
Tm
)
Let S1 = S.
for t = 1, ..., T do
Pt ← argmaxP∈F1 〈S,P〉 − λ‖P‖1,1, where F1 :=P : 0 P Id and tr(P) = 1
.
vt = Λ1(P)tvt ← TPower(St, vt, k).
St+1 ← (Id − vtvTt )St(Id − vtv
Tt ).
end for
We initialize the truncated power algorithm using the Fantope Projection and Selection (FPS)
method (Vu et al., 2013). To estimate the subspace projection matrix of the p leading eigenvectors,
the FPS proposes a sparse principal subspace estimator P that is defined to be the solution of the
semidefinite program
maxP〈S,P〉 − λ‖P‖1,1, subject to P ∈ Fp, (27)
where Fp :=P : 0 P Id and tr(P) = p
. Note that when p = 1, (27) coincides with the
formulation of sparse PCA by d’Aspremont et al. (2007). Under mild assumptions, it has been
shown in Vu et al. (2013) that with properly chosen λ the solution of (27) P converges at a fast
parametric rate. In our particular application, we take p = 1. Note that it has been shown in Vu
et al. (2013) that when p = 1, the leading eigenvector of P converges to the leading eigenvector of
the covariance matrix at a fast parametric rate.
More specifically, denote S1 = S. At the t-th iteration, we first estimate the projection matrix of
the subspace spanned by the first leading eigenvector of S by solving the semidefinite programming
problem (27) with p = 1. Denote this estimator by P and its first leading eigenvectors by vt. Next,
we use the truncated power method to refine vt. The output of the truncated power method is our
estimator for the t-th topic vt.
After estimating the t-th leading vector, the matrix St deflates the vector vt and generates a
new matrix St+1:
St+1 := (Id − vtvTt )St(Id − vtv
Tt ).
The resulting matrix St+1 is orthogonal to vt. Next, we adopt the truncated power method to
estimate vt+1 using the input matrix St+1 with starting point vt+1 computed by FPS.
Denote the output of the TPower method with input matrix A, starting point w and tuning
parameter k by TPower(A, w, k), and denote the leading eigenvector of a matrix A by Λ1(A). The
algorithm to estimate the first T latent topics is summarized in Algorithm 1.
Next, we establish a concentration result as well as the consistency of the topic model estimator.
In particular, we prove convergence of the estimators vt computed by Algorithm 1.
Since the optimization problem in (26) is combinatoric and NP-hard, we solve (26) approxi-
mately by adopting the FPS and TPower method. In the next theorem, we will prove that our
22
estimator vt generated by Algorithm 1 enjoys a fast parametric rate of convergence.
Theorem B.5. Let X ∼ T (S;M, s). We assume the first T eigenvalues of S, λ1, ..., λT , have a
smallest gap that λt − λt+1 ≥ Cd for all t = 1, ..., T , and Cd > 0. If k ≥ s and k ≤ cs for some
constant c > 1, we have, with probability at least 1−O(d−1),
‖vt − vt‖2 ≤ C√s log d
n, (28)
for some constant C.
Proof. We prove the results for the case t = 1. The results for t = 2, ..., T are satisfied by induction.
We first characterize the initial points computed by the FPS method. By (13) and Theorem 3.3 of
Vu et al. (2013), it holds that, with probability at least 1−O(d−1),
sin(vt,vt) ≤ Cs√
log d/n,
for some constant C.
Assume s < k < Cs for some constant C. This shows that if the sample size n is large enough,
as n = o(log d) and sin(vt,vt) =√
1− vTt vt,
vTt vt ≥ c(η(S− S, s+ 2k) +
√s/k). (29)
Recall that η(S− S, k) = maxv∈Sd−1∩B0vT (S− S)v.
By Yuan and Zhang (2013), we have that when (29) is satisfied with k > s,
‖vt − vt‖2 ≤ η(S− S, k).
By (15), we have that, with probability at least 1−O(d−1),
‖vt − vt‖2 ≤ C√s log d/n,
for some constant C, as desired.
Therefore our estimator achieves the minimax optimal rate of convergence (Cai et al., 2013).
C TROPIC for TF-Biological Context Analysis
We provide a protocol to utilize the estimated gene topics to conduct high-throughput transcription
factor-biological context analysis.
We first estimate the latent topics using Algorithm 1. Note that we need to properly choose the
tuning parameter k. In particular, given the initialization vt provided by the FPS method (27), we
start with k = 100 and test if the resulting gene set corresponding to the non-zero entries of the
output vector vt is functionally enriched (Efron and Tibshirani, 2007). If not, we decrease k by
one and recompute the sparse leading eigenvector until the gene set is enriched. In our analysis, we
23
found that k = 30 is a good heuristic value. We choose T = 500 topics since their nonzero elements
cover more than 99% genes in our dataset.
Given the j-th TF, we adopt the method introduced by Ji et al. (2008) and Wu and Ji (2013)
to obtain top-ranked target genes. More specifically, given the ChIPx data for the j-th TF, we
use CisGenome (Ji et al., 2008) to detect the significant peaks. We then annotate the top r =
minw, 1, 000 significant ChIPx peaks to identify the TF-bound gene targets of TF. Finally, we
use ChIPXpress (Wu and Ji, 2013) to rank the target genes and denote the set Gj to be the collection
of the first r target genes ranked by ChIPXpress..
Suppose we have nm samples of the m-th biological context, x1, ...,xnm , drawn from a transel-
liptical vector Xm, we estimate its latent mean-adjusted covariance matrix using Sm by (5) and
(11). Next, we compute the leading eigenvector v(m) of Sm, which is called the m-th context topic.
Since the sample size nm is in general small (e.g., we have only 26 samples of Ewing sarcoma),
the estimated context topic unreliable. To handle this issue, we stabilize the estimation of vm by
regressing it on the population topic dictionary v1, ..., vT to identify which population topics
explain most of v(m)’s variability (with adjusted P-value less than 0.05 by Bonferroni’s method).
For each v(m), we denote the set of its K significant population topics as Sm = vm1 , ...vmK. We
construct a binary vector v(g)m ∈ Rd named the m-th feature gene vector. Let v
(g)m (i) = 1 if there
exists k ∈ 1, ...,K, such that vmk(i) 6= 0, while v
(g)m (i) = 0 otherwise. The genes correspond to
the nonzero entries of the vector v(g)m represent the important features of biological context m.
To actually identify the TF-biological context association, let kmj be the number of genes
presented in v(g)m and Gj simultaneously, we further trim down the identified target gene list of the
j-th TF to only include the top kmj significant target genes. Let u(m)j be a binary vector where
umj (i) = 1 if and only if the gene corresponding to the i-th position is among the top kmj target
genes. We then test if u(m)j and v
(g)(m) have significantly more overlap than random selection (with
the adjusted P-values less than 0.05 by FDR control (Benjamini and Hochberg, 1995) among the
M biological contexts). If so, it gives strong evidence that the feature genes of the m-th biological
context are regulated by the top target genes of the j-th TF, and we conclude that the j-th TF is
functionally associated with the m-th biological context.
References
Altucci, L., Leibowitz, M. D., Ogilvie, K. M., de Lera, A. R. and Gronemeyer, H.
(2007). RAR and RXR modulation in cancer and metabolic disease. Nat. Rev. Drug. Discov., 6
793–810.
Anderson, T. W. and Fang, K.-T. (1990). Statistical Inference in Elliptically Contoured and
Related Distributions. Allerton Press New York.
Arrowsmith, C. H., Bountra, C., Fish, P. V., Lee, K. and Schapira, M. (2012). Epigenetic
protein families: A new frontier for drug discovery. Nat. Rev. Drug. Discov., 11 384–400.
Bakalov, A., McCallum, A., Wallach, H. and Mimno, D. (2012). Topic models for tax-
24
onomies. In Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM,
237–240.
Barrett, T., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., Tomashevsky, M.,
Marshall, K. A., Phillippy, K. H., Sherman, P. M., Holko, M. et al. (2013). NCBI
GEO: Archive for functional genomics data sets update. Nucleic. Acids. Res., 41 D991–D995.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and
powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 289–300.
Bilodeau, S., Kagey, M. H., Frampton, G. M., Rahl, P. B. and Young, R. A. (2009).
SetDB1 contributes to repression of genes encoding developmental regulators and maintenance
of ES cell state. Gene. Dev., 23 2484–2489.
Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn.
Res., 3 993–1022.
Bouchard, M., Pfeffer, P. and Busslinger, M. (2000). Functional equivalence of the tran-
scription factors PAX2 and PAX5 in mouse development. Development, 127 3703–3713.
Boulesteix, A.-L. and Strimmer, K. (2005). Predicting transcription factor activities from
combined analysis of microarray and ChIP data: a partial least squares approach. Theor. Biol.
Med. Model., 2 23.
Busslinger, M., Klix, N., Pfeffer, P., Graninger, P. G. and Kozmik, Z. (1996). Deregula-
tion of PAX-5 by translocation of the Emu enhancer of the IgH locus adjacent to two alternative
PAX-5 promoters in a diffuse large-cell lymphoma. P. Natl. Acad. Sci., 93 6129–6134.
Cai, T. T., Ma, Z. and Wu, Y. (2013). Sparse PCA: Optimal rates and adaptive estimation.
Ann. Stat., 41 3074–3110.
Carvalho, C. M., Chang, J., Lucas, J. E., Nevins, J. R., Wang, Q. and West, M. (2008).
High-dimensional sparse factor modeling: Applications in gene expression genomics. J. Am. Stat.
Assoc., 103.
Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study.
Ann. I. H. Poincare-Pr., 48 1148–1185.
Ceol, C. J., Houvras, Y., Jane-Valbuena, J., Bilodeau, S., Orlando, D. A., Battisti,
V., Fritsch, L., Lin, W. M., Hollmann, T. J., Ferre, F. et al. (2011). The histone
methyltransferase SETDB1 is recurrently amplified in melanoma and accelerates its onset. Na-
ture, 471 513–517.
Chi, P., Allis, C. D. and Wang, G. G. (2010). Covalent histone modifications-miswritten,
misinterpreted and mis-erased in human cancers. Nat. Rev. Cancer, 10 457–469.
25
Crea, F., Hurt, E. M. and Farrar, W. L. (2010). Clinical significance of polycomb gene
expression in brain tumors. Mol. Cancer, 9 265.
Darnell, J. E. (2002). Transcription factors as targets for cancer therapy. Nat. Rev. Cancer, 2
740–749.
d’Aspremont, A., El Ghaoui, L., Jordan, M. I. and Lanckriet, G. R. (2007). A direct
formulation for sparse PCA using semidefinite programming. SIAM Rev., 49 434–448.
Dawson, M. A. and Kouzarides, T. (2012). Cancer epigenetics: From mechanism to therapy.
Cell, 150 12–27.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R.
(1990). Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci., 41 391–407.
Dupont, S., Krust, A., Gansmuller, A., Dierich, A., Chambon, P. and Mark, M. (2000).
Effect of single and compound knockouts of estrogen receptors alpha (ERalpha) and beta (ER-
beta) on mouse reproductive phenotypes. Development, 127 4277–4291.
Edgar, R., Domrachev, M. and Lash, A. E. (2002). Gene expression omnibus: NCBI gene
expression and hybridization array data repository. Nucleic Acids Res., 30 207–210.
Eferl, R. and Wagner, E. F. (2003). AP-1: A double-edged sword in tumorigenesis. Nat. Rev.
Cancer, 3 859–868.
Efron, B. and Tibshirani, R. (2007). On testing the significance of sets of genes. Ann. Appl.
Stat. 107–129.
Evans, R. M. (1988). The steroid and thyroid hormone receptor superfamily. Science, 240
889–895.
Faith, J. J., Hayete, B., Thaden, J. T., Mogno, I., Wierzbowski, J., Cottarel, G.,
Kasif, S., Collins, J. J. and Gardner, T. S. (2007). Large-scale mapping and validation of
escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol.,
5 e8.
Greer, E. L. and Shi, Y. (2012). Histone methylation: A dynamic mark in health, disease and
inheritance. Nat. Rev. Genet., 13 343–357.
Grimaldi, C. M., Cleary, J., Dagtas, A. S., Moussai, D., Diamond, B. et al. (2002).
Estrogen alters thresholds for B cell apoptosis and activation. J. Clin. Invest., 109 1625–1633.
Gruvberger, S., Ringner, M., Chen, Y., Panavally, S., Saal, L. H., Borg, A., Ferno,
M., Peterson, C. and Meltzer, P. S. (2001). Estrogen receptor status in breast cancer is
associated with remarkably distinct gene expression patterns. Cancer Res., 61 5979–5984.
Gullo, L., Pezzilli, R. and Morselli-Labate, A. M. (1994). Diabetes and the risk of pan-
creatic cancer. New Engl. J. Med., 331 81–84.
26
Halder, G. and Johnson, R. L. (2011). Hippo signaling: Growth control and beyond. Develop-
ment, 138 9–22.
Han, F. and Liu, H. (2012). Transelliptical component analysis. In NIPS. 368–376.
Han, F. and Liu, H. (2013). Optimal rates of convergence of transelliptical component analysis.
arXiv preprint arXiv:1305.6916.
Hanahan, D. and Weinberg, R. A. (2011). Hallmarks of cancer: The next generation. Cell,
144 646–674.
Harvey, K. and Tapon, N. (2007). The Salvador–Warts–Hippo pathwayAn emerging tumour-
suppressor network. Nat. Rev. Cancer, 7 182–191.
Harvey, K. F., Zhang, X. and Thomas, D. M. (2013). The hippo pathway and human cancer.
Nat. Rev. Cancer, 13 246–257.
Heine, P., Taylor, J., Iwamoto, G., Lubahn, D. and Cooke, P. (2000). Increased adipose
tissue in male and female estrogen receptor-α knockout mice. P. Natl. Acad. Sci., 97 12729–
12734.
Iida, S., Rao, P., Nallasivam, P., Hibshoosh, H., Butler, M., Louie, D., Dyomin, V.,
Ohno, H., Chaganti, R. and Dalla-Favera, R. (1996). The t (9; 14)(p13; q32) chromosomal
translocation associated with lymphoplasmacytoid lymphoma involves the PAX-5 gene. Blood,
88 4110–4117.
Ji, H., Jiang, H., Ma, W., Johnson, D. S., Myers, R. M. and Wong, W. H. (2008). An
integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat. Biotechnol., 26
1293–1300.
Knight, W. A., Livingston, R. B., Gregory, E. J. and McGuire, W. L. (1977). Estrogen
receptor as an independent prognostic factor for early recurrence in breast cancer. Cancer Res.,
37 4669–4671.
Landt, S. G., Marinov, G. K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou,
S., Bernstein, B. E., Bickel, P., Brown, J. B., Cayting, P. et al. (2012). ChIP-seq
guidelines and practices of the ENCODE and modENCODE consortia. Genome Res., 22 1813–
1831.
Leonetti, C., D’Agnano, I., Lozupone, F., Valentini, A., Geiser, T., Zon, G., Cal-
abretta, B., Citro, G. and Zupi, G. (1996). Antitumor effect of c-MYC antisense phospho-
rothioate oligodeoxynucleotides on human melanoma cells in vitro and in mice. J. Natl. Cancer
I., 88 419–429.
Li, H., Cai, Q., Wu, H., Vathipadiekal, V., Dobbin, Z. C., Li, T., Hua, X., Landen, C. N.,
Birrer, M. J., Sanchez-Beato, M. et al. (2012). SUZ12 promotes human epithelial ovarian
cancer by suppressing apoptosis via silencing HRK. Mol. Cancer Res., 10 1462–1472.
27
Li, H., Ma, X., Wang, J., Koontz, J., Nucci, M. and Sklar, J. (2007). Effects of rearrange-
ment and allelic exclusion of JJAZ1/SUZ12 on cell proliferation and survival. P. Natl. Acad.
Sci., 104 20001–20006.
Li, H., Wang, J., Mor, G. and Sklar, J. (2008). A neoplastic gene fusion mimics trans-splicing
of RNAs in normal human cells. Science, 321 1357–1361.
Li, M.-D. and Yang, X. (2011). A retrospective on nuclear receptor regulation of inflammation:
lessons from GR and PPARs. PPAR Res., 2011.
Li, Z., Van Calcar, S., Qu, C., Cavenee, W. K., Zhang, M. Q. and Ren, B. (2003). A
global transcriptional regulatory role for c-MYC in Burkitt’s lymphoma cells. P. Natl. Acad.
Sci., 100 8164–8169.
Liu, H., Han, F. and Zhang, C.-h. (2012). Transelliptical graphical models. In NIPS. 800–808.
Mangelsdorf, D. J., Thummel, C., Beato, M., Herrlich, P., Schutz, G., Umesono, K.,
Blumberg, B., Kastner, P., Mark, M., Chambon, P. et al. (1995). The nuclear receptor
superfamily: The second decade. Cell, 83 835–839.
Martin, C. and Zhang, Y. (2005). The diverse functions of histone lysine methylation. Nat.
Rev. Mol. Cell. Bio., 6 838–849.
Martın-Perez, D., Sanchez, E., Maestre, L., Suela, J., Vargiu, P., Di Lisio, L.,
Martınez, N., Alves, J., Piris, M. A. and Sanchez-Beato, M. (2010). Deregulated ex-
pression of the polycomb-group protein SUZ12 target genes characterizes mantle cell lymphoma.
Am. J. Pathol., 177 930–942.
McCall, M. N., Bolstad, B. M. and Irizarry, R. A. (2010). Frozen robust multiarray analysis
(fRMA). Biostatistics, 11 242–253.
McCall, M. N., Uppal, K., Jaffee, H. A., Zilliox, M. J. and Irizarry, R. A. (2011). The
gene expression barcode: leveraging public data repositories to begin cataloging the human and
murine transcriptomes. Nucleic Acids Res., 39 D1011–1015.
Mimno, D. (2012). Computational historiography: Data mining in a century of classics journals.
J. Comp. Cul. Herit., 5 1–20.
Morrison, A. M., Jager, U., Chott, A., Schebesta, M., Haas, O. A. and Busslinger, M.
(1998a). Deregulated PAX-5 transcription from a translocated IgH promoter in marginal zone
lymphoma. Blood, 92 3865–3878.
Morrison, A. M., Nutt, S. L., Thevenin, C., Rolink, A. and Busslinger, M. (1998b).
Loss-and-gain offunction mutations reveal an important role of BSAP (Pax-5) at the start and
end of B cell differentiation. In Seminars in Immunology, vol. 10. Academic Press, 133–142.
28
Nakamura, T., Imai, Y., Matsumoto, T., Sato, S., Takeuchi, K., Igarashi, K., Harada,
Y., Azuma, Y., Krust, A., Yamamoto, Y. et al. (2007). Estrogen prevents bone loss via
estrogen receptor α and induction of Fas ligand in osteoclasts. Cell, 130 811–823.
Ntziachristos, P., Tsirigos, A., Van Vlierberghe, P., Nedjic, J., Trimarchi, T., Fla-
herty, M. S., Ferres-Marco, D., da Ros, V., Tang, Z., Siegle, J. et al. (2012). Genetic
inactivation of the polycomb repressive complex 2 in T cell acute lymphoblastic leukemia. Nat.
Med., 18 298–303.
Ogawa, S., Eng, V., Taylor, J., Lubahn, D. B., Korach, K. S. and Pfaff, D. W. (1998).
Roles of estrogen receptor-α gene expression in reproduction-related behaviors in female mice 1.
Endocrinology, 139 5070–5081.
Piette, J., Piret, B., Bonizzi, G., Schoonbroodt, S., Merville, M.-P., Legrand-Poels,
S. and Bours, V. (1997). Multiple redox regulation in NF-kappaB transcription factor activa-
tion. Biol. Chem., 378 1237–1245.
Purushotham, S., Liu, Y. and Kuo, C.-C. J. (2012). Collaborative topic regression with
social matrix factorization for recommendation systems. In Proceedings of the 29th International
Conference on Machine Learning. 759–766.
Rayet, B. and Gelinas, C. (1999). Aberrant REl/NFKB genes and activity in human cancer.
Oncogene, 18.
Shaffer, A., Rosenwald, A. and Staudt, L. M. (2002). Lymphoid malignancies: The dark
side of B-cell differentiation. Nat. Rev. Immunol., 2 920–933.
Shalit, U., Weinshall, D. and Chechik, G. (2013). Modeling musical influence with topic
models. In ICML. 244–252.
Shaulian, E. and Karin, M. (2002). AP-1 as a regulator of cell life and death. Nat. Cell Biol.,
4 E131–E136.
Slack, G. W. and Gascoyne, R. D. (2011). MYC and aggressive B-cell lymphomas. Adv. Anat.
Pathol., 18 219–228.
Sparmann, A. and van Lohuizen, M. (2006). Polycomb silencers control cell fate, development
and cancer. Nat. Rev. Cancer, 6 846–856.
Suva, M. L., Riggi, N. and Bernstein, B. E. (2013). Epigenetic reprogramming in cancer.
Science, 339 1567–1570.
Thurmond, T. S., Murante, F. G., Staples, J. E., Silverstone, A. E., Korach, K. S. and
Gasiewicz, T. A. (2000). Role of estrogen receptor α in hematopoietic stem cell development
and B lymphocyte maturation in the male mouse 1. Endocrinology, 141 2309–2318.
Trefethen, L. N. and Bau III, D. (1997). Numerical Linear Algebra. 50, SIAM.
29
Urbanek, P., Fetka, I., Meisler, M. H. and Busslinger, M. (1997). Cooperation of PAX2
and PAX5 in midbrain and cerebellum development. P. Natl. Acad. Sci., 94 5703–5708.
Urbanek, P., Wang, Z.-Q., Fetka, I., Wagner, E. F. and Busslinger, M. (1994). Complete
block of early B cell differentiation and altered patterning of the posterior midbrain in mice
lacking PAX5/BSAP. Cell, 79 901–912.
Vigneri, P., Frasca, F., Sciacca, L., Pandini, G. and Vigneri, R. (2009). Diabetes and
cancer. Endocr.-Relat. Cancer, 16 1103–1123.
Vu, V. Q., Cho, J., Lei, J. and Rohe, K. (2013). Fantope Projection and Selection: A near-
optimal convex relaxation of sparse PCA. In NIPS. 2670–2678.
Wang, C., Blei, D. and Li, F.-F. (2009). Simultaneous image classification and annotation. In
IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1903–1910.
Wang, C. and Blei, D. M. (2009). Decoupling sparsity and smoothness in the discrete hierarchical
Dirichlet process. In NIPS. 1982–1989.
Wu, G. and Ji, H. (2013). ChIPXpress: Using publicly available gene expression data to improve
ChIP-seq and ChIP-chip target gene ranking. BMC Bioinformatics, 14 188.
Wu, G., Yustein, J. T., McCall, M. N., Zilliox, M., Irizarry, R. A., Zeller, K., Dang,
C. V. and Ji, H. (2013a). ChIP-PED enhances the analysis of ChIP-seq and ChIP-chip data.
Bioinformatics.
Wu, H., Xiao, Y., Zhang, S., Ji, S., Wei, L., Fan, F., Geng, J., Tian, J., Sun, X., Qin,
F. et al. (2013b). The Ets transcription factor GABP is a component of the hippo pathway
essential for growth and antioxidant defense. Cell Reports, 3 1663–1677.
Yao, L., Mimno, D. and McCallum, A. (2009). Efficient methods for topic model inference on
streaming document collections. In ACM SIGKDD. ACM, 937–946.
Yoshida, S., Nakazawa, N., Iida, S., Hayami, Y., Sato, S., Wakita, A., Shimizu, S.,
Taniwaki, M. and Ueda, R. (1999). Detection of MUM1/IRF4-IgH fusion in multiple myeloma.
Leukemia, 13 1812.
Yu, J., Yu, J., Rhodes, D. R., Tomlins, S. A., Cao, X., Chen, G., Mehra, R., Wang, X.,
Ghosh, D., Shah, R. B. et al. (2007). A polycomb repression signature in metastatic prostate
cancer predicts cancer outcome. Cancer Research, 67 10657–10663.
Yuan, X.-T. and Zhang, T. (2013). Truncated power method for sparse eigenvalue problems. J.
Mach. Learn. Res., 14 899–925.
Zhu, J., Zhang, B., Smith, E. N., Drees, B., Brem, R. B., Kruglyak, L., Bumgarner,
R. E. and Schadt, E. E. (2008). Integrating large-scale functional genomic data to dissect the
complexity of yeast regulatory networks. Nat. Genet., 40 854–861.
30
Zhuang, D., Mannava, S., Grachtchouk, V., Tang, W., Patil, S., Wawrzyniak, J.,
Berman, A., Giordano, T., Prochownik, E., Soengas, M. et al. (2008). c-MYC over-
expression is required for continuous suppression of oncogene-induced senescence in melanoma
cells. Oncogene, 27 6623–6634.
31