mining massive amounts of genomic data: a semiparametric ... · mining massive amounts of genomic...

31
Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang * Min-Dian Li Michael I. Jordan Han Liu § January 1, 2015 Abstract Characterizing the functional relevance of transcription factors (TFs) in different biological contexts is pivotal in systems biology. Given the massive amount of genomic data, computa- tional identification of TFs is often necessary to generate new hypotheses for experimentalists. In this paper, we use large gene expression and chromatin immunoprecipitation (ChIP) data corpuses to conduct high-throughput TF-biological context association analysis. This work makes two contributions: (i) From a methodological perspective, we propose a unified topic modeling framework for exploring and analyzing large and complex genomic datasets. Under this framework, we develop new statistical optimization algorithms and semiparametric theo- retical analysis which are also applicable to a variety of large-scale data analyses. (ii) From a scientific perspective, our method provides an informative list of new discoveries in biology. Our data-driven analysis of 38 TFs in 68 tumor biological contexts identifies functional signatures of epigenetic regulators, such as SUZ12 and SET-DB1, and nuclear receptors, in many tumor types. In particular, the TF signature of SUZ12 is present in a broad range of tumor types, suggesting the important role of SUZ12-mediated histone methylation in tumor biology. 1 Introduction A fundamental goal of systems biology and functional genomics is to understand global regulation of gene expression. Transcription factors (TFs), cofactors, and epigenetic proteins represent major regulators of gene expression, the disturbance of which contributes to the pathogenesis of a plethora of human diseases, including cancer (Darnell, 2002; Arrowsmith et al., 2012). A major approach * Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; e-mail: [email protected] Department of Cellular and Molecular Physiology, Section of Comparative Medicine and Program in Integrative Cell Signaling and Neurobiology of Metabolism, Yale University School of Medicine, New Haven, CT 06520, USA; e-mail: [email protected] Department of EECS and Statistics, University of California, Berkeley, CA 94720, USA; e-mail: [email protected] § Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; e-mail: [email protected] 1

Upload: others

Post on 04-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

Mining Massive Amounts of Genomic Data: A

Semiparametric Topic Modeling Approach

Ethan X. Fang∗ Min-Dian Li† Michael I. Jordan‡ Han Liu§

January 1, 2015

Abstract

Characterizing the functional relevance of transcription factors (TFs) in different biological

contexts is pivotal in systems biology. Given the massive amount of genomic data, computa-

tional identification of TFs is often necessary to generate new hypotheses for experimentalists.

In this paper, we use large gene expression and chromatin immunoprecipitation (ChIP) data

corpuses to conduct high-throughput TF-biological context association analysis. This work

makes two contributions: (i) From a methodological perspective, we propose a unified topic

modeling framework for exploring and analyzing large and complex genomic datasets. Under

this framework, we develop new statistical optimization algorithms and semiparametric theo-

retical analysis which are also applicable to a variety of large-scale data analyses. (ii) From a

scientific perspective, our method provides an informative list of new discoveries in biology. Our

data-driven analysis of 38 TFs in 68 tumor biological contexts identifies functional signatures

of epigenetic regulators, such as SUZ12 and SET-DB1, and nuclear receptors, in many tumor

types. In particular, the TF signature of SUZ12 is present in a broad range of tumor types,

suggesting the important role of SUZ12-mediated histone methylation in tumor biology.

1 Introduction

A fundamental goal of systems biology and functional genomics is to understand global regulation

of gene expression. Transcription factors (TFs), cofactors, and epigenetic proteins represent major

regulators of gene expression, the disturbance of which contributes to the pathogenesis of a plethora

of human diseases, including cancer (Darnell, 2002; Arrowsmith et al., 2012). A major approach

∗Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA;

e-mail: [email protected]†Department of Cellular and Molecular Physiology, Section of Comparative Medicine and Program in Integrative

Cell Signaling and Neurobiology of Metabolism, Yale University School of Medicine, New Haven, CT 06520, USA;

e-mail: [email protected]‡Department of EECS and Statistics, University of California, Berkeley, CA 94720, USA; e-mail:

[email protected]§Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA;

e-mail: [email protected]

1

Page 2: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

to understand such regulators is to integrate their genomic binding patterns with gene expression

profiles in different biological contexts, including physiologic or pathologic conditions from different

organisms. Currently, chromatin immunoprecipitation (ChIP), followed by microarray or DNA

sequencing (-chip or -seq, collectively referred to as ChIPx) are widely used to examine the function

of TFs, cofactors and epigenetic proteins in different biological contexts. Since the launch of

the ENCODE and modENCODE consortia, ChIPx data from more than 140 TFs and histone

modifications in more than 100 cell types from four organisms including human have been made

available (Landt et al., 2012). Meanwhile, more than 36,728 series of gene expression profiles in

different organisms from frogs to human have been deposited to the Gene Expression Omnibus

(GEO) public database (Edgar et al., 2002; Zhu et al., 2008; Barrett et al., 2013).

Although a variety of computational methods have been proposed to integrate ChIPx and gene

expression data and to predict associations between TFs and biological contexts (Boulesteix and

Strimmer, 2005; Faith et al., 2007; Zhu et al., 2008; Wu et al., 2013a), there are several critical

challenges associated with such methods. First, current methods are often not able to cope with

the high dimensionality associated with gene-expression datasets, where the dimension of the data

is generally much larger than the sample size. Second, current methods often assume idealized

distributions such as Gaussian that are generally not a good match to the complex distributions

that arises in these datasets, particularly in the tail of the distributions. Third, current methods

do not address issues of heterogeneity that arise when the overall dataset is formed from different

sources.

In this paper, we exploit the transelliptical distribution recently studied by Han and Liu (2012)

to propose a semiparametric transelliptical topic (TROPIC) modeling framework to address the

challenges of integrating genomic binding patterns with gene expression profiles across biological

contexts. A semiparametric model contains both finite- and infinite-dimensional parameters, and

is used in our framework to model gene expression data. The infinite-dimensional component of

the model provides flexibility, and the finite-dimensional component comprise the parameters of

major scientific interest. Such a topic modeling framework, combined with a hierarchical mixture

architecture, enables us to effectively extract common information from large aggregated datasets

that exhibit high heterogeneity. In addition, we develop new statistical optimization algorithms to

estimate the parameters in these topic models efficiently. We will show that under mild assumptions,

our estimation procedure converges at a near-optimal rate.

We show how to use our method to perform high-throughput TF-biological context analysis

of ChIPx and gene expression profiles by matching target genes of a TF with feature genes of a

biological context. We first show that the TROPIC method reveals the TF signature of c-MYC

in a conserved cohort of tumors with ChIPx data from different sources. We next show that the

TROPIC method is more reliable than ChIP-PED (Wu et al., 2013a), the first of such analytic

methods, but provides a simple platform. To illustrate the effectiveness of TROPIC, we further

apply our framework to ChIPx data involving 38 TFs and epigenetic proteins, and gene expression

profiles of 68 tumor types. Figure 1 illustrates the work flow. We find that the TF signature

of the epigenetic regulator SUZ12 is prevalent in a broad range of tumors. Classic tumor-related

TFs, such as NF-κB, c-FOS, c-JUN, ESR1 and PAX5, are also prevalent in tumors. Interestingly,

2

Page 3: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

Gene Topics

Gen

es Top target gene (TTG)Topic gene match TTGTopic gene does not match TTG

Significant Topic

BRCA1

EWSR1

MYC Potential connection

Unexplored area

Confirmed connections

3000

+ C

hiPx

sam

ples

for m

ultip

le T

Fs

60,000+ gene expression samplesfrom multiple biological contexts

Breast Cancer

Ewing Sarcoma

Kidney CancerTF-DNAinteraction

(107/sample)Text

Gene ExpressionData (Microarray)

(104/sample)

Text

Application: Data-Driven Science

>Ăďϭ >ĂďϮ >Ăďϯ >Ăď ŵ

DzZϭ

'ĞŶĞƐ(> ϭϮK)

'ĞŶĞ ĞdžƉƌĞƐƐŝŽŶ ĚĂƚĂ

ŝŽůŽŐŝĐĂů ĐŽŶƚĞdžƚƐƌĞ

ĂƐƚĐĂŶ

ĐĞƌ

ǁŝŶŐΖƐ

ƐĂƌĐŽŵ

Ă

^ƚĞŵĐĞůů

(> ϮK)

,/WͲĐŚŝƉ ĚĂƚĂ

;> ϭKͿdƌĂŶ

ƐĐƌŝƉƟŽ

Ŷ ĨĂĐƚŽƌ

'ŽĂů ƐƐŽĐŝĂƟŽŶ ŵŝŶŝŶŐ ďĞƚǁĞĞŶ d&Ɛ ĂŶĚ ŝŽůŽŐŝĐĂů ĐŽŶƚĞdžƚƐDŝdžƚƵƌĞ ŽĨ ƚƌĂŶƐĞůůŝƉƟĐĂů ƚŽƉŝĐ ŵŽĚĞůƐ ;&ĂŶŐ ĂŶĚ >ŝƵ ϮϬϭϯͿ

A

B

C

Figure 1: (A) Our method integrates datasets arising from gene expression and ChIPx. (B) We assess whether top

target genes (red) have significant overlap with topic genes (purple). (C) We systematically explore the associations

between biological contexts and transcription factors. The current state of the art is that only a small proportion

(red) of the joint ChIPx and expression data in human has been investigated; we analyze the unexplored area (grey)

in order to guide biologists in the design of new experiments.

several nuclear receptors, e.g., HNF4A, and RXRA, exhibit significant relevance in a wide spectrum

of tumor types.

2 Data and Methodology

We exploit a gene expression dataset (McCall et al., 2011) consisting of n = 13, 182 samples of

M = 2, 631 biological contexts generated from Affymetrix Human 133A (GPL96) arrays. The data

was downloaded from GEO, preprocessed and normalized using frozen-RMA (McCall et al., 2010) to

reduce batch effect. For each probeset, we standardize its expression values to have zero mean and

unit standard deviation across all array samples. The data contain 20,248 probes, corresponding

to d = 12, 704 genes.

3

Page 4: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

2.1 Data Modeling

The gene expression data X ∈ Rn×d is highly heterogeneous since it is collected from multiple

biological contexts and labs. Such heterogeneity invalidates the classical Gaussian model and

motivates us to adopt a more flexible model based on the transelliptical distribution (Han and Liu,

2012).

A random vector X = (X1, ..., Xd)T ∈ Rd follows a transelliptical distribution, denoted X ∼

TE(µ,Σ;Z, f1, ..., fd), if there exist monotone univariate functions f1, ..., fd : R→ R such that the

transformed data f(X) =(f1(X1), ..., fd(Xd)

)Tfollows an elliptical distribution with mean µ and

covariance matrix Σ. More details regarding this distribution are provided in Appendix A.

To model the heterogeneity of the gene expression data X, we assume the expression data

from the m-th biological context are generated from a transelliptical random vector Xm. This

results in a transelliptical mixture model, i.e., each gene expression sample is generated from X ∼∑Mm=1 πmXm ∈ Rd where M is the total number of biological contexts and

∑Mm=1 πm = 1.

The transelliptical mixture model has a natural hierarchical interpretation (Liu et al., 2012).

Specifically, for each biological context m, we assume that there exists a latent Gaussian random

vector Ym ∼ Nd(µm,Σm). As shown in Figure 2, the Gaussian random vector can be converted

into an elliptical random vector Zm ∼ ECd(g,µm,Σm) via a global stochastic scaling factor ξm.

Compared to the Gaussian distribution, elliptical distributions are powerful at modeling heavy-tail

distributions with possibly nontrivial tail dependency. However, elliptical distributions are still re-

strictive since they must be symmetric. The elliptical random vector can be further converted into

a possibly asymmetric transelliptical random vector through marginal monotone transformations.

The transelliptical model is semiparametric since it contains both finite-dimensional parameters

(the mean and covariance matrix) and infinite-dimensional parameters (the stochastic scaling vari-

able and marginal transformations). Such a semiparametric architecture naturally addresses the

heterogeneity issue in modeling the expression data. For the purposes of statistical inference, we

treat the stochastic scaling factor ξm and marginal transformations as nuisance parameters and di-

rectly infer the latent means and covariance matrices µm’s and Σm’s. We define Y ∼∑Mm=1 πmYm

to be the latent Gaussian mixture random vector associated with X.

2.2 Transelliptical Topic Model

We assume the gene expression data X ∈ Rn×d can be summarized by a small number of “topic”

vectors v1,v2, ...,vT ∈ Rd with T n. This general approach has been used in many applications,

including text mining (Blei et al., 2003; Mimno, 2012), social media analysis (Purushotham et al.,

2012), image processing (Wang et al., 2009) and others (Bakalov et al., 2012; Yao et al., 2009; Shalit

et al., 2013). In particular, motivated by the approach to topic modeling based on the singular

value decomposition (Deerwester et al., 1990), we define the topics of the transelliptical mixture

random vector X to be the leading eigenvectors of the latent mean-adjusted covariance matrix

S = Σ +µµT , where Σ and µ are the covariance matrix and mean of the latent Gaussian mixture

4

Page 5: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

EwingSarcoma

KidneyCancer

Y | = 2

N(µ2,2)

Y | = M

N(µM ,M )

h2 hM

2 M

1 2 M

Multi(1, . . . ,M )Mixturedistribution

Lighted tail

Heavy tail

Asymmetric

Expression forEwing Sarcoma

Expression forKidney Cancer

Gau

ssia

nEl

liptic

alTr

anse

llipt

ical

BreastCancer

Y | = 1

N(µ1,1)

h1

1

Expression forBreast Cancer

Figure 2: The hierarchical structure of a transelliptical mixture distribution. Each biological context

m has a underlying normal distribution Ym ∼ N(µm,Σm). Each Ym is transformed to an elliptical

random vector and then to a transelliptical random vector. The observed data are generated from

the transelliptical random vector.

random vector Y , i.e.,

S = Cov(Y ) + E(Y )E(Y T ) =M∑

m=1

πmSm, (1)

where each Sm = Σm + µmµTm.

The first term Cov(Y ) captures population-level variability, and the second term E(Y )E(Y T )

captures location information. Recall that for a positive semidefinite matrix, S ∈ Rd×d, we can

write S =∑d

i=1 λivivTi where λ1 ≥ λ2 ≥ ... ≥ λd ≥ 0 are the eigenvalues of S, and vi are the

corresponding eigenvectors, such that the best rank-k approximation of S is∑k

i=1 λivivTi for all 1 ≤

k ≤ d (Trefethen and Bau III, 1997). Thus, the leading topics provide a latent representation that

summarizes important aspects of the first- and second-order statistical structure of the distribution

of X. We additionally assume that the topics v1,v2, ...vT ∈ Rd are s-sparse; i.e., we assume at

most s of the d elements of each vt are non-zero where s d. Such sparsity assumptions have

been widely adopted in the latent variable modeling literature as a tool for addressing the curse of

dimensionality; see, e.g., Carvalho et al. (2008) and Wang and Blei (2009). The nonzero components

of the topics represent features which are important in one or more Xm’s. To summarize, the

transelliptical topic model is defined as:

5

Page 6: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

Definition 2.1 (Transelliptical topic model). The transelliptical topic model, denoted by T (S;M, s),

is the set of distributions X ∼ ∑Mm=1 πmXm, where each Xm ∼ TEd(µm,Σm;Z, f

(m)1 , ..., f

(m)d ),

such that S =∑M

m=1 πm(Σm + µmµTm) and the first T leading eigenvectors of S are s-sparse.

Since transelliptical distributions can be heavy-tailed or asymmetric, we exploit a combination

of rank correlation (Han and Liu, 2012) and an M-estimator proposed by Catoni (2012) to estimate

the mean-adjusted covariance matrix S. For parameter estimation, we adopt the truncated power

(TPower) method (Yuan and Zhang, 2013) initialized by a semidefinite program that is known as

the Fantope Projection and Selection (FPS) method (Vu et al., 2013). More details regarding these

estimators can be found in Appendix B.

We now present a theorem which shows that our proposed method achieves the minimax optimal

rate of convergence, OP(√

(s log d)/n), for estimating the sparse topic vectors.

Theorem 2.2. Let X ∼ T (S;M, s). We assume the first T eigenvalues of S, λ1, ..., λT , have a

smallest spectral gap such that λt − λt+1 ≥ Cd for all t = 1, ..., T − 1 and Cd > 0. Denote the

estimated topics to be v1, ..., vT . Under “sign sub-Gaussian condition” (Han and Liu, 2013), with

suitable choice of tuning parameters, with probability at least 1−O(d−1), we have

‖vt − vt‖2 ≤ C ·√s log d

n, (2)

for some constant C.

Note that whenX follows a Gaussian or elliptical mixture distribution, the topics are the leading

eigenvectors of E(XXT ). To connect our topic model with existing work, suppose we have T topics

[v1, ...,vT ] = W ∈ Rd×T where each column vt ∈ Rd is a topic. We assume that the observed data

matrix X ∈ Rn×d is generated through some random combination of topics v1, ...,vT ; i.e., we

assume that the observed data matrix XT = WA where the random matrix A ∈ RT×n is generated

from some unknown distribution. In Deerwester et al. (1990), a singular value decomposition of

the observed data matrix XT = UDVT is conducted, such that if the columns of U are viewed

as the topics, then A = DV can be viewed as a random combination matrix. It is easily seen

that if d is fixed and n → ∞, the columns of U converge to the leading eigenvectors of E(XXT )

asymptotically. Thus, our definition of topics can be viewed as a generalization of that of Deerwester

et al. (1990).

Our topic modeling framework, based on a transelliptical mixture distribution, is nongenerative,

in distinction to the bulk of the literature on topic modeling, which focuses on generative models

(Blei et al., 2003; Mimno, 2012). Our topics are defined in the latent space and the transformations

to the observed data are treated as nuisance parameters; however, the topics in the latent space

can be viewed as informative summaries of the distribution of the random vector X.

2.3 TROPIC for TF-Biological Context Analysis

We now introduce the TROPIC method for conducting TF-biological context analysis. Given a

transcription factor and a biological context, we first identify the biological context’s feature genes

6

Page 7: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

using the estimated topics from the gene expression data. Next, we exploit the ChIPx data to

identify the top target genes of the TF. We then test if the feature genes of the biological context

and the top target genes of the TF have significant overlap. If so, we conclude that the feature

genes and target genes significantly match, and the TF is deemed functionally significant in the

biological context.

In more detail, let v1, v2, ..., vT be the estimated topics from the gene expression profiles X. We

let v(m) denote the leading eigenvector of the estimated latent mean-adjusted covariance matrix

of Xm, which can also be viewed as the leading “topic” of the m-th biological context. We can

view v(m) as encoding summary information for the m-th biological context. However, the sample

size of the m-th biological context is possibly very small, which results in the instability of the

estimated v(m). To resolve this problem, we regress v(m) on the population topicsv1, v2, ..., vT

to identify a subset Sm =vm1 , ..., vmK

which explains the greatest fraction of the variability.

We then construct a binary feature vector, v(g)m , where v

(g)m (i) = 1 if there exists some k such that

vmk(i) 6= 0, and v

(g)m (i) = 0 otherwise, where v(i) denotes the i-th component of v.

We further construct a binary target gene vector u(m)j corresponding to the j-th TF. The

elements of u(m)j corresponding to the top target genes of the j-th TF are set to be 1, where we

first use CisGenome (Ji et al., 2008) to perform peak detection using the ChIPx data of the j-th

TF, and then we use ChIPXpress (Wu and Ji, 2013) to identify the top target genes of the TF.

We then test if v(g)m and u

(m)j have significant overlap. If so, we conclude that the feature genes

of the m-th biological context significantly match the target genes of the j-th TF, and infer that

the regulation of the j-th TF is functionally important in the m-th biological context. A more

detailed presentation of the protocol can be found in the Appendix C.

3 Results and Discussions

We apply the TROPIC method to the analyze the association between 38 TFs and a total of 68

tumor-related biological contexts where the sample sizes of each biological contexts are greater than

20. In this section, we discuss several important biological findings that arise from this analysis.

3.1 TROPIC Reliably Predicts TF Signature in a Conserved Cohort of Tumor

Types with ChIPx Data from Different Sources

To test the hypothesis that the adaptively selected target genes from the ChIPx data represent

the major targets of a TF, we use the TROPIC method to examine the association between major

targets of MYC and 68 sources of tumors, with ChIPx data from 6 sources, respectively. The ChIPx

data are different in the prepared laboratory and cell type. As shown in Figure 3A, ChIPx data for

MYC predicts a conserved cohort of tumor types (14/18), suggesting our selection criteria faithfully

preserves major targets of MYC regardless of the origins of the data. In particular, as shown in

Figure 3B, ChIPx data from three different cell types predicts MYC signature in 12 tumors shared

by all cell types. The cell types chosen are originally from umbilical vein endothelium (HUVEC),

lymphoblastoid tumor (GM12878), myelogenous leukemia (K562), and cervical malignant tumor

7

Page 8: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

(HeLa), which have distinct cellular physiology. Experimental variance is another concern for

extrapolating TF function to a new biological context. We compare the outcomes of TROPIC

from two laboratories and found K562 cell-derived and GM12878 cell-derived ChIPx data predict

MYC signature in a highly overlapped cohort of tumors, 12/14 and 14/16, respectively, as shown

in Figure 3C. Together, the results indicate that our selection criteria to process ChIPx data can

reliably predict TF signatures in new biological contexts.

12

: UTA HUVEC: UTA GM12878: UTA K562

0

1

0

1

1

1

12

0

1

1

1

1

2

: Yale Hela: Yale GM12878: Yale K562

22

: Yale K562: UTA K562

12 21

: UTA GM12878: Yale GM12878

14

Yale: Hela

UTA: HUVEC

UTA: GM12878

UTA: K562

Yale: GM12878

Yale: K562

Bio

logi

cal

Con

text

ChIPSources Brea

st: Tu

mor LGLA

ALK Positive

Anaplas

tic Ly

mphoma

Breast:

Tumor S

troma

B Cell: L

ymphoma

Ewing Tumor: B

one Tumor

Breast:

Tumor P

ost-Men

opausa

l

Classic

al Hodgkin

Lymphoma

Lung: Lung Can

cer C

ell Line

Melanoma B

L: Mela

noma Cell

Line

Lympoblas

toid Cell Lines

MCF7: Brea

st Aden

ocarci

noma

Melanoma M

etasta

tic Deri

vativ

es, L

ung

Squamous C

ervica

l Epith

ellium: T

umor

Bone Marr

ow: T-A

LL

Melanoma M

etasta

tic Deri

vativ

es, S

.C

Lung: Tumor

A375 C

ell Line:

Malignan

t Mela

mona

K562 C

ell Line:

CML

A

B C

Figure 3: TROPIC predicts the TF signature in a conserved cohort of tumor types with ChIPx

data from different sources. (A) The diagram that shows significant biological contexts from 68

tumors for MYC. The horizontal panel shows significant biological contexts. The vertical panel

shows sources of ChIPx data for MYC. The red color indicates an adjusted P-value < 0.05. (B)

A Venn diagram of the number of significant biological contexts for ChIPx data from different cell

types. (C) A Venn diagram of the number of significant biological contexts for ChIPx data from

different laboratories.

3.2 TROPIC Predicts TF Signature in a Bigger Cohort of Tumor Types than

ChIP-PED

ChIP-PED is an alternative method to predict TF signatures in biological contexts where ChIPx

data are not available. To estimate the accuracy of our TROPIC method, we choose ChIPx data

for MYC and SET-DB1, which represent a TF and an epigenetic protein, and apply the TROPIC

8

Page 9: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

method to predict associations of TFs with 68 tumors. Note that throughout the paper, we use

the FDR method (Benjamini and Hochberg, 1995) to adjust the P-values for multiple comparison.

However, for a fair comparison in Figure 4, we adjust the P-values of the two methods using

Bonferroni’s method as ChIP-PED does. By applying Bonferroni’s adjusted P-value of 0.05 as the

threshold, the results show that the tumor types predicted by ChIP-PED have significant overlap

with that predicted by the TROPIC method as shown in Figure 4. In particular, the TROPIC

method predicts MYC signature in seven tumors (Figure 4A, ChIPx source: UTA GM12878 without

MCF7) whereas ChIP-PED predicts MYC signature in a sub-cohort of four tumors. Two types

of lymphoma and K562 cell line are predicted by both methods, which is supported by previous

studies (Li et al., 2003; Slack and Gascoyne, 2011). Melanoma is another common tumor type

affected by MYC (Zhuang et al., 2008; Leonetti et al., 1996), which is predicted by our method.

Similarly, ChIP-PED predicts SET-DB1 signature in a sub-cohort of 6 tumors out of 11 predicted

by the TROPIC method as shown Figure 4B. Both TROPIC and ChIP-PED methods predict

melanoma as a significant biological context, which is consistent with a recent study (Ceol et al.,

2011). The difference is likely due to the additional assumption by ChIP-PED method, where

ChIP-PED assumes that the target genes and TF will both have significantly high/low expressions.

Meanwhile, our TROPIC method sets no threshold value for the expression level of TFs and does

not match the expression level of target genes to the expression level of TFs. It is reasonable that

altered expression of TF contributes to changes in its target genes, especially given that tumor

cells are known to show increased activity of oncogenic TFs (Darnell, 2002). However, increased

activity of TFs is not necessarily associated with increased level of expression. It is known that

chromosomal translocations and point mutations in oncogenic TFs, cofactors, or epigenetic proteins

can contribute to increased activity of TFs. In addition, decreased activity of TFs, cofactors, or

epigenetic proteins can be counted as features of the biological context by the TROPIC method,

so long as the inactivation leads to a dramatic change on target genes. This extends the power of

TROPIC to predict TF signature in a biological context that has inactivated TFs, as commonly

observed in chromosomal transcolations and truncations. In summary, the TROPIC method can

predict the TF signature regardless of the expression level and the activation status of the protein,

and thus provides a bigger cohort of tumor types for a specific TF.

3.3 TROPIC Predicts Novel Biological Contexts in Tumors

To test whether the TROPIC method is applicable to other regulators of gene expression, we

further apply the transelliptical topic modeling framework to context-specific analysis of ChIPx

data comprising 38 TFs, cofactors, and epigenetic proteins, and gene expression of 68 tumor types.

3.3.1 Epigenetic Regulators are Relevant to Many Tumor Types

Epigenetic control of gene expression is emerging as a crucial contributor to tumorigenesis and

metastasis (Suva et al., 2013). Histone methylation is an important and widespread form of epige-

netic mechanism. Emerging evidence indicates that deregulation of histone methylation contributes

to tumor formation (Martin and Zhang, 2005; Greer and Shi, 2012; Dawson and Kouzarides, 2012;

9

Page 10: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

A375 C

ell Line:

Malignan

t Mela

mona

B Cell Progen

itor: A

LL

ALK Positive

Anaplas

tic Ly

mphoma

B Cell: L

ymphoma

TROPIC 1

Biol

ogic

alCo

ntex

t

Method

ChIP-PED 1

Bone Marr

ow: Mye

loma

Bone Marr

ow: T-A

LL

Left Frontal

Lobe: Glio

blastoma

Lympoblas

toid Cell Lines

Lung: Tumor

Melanoma M

etasta

tic Deri

vativ

e, Lung

Melanoma B

L: Mela

noma Cell

Line

Yolk sa

c Tumor: T

umor

Cervix:

Cance

r

Favorab

le Hist

ology Wilm

s Tumor: N

on-Rela

pse

Favorab

le Hist

ology Wilm

s Tumor: R

elapse

Lung: Lung Can

cer C

ell Line

Melanoma M

etasta

tic Deri

vativ

e, S.C

Classic

al Hodgkin

Lymphoma

Blood: Leu

kemia

MCF7: Brea

st Aden

ocarci

noma

Biol

ogic

alCo

ntex

tMethod

TROPIC

Breast:

Tumor L

GLA

ALK Positive

Anaplas

tic Ly

mphoma

Breast:

Tumor S

troma

B Cell: L

ymphoma

Ewing Tumor: B

one Tumor

Breast:

Tumor P

ost-Men

opausa

l

Classic

al Hodgkin

Lymphoma

Lung: Lung Can

cer C

ell Line

Melanoma B

L: Mela

noma Cell

Line

Lympoblas

toid Cell Lines

MCF7: Brea

st Aden

ocarci

noma

Melanoma M

etasta

tic Deri

vativ

es, L

ung

Squamous C

ervica

l Epith

ellium: T

umor

Bone Marr

ow: T-A

LL

Melanoma M

etasta

tic Deri

vativ

es, S

.C

Lung: Tumor

A375 C

ell Line:

Malignan

t Mela

mona

K562 C

ell Line:

CML

ChIP-PED

A

B

MYC

SET-DB1

ChIP-PED 2

TROPIC 2

Figure 4: Comparison between TROPIC and ChIP-PED. (A) Diagram that shows significant bio-

logical contexts from 68 tumors for MYC computed from TROPIC and ChIP-PED. The red square

indicates an adjusted P-value < 0.05. (B) Diagram that shows significant biological contexts from

68 tumors for SET-DB1 computed from TROPIC and ChIP-PED, where the first two rows indicate

the results from one ChIPx dataset, and the last two rows show the results from another ChIPx

dataset. The red square indicates an adjusted P-value < 0.05.

Chi et al., 2010). We include several epigenetic regulators in the TROPIC analysis and present the

results as shown in Figure 5.

SUZ12: Multiple subunits of polycomb repressive complex 2 (PRC2) that trimethylates his-

tone 3 lysine 27 are either mutated or dyeregulated in different tumors (Sparmann and van Lo-

huizen, 2006). SUZ12 is a core subunit of PRC2. Previous studies report altered expression level

of PRC2/SUZ12 in a wide range of human primary tumors, such as T cell acute lymphoblastic

leukemia (T-ALL) (Ntziachristos et al., 2012), ovarian (Li et al., 2012, 2007), metastatic prostate

(Yu et al., 2007), lung (Martın-Perez et al., 2010), melanoma (Martın-Perez et al., 2010), brain and

glial tumors (Crea et al., 2010). To test whether the SUZ12 signature is present in tumors, we apply

the TROPIC method to analyze SUZ12 in human tumor samples. The results indicate that SUZ12

signature is present in 48 out of 68 tumor samples (70.59%), including most of the reported tumor

types as shown in Figure 5. Genetic manipulation of SUZ12 results in difference in tumor prolifera-

tion in the context of ovarian cancer and mantle cell lymphoma (Li et al., 2012; Martın-Perez et al.,

2010). However, whether the function of SUZ12 in other tumor types is significant is largely un-

10

Page 11: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

SUZ12

JUND

SETDB1

EP300

GABP

NFKB

FOS

JUN

IRF4

ESR1

RXRA

HNF4A

Biol

ogic

alCo

ntex

t

ChIPProteins Acu

te Ly

mphoblastic

Leuke

mia

B Cell Progen

itor: A

LL

ALK Positive

Anaplas

tic Ly

mphoma

B Cell Prec

ursor: A

LL

B Cell: C

hronic Ly

mphoblastic

Leuke

mia

B Cell: L

ymphoma

Bladder:

sTCC

Bladder

Tumor: T

2-4

Bladder:

mTCC

Blasts

and M

onuclear

Cells:

Leuke

mia

Blood: Acu

te Mye

loid Leuke

mia

Bone Marr

ow: Chronic

Lymphocy

tic Leu

kemia

Bone Marr

ow Mononucle

ar Cell

s: AML

Bone Marr

ow: Acu

te Ly

mphocytic

Leuke

mia

Bone Marr

ow: Leu

kemia

Bone Marr

ow: Mye

loma

Bone Marr

ow: T-A

LL

Breast:

Cance

r

Bone Marr

ow: Multip

le Mye

loma

Bone Marr

ow: Wald

enstr

oms Mac

roglobulinem

ia

Brain: G

lioblas

toma

Brain: T

umor

Breast:

Cance

r Dutal

Breast:

Tumor

Breast:

Tumor E

pitheli

um

Breast:

Tumor L

argely

Opera

ble or L

ocally

Advance

d

Breast:

Tumor L

argely

OLAI

Breast:

Tumor L

ymph Node-N

egati

ve

Breast:

Tumor P

ost-Men

opausa

l

Breast:

Tumor S

troma

Cervix:

Cance

r

Breast:

Tumor N

ode-Neg

ative

Blood: Leu

kemia

A375 C

ell Line:

Malignan

t Mela

mona

Colon: Tumor

Glioblas

toma: Tu

mor

Ewing Tumor: B

one Tumor

Germ Cell

: Tumor

Left Frontal

Lobe: Glio

blastoma

K562 C

ell Line:

Normal

Leuke

mia Cell

s: Acu

te Ly

mphoblastic

Leuke

mia

Lung: Aden

ocarci

noma

Lympoblas

toid Cell Lines

Lung: Tumor

Mammary

Glan

d: Tumor

MCF7: Brea

st Aden

ocarci

noma

Melanoma M

etasta

tic Deri

vativ

e, Lung

Melanoma B

L: Mela

noma Cell

Line

Ovaria

n Tumor: E

ndometroid

Ovary:

Cance

r

Posterio

r Foss

a:Pilo

cytic

Astrocy

toma

Skin: M

elanoma

Prostate:

Tumor

Squamous C

ell: C

arcinoma

T Cell: A

cute

Lymphoblas

tic Leu

kemia

Yolk sa

c Tumor: T

umor

Favorab

le Hist

ology Wilm

s Tumor: N

on-Rela

pse

Favorab

le Hist

ology Wilm

s Tumor: R

elapse

Lung: Lung Can

cer C

ell Line

Melanoma M

etasta

tic Deri

vativ

e, S.C

Ovaria

n Tumor: M

ucinous

Ovaria

n Tumor: S

erous

Skin: M

etasta

tic M

elanoma

Squamous C

ervica

l Epith

elium: T

umor

Liposarco

ma Cultu

re:incu

bated w

ith doxo

rubicin

Liposarco

ma Cultu

re:incu

bated w

ith PBS

Right Frontal

Lobe:Glio

blastoma

Classic

al Hodgkin

Lymphoma

SUZ12

JUND

SETDB1

EP300

GABP

NFKB

FOS

JUN

IRF4

ESR1

RXRA

HNF4A

Biol

ogic

alCo

ntex

t

ChIPProteins

Epig

enet

icpr

otei

nsHi

ppo

Path

way

Nucl

ear

Rece

ptor

Onc

ogon

icTF

Epig

enet

icpr

otei

nsHi

ppo

Path

way

Nucl

ear

Rece

ptor

Onc

ogon

icTF

PAX5

PAX5

Figure 5: Results of TROPIC on 13 TFs on 68 tumor-related biological contexts. The red square

indicates an adjusted P-value < 0.05.

11

Page 12: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

known. In addition, traditional screening via expression profiling, somatic mutation mapping, and

knockdown underestimates the functional relevance of TFs and transcriptional regulators. It has

been reported that portions of SUZ12 are commonly fused to JAZF1 gene in normal and neoplastic

endometrial cells (Li et al., 2007, 2008). JAZF1-SUZ12 contributes to tumorigenesis independent

of the expression and sequence of SUZ12 gene but exhibits TF signature that can be identified by

the TROPIC method (Figure 5, see ovarian tumor: endometroid). Despite a large body of evidence

supporting PRC2/SUZ12 as an oncoprotein, a recent study shows PRC2/SUZ12 acts as a tumor

suppressor in T-ALL (Ntziachristos et al., 2012). Our results identify SUZ12 signature in T-ALL,

suggesting that the TROPIC method focuses on the functional significance of TFs regardless of

the positive/negative role played by TFs. Together, our data suggests that SUZ12 is an important

regulator of gene expression in a broad range of tumor types.

SET-DB1: SET-DB1 is another epigenetic regulator that methylates histone 3 lysine 9 residue

into mono-, di-, and tri-methylated form (Greer and Shi, 2012). Originally discovered in fruit flies,

mammalian SET-DB1 is involved in the maintenance of embryonic stem cells by repressing the

expression of developmental regulators (Bilodeau et al., 2009). A recent study reports that SET-

DB1 is amplified in melanoma and accelerates the onset of tumor (Ceol et al., 2011). The same

study also finds that the copy number of SET-DB1 is increased in breast, liver, lung, and ovarian

tumors. Increased copy number of a certain gene does not necessarily lead to increased activity,

but a significant representation of TF signature will support the tumor-relevant role of that gene.

To verify whether SET-DB1 signature is present in tumor samples, we include SET-DB1 in the

TROPIC analysis. The results show that SET-DB1 signature is present in 20 sources of tumors,

including melanoma, breast, and lung tumor as shown in Figure 5. In particular, melanoma and

Wilms tumors are predicted by both TROPIC and ChIP-PED to be significant biological contexts

for SET-DB1 (Figure 4B first two rows). Whether SET-DB1 is involved in tumorigenesis in Wilms

tumor awaits further studies. In addition to confirm the presence of reported tumor types, the data

suggests that SET-DB1 is an important regulator in several types of blood and solid tumors.

3.3.2 Ets Family Protein GABP is Significantly Associated with Leukemia

GABP: Hippo tumor suppressor signaling is a conserved molecular pathway for the control of

organ size and has implicated in cancer (Harvey et al., 2013; Halder and Johnson, 2011). Hippo

pathway gauges the organ size by restricting both cell growth and cell proliferation, as well as

inducing cell death. Dysregulation of Hippo signaling is observed in a broad range of human

cancers, however, somatic or germline mutations in Hippo pathway are uncommon (Harvey and

Tapon, 2007). Recently, GA-binding protein (GABP), a member of ETS transcription factor family,

has been found to drive the expression of YAP (Wu et al., 2013b), the effector TF of Hippo pathway.

Loss of GABP down-regulates the level of YAP, resulting in a block at the G1/S phase of cell cycle

and increased cell death, which establishes GABP as an important regulator of Hippo pathway. We

test whether GABP signature is associated with tumors by the TROPIC method. The results show

that GABP signature is present in 19 sources of tumors as shown in Figure 5, including melanoma,

breast, lung, and prostate tumors, with an enrichment of lymphoblatic leukemia (8/19, 42.11%). It

has been reported that activation of Hippo-YAP pathway are deregulated in solid tumors (breast,

12

Page 13: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

lung, colorectal, and liver) (Halder and Johnson, 2011). Our analysis suggests that GABP is a

contributing pathogenic factor in breast and lung tumors through Hippo-YAP pathway.

3.3.3 Classic Oncogenic TFs are Implicated in Many Tumor Types

NF-κB and AP-1: Historically, transcription factors, such as NF-κB (RELA) and AP-1 (FOS,

JUN, JUND, etc), are among the first cohort of oncogenes. These TFs are master regulators in cell

proliferation, differentiation, survival, stress response, and inflammation, most of which represent

hallmarks of tumor cells (Li and Yang, 2011; Piette et al., 1997; Shaulian and Karin, 2002; Hanahan

and Weinberg, 2011). A large body of studies has implicated the critical role of NF-κB (RELA)

and AP-1 in lymphoma and leukemia (Eferl and Wagner, 2003; Rayet and Gelinas, 1999). We

test whether our method can reveal lymphoma and leukemia as the significant biological contexts

for those proteins. Importantly, chromosomal amplification, over-expression and rearrangement of

these genes contribute to tumorigenesis, which is likely to be filtered out by existing methods. We

apply the TROPIC method to NF-κB (RELA), FOS, JUN, and JUND. The results show that many

biological contexts of blood tumors are significant for these TFs whereas IRF4 is not significant in

most of the tumors except in myeloma and T-cell acute lymphoblastic leukemia (T-ALL) (Figure

5) as reported previously (Yoshida et al., 1999). These results demonstrate the high credibility of

TROPIC in predicting biological contexts.

ESR1: Estrogen receptor 1 (ESR1 or estrogen receptor alpha) is a classic steroid nuclear

receptor that is activated by estrogen hormone. Estrogen is a hormone that regulates the behavior

and physiology. ESR1-deficient mice are sterile with incomplete development of sex organs (Ogawa

et al., 1998; Dupont et al., 2000). ESR1 also acts in other tissues, such as bone and adipose

tissue (Heine et al., 2000; Nakamura et al., 2007). It has been reported that estrogen promotes

apoptosis of osteoblasts by ESR1 and induction of FAS death ligand (Nakamura et al., 2007).

ESR1-deficient mice are obese with increased number and size of adipose tissue (Heine et al.,

2000). ESR1 is involved in the pathogenesis of breast cancer and endometrial cancer. Expression

of ESR1 is widely used as a prognostic marker for breast cancer (Knight et al., 1977; Gruvberger

et al., 2001). The wide spectrum of physiological function for ESR1 indicates its pathogenic role is

beyond the realm of reproductive tissue-derived cancers. To test this hypothesis, we run TROPIC

analysis with ChIPx data for ESR1 and found that ESR1 is associated significantly with 30 out of 68

tumor-related biological contexts (Figure 5). As expected, breast cancer and ovarian cancer exhibit

ESR1 signature. Surprisingly, many types of B-cell (acute or chronic) lymphoblastic leukemia are

significantly associated with ESR1. It is known that estrogen promotes proliferation and survival

of B cells (Grimaldi et al., 2002; Thurmond et al., 2000). ESR1 pathway may contribute to the

pathogenesis of B-cell lymphoma and leukemia via increasing cell proliferation and survival.

PAX5: Paired box protein 5 (PAX5) is a transcription factor in B cell development and has been

implicated in several types of lymphoma (Shaffer et al., 2002). PAX5 activates a transcriptional

program of various B-cell-specific genes, which is required for directing bone-marrow progenitor

cells to differentiate into B cells (Morrison et al., 1998b). Urbanek et al. (1994) reported that loss

of PAX5 in mice leads to a complete arrest of B cell development at an early precursor stage. PAX5

is also important in the late stage of B cell differentiation. De-regulation of PAX5 is commonly ob-

13

Page 14: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

served in several types of lymphoma in the form of chromosomal translocation. A t(9:14)(p13;q32)

chromosomal transclocation brings the potent Emu enhancer of the IgH gene (a gene expressed in

mature B cells) into close proximity of the PAX5 promoter and results in increased expression of

PAX5 in late B-cell differentiation (Busslinger et al., 1996; Iida et al., 1996; Morrison et al., 1998a).

As expected, the significant biological contexts of PAX5 include B-cell lymphoma and leukemia

(i.e. B-ALL and B-CLL) (Figure 5). PAX5 is not only a master regulator of B cell biology, but also

an important pattern organizer in the development of central nervous system and genital tracts

(Urbanek et al., 1997; Bouchard et al., 2000). However, whether PAX5 is implicated in other types

of cancers is not known. Our TROPIC analysis shows that PAX5 is associated significantly with

28 out of 68 tumor biological contexts (Figure 5), including solid tumors from brain (brain and

glial cells) and reproduction organs (ovary and bladder). These observations indicate that the

tumorigenic role of PAX5 is beyond the realm of B-cell lymphoma.

3.3.4 Nuclear Receptor RXRA and HNF4A are Broadly Implicated in Tumors

Nuclear receptors represent a superfamily of ligand-activated transcription factor that modulates

cell growth, differentiation, survival and metabolism (Mangelsdorf et al., 1995; Evans, 1988). The

ligands for nuclear receptors include hormones and metabolites, ranging from retinoic acid (RAs),

vitamin D, steroid hormones, to lipid species. Retinoid X receptor A (RXRA) recognizes 9-cis

retinoic acid (9-cis RA), and heterodimerize with other nuclear receptors to modulate cellular

function. RAs are widely explored as therapeutics for both blood and solid tumors (Altucci et al.,

2007). A presence of RXRA signature in tumors will be useful to estimate the plausibility of

RA-based therapy in that specific tumor. We examine the significant tumor contexts for RXRA

and found that more than 65% of tumor contexts (47/68) are relevant to RXRA (Figure 5). This

highlights the important roles of RXRA biology in those tumors. Similarly, another nuclear receptor

hepatocyte nuclear factor 4 A (HNF4A) is significantly associated with a broad spectrum of tumors

(33/68, 48.52%), including melanoma, leukemia, breast, cervical, lung, and ovarian tumors (Figure

5). HNF4A has long been thought as a critical regulator of metabolism and contributes to the

pathogenesis of type I diabetes. It is well known that there is a strong link between diabetes and

cancer (Gullo et al., 1994; Vigneri et al., 2009). Our results suggest that HNF4A may be a genetic

link between diabetes and cancer.

3.4 The Estimate of the True Positive Rate

To evaluate the quality of our results, we randomly select 100 pairs of our found functionally

important TF-biological pairs. Next, we search existing literatures to find if the connections between

each TF-biological context pair has been experimentally proved. In total, we find 48/100 pairs

have been explicitly verified by biologists. Furthermore, 78/100 pairs have been mentioned in

the literatures. Thus, very conservatively, the true positive rate of our results is 48-78%. This

provides strong evidence that our method is able to guide the biologists to conduct experiments

more efficiently.

14

Page 15: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

4 Discussion

We present a semiparametric topic modeling framework to conduct high-throughput TF-biological

contexts analysis. Our approach addresses several key challenges in Big Data analysis, including

high dimensionality, distributional complexity, and data heterogeneity. Theoretically, our method

guarantees a nearly optimal rate of convergence across a wide family of possibly heavy-tailed distri-

butions. Practically, our method is computationally simple and robust to very noisy data. Consid-

ering the limited source of ChIPx data and the massive expanding pool of gene expression profiles,

TROPIC has the potential to assist in the construction of the global regulatory networks of large

numbers of genes.

One drawback of our method comparing with ChIP-PED is that our method does not reveal

the detailed regulation pattern of a TF in different biological contexts (e.g., whether this TF acts as

an activator or repressor). A natural way to address the issue is to further divide the feature genes

and target genes into different groups according to the signs of their correlations, and consider the

topics in different groups of genes.

Acknowledgement

Han Liu is supported by NSF Grants III-1116730 and NSF III-1332109, NIH R01MH102339, NIH

R01GM083084, and NIH R01HG06841, and FDA HHSF223201000072C. The authors are also grate-

ful for the host of the Simons Institute of Theory of Computation at UC Berkeley. Min-Dian Li

is supported by a scholarship from the CSC-Yale World Scholars Program and the Glenn/AFAR

Scholarship for Research in the Biology of Aging.

Appendix

A Elliptical and Transelliptical Models

In this section, we briefly review the transelliptical distribution (Han and Liu, 2012) and discuss

its relationship with the other distribution families.

We start with some notations. For a vector u = (u1, ..., ud)T ∈ Rd, the `0, `p and `∞ vector

norms are defined as ‖u‖0 := card(supp(u)), ‖u‖p := (∑d

j=1 |uj |p)1p and ‖u‖∞ := max1≤j≤d |uj |.

For a matrix A = [ajk]d×d, the `max-norm is defined as ‖A‖max := max1≤j,k≤d|Ajk|. Let Sd−1 :=

u ∈ Rd : ‖u‖2 = 1 be the d-dimensional unit sphere. For any two vectors a, b ∈ Rd and two

squared matrices A, B ∈ Rd×d, we denote their inner products by 〈a,b〉 := aTb and 〈A · B〉 :=

Tr(ATB) respectively. Throughout the Appendix, we use a generic constant C whose value may

vary from line to line.

The transelliptical model is a semiparametric distribution familiy in which the nonparametric

components provide modeling flexibility, while the parametric components encode the important

15

Page 16: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

information we can estimate efficiently. Before describing the transelliptical distribution, we briefly

overview the elliptical model, which can be viewed as a subfamily of the transelliptical model.

Recall that a random vector X = (X1, X2, ..., Xd)T ∈ Rd is continuous if the marginal distribu-

tions of X1, ..., Xd are all continuous, and we say X possesses density if X is absolutely continuous

with respect to Lebesgue measure. The elliptical distribution is defined below.

Definition A.1. A random vector X ∈ Rd (assuming its density exists) follows an elliptical

distribution if its density is of the following form:

f(x) = c|Σ|−1/2g((x− µ)TΣ−1(x− µ)

), (3)

where µ ∈ Rd; Σ ∈ Rd×d is positive definite; g : R+ → R+ is a univariate function on [0,∞), and c

is a normalization constant. We denote X ∼ ECd(µ,Σ, g).

Remark A.2. In general, we say a random vector X ∈ Rd follows an elliptical distribution if it can

be represented as Xd= µ + ξAU , where µ ∈ Rd, A ∈ Rd×p, p ≤ d and p = rank(Σ), AAT = Σ;

ξ ≥ 0 is a random variable independent of U ; U ∈ Sp−1 is uniformly distributed on the unit sphere

in Rp. It is seen that X does not necessarily possess a density as ξ does not always possess a

density, and Σ is only assumed to be positive semidefinite which might not be of full rank. In this

paper, we restrict our discussion on elliptical distributions which possess densities.

Assume that a random vector X ∈ Rd possesses a density and a covariance matrix (i.e., the

second moments of X are finite). The next proposition (Anderson and Fang, 1990) characterize

the relationship between the matrix Σ and the covariance matrix of X.

Proposition A.3. If a random vector X ∈ Rd follows an elliptical distribution possessing density

as defined in (3), then the matrix Σ ∈ Rd×d in (3) is a scatter matrix of X, i.e., Σ is proportional

to the covariance matrix of X.

The next proposition provides a condition for (µ,Σ, g) to be identifiable for X.

Proposition A.4. Let X = (X1, ..., Xd)T ∈ Rd be a random vector. If X ∼ ECd(µ,Σ, g) is

continuous and possesses a density, then (i) Σjj > 0 for all j ∈ 1, ..., d; (ii) (µ,Σ, g) is identifiable

for X under the constraint that Σjj = Var(Xj) for all j ∈ 1, ..., d.In the sequel, we adapt the identifiability condition that Var(Xj) = Σjj for all j ∈ 1, ..., d. In

order to model more complex distributions, Han and Liu (2012) extend the elliptical family to the

more flexible transelliptical family.

Definition A.5. (Transelliptical Distribution). A continuous random vectorX = (X1, X2, ..., Xd)T

follows a transelliptical distribution, denoted by X ∼ TEd(µ,Σ;Z, f1, ..., fd), if there exist mono-

tone univariate functions f1,..., fd, such that

(f1(X1), ..., fd(Xd))T d

= Z ∼ ECd(µ,Σ, g). (4)

We further assume that each fj(·) preserves the marginal mean and variance of Xj , i.e., E(Xj) =

E(Zj) and Var(Xj) = Var(Zj), such an identifiability condition is motivated by the “normal refer-

ence rule” (i.e., the model should reduce to a Gaussian model if the data are actually Gaussian.).

We call the matrix Σ the latent covariance matrix of X.

16

Page 17: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

Note that the definition of transelliptical distribution is slightly different from the original

definition in Han and Liu (2012), as we impose a different identifiability condition. Namely, the

aim of Han and Liu (2012) is to conduct scale-invariant PCA on the latent correlation matrix of X.

Thus, the identifiability condition in Han and Liu (2012) is that µ = 0 and the diagonal components

of Σ are all 1’s. While this form of identifiability provides ease of estimation, it loses the marginal

location and scale information. Thus we assume that E(Xj) = E(Zj) and Var(Xj) = Var(Zj).

B Estimating Leading Topics

The leading topics of the transelliptical topic model can be estimated using a combination of sparse

semidefinite programming and algorithmic statistics.

Let X ∼ ∑Mm=1 πmXm where Xm ∼ TEd(µm,Σm;Zm, f

(m)1 , ..., f

(m)d ). To conduct transellip-

tical topic analysis, we first need to estimate each µm and Σm in order to estimate the pooled

mean-adjusted covariance matrix. As the transelliptical family contains heavy-tailed and asym-

metric distributions, classical sample mean and covariance matrices do not achieve the desired rate

of convergence and new estimation procedures are needed.

B.1 Estimating the Latent Means

Let X ∼ TEd(µm,Σm;Zm, f1, ..., fd), we exploit an M-estimator proposed by Catoni (2012) to

estimate the mean of X. Let µ = (µ1, ..., µd)T . Given n independent samples x1, ...,xn of X where

each xi = (xi1, ..., xid)T , we estimate µj using the marginal data x1j , ...xnj.

The estimator is defined as follows. Suppose we want to estimate the mean of a random variable

Z. Let z1, ..., zn be n independent realizations of Z and ψ : R → R be a continuous and strictly

increasing function satisfying − log(1− z + z2/2) ≤ ψ(x) ≤ log(1 + z + z2/2).

The estimator for the mean of Z is defined as the unique value µ such that

n∑

i=1

ψ(αδ(zi − µ)

)= 0, (5)

where δ and αδ are two parameters chosen adaptively from the data. For the choices of ψ, δ and

αδ, see Catoni (2012) for more detailed discussions.

For n samples x1,x2, ...,xn independently drawn from random vector X ∈ Rd, let E(X) =

(µ1, ..., µd)T . Choosing δ = d−2/2, we exploit the estimator in (5) to estimate the marginal means,

µ1, ..., µd. Theoretically, Catoni (2012) shows that, with probability at least 1−O(d−1), we have

max1≤j≤d

|µj − µj | ≤ C√

log d

n. (6)

B.2 Estimating Latent Covariance Matrices

In order to estimate the pooled covariance matrix of X, we need to estimate the latent covariance

matrix Σm of each Xm.

17

Page 18: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

For a transelliptically distributed random vector X ∼ TEd(µ,Σ;Z, f1, ..., fd), it is easy to see

that the sample covariance matrix is not a consistent estimator of Σ due to the transformations

f1, ..., fd. It has been demonstrated in Han and Liu (2012) that we can efficiently estimate the latent

correlation matrix, i.e., the correlation matrix of the latent Gaussian random vector Y associated

with X. More specifically, we make use of the Kendall tau correlation matrix as defined below.

Definition B.1. The sample Kendall tau correlation matrix C = [ρjk]d×d is defined as

ρjk = sin(π

2τjk

), (7)

for all j, k ∈ 1, ..., d, where τjk = 2n−1(n− 1)−1∑

1≤i<i′≤n sign((xij − xi′j)(xik − xi′k)

)if j 6= k,

and τjk = 1 otherwise.

The next proposition from Han and Liu (2012) shows that the Kendall tau correlation matrix

enjoys a parametric rate of convergence in high-dimensional setting with respect to the `max norm.

Proposition B.2. Given n independent samples x1, ...,xn of a random vector X ∈ Rd following

a transelliptical distribution X ∼ TEd(µ,Σ;Z, f1, ..., fd). Let C be the latent correlation matrix,

i.e., the correlation of matrix of the latent Gaussian random vector Y associated with X, and let

C be the Kendall tau correlation matrix introduced in (7). We have, with probability at least

1−O(d−1),

‖C−C‖max ≤ C√

log d

n, (8)

where C is a generic constant which does not depend on d and n.

In our application, we need to estimate the latent covariance matrix. A direct approach is to

use the relationship between the correlation matrix C and the covariance matrix Σ that

Σjk = ρjkσjσk, where each σj is the marginal standard deviation of Xj .

Next, we construct an estimator for the marginal standard deviations based on (5). Given

n samples z1, ..., zn independently drawn from random variable Z, we first estimate the marginal

mean by the estimator defined in (5). Then, we use the same estimator to estimate the mean of

Z2 using z21 , ..., z2n by (5). Denoting the estimated mean of Z and Z2 by µ and M respectively, we

construct an estimator of the standard deviation of Z by

σ :=

√max

M − µ2, ε

, where ε > 0 is a small positive number. (9)

Denote the estimated marginal standard deviations of X1, ..., Xd by σ1, ..., σd. It is easy to see that,

with probability at least 1−O(d−1),

|σj − σj | ≤ C√

log d/n for all j. (10)

Combining the M-estimator for the standard deviations with the Kendall tau correlation matrix

defined in (7) gives us a covariance matrix estimator Σ = [Σjk]d×d, where

Σjk = σj σkρjk, for all 1 ≤ j, k ≤ d, (11)

18

Page 19: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

where ρjk is the Kendall tau correlation defined in (7). We will show in the next sections that this

estimator of covariance matrix enjoys a parametric rate of convergence in the family of transelliptical

distributions and is robust in more complex settings.

After estimating the latent mean-adjusted covariance matrix and mean of eachXm, we estimate

the pooled mean-adjusted covariance matrix S. Suppose that for each m = 1, ...,M , we have smsamples of Xm. Let S =

∑Mm=1 sm. The estimator for the pooled latent mean-adjusted covariance

matrix S is constructed as

S =

M∑

m=1

smS

(Σm + µTmµ

Tm

), (12)

where the Σm and µm are estimated by (11) and (5) respectively, and µ =∑M

m=1smS µm. It follows

immediately that, with probability at least 1−O(d−1),

‖S− S‖max ≤ C√

log d/n. (13)

B.3 Estimating Leading Topics

As we have discussed in Section 2.2, the topics of the random vectorX ∼∑Mm=1 πmXm, where each

Xm ∼ TEd(µm,Σm;Z, f(m)1 , ..., f

(m)d ), are defined as the leading eigenvectors of the pooled latent

mean-adjusted covariance matrix S defined in (1). We further assume that the leading eigenvectors

v1,...,vT are s-sparse, i.e., ‖vt‖0 ≤ s for each t = 1, . . . , T .

Given the estimators for the latent-mean covariance matrices Σm defined in (11), we first analyze

the concentration of the spectral norm of Σm−Σm, where Σ is the covariance matrix of the latent

Gaussian mixture random vector Ym associated with Xm.

Theorem B.3. Given n i.i.d samples x1, ...,xn of random vector X ∈ Rd where X follows a

transelliptical distribution, i.e., X ∼ TEd(µ,Σ;Z, f1, ..., fd), let σ = (σ1, ..., σd)T be the estimated

marginal standard deviations derived from Cantoni’s estimator and C = [ρjk]d×d be the estimated

Kendall tau correlation matrix defined in (7), and let D = diag(σ). Let Σ = DCD. Under “sign

sub-Gaussian condition” (Han and Liu, 2013), We have, with probability at least 1−O(d−1)

‖Σ−Σ‖2 ≤ C√d log d

n, (14)

where C is a constant.

Furthermore, let η(Σ−Σ, s) = supv∈Sd−1∩B0(s) vT (Σ−Σ)v, with probability at least 1−O(d−1),

η(Σ−Σ, s) ≤ C√s log d

n, (15)

where C is a constant.

19

Page 20: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

Proof. We have

‖DCD−DCD‖2 (16)

≤ ‖DC(D−D) + (DC−DC)D‖2≤ ‖DC−DC‖2‖D‖2 + ‖DC‖2‖D− D‖2≤ ‖DC + (D−D)C−DC‖2‖D‖2 + ‖DC‖2‖D− D‖2≤ ‖D‖22‖C−C‖2 + ‖D‖2‖D−D‖2‖C‖2 + ‖DC‖2‖D− D‖2≤ σ2max‖C−C‖2 + ‖D‖2‖D−D‖2‖C‖2 + ‖D‖2‖D−D‖2‖C‖2. (17)

We then consider the three terms of (17) one by one.

For the first term, by Han and Liu (2013), with probability at least 1−O(d−1),

σ2max‖C−C‖2 ≤ C√dlog d/n, (18)

for some constant C.

For the second term, we have, with probability at least 1−O(d−1),

‖D‖2‖D−D‖2‖C‖2 = ‖D−D‖22‖C‖2 + ‖D‖2‖D−D‖2‖C‖2 ≤ C√

log d/n, (19)

where C is a constant, and the inequality holds by (10).

For the last term, we have that, with probability at least 1−O(d−1),

‖D‖2‖D−D‖2‖C‖2 ≤ ‖D‖2‖D−D‖2‖C−C‖2 + ‖D‖2‖D−D‖‖C‖2≤ σmaxC1

√d log d/n+ C2

√log d/n

≤ C√d log d/n, (20)

where C1, C2, C are constants, and the first inequality has been proved by Han and Liu (2013)

under the assumption that maxj=1,...,d σj ≤ σmax and (10).

Combining (18), (19) and (20) together, (14) holds as desired.

Next, we establish a concentration result for the sparse spectral norm η(S − S, s). For any

v ∈ B0(s) ∩ Sd−1, we have

|vT (DCD−DCD)v|= |vT (DCD−DCD)v|= |vT (DC

(D−D) + (DC−DC)D

)v|

≤ |vT (D−D)CDv|+ |vTD(C−C)Dv|+ |vT DC(D−D)v|. (21)

Now, we bound the three terms in (21) one by one. For the first term, we have that for any

v ∈ B0(s) ∩ Sd−1, with probability at least 1−O(d−1),

|vT (D−D)CDv|≤ |vT (D−D)CDv|+ |vT (D−D)(C−C)Dv|≤ C

√log d/n ·

√s log d/n, (22)

20

Page 21: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

for some constant C, where the last inequality is by (10) and Han and Liu (2013).

For the second term we have that for any v ∈ B0(s)∩Sd−1, with probability at least 1−O(d−1),

|vTD(C−C)Dv| ≤ maxv∈B0(s)∩Sd−1

σ2max|vTD(C−C)Dv| ≤ C√s log d/n, (23)

for some constant C, where the last inequality is by Han and Liu (2013).

For the third term we have that for any v ∈ B0(s)∩ Sd−1, with probability at least 1−O(d−1),

|vT DC(D−D)v|≤ |vT D(C−C)(D−D)v|+ |vT DC(D−D)v|≤ C1

√log d/n ·

√s log d/n+ C2

√log d/n ·

√s log d/n, (24)

for some constants C1 and C2, where the last inequality follows from (10) and Han and Liu (2013).

Plugging (22), (23) and (24) into (21), (15) follows as desired.

As a consequence of Theorem B.3 and (5), by the triangle inequality and induction, we have the

following result for the rate of convergence of the estimated pooled mean-adjusted latent covariance.

Corollary B.4. The estimated pooled mean-adjusted latent covariance matrix S defined in (13)

satisfies that, with probability at least 1−O(d−1),

‖S− S‖2 ≤ C1

√d log d

n, and η(S− S, s) ≤ C2

√s log d

n, (25)

for some constants C1 and C2.

Given the fast rate of convergence of S, we exploit the truncated power method (Yuan and

Zhang, 2013) and the Fantope Projection and Selection method (Vu et al., 2013) to estimate the

topics.

The truncated power (TPower) method is a modification of the power method to compute the

leading eigenvector. Note that to compute the s-sparse leading eigenvector of a matrix S, we are

solving the following optimization problem:

maxv

vT Sv, subject to v ∈ Sd−1 and ‖v‖0 ≤ s. (26)

The TPower method approximately solves (26) iteratively. At the k-th iteration, we have a inter-

mediate eigenvector yk. We sort the absolute values of the elements of yk, then truncate all the

elements of yk except for the elements with the largest s absolute values. For y = (y1, ..., yd)T ∈ Rd,

we denote the truncation with respect to the set A by y(A) = (y1 · I(1 ∈ A), ..., yd · I(d ∈ A))T . in

more detail, using the power method, at each iteration, we project the intermediate eigenvector

to the set Sd−1 and the `0 ball with radius s by letting yk+1 = yk(Ak)/‖yt(Ak)‖2 where Ak is

the set of indices of yk with the largest s absolute values. It has been shown by Yuan and Zhang

(2013) that, with suitable initialization, the solution of the truncated power method converges to

the sparse leading eigenvector at a fast parametric rate.

21

Page 22: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

Algorithm 1 Algorithm to estimate the first T latent topics

Input: n independent samples drawn from random variable X.

Output: v1, v2,..., vT which are k-sparse.

S←∑Mm=1

smS

(Σm + µTmµ

Tm

)

Let S1 = S.

for t = 1, ..., T do

Pt ← argmaxP∈F1 〈S,P〉 − λ‖P‖1,1, where F1 :=P : 0 P Id and tr(P) = 1

.

vt = Λ1(P)tvt ← TPower(St, vt, k).

St+1 ← (Id − vtvTt )St(Id − vtv

Tt ).

end for

We initialize the truncated power algorithm using the Fantope Projection and Selection (FPS)

method (Vu et al., 2013). To estimate the subspace projection matrix of the p leading eigenvectors,

the FPS proposes a sparse principal subspace estimator P that is defined to be the solution of the

semidefinite program

maxP〈S,P〉 − λ‖P‖1,1, subject to P ∈ Fp, (27)

where Fp :=P : 0 P Id and tr(P) = p

. Note that when p = 1, (27) coincides with the

formulation of sparse PCA by d’Aspremont et al. (2007). Under mild assumptions, it has been

shown in Vu et al. (2013) that with properly chosen λ the solution of (27) P converges at a fast

parametric rate. In our particular application, we take p = 1. Note that it has been shown in Vu

et al. (2013) that when p = 1, the leading eigenvector of P converges to the leading eigenvector of

the covariance matrix at a fast parametric rate.

More specifically, denote S1 = S. At the t-th iteration, we first estimate the projection matrix of

the subspace spanned by the first leading eigenvector of S by solving the semidefinite programming

problem (27) with p = 1. Denote this estimator by P and its first leading eigenvectors by vt. Next,

we use the truncated power method to refine vt. The output of the truncated power method is our

estimator for the t-th topic vt.

After estimating the t-th leading vector, the matrix St deflates the vector vt and generates a

new matrix St+1:

St+1 := (Id − vtvTt )St(Id − vtv

Tt ).

The resulting matrix St+1 is orthogonal to vt. Next, we adopt the truncated power method to

estimate vt+1 using the input matrix St+1 with starting point vt+1 computed by FPS.

Denote the output of the TPower method with input matrix A, starting point w and tuning

parameter k by TPower(A, w, k), and denote the leading eigenvector of a matrix A by Λ1(A). The

algorithm to estimate the first T latent topics is summarized in Algorithm 1.

Next, we establish a concentration result as well as the consistency of the topic model estimator.

In particular, we prove convergence of the estimators vt computed by Algorithm 1.

Since the optimization problem in (26) is combinatoric and NP-hard, we solve (26) approxi-

mately by adopting the FPS and TPower method. In the next theorem, we will prove that our

22

Page 23: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

estimator vt generated by Algorithm 1 enjoys a fast parametric rate of convergence.

Theorem B.5. Let X ∼ T (S;M, s). We assume the first T eigenvalues of S, λ1, ..., λT , have a

smallest gap that λt − λt+1 ≥ Cd for all t = 1, ..., T , and Cd > 0. If k ≥ s and k ≤ cs for some

constant c > 1, we have, with probability at least 1−O(d−1),

‖vt − vt‖2 ≤ C√s log d

n, (28)

for some constant C.

Proof. We prove the results for the case t = 1. The results for t = 2, ..., T are satisfied by induction.

We first characterize the initial points computed by the FPS method. By (13) and Theorem 3.3 of

Vu et al. (2013), it holds that, with probability at least 1−O(d−1),

sin(vt,vt) ≤ Cs√

log d/n,

for some constant C.

Assume s < k < Cs for some constant C. This shows that if the sample size n is large enough,

as n = o(log d) and sin(vt,vt) =√

1− vTt vt,

vTt vt ≥ c(η(S− S, s+ 2k) +

√s/k). (29)

Recall that η(S− S, k) = maxv∈Sd−1∩B0vT (S− S)v.

By Yuan and Zhang (2013), we have that when (29) is satisfied with k > s,

‖vt − vt‖2 ≤ η(S− S, k).

By (15), we have that, with probability at least 1−O(d−1),

‖vt − vt‖2 ≤ C√s log d/n,

for some constant C, as desired.

Therefore our estimator achieves the minimax optimal rate of convergence (Cai et al., 2013).

C TROPIC for TF-Biological Context Analysis

We provide a protocol to utilize the estimated gene topics to conduct high-throughput transcription

factor-biological context analysis.

We first estimate the latent topics using Algorithm 1. Note that we need to properly choose the

tuning parameter k. In particular, given the initialization vt provided by the FPS method (27), we

start with k = 100 and test if the resulting gene set corresponding to the non-zero entries of the

output vector vt is functionally enriched (Efron and Tibshirani, 2007). If not, we decrease k by

one and recompute the sparse leading eigenvector until the gene set is enriched. In our analysis, we

23

Page 24: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

found that k = 30 is a good heuristic value. We choose T = 500 topics since their nonzero elements

cover more than 99% genes in our dataset.

Given the j-th TF, we adopt the method introduced by Ji et al. (2008) and Wu and Ji (2013)

to obtain top-ranked target genes. More specifically, given the ChIPx data for the j-th TF, we

use CisGenome (Ji et al., 2008) to detect the significant peaks. We then annotate the top r =

minw, 1, 000 significant ChIPx peaks to identify the TF-bound gene targets of TF. Finally, we

use ChIPXpress (Wu and Ji, 2013) to rank the target genes and denote the set Gj to be the collection

of the first r target genes ranked by ChIPXpress..

Suppose we have nm samples of the m-th biological context, x1, ...,xnm , drawn from a transel-

liptical vector Xm, we estimate its latent mean-adjusted covariance matrix using Sm by (5) and

(11). Next, we compute the leading eigenvector v(m) of Sm, which is called the m-th context topic.

Since the sample size nm is in general small (e.g., we have only 26 samples of Ewing sarcoma),

the estimated context topic unreliable. To handle this issue, we stabilize the estimation of vm by

regressing it on the population topic dictionary v1, ..., vT to identify which population topics

explain most of v(m)’s variability (with adjusted P-value less than 0.05 by Bonferroni’s method).

For each v(m), we denote the set of its K significant population topics as Sm = vm1 , ...vmK. We

construct a binary vector v(g)m ∈ Rd named the m-th feature gene vector. Let v

(g)m (i) = 1 if there

exists k ∈ 1, ...,K, such that vmk(i) 6= 0, while v

(g)m (i) = 0 otherwise. The genes correspond to

the nonzero entries of the vector v(g)m represent the important features of biological context m.

To actually identify the TF-biological context association, let kmj be the number of genes

presented in v(g)m and Gj simultaneously, we further trim down the identified target gene list of the

j-th TF to only include the top kmj significant target genes. Let u(m)j be a binary vector where

umj (i) = 1 if and only if the gene corresponding to the i-th position is among the top kmj target

genes. We then test if u(m)j and v

(g)(m) have significantly more overlap than random selection (with

the adjusted P-values less than 0.05 by FDR control (Benjamini and Hochberg, 1995) among the

M biological contexts). If so, it gives strong evidence that the feature genes of the m-th biological

context are regulated by the top target genes of the j-th TF, and we conclude that the j-th TF is

functionally associated with the m-th biological context.

References

Altucci, L., Leibowitz, M. D., Ogilvie, K. M., de Lera, A. R. and Gronemeyer, H.

(2007). RAR and RXR modulation in cancer and metabolic disease. Nat. Rev. Drug. Discov., 6

793–810.

Anderson, T. W. and Fang, K.-T. (1990). Statistical Inference in Elliptically Contoured and

Related Distributions. Allerton Press New York.

Arrowsmith, C. H., Bountra, C., Fish, P. V., Lee, K. and Schapira, M. (2012). Epigenetic

protein families: A new frontier for drug discovery. Nat. Rev. Drug. Discov., 11 384–400.

Bakalov, A., McCallum, A., Wallach, H. and Mimno, D. (2012). Topic models for tax-

24

Page 25: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

onomies. In Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM,

237–240.

Barrett, T., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., Tomashevsky, M.,

Marshall, K. A., Phillippy, K. H., Sherman, P. M., Holko, M. et al. (2013). NCBI

GEO: Archive for functional genomics data sets update. Nucleic. Acids. Res., 41 D991–D995.

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and

powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 289–300.

Bilodeau, S., Kagey, M. H., Frampton, G. M., Rahl, P. B. and Young, R. A. (2009).

SetDB1 contributes to repression of genes encoding developmental regulators and maintenance

of ES cell state. Gene. Dev., 23 2484–2489.

Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn.

Res., 3 993–1022.

Bouchard, M., Pfeffer, P. and Busslinger, M. (2000). Functional equivalence of the tran-

scription factors PAX2 and PAX5 in mouse development. Development, 127 3703–3713.

Boulesteix, A.-L. and Strimmer, K. (2005). Predicting transcription factor activities from

combined analysis of microarray and ChIP data: a partial least squares approach. Theor. Biol.

Med. Model., 2 23.

Busslinger, M., Klix, N., Pfeffer, P., Graninger, P. G. and Kozmik, Z. (1996). Deregula-

tion of PAX-5 by translocation of the Emu enhancer of the IgH locus adjacent to two alternative

PAX-5 promoters in a diffuse large-cell lymphoma. P. Natl. Acad. Sci., 93 6129–6134.

Cai, T. T., Ma, Z. and Wu, Y. (2013). Sparse PCA: Optimal rates and adaptive estimation.

Ann. Stat., 41 3074–3110.

Carvalho, C. M., Chang, J., Lucas, J. E., Nevins, J. R., Wang, Q. and West, M. (2008).

High-dimensional sparse factor modeling: Applications in gene expression genomics. J. Am. Stat.

Assoc., 103.

Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study.

Ann. I. H. Poincare-Pr., 48 1148–1185.

Ceol, C. J., Houvras, Y., Jane-Valbuena, J., Bilodeau, S., Orlando, D. A., Battisti,

V., Fritsch, L., Lin, W. M., Hollmann, T. J., Ferre, F. et al. (2011). The histone

methyltransferase SETDB1 is recurrently amplified in melanoma and accelerates its onset. Na-

ture, 471 513–517.

Chi, P., Allis, C. D. and Wang, G. G. (2010). Covalent histone modifications-miswritten,

misinterpreted and mis-erased in human cancers. Nat. Rev. Cancer, 10 457–469.

25

Page 26: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

Crea, F., Hurt, E. M. and Farrar, W. L. (2010). Clinical significance of polycomb gene

expression in brain tumors. Mol. Cancer, 9 265.

Darnell, J. E. (2002). Transcription factors as targets for cancer therapy. Nat. Rev. Cancer, 2

740–749.

d’Aspremont, A., El Ghaoui, L., Jordan, M. I. and Lanckriet, G. R. (2007). A direct

formulation for sparse PCA using semidefinite programming. SIAM Rev., 49 434–448.

Dawson, M. A. and Kouzarides, T. (2012). Cancer epigenetics: From mechanism to therapy.

Cell, 150 12–27.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R.

(1990). Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci., 41 391–407.

Dupont, S., Krust, A., Gansmuller, A., Dierich, A., Chambon, P. and Mark, M. (2000).

Effect of single and compound knockouts of estrogen receptors alpha (ERalpha) and beta (ER-

beta) on mouse reproductive phenotypes. Development, 127 4277–4291.

Edgar, R., Domrachev, M. and Lash, A. E. (2002). Gene expression omnibus: NCBI gene

expression and hybridization array data repository. Nucleic Acids Res., 30 207–210.

Eferl, R. and Wagner, E. F. (2003). AP-1: A double-edged sword in tumorigenesis. Nat. Rev.

Cancer, 3 859–868.

Efron, B. and Tibshirani, R. (2007). On testing the significance of sets of genes. Ann. Appl.

Stat. 107–129.

Evans, R. M. (1988). The steroid and thyroid hormone receptor superfamily. Science, 240

889–895.

Faith, J. J., Hayete, B., Thaden, J. T., Mogno, I., Wierzbowski, J., Cottarel, G.,

Kasif, S., Collins, J. J. and Gardner, T. S. (2007). Large-scale mapping and validation of

escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol.,

5 e8.

Greer, E. L. and Shi, Y. (2012). Histone methylation: A dynamic mark in health, disease and

inheritance. Nat. Rev. Genet., 13 343–357.

Grimaldi, C. M., Cleary, J., Dagtas, A. S., Moussai, D., Diamond, B. et al. (2002).

Estrogen alters thresholds for B cell apoptosis and activation. J. Clin. Invest., 109 1625–1633.

Gruvberger, S., Ringner, M., Chen, Y., Panavally, S., Saal, L. H., Borg, A., Ferno,

M., Peterson, C. and Meltzer, P. S. (2001). Estrogen receptor status in breast cancer is

associated with remarkably distinct gene expression patterns. Cancer Res., 61 5979–5984.

Gullo, L., Pezzilli, R. and Morselli-Labate, A. M. (1994). Diabetes and the risk of pan-

creatic cancer. New Engl. J. Med., 331 81–84.

26

Page 27: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

Halder, G. and Johnson, R. L. (2011). Hippo signaling: Growth control and beyond. Develop-

ment, 138 9–22.

Han, F. and Liu, H. (2012). Transelliptical component analysis. In NIPS. 368–376.

Han, F. and Liu, H. (2013). Optimal rates of convergence of transelliptical component analysis.

arXiv preprint arXiv:1305.6916.

Hanahan, D. and Weinberg, R. A. (2011). Hallmarks of cancer: The next generation. Cell,

144 646–674.

Harvey, K. and Tapon, N. (2007). The Salvador–Warts–Hippo pathwayAn emerging tumour-

suppressor network. Nat. Rev. Cancer, 7 182–191.

Harvey, K. F., Zhang, X. and Thomas, D. M. (2013). The hippo pathway and human cancer.

Nat. Rev. Cancer, 13 246–257.

Heine, P., Taylor, J., Iwamoto, G., Lubahn, D. and Cooke, P. (2000). Increased adipose

tissue in male and female estrogen receptor-α knockout mice. P. Natl. Acad. Sci., 97 12729–

12734.

Iida, S., Rao, P., Nallasivam, P., Hibshoosh, H., Butler, M., Louie, D., Dyomin, V.,

Ohno, H., Chaganti, R. and Dalla-Favera, R. (1996). The t (9; 14)(p13; q32) chromosomal

translocation associated with lymphoplasmacytoid lymphoma involves the PAX-5 gene. Blood,

88 4110–4117.

Ji, H., Jiang, H., Ma, W., Johnson, D. S., Myers, R. M. and Wong, W. H. (2008). An

integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat. Biotechnol., 26

1293–1300.

Knight, W. A., Livingston, R. B., Gregory, E. J. and McGuire, W. L. (1977). Estrogen

receptor as an independent prognostic factor for early recurrence in breast cancer. Cancer Res.,

37 4669–4671.

Landt, S. G., Marinov, G. K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou,

S., Bernstein, B. E., Bickel, P., Brown, J. B., Cayting, P. et al. (2012). ChIP-seq

guidelines and practices of the ENCODE and modENCODE consortia. Genome Res., 22 1813–

1831.

Leonetti, C., D’Agnano, I., Lozupone, F., Valentini, A., Geiser, T., Zon, G., Cal-

abretta, B., Citro, G. and Zupi, G. (1996). Antitumor effect of c-MYC antisense phospho-

rothioate oligodeoxynucleotides on human melanoma cells in vitro and in mice. J. Natl. Cancer

I., 88 419–429.

Li, H., Cai, Q., Wu, H., Vathipadiekal, V., Dobbin, Z. C., Li, T., Hua, X., Landen, C. N.,

Birrer, M. J., Sanchez-Beato, M. et al. (2012). SUZ12 promotes human epithelial ovarian

cancer by suppressing apoptosis via silencing HRK. Mol. Cancer Res., 10 1462–1472.

27

Page 28: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

Li, H., Ma, X., Wang, J., Koontz, J., Nucci, M. and Sklar, J. (2007). Effects of rearrange-

ment and allelic exclusion of JJAZ1/SUZ12 on cell proliferation and survival. P. Natl. Acad.

Sci., 104 20001–20006.

Li, H., Wang, J., Mor, G. and Sklar, J. (2008). A neoplastic gene fusion mimics trans-splicing

of RNAs in normal human cells. Science, 321 1357–1361.

Li, M.-D. and Yang, X. (2011). A retrospective on nuclear receptor regulation of inflammation:

lessons from GR and PPARs. PPAR Res., 2011.

Li, Z., Van Calcar, S., Qu, C., Cavenee, W. K., Zhang, M. Q. and Ren, B. (2003). A

global transcriptional regulatory role for c-MYC in Burkitt’s lymphoma cells. P. Natl. Acad.

Sci., 100 8164–8169.

Liu, H., Han, F. and Zhang, C.-h. (2012). Transelliptical graphical models. In NIPS. 800–808.

Mangelsdorf, D. J., Thummel, C., Beato, M., Herrlich, P., Schutz, G., Umesono, K.,

Blumberg, B., Kastner, P., Mark, M., Chambon, P. et al. (1995). The nuclear receptor

superfamily: The second decade. Cell, 83 835–839.

Martin, C. and Zhang, Y. (2005). The diverse functions of histone lysine methylation. Nat.

Rev. Mol. Cell. Bio., 6 838–849.

Martın-Perez, D., Sanchez, E., Maestre, L., Suela, J., Vargiu, P., Di Lisio, L.,

Martınez, N., Alves, J., Piris, M. A. and Sanchez-Beato, M. (2010). Deregulated ex-

pression of the polycomb-group protein SUZ12 target genes characterizes mantle cell lymphoma.

Am. J. Pathol., 177 930–942.

McCall, M. N., Bolstad, B. M. and Irizarry, R. A. (2010). Frozen robust multiarray analysis

(fRMA). Biostatistics, 11 242–253.

McCall, M. N., Uppal, K., Jaffee, H. A., Zilliox, M. J. and Irizarry, R. A. (2011). The

gene expression barcode: leveraging public data repositories to begin cataloging the human and

murine transcriptomes. Nucleic Acids Res., 39 D1011–1015.

Mimno, D. (2012). Computational historiography: Data mining in a century of classics journals.

J. Comp. Cul. Herit., 5 1–20.

Morrison, A. M., Jager, U., Chott, A., Schebesta, M., Haas, O. A. and Busslinger, M.

(1998a). Deregulated PAX-5 transcription from a translocated IgH promoter in marginal zone

lymphoma. Blood, 92 3865–3878.

Morrison, A. M., Nutt, S. L., Thevenin, C., Rolink, A. and Busslinger, M. (1998b).

Loss-and-gain offunction mutations reveal an important role of BSAP (Pax-5) at the start and

end of B cell differentiation. In Seminars in Immunology, vol. 10. Academic Press, 133–142.

28

Page 29: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

Nakamura, T., Imai, Y., Matsumoto, T., Sato, S., Takeuchi, K., Igarashi, K., Harada,

Y., Azuma, Y., Krust, A., Yamamoto, Y. et al. (2007). Estrogen prevents bone loss via

estrogen receptor α and induction of Fas ligand in osteoclasts. Cell, 130 811–823.

Ntziachristos, P., Tsirigos, A., Van Vlierberghe, P., Nedjic, J., Trimarchi, T., Fla-

herty, M. S., Ferres-Marco, D., da Ros, V., Tang, Z., Siegle, J. et al. (2012). Genetic

inactivation of the polycomb repressive complex 2 in T cell acute lymphoblastic leukemia. Nat.

Med., 18 298–303.

Ogawa, S., Eng, V., Taylor, J., Lubahn, D. B., Korach, K. S. and Pfaff, D. W. (1998).

Roles of estrogen receptor-α gene expression in reproduction-related behaviors in female mice 1.

Endocrinology, 139 5070–5081.

Piette, J., Piret, B., Bonizzi, G., Schoonbroodt, S., Merville, M.-P., Legrand-Poels,

S. and Bours, V. (1997). Multiple redox regulation in NF-kappaB transcription factor activa-

tion. Biol. Chem., 378 1237–1245.

Purushotham, S., Liu, Y. and Kuo, C.-C. J. (2012). Collaborative topic regression with

social matrix factorization for recommendation systems. In Proceedings of the 29th International

Conference on Machine Learning. 759–766.

Rayet, B. and Gelinas, C. (1999). Aberrant REl/NFKB genes and activity in human cancer.

Oncogene, 18.

Shaffer, A., Rosenwald, A. and Staudt, L. M. (2002). Lymphoid malignancies: The dark

side of B-cell differentiation. Nat. Rev. Immunol., 2 920–933.

Shalit, U., Weinshall, D. and Chechik, G. (2013). Modeling musical influence with topic

models. In ICML. 244–252.

Shaulian, E. and Karin, M. (2002). AP-1 as a regulator of cell life and death. Nat. Cell Biol.,

4 E131–E136.

Slack, G. W. and Gascoyne, R. D. (2011). MYC and aggressive B-cell lymphomas. Adv. Anat.

Pathol., 18 219–228.

Sparmann, A. and van Lohuizen, M. (2006). Polycomb silencers control cell fate, development

and cancer. Nat. Rev. Cancer, 6 846–856.

Suva, M. L., Riggi, N. and Bernstein, B. E. (2013). Epigenetic reprogramming in cancer.

Science, 339 1567–1570.

Thurmond, T. S., Murante, F. G., Staples, J. E., Silverstone, A. E., Korach, K. S. and

Gasiewicz, T. A. (2000). Role of estrogen receptor α in hematopoietic stem cell development

and B lymphocyte maturation in the male mouse 1. Endocrinology, 141 2309–2318.

Trefethen, L. N. and Bau III, D. (1997). Numerical Linear Algebra. 50, SIAM.

29

Page 30: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

Urbanek, P., Fetka, I., Meisler, M. H. and Busslinger, M. (1997). Cooperation of PAX2

and PAX5 in midbrain and cerebellum development. P. Natl. Acad. Sci., 94 5703–5708.

Urbanek, P., Wang, Z.-Q., Fetka, I., Wagner, E. F. and Busslinger, M. (1994). Complete

block of early B cell differentiation and altered patterning of the posterior midbrain in mice

lacking PAX5/BSAP. Cell, 79 901–912.

Vigneri, P., Frasca, F., Sciacca, L., Pandini, G. and Vigneri, R. (2009). Diabetes and

cancer. Endocr.-Relat. Cancer, 16 1103–1123.

Vu, V. Q., Cho, J., Lei, J. and Rohe, K. (2013). Fantope Projection and Selection: A near-

optimal convex relaxation of sparse PCA. In NIPS. 2670–2678.

Wang, C., Blei, D. and Li, F.-F. (2009). Simultaneous image classification and annotation. In

IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1903–1910.

Wang, C. and Blei, D. M. (2009). Decoupling sparsity and smoothness in the discrete hierarchical

Dirichlet process. In NIPS. 1982–1989.

Wu, G. and Ji, H. (2013). ChIPXpress: Using publicly available gene expression data to improve

ChIP-seq and ChIP-chip target gene ranking. BMC Bioinformatics, 14 188.

Wu, G., Yustein, J. T., McCall, M. N., Zilliox, M., Irizarry, R. A., Zeller, K., Dang,

C. V. and Ji, H. (2013a). ChIP-PED enhances the analysis of ChIP-seq and ChIP-chip data.

Bioinformatics.

Wu, H., Xiao, Y., Zhang, S., Ji, S., Wei, L., Fan, F., Geng, J., Tian, J., Sun, X., Qin,

F. et al. (2013b). The Ets transcription factor GABP is a component of the hippo pathway

essential for growth and antioxidant defense. Cell Reports, 3 1663–1677.

Yao, L., Mimno, D. and McCallum, A. (2009). Efficient methods for topic model inference on

streaming document collections. In ACM SIGKDD. ACM, 937–946.

Yoshida, S., Nakazawa, N., Iida, S., Hayami, Y., Sato, S., Wakita, A., Shimizu, S.,

Taniwaki, M. and Ueda, R. (1999). Detection of MUM1/IRF4-IgH fusion in multiple myeloma.

Leukemia, 13 1812.

Yu, J., Yu, J., Rhodes, D. R., Tomlins, S. A., Cao, X., Chen, G., Mehra, R., Wang, X.,

Ghosh, D., Shah, R. B. et al. (2007). A polycomb repression signature in metastatic prostate

cancer predicts cancer outcome. Cancer Research, 67 10657–10663.

Yuan, X.-T. and Zhang, T. (2013). Truncated power method for sparse eigenvalue problems. J.

Mach. Learn. Res., 14 899–925.

Zhu, J., Zhang, B., Smith, E. N., Drees, B., Brem, R. B., Kruglyak, L., Bumgarner,

R. E. and Schadt, E. E. (2008). Integrating large-scale functional genomic data to dissect the

complexity of yeast regulatory networks. Nat. Genet., 40 854–861.

30

Page 31: Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

Zhuang, D., Mannava, S., Grachtchouk, V., Tang, W., Patil, S., Wawrzyniak, J.,

Berman, A., Giordano, T., Prochownik, E., Soengas, M. et al. (2008). c-MYC over-

expression is required for continuous suppression of oncogene-induced senescence in melanoma

cells. Oncogene, 27 6623–6634.

31