rna-seq co-expression analysis using mixture modelsjouy.inra.fr rna-seq co-expression analysis 3 /...

Post on 07-Mar-2018

228 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

RNA-seq co-expression analysisusing mixture models

A. Rau, C. Maugis-Rabusseau, M.-L. Martin-Magniette, G. Celeux

September 29, 2015

Netbio @ Paris

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 1 / 25

Introduction Biological context

Gene (co-)expression

Transcriptome data: main source of ’omic information available forliving organisms

Microarrays (∼1995 - )High-throughput sequencing (HTS): RNA-seq (∼2008 - )

Comparison of two conditions (hypothesis tests) → Differentialexpression analysis

Co-expression (clustering) analysis

Study gene expression behavior across several conditions

Co-expressed genes may be involved in similar biological process(es)⇒ study genes without known or predicted function (orphan genes)

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 2 / 25

Introduction Co-expression analysis with RNA-seq data

High-throughput transcriptome sequencing data (RNA-seq)

Reads aligned or directly mapped to the genome to get counts pergenomic feature (discrete data) ⇒ digital measures of gene expression

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 3 / 25

Introduction Co-expression analysis with RNA-seq data

RNA-seq data, continued

Some statistical challenges of RNA-seq data analysis

Discrete, non-negative, and skewed data with very large dynamicrange (up to 5+ orders of magnitude)

Sequencing depth (= “library size”) varies among experiments, andother technical biases...

Counts correlated with gene length

Gene 1 Gene 2

Gene 1 Gene 2

Sample 1

Sample 2

To date, most methodological developments are for experimental design,normalization, and differential analysis...

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 4 / 25

Introduction Co-expression analysis with RNA-seq data

Some notation

Notation

Let Yij` be the count (expression measure) for gene i in replicate ` ofcondition j , with corresponding observed value yij`.

Let sj` be the library size in replicate ` of condition j

Let y = (yij`) be the n ×∑

j Lj matrix of counts for all genes andvariables and yi the ith row of the matrix

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 5 / 25

Introduction Finite mixture models

Finite mixture models

Model-based clustering

Rigourous framework for parameter estimation and model selection

Output: each gene assigned a probability of cluster membership

Assume data y come from K distinct subpopulations, each modeledseparately:

f (y|K ,ΨK ) =n∏

i=1

K∑k=1

πk fk(yi |θk)

ΨK = (π1, . . . , πK−1,θ′)′

π = (π1, . . . , πK )′ are the mixing proportions, where∑K

k=1 πk = 1

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 6 / 25

Poisson mixture model

Finite mixture models for RNA-seq data

f (y|K ,ΨK ) =n∏

i=1

K∑k=1

πk fk(yi |θk)

For microarray data, we often assume yi |k ∼ MVN(µk ,Σk)...

For RNA-seq data, we need to choose the family andparameterization of fk(·). One possibility:

yi |k ∼J∏

j=1

Lj∏`=1

P(yij`|µij`k)

Question: How to parameterize the mean µij`k to obtain meaningfulclusters of co-expressed genes?

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 7 / 25

Poisson mixture model

Finite mixture models for RNA-seq data

f (y|K ,ΨK ) =n∏

i=1

K∑k=1

πk fk(yi |θk)

For microarray data, we often assume yi |k ∼ MVN(µk ,Σk)...

For RNA-seq data, we need to choose the family andparameterization of fk(·). One possibility:

yi |k ∼J∏

j=1

Lj∏`=1

P(yij`|µij`k)

Question: How to parameterize the mean µij`k to obtain meaningfulclusters of co-expressed genes?

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 7 / 25

Poisson mixture model

Finite mixture models for RNA-seq data

f (y|K ,ΨK ) =n∏

i=1

K∑k=1

πk fk(yi |θk)

For microarray data, we often assume yi |k ∼ MVN(µk ,Σk)...

For RNA-seq data, we need to choose the family andparameterization of fk(·). One possibility:

yi |k ∼J∏

j=1

Lj∏`=1

P(yij`|µij`k)

Question: How to parameterize the mean µij`k to obtain meaningfulclusters of co-expressed genes?

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 7 / 25

Poisson mixture model Parameterization

Which genes should be clustered?

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 8 / 25

Poisson mixture model Parameterization

Which genes should be clustered?

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 8 / 25

Poisson mixture model Parameterization

Which genes should be clustered?

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 8 / 25

Poisson mixture model Parameterization

Poisson mixture model for RNA-seq data

Consider yij`|k ∼ Poisson(yij`|µij`k), where

µij`k = wi sj`λjk

wi : overall expression level of gene i (= yi ··)

sj` : normalized library sizea

λk = (λjk) : parameters that define profiles of genes in each clusterb

aEstimated from data using standard techniques and considered to be fixedbFor identifiability of model, we assume

∑j,` λjksj` = 1 for all k

Genes assigned to the same cluster if they share the same profile ofvariation around their mean count across all conditions

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 9 / 25

Poisson mixture model Parameterization

Poisson mixture model for RNA-seq data

Consider yij`|k ∼ Poisson(yij`|µij`k), where

µij`k = wi sj`λjk

wi : overall expression level of gene i (= yi ··)

sj` : normalized library sizea

λk = (λjk) : parameters that define profiles of genes in each clusterb

aEstimated from data using standard techniques and considered to be fixedbFor identifiability of model, we assume

∑j,` λjksj` = 1 for all k

Genes assigned to the same cluster if they share the same profile ofvariation around their mean count across all conditions

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 9 / 25

Poisson mixture model Implementation

Parameter estimation

The log likelihood is

L(ΨK |y,K ) = log

[n∏

i=1

f (yi |K ,ΨK )

]=

n∑i=1

log

[K∑

k=1

πk f (yi |θk)

],

where θk = (wi , λ1k , . . . , λdK )′

Estimation approach (EM): mixture parameters are estimated for agiven model K by computing the maximum likelihood estimate(Dempster et al. 1977) Details...

Note: the EM algorithm is sensitive to initialization, so we make useof a splitting small-EM initialization Details...

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 10 / 25

Poisson mixture model Implementation

Parameter estimation

The log likelihood is

L(ΨK |y,K ) = log

[n∏

i=1

f (yi |K ,ΨK )

]=

n∑i=1

log

[K∑

k=1

πk f (yi |θk)

],

where θk = (wi , λ1k , . . . , λdK )′

Estimation approach (EM): mixture parameters are estimated for agiven model K by computing the maximum likelihood estimate(Dempster et al. 1977) Details...

Note: the EM algorithm is sensitive to initialization, so we make useof a splitting small-EM initialization Details...

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 10 / 25

Poisson mixture model Implementation

Classification by the MAP rule

“Maximum a posteriori” (MAP) rule:

Each individual is attributed to the cluster for which it has the largestconditional probability of membership given the estimated parameters:

τik(θ) =πk fk(yi |θk)∑K`=1 π`f`(yi |θ`)

MAP rule with θK :

zik =

{1 if τik

(θK

)> τi`

(θK

)∀` 6= k

0 otherwise

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 11 / 25

Poisson mixture model Model selection

Model selection

1 Collection of models (SK )K∈K indexed by number of clusters K

2 In each model SK , parameter estimation via MLE: ΨK

3 Selection of the “best” model K using a penalized criterion:

K = arg minK∈K

{−1

n

n∑i=1

log f (yi |K , ΨK ) + penalty(K )

}

⇒ Asymptotic penalized criteria include Bayesian InformationCriterion (BIC) and Integrated Completed Likelihood (ICL) Details...

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 12 / 25

Poisson mixture model Model selection

Slope heuristics for model selection (Birge and Massart, 2006)

Non-asymptotic framework: construct a penalized criterion1 such thatthe selected model has a risk close to the oracle model

Optimal penalty for model of dimension D:

penaltyopt ≈ 2κD

n

In large dimensions:

Linear behavior of loglikelihood with respect to model dimension D

⇒ Estimation of slope to calibrate κ in a data-driven manner(Data-Driven Slope Estimation = DDSE), capushe R package

1Theoretically validated in Gaussian framework, but encouraging applications in othercontexts (Baudry et al., 2012)

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 13 / 25

Poisson mixture model Model selection

Slope heuristics for model selection (Birge and Massart, 2006)

Non-asymptotic framework: construct a penalized criterion1 such thatthe selected model has a risk close to the oracle model

Optimal penalty for model of dimension D:

penaltyopt ≈ 2κD

n

In large dimensions:

Linear behavior of loglikelihood with respect to model dimension D

⇒ Estimation of slope to calibrate κ in a data-driven manner(Data-Driven Slope Estimation = DDSE), capushe R package

1Theoretically validated in Gaussian framework, but encouraging applications in othercontexts (Baudry et al., 2012)

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 13 / 25

Poisson mixture model Model selection

Slope heuristics in practice for RNA-seq

●●●●●

●●●●

●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●● ● ● ● ● ● ● ● ●

−2.

0e+

08−

1.5e

+08

−1.

0e+

08−

5.0e

+07

Contrast representation

penshape(m) (labels : Model names)

−γ n

(sm)

* * *

*The regression line is computed with 17 pointsValidation points

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 65 70 75 80 85 90 95 100

110

120

130

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 14 / 25

Poisson mixture model HTSCluster

HTSCluster R package

> PMM <- PoisMixClusWrapper(y=data, gmin=1, gmax=35,

conds=conds, split.init=TRUE, norm="TMM")

>

> summary(PMM)

*************************************************

Selected number of clusters via ICL = 10

Selected number of clusters via BIC = 30

Selected number of clusters via Djump = 15

Selected number of clusters via DDSE = 14

*************************************************

>

> summary(PMM$DDSE.results)

*************************************************

Number of clusters = 14

Model selection via DDSE

*************************************************

Cluster sizes:

Cluster 1 Cluster 2 Cluster 3 ...

540 192 235 ...

...andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 15 / 25

RNA-seq application Embryonic fly data

Real data analysis: Embryonic fly development

modENCODE project to provide functional annotation of Drosophila(Graveley et al., 2011)

Expression dynamics over 27 distinct stages of development duringlife cycle studied with RNA-seq

12 embryonic samples (collected at 2-hr intervals over 24 hrs) for13,164 genes downloaded from ReCount database (Frazee et al.,2011)

3 independent runs, used HTSCluster to fit Poisson mixture modelsfor K ∈ {1, . . . , 60, 65, . . . , 100, 110, . . . , 130}Using slope heuristics, selected model is K = 48

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 16 / 25

RNA-seq application Embryonic fly data

Real data analysis: Embryonic fly development

modENCODE project to provide functional annotation of Drosophila(Graveley et al., 2011)

Expression dynamics over 27 distinct stages of development duringlife cycle studied with RNA-seq

12 embryonic samples (collected at 2-hr intervals over 24 hrs) for13,164 genes downloaded from ReCount database (Frazee et al.,2011)

3 independent runs, used HTSCluster to fit Poisson mixture modelsfor K ∈ {1, . . . , 60, 65, . . . , 100, 110, . . . , 130}Using slope heuristics, selected model is K = 48

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 16 / 25

RNA-seq application Embryonic fly data

HTSCluster model diagnostics

Maximum conditional probability

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

020

0040

0060

0080

0010

000

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 17 / 25

RNA-seq application Embryonic fly data

HTSCluster model diagnostics

●●●

●●

●●

●●●

●●

●●

●●●●●●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●●●●●●●●●

●●

●●●

●●

● ●●

●●

●●

●●●●●

●●

● ●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●●●●●●

● ●

●●

●●●

●●●

●●

●●●

●●

●●●●●

●●●●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●●●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●

●●

●●●

●●●●●●

●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●●●●

●●

●● ●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●

●●●●●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●●●

●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●●●●●●●●

●● ●●●

●●●

●●●●●●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45 48

0.2

0.4

0.6

0.8

1.0

Cluster

Max

imum

con

ditio

nal p

roba

bilit

y

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 18 / 25

RNA-seq application Embryonic fly data

HTSCluster model diagnostics

18 19 30 44 20 48 46 10 35 16 25 22 14 3 9 6

τ ≤ 0.8τ > 0.8

Cluster

Fre

quen

cy

010

020

030

040

050

060

0

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 19 / 25

RNA-seq application Embryonic fly data

HTSCluster: Visualization of results

2 4 6 8 10 12

0e+

002e

+05

4e+

05

Cluster 1

2 4 6 8 10 120e

+00

4e+

058e

+05

Cluster 2

2 4 6 8 10 12

010

0000

2500

00

Cluster 3

2 4 6 8 10 12

010

0000

2500

00

Cluster 4

2 4 6 8 10 12

0e+

002e

+05

4e+

05

Cluster 5

2 4 6 8 10 12

010

0000

2000

00

Cluster 6

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 20 / 25

RNA-seq application Embryonic fly data

HTSCluster: Visualization of results

2 4 6 8 10 12

02

46

810

Cluster 1

2 4 6 8 10 120

24

68

12

Cluster 2

2 4 6 8 10 12

02

46

810

Cluster 3

2 4 6 8 10 12

02

46

810

Cluster 4

2 4 6 8 10 12

02

46

810

Cluster 5

2 4 6 8 10 12

02

46

810

Cluster 6

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 20 / 25

RNA-seq application Embryonic fly data

HTSCluster: Visualization of results

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Cluster 1

2 4 6 8 10 120.

00.

20.

40.

6

Cluster 2

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Cluster 3

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Cluster 4

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Cluster 5

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

Cluster 6

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 20 / 25

RNA-seq application Embryonic fly data

HTSCluster: Visualization of results

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Cluster 1

2 4 6 8 10 12

0.0

0.2

0.4

0.6

Cluster 2

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Cluster 3

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Cluster 4

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Cluster 5

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

Cluster 6

121110987654321

0.0

0.2

0.4

0.6

0.8

1.0

λ jks

j.

Cluster

1 2 3 4 5 6

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 20 / 25

RNA-seq application Embryonic fly data

HTSCluster: Visualization of results

Embryos 22−24hrEmbryos 20−22hrEmbryos 18−20hrEmbryos 16−18hrEmbryos 14−16hrEmbryos 12−14hrEmbryos 10−12hrEmbryos 8−10hrEmbryos 6−8hrEmbryos 4−6hrEmbryos 2−4hrEmbryos 0−2hr

0.0

0.2

0.4

0.6

0.8

1.0

λ jks

j.

Cluster

1 2 3 4 5 6 7 8 9 10 11 121314 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Functional enrichment analysis: 33 of 48 clusters associated with atleast one Gene Ontology Biological Process term (e.g., cluster 6associated with muscle attachment)

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 21 / 25

Discussion

HTSCluster for clustering count-based RNA-seq profiles

Interpretable parameterization for RNA-seq co-expression analyses,straightforward parameter estimation, and a sound mechanism formodel selection

Performs well on real and simulated data compared to otherapproaches especially when the number of clusters is unknown

Details...

HTSCluster (v2.0.4): R package on CRAN

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 22 / 25

Discussion Future work

Some limits (opportunities!) for HTSCluster

Computational time can be a drawback: a full collection of models isestimated to allow for selection of a single “best” model, splittingsmall-EM initialization prevents parallelization...

Samples are currently assumed to be conditionally independent giventhe cluster

Conditions are currently assumed to be a single multi-level factor:how to correctly account for more complex experimental designs?(e.g. factorial, time series)

Is a Poisson mixture model the most appropriate choice for RNA-seqdata in practice? ...

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 23 / 25

Discussion Future work

Future work: Model comparisons for co-expression

Is it better to model the raw counts yij using a Poisson distribution orappropriately transformed counts t(yij) using a Gaussian distribution?2

f (yi |K , θK ) =K∑

k=1

πk

J∏j=1

P(yij |θk)

- vs -

g(t(yi )|K , ηK ) =K∑

k=1

πkΦ(t(yi )|µk ,Σk)

For example,

t(yij) = log

(yij/y·j + 1

mi + 1

)

2Ph.D. work of Melina Gallopinandrea.rau@jouy.inra.fr RNA-seq co-expression analysis 24 / 25

Discussion Future work

Future work: Model comparisons for co-expression

Is it better to model the raw counts yij using a Poisson distribution orappropriately transformed counts t(yij) using a Gaussian distribution?2

f (yi |K , θK ) =K∑

k=1

πk

J∏j=1

P(yij |θk)

- vs -

g(t(yi )|K , ηK ) =K∑

k=1

πkΦ(t(yi )|µk ,Σk)

For example,

t(yij) = log

(yij/y·j + 1

mi + 1

)2Ph.D. work of Melina Gallopin

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 24 / 25

Discussion Future work

Future work: Model comparisons for RNA-seqco-expression

BIC model selection criterion enables an objective comparison:

BICf (K ; y) =∑n

i=1 log f (yi ;K , θK )− υf2 log n

BICg (K ; y) =∑n

i=1 log g(t(yi );K , ηK ) +∑n

i=1 log t ′(yi )− υg2 log n

Left: Sultan et al. (2008). Right: Mach et al. (2014)

Further comparisons of transformations / models in progress...

andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 25 / 25

Discussion Future work

Future work: Model comparisons for RNA-seqco-expression

BIC model selection criterion enables an objective comparison:

BICf (K ; y) =∑n

i=1 log f (yi ;K , θK )− υf2 log n

BICg (K ; y) =∑n

i=1 log g(t(yi );K , ηK ) +∑n

i=1 log t ′(yi )− υg2 log n

Left: Sultan et al. (2008). Right: Mach et al. (2014)

Further comparisons of transformations / models in progress...andrea.rau@jouy.inra.fr RNA-seq co-expression analysis 25 / 25

Thank you!

In collaboration with...

Gilles Celeux (Inria Saclay - Ile-de-France)

Cathy Maugis-Rabusseau (INSA / IMT Toulouse)

Marie-Laure Martin-Magniette (AgroParisTech / INRA URGV)

Panos Papastamoulis (University of Manchester)

Melina Gallopin (current Ph.D. student)

Estimation of finite mixture models

A finite mixture model may be seen as an incomplete datastructure model

The complete data are

x = (y, z) = (x1, . . . , xn) = ((y1, . . . , yn), (z1, . . . , zn))

where the missing data are z = (z1, . . . , zn) = (zik)zi component of i , where zik = 1 if i arises from group k and 0otherwisez defines a partition P = (P1, . . . ,PK ) of the observed data y withPk = {i |zik = 1}

Expected completed likelihood:

L(ΨK ; y, z) =n∑

i=1

K∑k=1

zik {log πk + log fk(y;θ)}+ λπ

(K∑

k=1

πk − 1

)

where λπ is the Lagrange multiplier for the constraint on π

Estimation: EM algorithm (Dempster et al., 1977)

E-step Compute the conditional probabilities:

τik

(b)k

)=

π(b)k f (yi |θ

(b)k )∑K

m=1 π(b)m f (yi |θ

(b)m )

M-step Update Ψk to maximize the expected value of the completedlikelihood by weighting observation i for cluster k with

τik

(b)k

):

π(b+1)k =

1

n

n∑i=1

τik

(b)k

),

wi = yi ··

λ(b+1)jk =

∑ni=1 τik

(b)k

)yij ·

sj ·∑n

i=1 τik

(b)k

)yi ··

Back...

Splitting initialization (Papastamoulis et al., 2014)

for K ← 2 to gmax do– Calculate per-class entropy ek = −

∑i∈k log tK−1

ik for model with (K − 1)clusters– Select cluster k? = arg maxk ek to be splitfor i ← 1 to init.runs do

– Randomly split the observations in cluster k? into two clusters– Calculate corresponding λ(0,i),K and π(0,i),K

– Update values of λ(0,i),K and π(0,i),K via EM algorithm withinit.iter iterations– Calculate the log-likelihood L(i),K = L(λ(0,i),K , π(0,i),K )

end

Let i? = arg maxi L(i),K . Fix new initial values λ(0),K = λ(0,i?),K and

π(0),K = π(0,i?),K .end

Back...

Penalized criteria for model selection: BIC (Schwarz, 1978)

Maximization of integrated likelihood:

m = arg minm∈M

−f (y|m)

where

f (y|m) =

∫Θm

f (y|θ,m)Π(θ|m)dθ

Asymptotic approximation (where Dm = (m − 1) + m × J is thedimension of Sm):

−ln(f (y|m)) ≈ −L(y|θm) +Dm

2ln(n)

= nγn(sm) +Dm

2ln(n)

Bayesian information criterion (BIC):

m = arg minm∈M

{γn(sm) + penBIC(m)} with penBIC(m) =Dm

2nln(n)

Back...

Penalized criteria for model selection: ICL (Biernacki et al., 2000)

Alternative based on maximization of integrated completed likelihood:

m = arg minm∈M

−f (y, z|m)

where

f (y, z|m) =

∫Θm

f (y, z|θ,m)Π(θ|m)dθ

BIC-like asymptotic approximation for Integrated CompletedLikelihood (ICL):

m = arg minm∈M

{−1

nL(θm|y, z) +

Dm

2nln(n)}

= arg minm∈M

{γn(sm) + penBIC(m) + Ent(m)}

where Ent(m) = − 1n

∑i

∑k zik ln τik(θm)

Back...

Behavior of BIC and ICL in practice for RNA-seq data

0 20 40 60

−25

0000

−20

0000

−15

0000

−10

0000

Number of clusters

logL

ike

0 20 40 60

−25

0000

−20

0000

−15

0000

−10

0000

Number of clustersB

IC

X

0 20 40 60

−25

0000

−20

0000

−15

0000

−10

0000

Number of clusters

ICL

X

0 50 100 150 200

−2.

0e+

08−

1.0e

+08

Number of clusters

logL

ike

0 50 100 150 200

−2.

0e+

08−

1.0e

+08

Number of clusters

BIC

X

0 50 100 150 200

−2.

0e+

08−

1.0e

+08

Number of clusters

ICL

X

Back...

Behavior of BIC and ICL in practice for RNA-seq data

0 20 40 60

−25

0000

−20

0000

−15

0000

−10

0000

Number of clusters

logL

ike

0 20 40 60

−25

0000

−20

0000

−15

0000

−10

0000

Number of clustersB

IC

X

0 20 40 60

−25

0000

−20

0000

−15

0000

−10

0000

Number of clusters

ICL

X

0 50 100 150 200

−2.

0e+

08−

1.0e

+08

Number of clusters

logL

ike

0 50 100 150 200

−2.

0e+

08−

1.0e

+08

Number of clusters

BIC

X

0 50 100 150 200

−2.

0e+

08−

1.0e

+08

Number of clusters

ICL

X

Back...

Description of competing models

1 PoisL (Cai et al., 2004): K-means type algorithm using Poissonloglinear model

Equivalent to HTSCluster when equal library sizes, unreplicated data,equiprobable Poisson mixtures, and parameter estimation via theClassification EM (CEM) algorithm

2 Witten (2011): hierarchical clustering of dissimilarity measure basedon a Poisosn loglinear model

Originally intended to cluster samples

3 Si et al. (2014): model-based hierarchical algorithm using Poissonand negative binomial models

4 Classic K-means algorithm on expression profiles (yij`/yi ··)

Model selection not addressed by any of the above ⇒ Calinski andHarabasz index (1974) used for comparison

Simulation procedure (based on fly and human data)

For each setting (fly and human), 50 individual datasets:

K fixed to 15, true experimental design used

All parameters (λjk), (sj`), and (wi ) fixed to estimated values fromreal data analysis

n = 3000 genes randomly sampled from fly or human data, weightedby their maximum conditional probability

For each selected gene, we sample from the appropriate Poissondistribution:

Yijl ∼ P(µijlk)

where µijlk = wi sjlλjk if zik = 1.

Simulation results

All models fit for K ∈ 1, . . . , 40

Model selection via the slope heuristics (HTSCluster, PoisL) orCH-index (Si-Pois, Si-NB, Witten)

Models compared using the adjusted Rand Index (ARI, Hubert &Arabie 1985)

For comparison, also consider the oracle ARI (based on assignment ofobservations to clusters using the true parameter values)

Simulation results

Table: Mean (sd) ARI for simulations with parameters based on the fly and human liver.

Method Model selection Fly Human

HTSClustercapushe 0.93 (0.05) 0.61 (0.02)True K 0.84 (0.09) 0.60 (0.02)

PoisLcapushe 0.79 (0.15) 0.53 (0.05)True K 0.82 (0.05) 0.53 (0.04)

WittenCH index 0.15 (0.07) 0.11 (0.03)True K 0.67 (0.09) 0.39 (0.04)

Si-PoisCH index 0.26 (0.17) 0.48 (0.04)True K 0.95 (0.02) 0.61 (0.02)

Si-NBCH index 0.23 (0.16) 0.47 (0.04)True K 0.94 (0.02) 0.60 (0.02)

K-means True K 0.79 (0.08) 0.42 (0.02)

Oracle True K 0.95 (0.01) 0.63 (0.01)

Back...

top related