rna-seq co-expression analysis using mixture modelsjouy.inra.fr rna-seq co-expression analysis 3 /...

50
RNA-seq co-expression analysis using mixture models A. Rau , C. Maugis-Rabusseau, M.-L. Martin-Magniette, G. Celeux September 29, 2015 Netbio @ Paris [email protected] RNA-seq co-expression analysis 1 / 25

Upload: trinhhanh

Post on 07-Mar-2018

228 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

RNA-seq co-expression analysisusing mixture models

A. Rau, C. Maugis-Rabusseau, M.-L. Martin-Magniette, G. Celeux

September 29, 2015

Netbio @ Paris

[email protected] RNA-seq co-expression analysis 1 / 25

Page 2: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Introduction Biological context

Gene (co-)expression

Transcriptome data: main source of ’omic information available forliving organisms

Microarrays (∼1995 - )High-throughput sequencing (HTS): RNA-seq (∼2008 - )

Comparison of two conditions (hypothesis tests) → Differentialexpression analysis

Co-expression (clustering) analysis

Study gene expression behavior across several conditions

Co-expressed genes may be involved in similar biological process(es)⇒ study genes without known or predicted function (orphan genes)

[email protected] RNA-seq co-expression analysis 2 / 25

Page 3: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Introduction Co-expression analysis with RNA-seq data

High-throughput transcriptome sequencing data (RNA-seq)

Reads aligned or directly mapped to the genome to get counts pergenomic feature (discrete data) ⇒ digital measures of gene expression

[email protected] RNA-seq co-expression analysis 3 / 25

Page 4: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Introduction Co-expression analysis with RNA-seq data

RNA-seq data, continued

Some statistical challenges of RNA-seq data analysis

Discrete, non-negative, and skewed data with very large dynamicrange (up to 5+ orders of magnitude)

Sequencing depth (= “library size”) varies among experiments, andother technical biases...

Counts correlated with gene length

Gene 1 Gene 2

Gene 1 Gene 2

Sample 1

Sample 2

To date, most methodological developments are for experimental design,normalization, and differential analysis...

[email protected] RNA-seq co-expression analysis 4 / 25

Page 5: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Introduction Co-expression analysis with RNA-seq data

Some notation

Notation

Let Yij` be the count (expression measure) for gene i in replicate ` ofcondition j , with corresponding observed value yij`.

Let sj` be the library size in replicate ` of condition j

Let y = (yij`) be the n ×∑

j Lj matrix of counts for all genes andvariables and yi the ith row of the matrix

[email protected] RNA-seq co-expression analysis 5 / 25

Page 6: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Introduction Finite mixture models

Finite mixture models

Model-based clustering

Rigourous framework for parameter estimation and model selection

Output: each gene assigned a probability of cluster membership

Assume data y come from K distinct subpopulations, each modeledseparately:

f (y|K ,ΨK ) =n∏

i=1

K∑k=1

πk fk(yi |θk)

ΨK = (π1, . . . , πK−1,θ′)′

π = (π1, . . . , πK )′ are the mixing proportions, where∑K

k=1 πk = 1

[email protected] RNA-seq co-expression analysis 6 / 25

Page 7: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poisson mixture model

Finite mixture models for RNA-seq data

f (y|K ,ΨK ) =n∏

i=1

K∑k=1

πk fk(yi |θk)

For microarray data, we often assume yi |k ∼ MVN(µk ,Σk)...

For RNA-seq data, we need to choose the family andparameterization of fk(·). One possibility:

yi |k ∼J∏

j=1

Lj∏`=1

P(yij`|µij`k)

Question: How to parameterize the mean µij`k to obtain meaningfulclusters of co-expressed genes?

[email protected] RNA-seq co-expression analysis 7 / 25

Page 8: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poisson mixture model

Finite mixture models for RNA-seq data

f (y|K ,ΨK ) =n∏

i=1

K∑k=1

πk fk(yi |θk)

For microarray data, we often assume yi |k ∼ MVN(µk ,Σk)...

For RNA-seq data, we need to choose the family andparameterization of fk(·). One possibility:

yi |k ∼J∏

j=1

Lj∏`=1

P(yij`|µij`k)

Question: How to parameterize the mean µij`k to obtain meaningfulclusters of co-expressed genes?

[email protected] RNA-seq co-expression analysis 7 / 25

Page 9: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poisson mixture model

Finite mixture models for RNA-seq data

f (y|K ,ΨK ) =n∏

i=1

K∑k=1

πk fk(yi |θk)

For microarray data, we often assume yi |k ∼ MVN(µk ,Σk)...

For RNA-seq data, we need to choose the family andparameterization of fk(·). One possibility:

yi |k ∼J∏

j=1

Lj∏`=1

P(yij`|µij`k)

Question: How to parameterize the mean µij`k to obtain meaningfulclusters of co-expressed genes?

[email protected] RNA-seq co-expression analysis 7 / 25

Page 10: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poisson mixture model Parameterization

Which genes should be clustered?

[email protected] RNA-seq co-expression analysis 8 / 25

Page 11: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poisson mixture model Parameterization

Which genes should be clustered?

[email protected] RNA-seq co-expression analysis 8 / 25

Page 12: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poisson mixture model Parameterization

Which genes should be clustered?

[email protected] RNA-seq co-expression analysis 8 / 25

Page 13: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poisson mixture model Parameterization

Poisson mixture model for RNA-seq data

Consider yij`|k ∼ Poisson(yij`|µij`k), where

µij`k = wi sj`λjk

wi : overall expression level of gene i (= yi ··)

sj` : normalized library sizea

λk = (λjk) : parameters that define profiles of genes in each clusterb

aEstimated from data using standard techniques and considered to be fixedbFor identifiability of model, we assume

∑j,` λjksj` = 1 for all k

Genes assigned to the same cluster if they share the same profile ofvariation around their mean count across all conditions

[email protected] RNA-seq co-expression analysis 9 / 25

Page 14: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poisson mixture model Parameterization

Poisson mixture model for RNA-seq data

Consider yij`|k ∼ Poisson(yij`|µij`k), where

µij`k = wi sj`λjk

wi : overall expression level of gene i (= yi ··)

sj` : normalized library sizea

λk = (λjk) : parameters that define profiles of genes in each clusterb

aEstimated from data using standard techniques and considered to be fixedbFor identifiability of model, we assume

∑j,` λjksj` = 1 for all k

Genes assigned to the same cluster if they share the same profile ofvariation around their mean count across all conditions

[email protected] RNA-seq co-expression analysis 9 / 25

Page 15: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poisson mixture model Implementation

Parameter estimation

The log likelihood is

L(ΨK |y,K ) = log

[n∏

i=1

f (yi |K ,ΨK )

]=

n∑i=1

log

[K∑

k=1

πk f (yi |θk)

],

where θk = (wi , λ1k , . . . , λdK )′

Estimation approach (EM): mixture parameters are estimated for agiven model K by computing the maximum likelihood estimate(Dempster et al. 1977) Details...

Note: the EM algorithm is sensitive to initialization, so we make useof a splitting small-EM initialization Details...

[email protected] RNA-seq co-expression analysis 10 / 25

Page 16: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poisson mixture model Implementation

Parameter estimation

The log likelihood is

L(ΨK |y,K ) = log

[n∏

i=1

f (yi |K ,ΨK )

]=

n∑i=1

log

[K∑

k=1

πk f (yi |θk)

],

where θk = (wi , λ1k , . . . , λdK )′

Estimation approach (EM): mixture parameters are estimated for agiven model K by computing the maximum likelihood estimate(Dempster et al. 1977) Details...

Note: the EM algorithm is sensitive to initialization, so we make useof a splitting small-EM initialization Details...

[email protected] RNA-seq co-expression analysis 10 / 25

Page 17: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poisson mixture model Implementation

Classification by the MAP rule

“Maximum a posteriori” (MAP) rule:

Each individual is attributed to the cluster for which it has the largestconditional probability of membership given the estimated parameters:

τik(θ) =πk fk(yi |θk)∑K`=1 π`f`(yi |θ`)

MAP rule with θK :

zik =

{1 if τik

(θK

)> τi`

(θK

)∀` 6= k

0 otherwise

[email protected] RNA-seq co-expression analysis 11 / 25

Page 18: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poisson mixture model Model selection

Model selection

1 Collection of models (SK )K∈K indexed by number of clusters K

2 In each model SK , parameter estimation via MLE: ΨK

3 Selection of the “best” model K using a penalized criterion:

K = arg minK∈K

{−1

n

n∑i=1

log f (yi |K , ΨK ) + penalty(K )

}

⇒ Asymptotic penalized criteria include Bayesian InformationCriterion (BIC) and Integrated Completed Likelihood (ICL) Details...

[email protected] RNA-seq co-expression analysis 12 / 25

Page 19: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poisson mixture model Model selection

Slope heuristics for model selection (Birge and Massart, 2006)

Non-asymptotic framework: construct a penalized criterion1 such thatthe selected model has a risk close to the oracle model

Optimal penalty for model of dimension D:

penaltyopt ≈ 2κD

n

In large dimensions:

Linear behavior of loglikelihood with respect to model dimension D

⇒ Estimation of slope to calibrate κ in a data-driven manner(Data-Driven Slope Estimation = DDSE), capushe R package

1Theoretically validated in Gaussian framework, but encouraging applications in othercontexts (Baudry et al., 2012)

[email protected] RNA-seq co-expression analysis 13 / 25

Page 20: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poisson mixture model Model selection

Slope heuristics for model selection (Birge and Massart, 2006)

Non-asymptotic framework: construct a penalized criterion1 such thatthe selected model has a risk close to the oracle model

Optimal penalty for model of dimension D:

penaltyopt ≈ 2κD

n

In large dimensions:

Linear behavior of loglikelihood with respect to model dimension D

⇒ Estimation of slope to calibrate κ in a data-driven manner(Data-Driven Slope Estimation = DDSE), capushe R package

1Theoretically validated in Gaussian framework, but encouraging applications in othercontexts (Baudry et al., 2012)

[email protected] RNA-seq co-expression analysis 13 / 25

Page 21: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poisson mixture model Model selection

Slope heuristics in practice for RNA-seq

●●●●●

●●●●

●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●● ● ● ● ● ● ● ● ●

−2.

0e+

08−

1.5e

+08

−1.

0e+

08−

5.0e

+07

Contrast representation

penshape(m) (labels : Model names)

−γ n

(sm)

* * *

*The regression line is computed with 17 pointsValidation points

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 65 70 75 80 85 90 95 100

110

120

130

[email protected] RNA-seq co-expression analysis 14 / 25

Page 22: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Poisson mixture model HTSCluster

HTSCluster R package

> PMM <- PoisMixClusWrapper(y=data, gmin=1, gmax=35,

conds=conds, split.init=TRUE, norm="TMM")

>

> summary(PMM)

*************************************************

Selected number of clusters via ICL = 10

Selected number of clusters via BIC = 30

Selected number of clusters via Djump = 15

Selected number of clusters via DDSE = 14

*************************************************

>

> summary(PMM$DDSE.results)

*************************************************

Number of clusters = 14

Model selection via DDSE

*************************************************

Cluster sizes:

Cluster 1 Cluster 2 Cluster 3 ...

540 192 235 ...

[email protected] RNA-seq co-expression analysis 15 / 25

Page 23: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

RNA-seq application Embryonic fly data

Real data analysis: Embryonic fly development

modENCODE project to provide functional annotation of Drosophila(Graveley et al., 2011)

Expression dynamics over 27 distinct stages of development duringlife cycle studied with RNA-seq

12 embryonic samples (collected at 2-hr intervals over 24 hrs) for13,164 genes downloaded from ReCount database (Frazee et al.,2011)

3 independent runs, used HTSCluster to fit Poisson mixture modelsfor K ∈ {1, . . . , 60, 65, . . . , 100, 110, . . . , 130}Using slope heuristics, selected model is K = 48

[email protected] RNA-seq co-expression analysis 16 / 25

Page 24: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

RNA-seq application Embryonic fly data

Real data analysis: Embryonic fly development

modENCODE project to provide functional annotation of Drosophila(Graveley et al., 2011)

Expression dynamics over 27 distinct stages of development duringlife cycle studied with RNA-seq

12 embryonic samples (collected at 2-hr intervals over 24 hrs) for13,164 genes downloaded from ReCount database (Frazee et al.,2011)

3 independent runs, used HTSCluster to fit Poisson mixture modelsfor K ∈ {1, . . . , 60, 65, . . . , 100, 110, . . . , 130}Using slope heuristics, selected model is K = 48

[email protected] RNA-seq co-expression analysis 16 / 25

Page 25: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

RNA-seq application Embryonic fly data

HTSCluster model diagnostics

Maximum conditional probability

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

020

0040

0060

0080

0010

000

[email protected] RNA-seq co-expression analysis 17 / 25

Page 26: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

RNA-seq application Embryonic fly data

HTSCluster model diagnostics

●●●

●●

●●

●●●

●●

●●

●●●●●●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●●●●●●●●●

●●

●●●

●●

● ●●

●●

●●

●●●●●

●●

● ●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●●●●●●

● ●

●●

●●●

●●●

●●

●●●

●●

●●●●●

●●●●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●●●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●

●●

●●●

●●●●●●

●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●●●●

●●

●● ●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●

●●●●●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●●●

●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●●●●●●●●

●● ●●●

●●●

●●●●●●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45 48

0.2

0.4

0.6

0.8

1.0

Cluster

Max

imum

con

ditio

nal p

roba

bilit

y

[email protected] RNA-seq co-expression analysis 18 / 25

Page 27: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

RNA-seq application Embryonic fly data

HTSCluster model diagnostics

18 19 30 44 20 48 46 10 35 16 25 22 14 3 9 6

τ ≤ 0.8τ > 0.8

Cluster

Fre

quen

cy

010

020

030

040

050

060

0

[email protected] RNA-seq co-expression analysis 19 / 25

Page 28: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

RNA-seq application Embryonic fly data

HTSCluster: Visualization of results

2 4 6 8 10 12

0e+

002e

+05

4e+

05

Cluster 1

2 4 6 8 10 120e

+00

4e+

058e

+05

Cluster 2

2 4 6 8 10 12

010

0000

2500

00

Cluster 3

2 4 6 8 10 12

010

0000

2500

00

Cluster 4

2 4 6 8 10 12

0e+

002e

+05

4e+

05

Cluster 5

2 4 6 8 10 12

010

0000

2000

00

Cluster 6

[email protected] RNA-seq co-expression analysis 20 / 25

Page 29: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

RNA-seq application Embryonic fly data

HTSCluster: Visualization of results

2 4 6 8 10 12

02

46

810

Cluster 1

2 4 6 8 10 120

24

68

12

Cluster 2

2 4 6 8 10 12

02

46

810

Cluster 3

2 4 6 8 10 12

02

46

810

Cluster 4

2 4 6 8 10 12

02

46

810

Cluster 5

2 4 6 8 10 12

02

46

810

Cluster 6

[email protected] RNA-seq co-expression analysis 20 / 25

Page 30: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

RNA-seq application Embryonic fly data

HTSCluster: Visualization of results

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Cluster 1

2 4 6 8 10 120.

00.

20.

40.

6

Cluster 2

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Cluster 3

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Cluster 4

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Cluster 5

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

Cluster 6

[email protected] RNA-seq co-expression analysis 20 / 25

Page 31: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

RNA-seq application Embryonic fly data

HTSCluster: Visualization of results

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Cluster 1

2 4 6 8 10 12

0.0

0.2

0.4

0.6

Cluster 2

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Cluster 3

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Cluster 4

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Cluster 5

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

Cluster 6

121110987654321

0.0

0.2

0.4

0.6

0.8

1.0

λ jks

j.

Cluster

1 2 3 4 5 6

[email protected] RNA-seq co-expression analysis 20 / 25

Page 32: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

RNA-seq application Embryonic fly data

HTSCluster: Visualization of results

Embryos 22−24hrEmbryos 20−22hrEmbryos 18−20hrEmbryos 16−18hrEmbryos 14−16hrEmbryos 12−14hrEmbryos 10−12hrEmbryos 8−10hrEmbryos 6−8hrEmbryos 4−6hrEmbryos 2−4hrEmbryos 0−2hr

0.0

0.2

0.4

0.6

0.8

1.0

λ jks

j.

Cluster

1 2 3 4 5 6 7 8 9 10 11 121314 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Functional enrichment analysis: 33 of 48 clusters associated with atleast one Gene Ontology Biological Process term (e.g., cluster 6associated with muscle attachment)

[email protected] RNA-seq co-expression analysis 21 / 25

Page 33: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Discussion

HTSCluster for clustering count-based RNA-seq profiles

Interpretable parameterization for RNA-seq co-expression analyses,straightforward parameter estimation, and a sound mechanism formodel selection

Performs well on real and simulated data compared to otherapproaches especially when the number of clusters is unknown

Details...

HTSCluster (v2.0.4): R package on CRAN

[email protected] RNA-seq co-expression analysis 22 / 25

Page 34: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Discussion Future work

Some limits (opportunities!) for HTSCluster

Computational time can be a drawback: a full collection of models isestimated to allow for selection of a single “best” model, splittingsmall-EM initialization prevents parallelization...

Samples are currently assumed to be conditionally independent giventhe cluster

Conditions are currently assumed to be a single multi-level factor:how to correctly account for more complex experimental designs?(e.g. factorial, time series)

Is a Poisson mixture model the most appropriate choice for RNA-seqdata in practice? ...

[email protected] RNA-seq co-expression analysis 23 / 25

Page 35: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Discussion Future work

Future work: Model comparisons for co-expression

Is it better to model the raw counts yij using a Poisson distribution orappropriately transformed counts t(yij) using a Gaussian distribution?2

f (yi |K , θK ) =K∑

k=1

πk

J∏j=1

P(yij |θk)

- vs -

g(t(yi )|K , ηK ) =K∑

k=1

πkΦ(t(yi )|µk ,Σk)

For example,

t(yij) = log

(yij/y·j + 1

mi + 1

)

2Ph.D. work of Melina [email protected] RNA-seq co-expression analysis 24 / 25

Page 36: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Discussion Future work

Future work: Model comparisons for co-expression

Is it better to model the raw counts yij using a Poisson distribution orappropriately transformed counts t(yij) using a Gaussian distribution?2

f (yi |K , θK ) =K∑

k=1

πk

J∏j=1

P(yij |θk)

- vs -

g(t(yi )|K , ηK ) =K∑

k=1

πkΦ(t(yi )|µk ,Σk)

For example,

t(yij) = log

(yij/y·j + 1

mi + 1

)2Ph.D. work of Melina Gallopin

[email protected] RNA-seq co-expression analysis 24 / 25

Page 37: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Discussion Future work

Future work: Model comparisons for RNA-seqco-expression

BIC model selection criterion enables an objective comparison:

BICf (K ; y) =∑n

i=1 log f (yi ;K , θK )− υf2 log n

BICg (K ; y) =∑n

i=1 log g(t(yi );K , ηK ) +∑n

i=1 log t ′(yi )− υg2 log n

Left: Sultan et al. (2008). Right: Mach et al. (2014)

Further comparisons of transformations / models in progress...

[email protected] RNA-seq co-expression analysis 25 / 25

Page 38: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Discussion Future work

Future work: Model comparisons for RNA-seqco-expression

BIC model selection criterion enables an objective comparison:

BICf (K ; y) =∑n

i=1 log f (yi ;K , θK )− υf2 log n

BICg (K ; y) =∑n

i=1 log g(t(yi );K , ηK ) +∑n

i=1 log t ′(yi )− υg2 log n

Left: Sultan et al. (2008). Right: Mach et al. (2014)

Further comparisons of transformations / models in [email protected] RNA-seq co-expression analysis 25 / 25

Page 39: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Thank you!

In collaboration with...

Gilles Celeux (Inria Saclay - Ile-de-France)

Cathy Maugis-Rabusseau (INSA / IMT Toulouse)

Marie-Laure Martin-Magniette (AgroParisTech / INRA URGV)

Panos Papastamoulis (University of Manchester)

Melina Gallopin (current Ph.D. student)

Page 40: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Estimation of finite mixture models

A finite mixture model may be seen as an incomplete datastructure model

The complete data are

x = (y, z) = (x1, . . . , xn) = ((y1, . . . , yn), (z1, . . . , zn))

where the missing data are z = (z1, . . . , zn) = (zik)zi component of i , where zik = 1 if i arises from group k and 0otherwisez defines a partition P = (P1, . . . ,PK ) of the observed data y withPk = {i |zik = 1}

Expected completed likelihood:

L(ΨK ; y, z) =n∑

i=1

K∑k=1

zik {log πk + log fk(y;θ)}+ λπ

(K∑

k=1

πk − 1

)

where λπ is the Lagrange multiplier for the constraint on π

Page 41: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Estimation: EM algorithm (Dempster et al., 1977)

E-step Compute the conditional probabilities:

τik

(b)k

)=

π(b)k f (yi |θ

(b)k )∑K

m=1 π(b)m f (yi |θ

(b)m )

M-step Update Ψk to maximize the expected value of the completedlikelihood by weighting observation i for cluster k with

τik

(b)k

):

π(b+1)k =

1

n

n∑i=1

τik

(b)k

),

wi = yi ··

λ(b+1)jk =

∑ni=1 τik

(b)k

)yij ·

sj ·∑n

i=1 τik

(b)k

)yi ··

Back...

Page 42: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Splitting initialization (Papastamoulis et al., 2014)

for K ← 2 to gmax do– Calculate per-class entropy ek = −

∑i∈k log tK−1

ik for model with (K − 1)clusters– Select cluster k? = arg maxk ek to be splitfor i ← 1 to init.runs do

– Randomly split the observations in cluster k? into two clusters– Calculate corresponding λ(0,i),K and π(0,i),K

– Update values of λ(0,i),K and π(0,i),K via EM algorithm withinit.iter iterations– Calculate the log-likelihood L(i),K = L(λ(0,i),K , π(0,i),K )

end

Let i? = arg maxi L(i),K . Fix new initial values λ(0),K = λ(0,i?),K and

π(0),K = π(0,i?),K .end

Back...

Page 43: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Penalized criteria for model selection: BIC (Schwarz, 1978)

Maximization of integrated likelihood:

m = arg minm∈M

−f (y|m)

where

f (y|m) =

∫Θm

f (y|θ,m)Π(θ|m)dθ

Asymptotic approximation (where Dm = (m − 1) + m × J is thedimension of Sm):

−ln(f (y|m)) ≈ −L(y|θm) +Dm

2ln(n)

= nγn(sm) +Dm

2ln(n)

Bayesian information criterion (BIC):

m = arg minm∈M

{γn(sm) + penBIC(m)} with penBIC(m) =Dm

2nln(n)

Back...

Page 44: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Penalized criteria for model selection: ICL (Biernacki et al., 2000)

Alternative based on maximization of integrated completed likelihood:

m = arg minm∈M

−f (y, z|m)

where

f (y, z|m) =

∫Θm

f (y, z|θ,m)Π(θ|m)dθ

BIC-like asymptotic approximation for Integrated CompletedLikelihood (ICL):

m = arg minm∈M

{−1

nL(θm|y, z) +

Dm

2nln(n)}

= arg minm∈M

{γn(sm) + penBIC(m) + Ent(m)}

where Ent(m) = − 1n

∑i

∑k zik ln τik(θm)

Back...

Page 45: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Behavior of BIC and ICL in practice for RNA-seq data

0 20 40 60

−25

0000

−20

0000

−15

0000

−10

0000

Number of clusters

logL

ike

0 20 40 60

−25

0000

−20

0000

−15

0000

−10

0000

Number of clustersB

IC

X

0 20 40 60

−25

0000

−20

0000

−15

0000

−10

0000

Number of clusters

ICL

X

0 50 100 150 200

−2.

0e+

08−

1.0e

+08

Number of clusters

logL

ike

0 50 100 150 200

−2.

0e+

08−

1.0e

+08

Number of clusters

BIC

X

0 50 100 150 200

−2.

0e+

08−

1.0e

+08

Number of clusters

ICL

X

Back...

Page 46: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Behavior of BIC and ICL in practice for RNA-seq data

0 20 40 60

−25

0000

−20

0000

−15

0000

−10

0000

Number of clusters

logL

ike

0 20 40 60

−25

0000

−20

0000

−15

0000

−10

0000

Number of clustersB

IC

X

0 20 40 60

−25

0000

−20

0000

−15

0000

−10

0000

Number of clusters

ICL

X

0 50 100 150 200

−2.

0e+

08−

1.0e

+08

Number of clusters

logL

ike

0 50 100 150 200

−2.

0e+

08−

1.0e

+08

Number of clusters

BIC

X

0 50 100 150 200

−2.

0e+

08−

1.0e

+08

Number of clusters

ICL

X

Back...

Page 47: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Description of competing models

1 PoisL (Cai et al., 2004): K-means type algorithm using Poissonloglinear model

Equivalent to HTSCluster when equal library sizes, unreplicated data,equiprobable Poisson mixtures, and parameter estimation via theClassification EM (CEM) algorithm

2 Witten (2011): hierarchical clustering of dissimilarity measure basedon a Poisosn loglinear model

Originally intended to cluster samples

3 Si et al. (2014): model-based hierarchical algorithm using Poissonand negative binomial models

4 Classic K-means algorithm on expression profiles (yij`/yi ··)

Model selection not addressed by any of the above ⇒ Calinski andHarabasz index (1974) used for comparison

Page 48: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Simulation procedure (based on fly and human data)

For each setting (fly and human), 50 individual datasets:

K fixed to 15, true experimental design used

All parameters (λjk), (sj`), and (wi ) fixed to estimated values fromreal data analysis

n = 3000 genes randomly sampled from fly or human data, weightedby their maximum conditional probability

For each selected gene, we sample from the appropriate Poissondistribution:

Yijl ∼ P(µijlk)

where µijlk = wi sjlλjk if zik = 1.

Page 49: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Simulation results

All models fit for K ∈ 1, . . . , 40

Model selection via the slope heuristics (HTSCluster, PoisL) orCH-index (Si-Pois, Si-NB, Witten)

Models compared using the adjusted Rand Index (ARI, Hubert &Arabie 1985)

For comparison, also consider the oracle ARI (based on assignment ofobservations to clusters using the true parameter values)

Page 50: RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

Simulation results

Table: Mean (sd) ARI for simulations with parameters based on the fly and human liver.

Method Model selection Fly Human

HTSClustercapushe 0.93 (0.05) 0.61 (0.02)True K 0.84 (0.09) 0.60 (0.02)

PoisLcapushe 0.79 (0.15) 0.53 (0.05)True K 0.82 (0.05) 0.53 (0.04)

WittenCH index 0.15 (0.07) 0.11 (0.03)True K 0.67 (0.09) 0.39 (0.04)

Si-PoisCH index 0.26 (0.17) 0.48 (0.04)True K 0.95 (0.02) 0.61 (0.02)

Si-NBCH index 0.23 (0.16) 0.47 (0.04)True K 0.94 (0.02) 0.60 (0.02)

K-means True K 0.79 (0.08) 0.42 (0.02)

Oracle True K 0.95 (0.01) 0.63 (0.01)

Back...