aspects of statistical modelling & data analysis in gene ...mw/.downloads/semstat05/mw... ·...

40
SemStat05 SemStat05 Warwick Warwick - - Sept 11 & 12 Sept 11 & 12 th th 2005 2005 Mike West Mike West Duke University Duke University Aspects of Statistical Modelling Aspects of Statistical Modelling & Data Analysis & Data Analysis in Gene Expression Genomics in Gene Expression Genomics

Upload: buidan

Post on 16-Aug-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Mike West Mike West Duke UniversityDuke University

Aspects of Statistical Modelling Aspects of Statistical Modelling & Data Analysis & Data Analysis

in Gene Expression Genomicsin Gene Expression Genomics

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

FlightplanFlightplan

#1

#2

#3

#4

#5

#6

Genomics, Microarrays, Data:Big picture

Bayesics - Regression and Shrinkage:Gene expression as predictors

Patterns and Factors:Prediction via pattern profiling

Sparse Modelling:Regression subset-structure uncertainty

Sparse Models and Profiling:Gene expression as response: Designed experiments

Sparse Models and Profiling:Gene expression as response: Latent factor models

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Animal Model to Human: Predictive Pattern Translation Animal Model to Human: Predictive Pattern Translation

Gene expression profiles: Gene expression profiles:

Signatures of statesSignatures of states

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Cardiovascular Genomics: AtherosclerosisCardiovascular Genomics: Atherosclerosis

Human aorta gene expression:Human aorta gene expression:Gene/pathway identification

Disease state prediction

Mice models:Mice models:Disease state characterisation

Mouse Human mapping/fusion

Large-scale multifactoral designGene expression (aorta) response

Action is interactionsAge, Diet, Gender, WT/ApoE

24x5+ balanced factorial design

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Multivariate ANOVAMultivariate ANOVA

12,500 genes in parallel

X = β WT, 6wk, chow, fem (baseline)

+ µ male

+ δ fat diet

+ α age=12wk/old

+ γ ApoE genotype

+ µδ fat diet & male

+ µα 12wk/old & male

+ µγ ApoE & male

+ δα, δγ, αγ

+ µδα, µδγ, µαγ, δαγ, µδαγ+ noise

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Highly Multivariate Highly Multivariate MultifactorialMultifactorial ExperimentsExperiments

Multiplicities: Full multivariate analysis – simultaneous inference: identify “real” effects

Bayesian multivariate Anova: Multiple shrinkage estimation

A precursor experiment: Beware the flood (RNAi, etc)

Analysis goals: Gene identification, pattern profiling of states/interactions of design variables

Translation to human samples: Predict expression signatures

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Multivariate AnovaMultivariate Anova

Genes by samples

Gene-specific regression parameters = +

Fixed design

Design factor j: Main effect, interaction, …

Multiple tests/comparisons – simultaneous inference

Substantial noise component

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

SPARSE Multivariate AnovaSPARSE Multivariate Anova

Sparsity priors: small

Extends traditional sparsity priors:

Gene g, Design factor j: ~ sparsity

Invites informative, structural modelling

Hierarchical models/priors within design factors

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Large p Large p -- Shrinkage and SparsityShrinkage and Sparsity

Model-based, automatic shrinkage - Simultaneous ”multiple tests”

Decision theory/false discovery? Estimation versus Decision?

How many comparisons/hypotheses?

Computation and

posterior summarisation

MCMC methods

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Hierarchical Sparse Anova Model Hierarchical Sparse Anova Model

0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 10

5 0

1 0 0

1 5 0

2 0 0

2 5 0

3 0 0

3 5 0

4 0 0

4 5 0

5 0 0

Freq

uenc

y

P o s te r io r fo r π

New hierarchical model:

It matters: improved signal isolation

MCMC methods vital

(Lucas et al 05)(Lucas et al 05)0 0 .2 0 .4 0 .6 0 .8 10

5 0

1 0 0

1 5 0

2 0 0

2 5 0

3 0 0

3 5 0

4 0 0

4 5 0

5 0 0

P o s te r io r fo r π

Freq

uenc

y

Design factor j: ~ sparsity

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Multivariate Sparse RegressionMultivariate Sparse Regression

0 20 40 60 80 100-0 .2

-0 .15

-0 .1

-0 .05

0

0 .05

0 .1

0 .15

0 .2

0 .25

0 .3

O bserv ation #

Expr

essi

on L

evel

0 20 40 60 80 100-0 .2

-0 .15

-0 .1

-0 .05

0

0 .05

0 .1

0 .15

0 .2

O bserv ation #

Expr

essi

on L

evel

Regressions example: “housekeeping gene control factors”

Gene-Sample-Study artifacts

(Lucas et al 05)(Lucas et al 05)

Oncogene interventions experiments

PC 1

PC 2

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Multivariate Sparse Regression: Bias CorrectionsMultivariate Sparse Regression: Bias Corrections

Major but Sparse effects: Selective impact across genes

X Fitted component on control factor 1

° Gene expression data

(Lucas et al 05)(Lucas et al 05)

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Multivariate Sparse Regression: Bias CorrectionsMultivariate Sparse Regression: Bias Corrections

Oncogene interventions experiments:

Myc up-regulated gene

0 20 40 60 80 1005

5 .5

6

6 .5

7

7 .5

8

O bserv ation #

Expr

essi

on L

evel No

control factors

0 20 40 60 80 1005

5 .5

6

6 .5

7

7 .5

8

O bserv ation #

Expr

essi

on L

evel With

control factors False discovery

(Lucas et al 05)(Lucas et al 05)

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Multivariate Sparse Anova Model for VariancesMultivariate Sparse Anova Model for Variances

Sparse ANOVA for noise variances

0 20 40 60 80 100

8 .4

8 .6

8 .8

9

9 .2

9 .4

9 .6

O bserv ation #

Expr

essi

on L

evel

Sparsity prior for

0 20 40 60 80 100

8 .4

8 .6

8 .8

9

9 .2

9 .4

9 .6

O bserv ation #

Expr

essi

on L

evel

Oncogene interventions experiments: False discovery

MCMC methods vital

(Lucas et al 05)(Lucas et al 05)

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

GeneGene--Environment Associations in Mice Experiments Environment Associations in Mice Experiments

Probabilities and log-likelihood ratios

- SHRINKAGE

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Pattern Profiles: Signatures of StatesPattern Profiles: Signatures of States

‘significant’ effects define characterising metagenes

ApoE- WT/ApoE+

26

194

128

93 60

19

ApoE.Age65 genes

ApoE.Diet321 genes

ApoE.Age.Diet292 genes

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Pattern Profiling: MousePattern Profiling: Mouse--Human TranslationHuman Translation

Mapping genes from mouse to human normalise/evaluate/extrapolate signatures

Mouse WT/ApoE+

Mouse ApoE-

Human D-

Human D+

Human D+/-

Aorta lesions, advanced

atherosclerosis

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Pattern Profiling: MousePattern Profiling: Mouse--Human TranslationHuman Translation

ApoE.Age

Intersection of ApoE,

ApoE+{Age,Diet}

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

HumanHuman--Mouse Model Data Fusion: Clinical Research?Mouse Model Data Fusion: Clinical Research?

Improved pathway characterisation: implicated genes

More/better phenotypes in humans

Blood/serum assays for gene expression

Metabolites in serum and blood: metabolic/genomic fusion

Disease Risk Signature

(Karra et al 04; Seo et al 05)

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

FlightplanFlightplan

#1

#2

#3

#4

#5

#6

Genomics, Microarrays, Data:Big picture

Bayesics - Regression and Shrinkage:Gene expression as predictors

Patterns and Factors:Prediction via pattern profiling

Sparse Modelling:Regression subset-structure uncertainty

Sparse Models and Profiling:Gene expression as response: Designed experiments

Sparse Models and Profiling:Gene expression as response: Latent factor models

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Latent Factor Models & Patterns of AssociationLatent Factor Models & Patterns of Association

Decompositions of p(x)

Latent structure underlying associations

Cancer Studies:

Multiple deregulated pathway components

Latent Factors:

intersecting sub-pathways

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Latent Factor Model StructureLatent Factor Model Structure

One sample

- column p-vector

Vector of k<<p latent

- underlying –

factor variables

Idiosyncratic variation

Latent factors:

Model of covariance matrix:

(West 03, Valencia 7; +Aguilar 01 JBES; +Lopes 04 Stat Sinica)

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Sparsity and Structure Sparsity and Structure

Sparse Models:One factor – few or many variables

One variable – 0,1, or few factors

x17

x13

x12x11

x10

x9

x8

x7

x6x5

x4x3

x2x1

- - -

- - -

x16

x15

x14

Row (variable) g, factor j:

0,1, .., small

(West 2003, Valencia 7)

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Sparsity and StructureSparsity and Structure

Uncertain sparsity patterns:

+ ++

B =

Structure:Parametrisation of B - Identification

“founders” of factors

(West 2003, Valencia 7)

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Computation: Global MCMCComputation: Global MCMC

Other (hyper) parameters “easy”

Not parallelizable:

Serial Gibbs/MCMC

Parallel within Gibbs iterates

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Inference on Sparsity and StructureInference on Sparsity and Structure

Aspects of p(B,Π|X)

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Factor Models and Association PatternsFactor Models and Association Patterns

Sample correlations

Fitted correlationsin B’TB+Ψ

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Decomposing Associations by FactorDecomposing Associations by Factor

Covariance decompositions:

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Decomposition by Factor Signatures Decomposition by Factor Signatures

Data decompositions:

Breast cancer gene expression Factor 2: ER, 9: HER2

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Decomposition by Factor Signatures Decomposition by Factor Signatures

Discovery of gene-factor associations

Isolation of non –associated genes

… noise/idiosyncratic

Factors =

Patterns, Signatures

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Sparse Latent Factor RegressionSparse Latent Factor Regression

Response variables z

Evaluate p(z,x)

Predict z

z linked to some latent factors in x space?

… and to some individual x variables?

Response factorsx17

x13

x12x11

x10

x9

x8

x7

x6x5

x4x3

x2x1

- - -

- - -

x16

x15

x14

z

(West 2003, Valencia 7; Carvalho et al, 2005)

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Sparse Factor Regression: Predictive SignaturesSparse Factor Regression: Predictive Signatures

“bad samples” factors

A “cleaned-up” ER factor

SVD Sparse Model

Factor models “clean up” SVD

SVD “noisy” factors(West 2003, Valencia 7)

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Sparse Factor Regression: Predictive SignaturesSparse Factor Regression: Predictive Signatures

Expression assays of hormonal status

Jointly predicted: Multiple factors

4+ studies

ER

HER2

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Sparse Factor Regression: Predictive SignaturesSparse Factor Regression: Predictive Signatures

Jointly predicted:

Multiple tests/assays

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Exploratory/Discovery AnalysesExploratory/Discovery Analyses

Evolutionary gene set enrichmentExperiments: gene set responses

to perturbations? Promotor TF interactions

Duke/NCI Systems Biology CenterGenomic Data Synthesis

2 4 (apoptosis) (DNA replication)

MDM2GSNAGC1RBM8AMTHFSCUGBP2PKN2SSR4FANCGNME3POU4F1AGC1MDM2CRHR1H1FXRPS3AABCB8RGS12GLG1DOC-1RTNPO3MDM2MAPTLORGUCA1AGRIA1CDC34COL11A2MYCTBL3BTF3UCP3LBA1CDKN2CHTR6CDC6CYP2A13KHDRBS1KIAA0284PEX5CYP2A6LTKSSTR3MDM2

MCM6VCAM1CYP2A13PITX1MCM2DDX39MCM7GSNGSTM1CCNE1MCM3MCM4MFGE8ABCA3CDKN2AMCM5KIAA1026TOMM70ACDC2SASPOLR2HCSNK1DNME3CDC6CHC1TNFRSF14BACHMVDFANCGLIG1SF3B4CCNE1ABCF3 2

4

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Structure and Skeletons of VariableStructure and Skeletons of Variable--Factor WiringFactor WiringHER2

ER

Myc.a

Myc.b Myc.b

Myc.a

E2F1

E2F2

E2F4

E2F2

bcta-cat

E2F1

FactorsGe

nes

Biologically “Hard” vs “Soft”interactions

TF binding sites

P-P interactions

Expression expts

Priors over sparsity probabilities

Status, relative “activation”, deregulation

Priors over non-zero effects

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Graphs, Networks and Factor ModelsGraphs, Networks and Factor Models

Graphs of sparse factor models –covariance or association graphs

Graphical models – precision (inverse covariance) graphs

Multiple cancer data sets – Fusion studies

Constant/consonant structure

- network “motifs”(Zhou et al USC/PNAS 05)

( Dobra et al 04, J Multivariate Analysis;

Jones et al 05, Statistical Science;

Rich et al 05, Cancer Research )

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Comprehensive Sparse Models for Gene ExpressionComprehensive Sparse Models for Gene Expression

Sparsity models for:

• Design effects

• Control/correction factor effects

• Latent structure: Complex associations

• Response prediction

Combined model(Carvalho et al 05; Lucas et al 05)

Computation, Model Search: MCMC and MCMC-inspired Stochastic Search

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

Papers, software, many many links:www.isds.duke.edu/~mw

ABS04 web site: Lecture slides, course notes, papers, data, links: www.isds.duke.edu/~mw/ABS04

Integrated Cancer Biology Programicbp.genome.duke.edu

Genome Institute @ Dukewww.genome.duke.edu

SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005

CoCo--PilotsPilots

www.isds.duke.edu/~mwwww.isds.duke.edu/~mw

Chris Hans

Quanli Wang

Carlos Carvalho

Beatrix JonesJoe Lucas

Joe Nevins

End of FlightEnd of Flight