aspects of statistical modelling & data analysis in gene ...mw/.downloads/semstat05/mw... ·...
TRANSCRIPT
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Mike West Mike West Duke UniversityDuke University
Aspects of Statistical Modelling Aspects of Statistical Modelling & Data Analysis & Data Analysis
in Gene Expression Genomicsin Gene Expression Genomics
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
FlightplanFlightplan
#1
#2
#3
#4
#5
#6
Genomics, Microarrays, Data:Big picture
Bayesics - Regression and Shrinkage:Gene expression as predictors
Patterns and Factors:Prediction via pattern profiling
Sparse Modelling:Regression subset-structure uncertainty
Sparse Models and Profiling:Gene expression as response: Designed experiments
Sparse Models and Profiling:Gene expression as response: Latent factor models
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Animal Model to Human: Predictive Pattern Translation Animal Model to Human: Predictive Pattern Translation
Gene expression profiles: Gene expression profiles:
Signatures of statesSignatures of states
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Cardiovascular Genomics: AtherosclerosisCardiovascular Genomics: Atherosclerosis
Human aorta gene expression:Human aorta gene expression:Gene/pathway identification
Disease state prediction
Mice models:Mice models:Disease state characterisation
Mouse Human mapping/fusion
Large-scale multifactoral designGene expression (aorta) response
Action is interactionsAge, Diet, Gender, WT/ApoE
24x5+ balanced factorial design
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Multivariate ANOVAMultivariate ANOVA
12,500 genes in parallel
X = β WT, 6wk, chow, fem (baseline)
+ µ male
+ δ fat diet
+ α age=12wk/old
+ γ ApoE genotype
+ µδ fat diet & male
+ µα 12wk/old & male
+ µγ ApoE & male
+ δα, δγ, αγ
+ µδα, µδγ, µαγ, δαγ, µδαγ+ noise
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Highly Multivariate Highly Multivariate MultifactorialMultifactorial ExperimentsExperiments
Multiplicities: Full multivariate analysis – simultaneous inference: identify “real” effects
Bayesian multivariate Anova: Multiple shrinkage estimation
A precursor experiment: Beware the flood (RNAi, etc)
Analysis goals: Gene identification, pattern profiling of states/interactions of design variables
Translation to human samples: Predict expression signatures
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Multivariate AnovaMultivariate Anova
Genes by samples
Gene-specific regression parameters = +
Fixed design
Design factor j: Main effect, interaction, …
Multiple tests/comparisons – simultaneous inference
Substantial noise component
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
SPARSE Multivariate AnovaSPARSE Multivariate Anova
Sparsity priors: small
Extends traditional sparsity priors:
Gene g, Design factor j: ~ sparsity
Invites informative, structural modelling
Hierarchical models/priors within design factors
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Large p Large p -- Shrinkage and SparsityShrinkage and Sparsity
Model-based, automatic shrinkage - Simultaneous ”multiple tests”
Decision theory/false discovery? Estimation versus Decision?
How many comparisons/hypotheses?
Computation and
posterior summarisation
MCMC methods
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Hierarchical Sparse Anova Model Hierarchical Sparse Anova Model
0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 10
5 0
1 0 0
1 5 0
2 0 0
2 5 0
3 0 0
3 5 0
4 0 0
4 5 0
5 0 0
Freq
uenc
y
P o s te r io r fo r π
New hierarchical model:
It matters: improved signal isolation
MCMC methods vital
(Lucas et al 05)(Lucas et al 05)0 0 .2 0 .4 0 .6 0 .8 10
5 0
1 0 0
1 5 0
2 0 0
2 5 0
3 0 0
3 5 0
4 0 0
4 5 0
5 0 0
P o s te r io r fo r π
Freq
uenc
y
Design factor j: ~ sparsity
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Multivariate Sparse RegressionMultivariate Sparse Regression
0 20 40 60 80 100-0 .2
-0 .15
-0 .1
-0 .05
0
0 .05
0 .1
0 .15
0 .2
0 .25
0 .3
O bserv ation #
Expr
essi
on L
evel
0 20 40 60 80 100-0 .2
-0 .15
-0 .1
-0 .05
0
0 .05
0 .1
0 .15
0 .2
O bserv ation #
Expr
essi
on L
evel
Regressions example: “housekeeping gene control factors”
Gene-Sample-Study artifacts
(Lucas et al 05)(Lucas et al 05)
Oncogene interventions experiments
PC 1
PC 2
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Multivariate Sparse Regression: Bias CorrectionsMultivariate Sparse Regression: Bias Corrections
Major but Sparse effects: Selective impact across genes
X Fitted component on control factor 1
° Gene expression data
(Lucas et al 05)(Lucas et al 05)
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Multivariate Sparse Regression: Bias CorrectionsMultivariate Sparse Regression: Bias Corrections
Oncogene interventions experiments:
Myc up-regulated gene
0 20 40 60 80 1005
5 .5
6
6 .5
7
7 .5
8
O bserv ation #
Expr
essi
on L
evel No
control factors
0 20 40 60 80 1005
5 .5
6
6 .5
7
7 .5
8
O bserv ation #
Expr
essi
on L
evel With
control factors False discovery
(Lucas et al 05)(Lucas et al 05)
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Multivariate Sparse Anova Model for VariancesMultivariate Sparse Anova Model for Variances
Sparse ANOVA for noise variances
0 20 40 60 80 100
8 .4
8 .6
8 .8
9
9 .2
9 .4
9 .6
O bserv ation #
Expr
essi
on L
evel
Sparsity prior for
0 20 40 60 80 100
8 .4
8 .6
8 .8
9
9 .2
9 .4
9 .6
O bserv ation #
Expr
essi
on L
evel
Oncogene interventions experiments: False discovery
MCMC methods vital
(Lucas et al 05)(Lucas et al 05)
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
GeneGene--Environment Associations in Mice Experiments Environment Associations in Mice Experiments
Probabilities and log-likelihood ratios
- SHRINKAGE
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Pattern Profiles: Signatures of StatesPattern Profiles: Signatures of States
‘significant’ effects define characterising metagenes
ApoE- WT/ApoE+
26
194
128
93 60
19
ApoE.Age65 genes
ApoE.Diet321 genes
ApoE.Age.Diet292 genes
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Pattern Profiling: MousePattern Profiling: Mouse--Human TranslationHuman Translation
Mapping genes from mouse to human normalise/evaluate/extrapolate signatures
Mouse WT/ApoE+
Mouse ApoE-
Human D-
Human D+
Human D+/-
Aorta lesions, advanced
atherosclerosis
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Pattern Profiling: MousePattern Profiling: Mouse--Human TranslationHuman Translation
ApoE.Age
Intersection of ApoE,
ApoE+{Age,Diet}
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
HumanHuman--Mouse Model Data Fusion: Clinical Research?Mouse Model Data Fusion: Clinical Research?
Improved pathway characterisation: implicated genes
More/better phenotypes in humans
Blood/serum assays for gene expression
Metabolites in serum and blood: metabolic/genomic fusion
Disease Risk Signature
(Karra et al 04; Seo et al 05)
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
FlightplanFlightplan
#1
#2
#3
#4
#5
#6
Genomics, Microarrays, Data:Big picture
Bayesics - Regression and Shrinkage:Gene expression as predictors
Patterns and Factors:Prediction via pattern profiling
Sparse Modelling:Regression subset-structure uncertainty
Sparse Models and Profiling:Gene expression as response: Designed experiments
Sparse Models and Profiling:Gene expression as response: Latent factor models
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Latent Factor Models & Patterns of AssociationLatent Factor Models & Patterns of Association
Decompositions of p(x)
Latent structure underlying associations
Cancer Studies:
Multiple deregulated pathway components
Latent Factors:
intersecting sub-pathways
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Latent Factor Model StructureLatent Factor Model Structure
One sample
- column p-vector
Vector of k<<p latent
- underlying –
factor variables
Idiosyncratic variation
Latent factors:
Model of covariance matrix:
(West 03, Valencia 7; +Aguilar 01 JBES; +Lopes 04 Stat Sinica)
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Sparsity and Structure Sparsity and Structure
Sparse Models:One factor – few or many variables
One variable – 0,1, or few factors
x17
x13
x12x11
x10
x9
x8
x7
x6x5
x4x3
x2x1
- - -
- - -
x16
x15
x14
Row (variable) g, factor j:
0,1, .., small
(West 2003, Valencia 7)
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Sparsity and StructureSparsity and Structure
Uncertain sparsity patterns:
+ ++
B =
Structure:Parametrisation of B - Identification
“founders” of factors
(West 2003, Valencia 7)
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Computation: Global MCMCComputation: Global MCMC
Other (hyper) parameters “easy”
Not parallelizable:
Serial Gibbs/MCMC
Parallel within Gibbs iterates
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Inference on Sparsity and StructureInference on Sparsity and Structure
Aspects of p(B,Π|X)
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Factor Models and Association PatternsFactor Models and Association Patterns
Sample correlations
Fitted correlationsin B’TB+Ψ
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Decomposing Associations by FactorDecomposing Associations by Factor
Covariance decompositions:
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Decomposition by Factor Signatures Decomposition by Factor Signatures
Data decompositions:
Breast cancer gene expression Factor 2: ER, 9: HER2
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Decomposition by Factor Signatures Decomposition by Factor Signatures
Discovery of gene-factor associations
Isolation of non –associated genes
… noise/idiosyncratic
Factors =
Patterns, Signatures
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Sparse Latent Factor RegressionSparse Latent Factor Regression
Response variables z
Evaluate p(z,x)
Predict z
z linked to some latent factors in x space?
… and to some individual x variables?
Response factorsx17
x13
x12x11
x10
x9
x8
x7
x6x5
x4x3
x2x1
- - -
- - -
x16
x15
x14
z
(West 2003, Valencia 7; Carvalho et al, 2005)
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Sparse Factor Regression: Predictive SignaturesSparse Factor Regression: Predictive Signatures
“bad samples” factors
A “cleaned-up” ER factor
SVD Sparse Model
Factor models “clean up” SVD
SVD “noisy” factors(West 2003, Valencia 7)
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Sparse Factor Regression: Predictive SignaturesSparse Factor Regression: Predictive Signatures
Expression assays of hormonal status
Jointly predicted: Multiple factors
4+ studies
ER
HER2
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Sparse Factor Regression: Predictive SignaturesSparse Factor Regression: Predictive Signatures
Jointly predicted:
Multiple tests/assays
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Exploratory/Discovery AnalysesExploratory/Discovery Analyses
Evolutionary gene set enrichmentExperiments: gene set responses
to perturbations? Promotor TF interactions
Duke/NCI Systems Biology CenterGenomic Data Synthesis
2 4 (apoptosis) (DNA replication)
MDM2GSNAGC1RBM8AMTHFSCUGBP2PKN2SSR4FANCGNME3POU4F1AGC1MDM2CRHR1H1FXRPS3AABCB8RGS12GLG1DOC-1RTNPO3MDM2MAPTLORGUCA1AGRIA1CDC34COL11A2MYCTBL3BTF3UCP3LBA1CDKN2CHTR6CDC6CYP2A13KHDRBS1KIAA0284PEX5CYP2A6LTKSSTR3MDM2
MCM6VCAM1CYP2A13PITX1MCM2DDX39MCM7GSNGSTM1CCNE1MCM3MCM4MFGE8ABCA3CDKN2AMCM5KIAA1026TOMM70ACDC2SASPOLR2HCSNK1DNME3CDC6CHC1TNFRSF14BACHMVDFANCGLIG1SF3B4CCNE1ABCF3 2
4
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Structure and Skeletons of VariableStructure and Skeletons of Variable--Factor WiringFactor WiringHER2
ER
Myc.a
Myc.b Myc.b
Myc.a
E2F1
E2F2
E2F4
E2F2
bcta-cat
E2F1
FactorsGe
nes
Biologically “Hard” vs “Soft”interactions
TF binding sites
P-P interactions
Expression expts
Priors over sparsity probabilities
Status, relative “activation”, deregulation
Priors over non-zero effects
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Graphs, Networks and Factor ModelsGraphs, Networks and Factor Models
Graphs of sparse factor models –covariance or association graphs
Graphical models – precision (inverse covariance) graphs
Multiple cancer data sets – Fusion studies
Constant/consonant structure
- network “motifs”(Zhou et al USC/PNAS 05)
( Dobra et al 04, J Multivariate Analysis;
Jones et al 05, Statistical Science;
Rich et al 05, Cancer Research )
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Comprehensive Sparse Models for Gene ExpressionComprehensive Sparse Models for Gene Expression
Sparsity models for:
• Design effects
• Control/correction factor effects
• Latent structure: Complex associations
• Response prediction
Combined model(Carvalho et al 05; Lucas et al 05)
Computation, Model Search: MCMC and MCMC-inspired Stochastic Search
SemStat05 SemStat05 –– Warwick Warwick -- Sept 11 & 12Sept 11 & 12thth 20052005
Papers, software, many many links:www.isds.duke.edu/~mw
ABS04 web site: Lecture slides, course notes, papers, data, links: www.isds.duke.edu/~mw/ABS04
Integrated Cancer Biology Programicbp.genome.duke.edu
Genome Institute @ Dukewww.genome.duke.edu