special topics in genomics cis-regulatory modules and phylogenetic footprinting

27
Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Upload: pamela-rodgers

Post on 17-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Special Topics in Genomics

Cis-regulatory Modules and Phylogenetic Footprinting

Page 2: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

The slides for module discovery are provided by Prof. Qing Zhou @ UCLA

Cis-regulatory Modules and Module Discovery

Page 3: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Motif Discovery

T

G

C

A

0

0

0

0

wTTT

wGGG

wCCC

wAAA

21

21

21

21

1θ 2θ wθ ΘBackground Motif (weight matrix)

1

2

3

4 5

Mixture modeling

Page 4: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Difficulties in motif discovery in higher organisms

• Upstream sequences are longer.

• Motifs are less conserved and shorter.

• Background sequence structures are more complicated.

• To solve the problem, utilize more biological knowledge in our model.

1) module structure

2) multiple species conservation

Page 5: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Cis-regulatory module

• Combinatorial control of genes: cis-regulatory modules

module

module

Page 6: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

CisModule: modeling module structure(Zhou and Wong, PNAS 2004)

• Module structure: consider co-localization of motif sites.

Motif 1 Motif 2 Motif 3

Hierarchical Mixture modeling

K: # of motifs

0 1Θ KΘ

B M

r1 r

0q1q

Kq

S

25.0

25.0

25.0

25.0

Page 7: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Parameters and missing data

• Missing data problem.K # of motifsl Module lengthS Set of sequencesM Indicators for a module startA Indicators for a motif site start

Background modelWeight matrices for motifs

W Motif widthsr Probability of a module startq Probability of starting a motif site

Given

Observed data

Missing data

Parameters Ψ

Page 8: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Bayesian inference by posterior sampling

Module-motif detectionGiven Θ, r, q, and W, 1)Sample modules:

2) Within each module, sample motif sites:

M=1 M=0M=0

Parameter UpdateGiven M and A,

1) Infer Θ from aligned sites.2) Update r, q and W.

Aligned

TTTGC

TATCC

CTTGC

TTTAC

GTTGC

wθθθ 21

043

001

501

010

T

G

C

A

Page 9: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Module sampling

• Denote

.)Ψ|,()Ψ( M

MS|S PP

Want to sample from P (M | S, Ψ), need to calculate

,][ ],1[21 LL xxxx S ).Ψ|()Ψ( ],1[ nn xPf

Ψ).()|()1()Ψ(),1()Ψ( 10 nnlnn fxPrfnlnhrf

Forward summation:

Ψ).()( nn BA

),1( nlnh Module:

ln 1n n

)|( 0nxPBackground:

1 L

Page 10: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Module sampling

• Backward sampling

.)Ψ(

)Ψ()|1( ],1[1

n

nLnln f

AMMP

How to calculate ),1( nlnh

.)1,()|(),()|(),( 001

],1[

mihxPqwmihxPqmih n

K

kkkmwmk k

Page 11: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Posterior inference

• Motif sites: marginal posterior probability of being a motif start position > 0.5.

• Modules: marginal posterior probability of being within a module > 0.5.

Page 12: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Simulation study

• Generate 30 data sets independently, each contains:

1) 20 sequences, each of length 1000;

2) 25 modules, with length 150;

3) each module contains 1 E2F site, 1 YY1 site, and 1 cMyc site.

CisModule Do not consider module

Motifs Fail TP FP Fail TP FP

E2F 0.03 17.9 7.5 0.37 17.1 11.6

YY1 0.07 16.0 8.7 0.20 17.1 11.0

cMyc 0 15.7 9.9 0.63 13.6 12.4

Page 13: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Example: Discovery of tissue-specific modules in Ciona

• Sidow lab Collected 21 genes that are co-expressed during the development of muscle tissue in Ciona.

• Want to find motifs and modules in the upstream sequences (average length = 1330) of these genes.

• Found 3 motifs in 28 modules (4860 bps).

Are they real motifs that determine the gene expression??

Page 14: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Experimental validation

• Positive element: the shortest sufficient and non-overlapping sequence that drives strong expression in muscle: average length of 289 bps.

Page 15: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Experimental validation

• 70% of our predicted motif sites are located in the positive elements!

Page 16: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Other tools

• Gibbs Module Sampler (Thompson et al. Genome Res. 2004)

• EMCMODULE (Gupta and Liu, PNAS, 2005)

Page 17: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Phylogenetic Footprinting

Page 18: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Functional elements tend to be conserved across species

For example, exons are conserved due to the selection pressure. Introns and intergenic regions are less likely to be conserved.

Page 19: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Phylogenetic footprinting

Miller et al. Annu. Rev. Genomics Hum. Genet. 2004

Page 20: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Incorporating cross-species conservation into motif discovery

• A threshold method (Wasserman et al. Nature Genetics, 2000)

STEP1: construct cross-species alignmentSTEP2: compute conservation measure from the alignmentSTEP3: Non-conserved regions are filtered outSTEP4: Gibbs motif sampler is applied to conserved regions of

the target genome

Page 21: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Phylogenetic footprinting & motif discovery

• CompareProspector (Liu Y. et al. Genome Res. 2004)

STEP1: construct cross-species alignmentSTEP2: compute conservation measure (window percent

identity, WPID) from the alignmentSTEP3: multiply the likelihood ratio at a position by the

corresponding WPID, thus likelihood landscape is changed to favor conserved sites

STEP4: apply a Gibbs motif sampler based algorithm

Page 22: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Phylogenetic footprinting & motif discovery

• Evolutionary model based approachEMnEM (Moses et al. 2004)PhyME (Sinha et al. 2004)PhyloGibbs (Siddharthan et al. 2005)Tree Sampler (Li and Wong, 2005)

Page 23: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Incorporating cross-species conservation into motif discovery

• PhyloCon(Wang and Stormo, Bioinformatics, 2003)

STEP 1: construct alignment among orthologous sequences;STEP 2: convert conserved regions into profiles;STEP 3: use profiles in the first sequence as seeds;STEP 4: find matches of each seed in the second sequence;STEP 5: update seeds;STEP 6: repeat step 2 and 3 for all sequences.

Page 24: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Phylogenetic footprinting & module discovery

• Multimodule (Zhou and Wong, The Annals of Applied Statistics, 2007)

Page 25: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Multimodule

• Module structure of each sequence is modeled by an HMM.

• Couple HMMs via multiple alignment: Aligned states are coupled and collapsed into one common state.

• Uncoupled states: similar to single species model.

• Coupled states: evolutionary model.

Page 26: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Comparing with other methods

• Three data sets with experimental validation reported previously, which contain 9 known motifs with 152 validated sites.

• CompareProspector (Liu et al. 2004): conservation score

• PhyloCon (Wang and Stormo 2003): progressive alignment of profiles

• EMnEM (Moses et al. 2004): Phylogenetic motif discovery

• CisModule (Zhou and Wong 2004): Single-species module discovery.

Page 27: Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting

Comparing with other methods

Method # known motifs identified

For correctly identified motifs by each method

# predicted sites

# overlaps Sensitivity (%)

Specificity (%)

CompareProspector 7 75 36 24 48

PhyloCon 3 50 26 17 52

EMnEM 6 130 44 29 34

CisModule 5 110 35 23 32

MultiModule 8 157 79 52 50

# of known sites = 152