special topics in genomics cis-regulatory modules and phylogenetic footprinting
TRANSCRIPT
Special Topics in Genomics
Cis-regulatory Modules and Phylogenetic Footprinting
The slides for module discovery are provided by Prof. Qing Zhou @ UCLA
Cis-regulatory Modules and Module Discovery
Motif Discovery
0θ
T
G
C
A
0
0
0
0
wTTT
wGGG
wCCC
wAAA
21
21
21
21
1θ 2θ wθ ΘBackground Motif (weight matrix)
1
2
3
4 5
Mixture modeling
Difficulties in motif discovery in higher organisms
• Upstream sequences are longer.
• Motifs are less conserved and shorter.
• Background sequence structures are more complicated.
• To solve the problem, utilize more biological knowledge in our model.
1) module structure
2) multiple species conservation
Cis-regulatory module
• Combinatorial control of genes: cis-regulatory modules
module
module
CisModule: modeling module structure(Zhou and Wong, PNAS 2004)
• Module structure: consider co-localization of motif sites.
Motif 1 Motif 2 Motif 3
Hierarchical Mixture modeling
K: # of motifs
0 1Θ KΘ
B M
r1 r
0q1q
Kq
S
25.0
25.0
25.0
25.0
Parameters and missing data
• Missing data problem.K # of motifsl Module lengthS Set of sequencesM Indicators for a module startA Indicators for a motif site start
Background modelWeight matrices for motifs
W Motif widthsr Probability of a module startq Probability of starting a motif site
0Θ
Given
Observed data
Missing data
Parameters Ψ
Bayesian inference by posterior sampling
Module-motif detectionGiven Θ, r, q, and W, 1)Sample modules:
2) Within each module, sample motif sites:
M=1 M=0M=0
Parameter UpdateGiven M and A,
1) Infer Θ from aligned sites.2) Update r, q and W.
Aligned
TTTGC
TATCC
CTTGC
TTTAC
GTTGC
wθθθ 21
043
001
501
010
T
G
C
A
Module sampling
• Denote
.)Ψ|,()Ψ( M
MS|S PP
Want to sample from P (M | S, Ψ), need to calculate
,][ ],1[21 LL xxxx S ).Ψ|()Ψ( ],1[ nn xPf
Ψ).()|()1()Ψ(),1()Ψ( 10 nnlnn fxPrfnlnhrf
Forward summation:
Ψ).()( nn BA
),1( nlnh Module:
ln 1n n
)|( 0nxPBackground:
1 L
Module sampling
• Backward sampling
.)Ψ(
)Ψ()|1( ],1[1
n
nLnln f
AMMP
How to calculate ),1( nlnh
.)1,()|(),()|(),( 001
],1[
mihxPqwmihxPqmih n
K
kkkmwmk k
Posterior inference
• Motif sites: marginal posterior probability of being a motif start position > 0.5.
• Modules: marginal posterior probability of being within a module > 0.5.
Simulation study
• Generate 30 data sets independently, each contains:
1) 20 sequences, each of length 1000;
2) 25 modules, with length 150;
3) each module contains 1 E2F site, 1 YY1 site, and 1 cMyc site.
CisModule Do not consider module
Motifs Fail TP FP Fail TP FP
E2F 0.03 17.9 7.5 0.37 17.1 11.6
YY1 0.07 16.0 8.7 0.20 17.1 11.0
cMyc 0 15.7 9.9 0.63 13.6 12.4
Example: Discovery of tissue-specific modules in Ciona
• Sidow lab Collected 21 genes that are co-expressed during the development of muscle tissue in Ciona.
• Want to find motifs and modules in the upstream sequences (average length = 1330) of these genes.
• Found 3 motifs in 28 modules (4860 bps).
Are they real motifs that determine the gene expression??
Experimental validation
• Positive element: the shortest sufficient and non-overlapping sequence that drives strong expression in muscle: average length of 289 bps.
Experimental validation
• 70% of our predicted motif sites are located in the positive elements!
Other tools
• Gibbs Module Sampler (Thompson et al. Genome Res. 2004)
• EMCMODULE (Gupta and Liu, PNAS, 2005)
Phylogenetic Footprinting
Functional elements tend to be conserved across species
For example, exons are conserved due to the selection pressure. Introns and intergenic regions are less likely to be conserved.
Phylogenetic footprinting
Miller et al. Annu. Rev. Genomics Hum. Genet. 2004
Incorporating cross-species conservation into motif discovery
• A threshold method (Wasserman et al. Nature Genetics, 2000)
STEP1: construct cross-species alignmentSTEP2: compute conservation measure from the alignmentSTEP3: Non-conserved regions are filtered outSTEP4: Gibbs motif sampler is applied to conserved regions of
the target genome
Phylogenetic footprinting & motif discovery
• CompareProspector (Liu Y. et al. Genome Res. 2004)
STEP1: construct cross-species alignmentSTEP2: compute conservation measure (window percent
identity, WPID) from the alignmentSTEP3: multiply the likelihood ratio at a position by the
corresponding WPID, thus likelihood landscape is changed to favor conserved sites
STEP4: apply a Gibbs motif sampler based algorithm
Phylogenetic footprinting & motif discovery
• Evolutionary model based approachEMnEM (Moses et al. 2004)PhyME (Sinha et al. 2004)PhyloGibbs (Siddharthan et al. 2005)Tree Sampler (Li and Wong, 2005)
Incorporating cross-species conservation into motif discovery
• PhyloCon(Wang and Stormo, Bioinformatics, 2003)
STEP 1: construct alignment among orthologous sequences;STEP 2: convert conserved regions into profiles;STEP 3: use profiles in the first sequence as seeds;STEP 4: find matches of each seed in the second sequence;STEP 5: update seeds;STEP 6: repeat step 2 and 3 for all sequences.
Phylogenetic footprinting & module discovery
• Multimodule (Zhou and Wong, The Annals of Applied Statistics, 2007)
Multimodule
• Module structure of each sequence is modeled by an HMM.
• Couple HMMs via multiple alignment: Aligned states are coupled and collapsed into one common state.
• Uncoupled states: similar to single species model.
• Coupled states: evolutionary model.
Comparing with other methods
• Three data sets with experimental validation reported previously, which contain 9 known motifs with 152 validated sites.
• CompareProspector (Liu et al. 2004): conservation score
• PhyloCon (Wang and Stormo 2003): progressive alignment of profiles
• EMnEM (Moses et al. 2004): Phylogenetic motif discovery
• CisModule (Zhou and Wong 2004): Single-species module discovery.
Comparing with other methods
Method # known motifs identified
For correctly identified motifs by each method
# predicted sites
# overlaps Sensitivity (%)
Specificity (%)
CompareProspector 7 75 36 24 48
PhyloCon 3 50 26 17 52
EMnEM 6 130 44 29 34
CisModule 5 110 35 23 32
MultiModule 8 157 79 52 50
# of known sites = 152