genes and regulatory elements zhiping weng u mass medical school
Post on 19-Dec-2015
219 views
TRANSCRIPT
Genes and Regulatory Elements
Zhiping WengU Mass Medical School
2
ENCODEENCyclopedia Of DNA elements
(The ENCODE Project Consortium, Science 2004, Nature 2007)
m001
m002
m003
m004m005m007
m008
m009m010m011
m012m013
m014
r111
r112
r113
r114
r121
r122
r123
r131
r132
r133
r211
r212
r213
r221
r222
r223
r231
r232
r233r311
r312
r313
r321
r322
r323
r334
r324m006
r331
r332
r333
1 2
3 4 5
6 987 10 1211
13 1514
2019
16
2221 Y
X
17 18
Goal: Identify all functional
elements in the human
genome.
Pilot phase: 1% of the genome
is being annotated very
extensively (30 Mb of
sequence).
Now genome-wide
The ENCODE Project Consortium (2004)The ENCODE (ENCyclopedia Of DNA Elements) ProjectScience, Vol 306, 636-640.
Gene
RNA-seq
Epigenomics
Regulatory Elements
The human genome
2% genes (25,000)
53% Unique and segmentalduplicated DNA
45% repetitive DNA
Where are the gene regulatory elements?
G. Crawford
DNase hypersensitive (HS) sites identify active gene regulatory elements
DNase IHS sites
Regions hypersensitive to DNasePromotersEnhancersSilencersInsulatorsLocus control regionsMeiotic recombination hotspots
HS sites identify “open” regions of chromatin
Crawford et al., Nature Methods 2006
DNase-chip to identify DNase HS sites
Crawford et al., Nature Methods 2006
or sequence directly.
Arrays used for DNase-chip
NimbleGen arrays385,000 50-mer oligosoligos spaced every 38 bases (12 base overlap)non-repetitive unique regions1% of the genome (44 ENCODE regions)
Crawford et al., Nature Methods 2006
DNase-chip Quality Assessment
Xi H., Shulha H.P., Lin J.M., Vales T.R., Fu Y., Bodine D.M., McKay R.D.J,Chenoweth J.G., Tesar P.J., Furey T.S., Ren B., Weng Z.+, Crawford G.E.+ (2007)Identification and characterization of cell type-specific and ubiquitous chromatin regulatory structures in the human genome. +Co-corresponding authors PLoS Genetics, 8, 8-20.
GM CD4 HeLa H9 K562 IMR90Ubiquitous HS sites20%
Cell-type specific andCommon HS sites80%
Unique, common, and ubiquitous DNase HS sites
Collectively, the DHS cover 8.3% of the ENCODE regions.
Have we reached saturation in identifying most DNase HS sites?
CpG content of DNase HS sites
Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site
(TSS)
Ubiquitous DNase HS sites
are enriched for promoters
(TSS) What about ubiquitous
distal DNase HS sites?
Most Distal (non-TSS) ubiquitous DNase sites are insulators bound by CTCF
ChIP
Kim T.H. et al. Direct Isolation and
Identification of Promoters in the
Human Genome
Genome Research (2005)
Antibody against CTCF
Tiling array
Direct sequencing ChIP-seq
Chromatin-immunoprecipitation
(ChIP) - chip
The H19/IGF2 Locus is well insulated
DNase HS sites identify insulator in the Hox locus
Cell culture insulator assays demonstratethat DNaseI HS sites (that overlap CTCF) display enhancer blocking activity.
CTCF motif sites are conserved
CTCF sites make up a greater % of ubiquitousdistal DNase HS sites than enhancers
Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site
(TSS)
Ubiquitous DNase HS sites
are enriched for promoters
(TSS)
Ubiquitous proximal DNase HS sites
Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site
(TSS)
Antibody against histone modification
•Tiling array•Sequencing
Enrichment between tissue-specific H3K4me2 and DNase HS sites
Cell type-specific DNase HS sites correlatewith cell type-specific histone modifications
Similarly for H3K4me1, H3K4me3, H3ac and H4ac, for which we have experimental data.
Cell type-specific DNase HS sites correlatewith cell type-specific enhancers
Cell type-specific DNase HS sites correlatewith cell type-specific gene expression
Transcriptional Motifs
Gene transcription is controlled by molecules (transcription factors, or TFs) binding to short DNA sequences (cis-elements, TF motifs) in promoters and distal elements
Finding enriched motifs in tissue-specific DNase HS sites
Screen against a motif library, e.g., JASPAR or TRANSFAC
STAT
Myc/Max
YY1
(etc.)
the Clover algorithm
DHS #1
DHS #2
DHS #3
DHS #4
DHS #5
JASPAR: a database of transcription factor motifs
Clover:Cis-eLement OVERrepresentation
Myc/Max
DHSsequences
17.3
Raw score
The Clover AlgorithmFrith MC, Fu Y, Yu L, Chen J-F, Hansen U, Weng Z (2004). Detection of Functional DNA Motifs Via Statistical Overrepresentation. Nucleic
Acids Res. 32:1372-1381.
Lk: nucleotide at position k
W: motif width
S: a promoter sequence
Ms: number of motif locations in a sequence
A: all possibilities of choosing a subset of sequences
N: the total number of promoter sequences
Clover Raw score
Clover:Cis-eLement OVERrepresentation
Myc/Max
DHS sequences
P-value = 1/4
Control DNA sequences
17.3
Raw score
4.2 6.6 18 9.1
Motifs enriched in cell-type specific DNase HS sites
Cell type
Motif Family Proximal Distal Far distal
H9 ES Oct x
Sp-1 x x
STAT x
SOX x
K562 GATA x x x
PR x x
GEN_INI x x
Tel-2 x
IMR90 AP-4 x x
Motifs enriched in cell-type specific DNase HS sites
Cell typeMotif Family Proximal Distal Far distal
TAL1, E2A, E12,Lmo2 x x
ETS x x
GM06990 Lmo2,
E12, E47 x x
T3R x
IRF x
PAX6 x
HeLa AP-1 x x x
IPF1 x x
NF-1 x
CD4
Genome-wide DNase-chip and DNase-sequencing data
• CD4 cells• 23 k proximal DNaseI HS sites• 72 k distal DNaseI HS sites
Enriched transcription factor binding motifs in distal DNaseI HS sites
• Hematopoietic system:– TAL1– AML– PU.1 – C/EBPα
• Immune system: – STAT1, STAT3, STAT5– IRF1, IRF3 and IRF5
acgtcggctgacaccaggtctgcttgattcgatgagattgaattcgtaggagctggattagag
ggcttggggcttgaggcttgacaccatatcgtagcgctgagttgctgagtttcgtatggcgct
cgatgcttattagcggctattataggctagctaggcaatacacatcgctgatatagcggctta
tgagatagcgtgctagctatatggattggaatattcggcgctgaaaggtcttagctagtcgta
aatatatgcgcgtatgcgtatggcgggtatatgggggcttggtcttttttttcgcttaggtcg
Enriched motifs
Distal DHS sequences
Find motif clustersin the human genome
Identify motif clusters (modules)
MotifScore
Locationin DNA
Red = motif type 1 (e.g. TAL1)
Blue = motif type 2 (e.g. ETS)
Finding motif clusters with a hidden Markov model
Cluster-BusterMC Frith, MC Li, Z Weng (2003).
Cluster-Buster: Finding dense clusters of motifs in DNA sequences.
Nucleic Acids Research, 31(13):3666-8.http://zlab.bu.edu/cluster-buster/
0.8
0.1
0.1
Overlap between predicted motif clusters and distal DNase HS sites
CutoffDNase HS sitesPredicted motif clusters
Enrichment of the overlap =Overlap * Sequence space
DHS * Motif Clusters
Motif clusters can predict distal DNase HS sites genome-wide
0
2
4
6
8
10
12
7 12 17Cutoff of cluster score
Fo
ld e
nri
chm
en
t
Summary• DNase HS sites identified from 6 cell types
Cell-type specific Common Ubiquitous (found in all cell types studied)
• Ubiquitous DNase HS sites are likely to function as…Promoters (TSS)Insulators (CTCF)(no enhancers?)
• Ubiquitous sites indicative of housekeeping chromatin structure
• Cell-type specific DNase HS sitesCorrelate with histone modifications in a cell type-specific mannerCorrelate with gene expression in a cell type-specific mannerCorrelate with enhancer elements in a cell type-specific mannerContain cell type-specific motifs
• Motif clusters can predict DNase HS sites genome-wide
Motif FindingMany Slides by Bill Noble @ UW
Outline
• What is a sequence motif?• Weight matrix representation• Motif search• Motif discovery
– Expectation-maximization– Gibbs sampling
• Patterns-with-mismatches representation
What is a “Motif”?
• Generally, a recurring pattern, e.g.– Sequence motif– Structure motif– Network motif
• More specifically, a set of similar substrings, within a family of diverged sequences.– Protein sequence motifs– DNA sequence motifs
Example motif
Motif in Logos Format
Gene 3’-Processing Signals
RNA
A simplified representation of the arrangement of control elements (withexample sequences) that identify the 3'-processing site in yeast mRNA.
JH Graber et al. (2002) Nucleic Acids Research 30(8):1851-8.
Splice site motif in logo format
weblogo.berkeley.edu
Exonic Splicing Enhancers
These motifs occur within exons and enhance splicing of introns from mRNA.
Letter height indicates its frequency at that position.
Fairbrother WG et al. (2002) Science 297(5583):1007-13
Transcription Factor Binding Sites
ERE
EstrogenReceptor Transcription start
DNA
Gene ERE Sequence
Efp … a g g g t c a t g g t g a c c c t …
TERT … t t g g t c a g g c t g a t c t c …
Oxytocin … g c g g t g a c c t t g a c c c c …
Lactoferrin … c a g g t c a a g g c g a t c t t …
Angiotensin … t a g g g c a t c g t g a c c c g …
VEGF … a t a a t c a g a c t g a c t g g …
(estrogen response element)
Outline
• What is a sequence motif?• Weight matrix representation• Motif search• Motif discovery
– Expectation-maximization– Gibbs sampling
• Patterns-with-mismatches representation
Weight matrix
• Probabilistic model: How likely is each letter at each motif position?
ACGT
1 2 3 4 5 6 7 8 9
.89 .02 .38 .34 .22 .27 .02 .03 .02
.04 .91 .20 .17 .28 .31 .30 .04 .02
.04 .05 .41 .18 .29 .16 .07 .92 .18
.03 .02 .01 .31 .21 .26 .61 .01 .78
A. K. A.
Weight matrices are also known as• Position-specific scoring matrices• Position-specific probability matrices• Position-specific weight matrices
Scoring a motif model
• A motif is interesting if it is very different from the background distribution
more interestingless interesting
ACGT
1 2 3 4 5 6 7 8 9
.89 .02 .38 .34 .22 .27 .02 .03 .02
.04 .91 .20 .17 .28 .31 .30 .04 .02
.04 .05 .41 .18 .29 .16 .07 .92 .18
.03 .02 .01 .31 .21 .26 .61 .01 .78
Relative entropy
• A motif is interesting if it is very different from the background distribution
• Use relative entropy*:
,,
position letter
log ii
i
pp
b
pi, = probability of in matrix position ib = background frequency (in non-motif sequence)
* Relative entropy is sometimes called information content.
Scoring motif instances
• A motif instance matches if it looks like it was generated by the weight matrix
Matches weight matrix
Hard to tell
“ A C G G C G C C T”
Not likely!
ACGT
1 2 3 4 5 6 7 8 9
.89 .02 .38 .34 .22 .27 .02 .03 .02
.04 .91 .20 .17 .28 .31 .30 .04 .02
.04 .05 .41 .18 .29 .16 .07 .92 .18
.03 .02 .01 .31 .21 .26 .61 .01 .78
Log likelihood ratio
• A motif instance matches if it looks like it was generated by the weight matrix
• Use log likelihood ratio
• Measures how much more like the weight matrix than like the background.
,
position
log i
i
i
i
p
b
i: the character atposition i of the instance
Outline
• What is a sequence motif?• Weight matrix representation• Motif search• Motif discovery
– Expectation-maximization– Gibbs sampling
• Patterns-with-mismatches representation
Position-specific scoring matrix
• This PSSM assigns the sequence NMFWAFGH a score of 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12.
A -1 -2 -1 0 -1 -2 0 -2
R 5 0 5 -2 1 -3 -2 0
N 0 6 0 0 0 -3 0 1
D -2 1 -2 -1 0 -3 -1 -1
C -3 -3 -3 -3 -3 -2 -3 -3
Q 1 0 1 -2 5 -3 -2 0
E 0 0 0 -2 2 -3 -2 0
G -2 0 -2 6 -2 -3 6 -2
H 0 1 0 -2 0 -1 -2 8
I -3 -3 -3 -4 -3 0 -4 -3
L -2 -3 -2 -4 -2 0 -4 -3
K 2 0 2 -2 1 -3 -2 -1
M -1 -2 -1 -3 0 0 -3 -2
F -3 -3 -3 -3 -3 6 -3 -1
P -2 -2 -2 -2 -1 -4 -2 -2
S -1 1 -1 0 0 -2 0 -1
T -1 0 -1 -2 -1 -2 -2 -2
W -3 -4 -3 -2 -2 1 -2 -2
Y -2 -2 -2 -3 -1 3 -3 2
V -3 -3 -3 -3 -2 -1 -3 -3
Significance of scores
MotifScanningalgorithm
LENENQGKCTIAEYKYDGKKASVYNSFVS
45
Low score = not a motifHigh score = motif occurrence
How high is high enough?
Computing a p-value
• The scores for all possible sequences of length that matches the motif.
• Use these scores to compute a p-value.
• The probability of observing a score >4 is the area under the curve to the right of 4.
• This probability is called a p-value.
• p-value = Pr(data|null)
Outline
• What is a sequence motif?• Weight matrix representation• Motif search• Motif discovery
– Expectation-maximization– Gibbs sampling
• Patterns-with-mismatches representation
Motif discovery problem
• Given sequences
• Find motif
IGRGGFGEVY at position 515LGEGCFGQVV at position 430VGSGGFGQVY at position 682
seq. 1seq. 2seq. 3
Motif discovery problem
• Given: a sequence or family of sequences.• Find:
the number of motifsthe width of each motifthe locations of motif occurrences
Why is this hard?
• Input sequences are long (thousands or millions of residues)
• Motif may be subtle– Instances are short.– Instances are only slightly similar.
?
?
Globin motifs
xxxxxxxxxxx.xxxxxxxxx.xxxxx..........xxxxxx.xxxxxxx.xxxxxxxxxx.xxxxxxxxxHAHU V.LSPADKTN..VKAAWGKVG.AHAGE..........YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSAHAOR M.LTDAEKKE..VTALWGKAA.GHGEE..........YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSAHADK V.LSAADKTN..VKGVFSKIG.GHAEE..........YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSAHBHU VHLTPEEKSA..VTALWGKVN.VDEVG...........G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPDHBOR VHLSGGEKSA..VTNLWGKVN.INELG...........G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAGHBDK VHWTAEEKQL..ITGLWGKVNvAD.CG...........A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPTMYHU G.LSDGEWQL..VLNVWGKVE.ADIPG..........HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSEDMYOR G.LSDGEWQL..VLKVWGKVE.GDLPG..........HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTEDIGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY.PDIQNKFSQFaGKDLASIKDGPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA..........HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPEGPYL GVLTDVQVAL..VKSSFEEFN.ANIPK...........N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQGGZLB M.L.DQQTIN..IIKATVPVLkEHGVT...........ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE xxxxx.xxxxxxxxxxxxx..xxxxxxxxxxxxxxx..xxxxxxx.xxxxxxx...xxxxxxxxxxxxxxxxHAHU QVKGH.GKKVADA.LTN......AVA.HVDDMPNA...LSALS.D.LHAHKL....RVDPVNF.KLLSHCLLHAOR QIKAH.GKKVADA.L.S......TAAGHFDDMDSA...LSALS.D.LHAHKL....RVDPVNF.KLLAHCILHADK QIKAH.GKKVAAA.LVE......AVN.HVDDIAGA...LSKLS.D.LHAQKL....RVDPVNF.KFLGHCFLHBHU AVMGNpKVKAHGK.KVLGA..FSDGLAHLDNLKGT...FATLS.E.LHCDKL....HVDPENF.RL.LGNVLHBOR AVMGNpKVKAHGA.KVLTS..FGDALKNLDDLKGT...FAKLS.E.LHCDKL....HVDPENFNRL..GNVLHBDK AILGNpMVRAHGK.KVLTS..FGDAVKNLDNIKNT...FAQLS.E.LHCDKL....HVDPENF.RL.LGDILMYHU EMKASeDLKKHGA.TVL......TALGGILKKKGHH..EAEIKPL.AQSHATK...HKIPVKYLEFISECIIMYOR EMKASaDLKKHGG.TVL......TALGNILKKKGQH..EAELKPL.AQSHATK...HKISIKFLEYISEAIIIGLOB T.GA...FATHATRIVSFLseVIALSGNTSNAAAV...NSLVSKL.GDDHKA....R.GVSAA.QF..GEFRGPUGNI NNPK...LKAHAAVIFKTI...CESATELRQKGHAVwdNNTLKRL.GSIHLK....N.KITDP.HF.EVMKGGPYL NNPD...LQAHAG.KVFKL..TYEAAIQLEVNGAVAs.DATLKSL.GSVHVS....K.GVVDA.HF.PVVKEGGZLB Q......PKALAM.TVL......AAAQNIENLPAIL..PAVKKIAvKHCQAGVaaaH.YPIVGQEL.LGAIK xxxxxxxxx.xxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxx..xHAHU VT.LAA.H..LPAEFTPA..VHASLDKFLASV.STVLTS..KY..RHAOR VV.LAR.H..CPGEFTPS..AHAAMDKFLSKV.ATVLTS..KY..RHADK VV.VAI.H..HPAALTPE..VHASLDKFMCAV.GAVLTA..KY..RHBHU VCVLAH.H..FGKEFTPP..VQAAYQKVVAGV.ANALAH..KY..HHBOR IVVLAR.H..FSKDFSPE..VQAAWQKLVSGV.AHALGH..KY..HHBDK IIVLAA.H..FTKDFTPE..CQAAWQKLVRVV.AHALAR..KY..HMYHU QV.LQSKHPgDFGADAQGA.MNKALELFRKDM.ASNYKELGFQ..GMYOR HV.LQSKHSaDFGADAQAA.MGKALELFRNDM.AAKYKEFGFQ..GIGLOB TA.LVA.Y..LQANVSWGDnVAAAWNKA.LDN.TFAIVV..PR..LGPUGNI ALLGTIKEA.IKENWSDE..MGQAWTEAYNQLVATIKAE..MK..EGPYL AILKTIKEV.VGDKWSEE..LNTAWTIAYDELAIIIKKE..MKdaAGGZLB EVLGDAAT..DDILDAWGK.AYGVIADVFIQVEADLYAQ..AV..E
Alternating approach
1. Guess an initial weight matrix2. Use weight matrix to predict instances in the input
sequences3. Use instances to predict a weight matrix4. Repeat 2 & 3 until satisfied.
Examples: Gibbs Sampler (Lawrence et al.) MEME (expectation maximization / Bailey, Elkan) ANN-Spec (neural network / Workman, Stormo)
Three Ingredients of Almost any Bioinformatics Method
1. Search space2. Scoring scheme3. Search algorithm (= optimization technique)
Strictly speaking, Gibbs sampling and expectation-maximization are search algorithms. They are not specific to motif discovery; indeed they were first used in other contexts.
Mathematically precise formulation of the problem
Expectation-Maximization
• Guarantees finding a local optimum.
• Widely used in bioinformatics:– The Baum-Welch algorithm for training HMMs is an
example– So is K-means clustering (e.g. used to analyze microarray
data).
Expectation-maximization (EM)
foreach subsequence of width Wconvert subsequence to a matrixdo {
re-estimate motif occurrences from matrixre-estimate matrix model from motif occurrences
} until (matrix model stops changing)endselect matrix with highest score
EM
Sample DNA sequences
>ce1cg TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCACAAAAATGGAAGTCCACAGTCTTGACAG
>ara GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAGAAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATGCCATAGCATTTTTATCCATAAG
>bglr1 ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATATAACTTTATAAATTCCTAAAATTACACAAAGTTAATAACTGTGAGCATGGTCATATTTTTATCAAT
>crp CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTACAGTAATACATTGATGTACTGCATGTATGCAAAGGACGTCACATTACCGTGCAGTACAGTTGATAGC
Motif occurrences
>ce1cg taatgtttgtgctggtttttgtggcatcgggcgagaatagcgcgtggtgtgaaagactgttttTTTGATCGTTTTCACaaaaatggaagtccacagtcttgacag
>ara gacaaaaacgcgtaacaaaagtgtctataatcacggcagaaaagtccacattgattaTTTGCACGGCGTCACactttgctatgccatagcatttttatccataag
>bglr1 acaaatcccaataacttaattattgggatttgttatatataactttataaattcctaaaattacacaaagttaataacTGTGAGCATGGTCATatttttatcaat
>crp cacaaagcgaaagctatgctaaaacagtcaggatgctacagtaatacattgatgtactgcatgtaTGCAAAGGACGTCACattaccgtgcagtacagttgatagc
Starting point
…gactgttttTTTGATCGTTTTCACaaaaatgg…
T T T G A T C G T T A 0.17 0.17 0.17 0.17 0.50 ...C 0.17 0.17 0.17 0.17 0.17G 0.17 0.17 0.17 0.50 0.17T 0.50 0.50 0.50 0.17 0.17
Re-estimating motif occurrences
TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA
T T T G A T C G T T A 0.17 0.17 0.17 0.17 0.50 ...C 0.17 0.17 0.17 0.17 0.17G 0.17 0.17 0.17 0.50 0.17 T 0.50 0.50 0.50 0.17 0.17
Score = 0.50 + 0.17 + 0.17 + 0.17 + 0.17 + ...
Scoring each subsequence
Subsequences ScoreTGTGCTGGTTTTTGT 2.95 GTGCTGGTTTTTGTG 4.62 TGCTGGTTTTTGTGG 2.31 GCTGGTTTTTGTGGC ...
Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA
Select from each sequence the subsequence with maximal score.
Re-estimating motif matrix
OccurrencesTTTGATCGTTTTCACTTTGCACGGCGTCACTGTGAGCATGGTCATTGCAAAGGACGTCAC
CountsA 000132011000040C 001010300200403G 020301131130000T 423001002114001
Adding pseudocounts
CountsA 000132011000040C 001010300200403G 020301131130000T 423001002114001
Counts + PseudocountsA 111243122111151C 112121411311514G 131412242241111T 534112113225112
Converting to frequencies
Counts + PseudocountsA 111243122111151C 112121411311514G 131412242241111T 534112113225112
T T T G A T C G T T A 0.13 0.13 0.13 0.25 0.50 ...C 0.13 0.13 0.25 0.13 0.25G 0.13 0.38 0.13 0.50 0.13T 0.63 0.38 0.50 0.13 0.13
Expectation-maximization
foreach subsequence of width Wconvert subsequence to a matrixdo {
re-estimate motif occurrences from matrixre-estimate matrix model from motif occurrences
} until (matrix model stops changing)endselect matrix with highest score
Problem: This procedure doesn't allow the motifs to move around very much. Taking the max is too brittle.
Solution: Associate with each start site a probability of motif occurrence.
Converting to probabilities
Occurrences Score ProbTGTGCTGGTTTTTGT 2.95 0.023 GTGCTGGTTTTTGTG 4.62 0.037 TGCTGGTTTTTGTGG 2.31 0.018 GCTGGTTTTTGTGGC ... ...Total 128.2 1.000
Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA
Computing weighted counts
Occurrences ProbTGTGCTGGTTTTTGT 0.023GTGCTGGTTTTTGTG 0.037TGCTGGTTTTTGTGG 0.018GCTGGTTTTTGTGGC ... 1 2 3 4 5 …
A
C
G
T
Include counts from all subsequences, weighted by the degree to which they match the motif model.
Occurrences ProbTGTGCTGGTTTTTGT 0.023GTGCTGGTTTTTGTG 0.037TGCTGGTTTTTGTGG 0.018GCTGGTTTTTGTGGC ... 1 2 3 4 5 …
A
C
G
T
Include counts from all subsequences, weighted by the degree to which they match the motif model.
Computing weighted counts
Problem: How do we estimate counts accurately when we have only a few examples?Solution: Use Dirichlet mixture priors.
Problem: Too many possible starting points.Solution: Save time by running only 1 iteration of EM at first.
Problem: Too many possible widths.Solution: Consider widths that vary by 2 and adjust motifs afterwards.
Problem: Algorithm assumes exactly one motif occurrence per sequence.Solution: Normalize motif occurrence probabilities across all sequences, using a user-specified parameter.
Problem: The EM algorithm finds only one motif.Solution: Probabilistically erase the motif from the data set, and repeat.
Problem: The motif model is too simplistic.Solution: Use a two-component mixture model that captures the background distribution. Allow the background model to be more complex, e.g. a Markov model.
Problem: The EM algorithm does not tell you how many motifs there are. Solution: Compute statistical significance of motifs and stop when they are no longer significant.
MEME algorithm
dofor (width = min; width *= 2; width < max)
foreach possible starting pointrun 1 iteration of EM
select candidate starting pointsforeach candidate
run EM to convergenceselect best motiferase motif occurrences
until (E-value of found motif > threshold)
Gibbs Samplinga type of Monte Carlo Markov chain method
Maximization Versus Sampling
• We are given some huge search space. Every point Z in the search space has some score SZ defined as before.
• Sampling: wander around the search space in such a way that how often we visit each point is proportional to πZ=exp(SZ).
• Maximization: find the point with the highest πZ, a likelihood ratio value between 0 and +∞.
• EM does maximization and MCMC does sampling.• MCMC attempts to escape local optima.
Gibbs SamplingUse a Markov chain to wander around the search space. If we are at point X, move to point Y with probability MXY
1
2
X
Start at a random point X.
Randomly pick a dimension.
Look at all points along this dimension.
Repeat.
Move to one of them randomly, proportional to its score π.
Suppose the search space is a 2D rectangle. (Typically, many dimensions!)
Initialization
Randomly guess an instance si from each of t input sequences {S1, ..., St}.
sequence 1
sequence 2
sequence 3
sequence 4
sequence 5
ACAGTGTTTAGACCGTGACCAACCCAGGCAGGTTT
Gibbs sampler
• Initially: randomly guess an instance si from each of t input sequences {S1, ..., St}.
• Steps 2 & 3 (search):– Throw away an instance si: remaining (t - 1) instances
define weight matrix.– Weight matrix defines instance probability at each position
of input string Si
– Pick new si according to probability distribution
• Return highest-scoring motif seen
Sampler step illustration:
ACAGTGTTAGGCGTACACCGT???????CAGGTTT
ACGT
.45 .45 .45 .05 .05 .05 .05
.25 .45 .05 .25 .45 .05 .05
.05 .05 .45 .65 .05 .65 .05
.25 .05 .05 .05 .45 .25 .85
ACGCCGT:20% ACGGCGT:52%
ACAGTGTTAGGCGTACACCGTACGCCGTCAGGTTT
sequence 411%
Comparison
• Both EM and Gibbs sampling involve iterating over two steps
• Convergence:– EM converges when the PSSM stops changing.– Gibbs sampling runs until you ask it to stop.
• Solution:– EM may not find the motif with the highest score.– Gibbs sampling will provably find the motif with the
highest score, if you let it run long enough.
Comparison of motif finders
Summary
• Motifs are represented by weight matrices.• Motif quality is measured by relative entropy. • Motif occurrences are scored using log likelihood
ratios.• EM and the Gibbs sampler attempt to find a motif
with maximal relative entropy.• Both algorithms alternate between predicting
instances and predicting the weight matrix.
Homework
• Go to UCSC genome browser to get the top 100 regions bound by CTCF
• Use MEME to find the binding motif of CTCF