genes and regulatory elements zhiping weng u mass medical school

Genes and Regulatory Elements

Zhiping WengU Mass Medical School

2

ENCODEENCyclopedia Of DNA elements

(The ENCODE Project Consortium, Science 2004, Nature 2007)

m001

m002

m003

m004m005m007

m008

m009m010m011

m012m013

m014

r111

r112

r113

r114

r121

r122

r123

r131

r132

r133

r211

r212

r213

r221

r222

r223

r231

r232

r233r311

r312

r313

r321

r322

r323

r334

r324m006

r331

r332

r333

1 2

3 4 5

6 987 10 1211

13 1514

2019

16

2221 Y

X

17 18

Goal: Identify all functional

elements in the human

genome.

Pilot phase: 1% of the genome

is being annotated very

extensively (30 Mb of

sequence).

Now genome-wide

The ENCODE Project Consortium (2004)The ENCODE (ENCyclopedia Of DNA Elements) ProjectScience, Vol 306, 636-640.

Gene

RNA-seq

Epigenomics

Regulatory Elements

The human genome

2% genes (25,000)

53% Unique and segmentalduplicated DNA

45% repetitive DNA

Where are the gene regulatory elements?

G. Crawford

DNase hypersensitive (HS) sites identify active gene regulatory elements

DNase IHS sites

Regions hypersensitive to DNasePromotersEnhancersSilencersInsulatorsLocus control regionsMeiotic recombination hotspots

HS sites identify “open” regions of chromatin

Crawford et al., Nature Methods 2006

DNase-chip to identify DNase HS sites


or sequence directly.

Arrays used for DNase-chip

NimbleGen arrays385,000 50-mer oligosoligos spaced every 38 bases (12 base overlap)non-repetitive unique regions1% of the genome (44 ENCODE regions)


DNase-chip Quality Assessment

Xi H., Shulha H.P., Lin J.M., Vales T.R., Fu Y., Bodine D.M., McKay R.D.J,Chenoweth J.G., Tesar P.J., Furey T.S., Ren B., Weng Z.+, Crawford G.E.+ (2007)Identification and characterization of cell type-specific and ubiquitous chromatin regulatory structures in the human genome. +Co-corresponding authors PLoS Genetics, 8, 8-20.

GM CD4 HeLa H9 K562 IMR90Ubiquitous HS sites20%

Cell-type specific andCommon HS sites80%

Unique, common, and ubiquitous DNase HS sites

Collectively, the DHS cover 8.3% of the ENCODE regions.

Have we reached saturation in identifying most DNase HS sites?

CpG content of DNase HS sites

Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site

(TSS)

Ubiquitous DNase HS sites

are enriched for promoters

(TSS) What about ubiquitous

distal DNase HS sites?

Most Distal (non-TSS) ubiquitous DNase sites are insulators bound by CTCF

ChIP

Kim T.H. et al. Direct Isolation and

Identification of Promoters in the

Human Genome

Genome Research (2005)

Antibody against CTCF

Tiling array

Direct sequencing ChIP-seq

Chromatin-immunoprecipitation

(ChIP) - chip

The H19/IGF2 Locus is well insulated

DNase HS sites identify insulator in the Hox locus

Cell culture insulator assays demonstratethat DNaseI HS sites (that overlap CTCF) display enhancer blocking activity.

CTCF motif sites are conserved

CTCF sites make up a greater % of ubiquitousdistal DNase HS sites than enhancers


(TSS)

Ubiquitous DNase HS sites

are enriched for promoters

(TSS)

Ubiquitous proximal DNase HS sites


(TSS)

Antibody against histone modification

•Tiling array•Sequencing

Enrichment between tissue-specific H3K4me2 and DNase HS sites

Cell type-specific DNase HS sites correlatewith cell type-specific histone modifications

Similarly for H3K4me1, H3K4me3, H3ac and H4ac, for which we have experimental data.

Cell type-specific DNase HS sites correlatewith cell type-specific enhancers

Cell type-specific DNase HS sites correlatewith cell type-specific gene expression

Transcriptional Motifs

Gene transcription is controlled by molecules (transcription factors, or TFs) binding to short DNA sequences (cis-elements, TF motifs) in promoters and distal elements

Finding enriched motifs in tissue-specific DNase HS sites

Screen against a motif library, e.g., JASPAR or TRANSFAC

STAT

Myc/Max

YY1

(etc.)

the Clover algorithm

DHS #1

DHS #2

DHS #3

DHS #4

DHS #5

JASPAR: a database of transcription factor motifs

Clover:Cis-eLement OVERrepresentation

Myc/Max

DHSsequences

17.3

Raw score

The Clover AlgorithmFrith MC, Fu Y, Yu L, Chen J-F, Hansen U, Weng Z (2004). Detection of Functional DNA Motifs Via Statistical Overrepresentation. Nucleic

Acids Res. 32:1372-1381.

Lk: nucleotide at position k

W: motif width

S: a promoter sequence

Ms: number of motif locations in a sequence

A: all possibilities of choosing a subset of sequences

N: the total number of promoter sequences

Clover Raw score

Clover:Cis-eLement OVERrepresentation

Myc/Max

DHS sequences

P-value = 1/4

Control DNA sequences

17.3

Raw score

4.2 6.6 18 9.1

Motifs enriched in cell-type specific DNase HS sites

Cell type

Motif Family Proximal Distal Far distal

H9 ES Oct x

Sp-1 x x

STAT x

SOX x

K562 GATA x x x

PR x x

GEN_INI x x

Tel-2 x

IMR90 AP-4 x x

Motifs enriched in cell-type specific DNase HS sites

Cell typeMotif Family Proximal Distal Far distal

TAL1, E2A, E12,Lmo2 x x

ETS x x

GM06990 Lmo2,

E12, E47 x x

T3R x

IRF x

PAX6 x

HeLa AP-1 x x x

IPF1 x x

NF-1 x

CD4

Genome-wide DNase-chip and DNase-sequencing data

• CD4 cells• 23 k proximal DNaseI HS sites• 72 k distal DNaseI HS sites

Enriched transcription factor binding motifs in distal DNaseI HS sites

• Hematopoietic system:– TAL1– AML– PU.1 – C/EBPα

• Immune system: – STAT1, STAT3, STAT5– IRF1, IRF3 and IRF5

acgtcggctgacaccaggtctgcttgattcgatgagattgaattcgtaggagctggattagag

ggcttggggcttgaggcttgacaccatatcgtagcgctgagttgctgagtttcgtatggcgct

cgatgcttattagcggctattataggctagctaggcaatacacatcgctgatatagcggctta

tgagatagcgtgctagctatatggattggaatattcggcgctgaaaggtcttagctagtcgta

aatatatgcgcgtatgcgtatggcgggtatatgggggcttggtcttttttttcgcttaggtcg

Enriched motifs

Distal DHS sequences

Find motif clustersin the human genome

Identify motif clusters (modules)

MotifScore

Locationin DNA

Red = motif type 1 (e.g. TAL1)

Blue = motif type 2 (e.g. ETS)

Finding motif clusters with a hidden Markov model

Cluster-BusterMC Frith, MC Li, Z Weng (2003).

Cluster-Buster: Finding dense clusters of motifs in DNA sequences.

Nucleic Acids Research, 31(13):3666-8.http://zlab.bu.edu/cluster-buster/

0.8

0.1

0.1

http://zlab.bu.edu/cluster-buster/

Overlap between predicted motif clusters and distal DNase HS sites

CutoffDNase HS sitesPredicted motif clusters

Enrichment of the overlap =Overlap * Sequence space

DHS * Motif Clusters

Motif clusters can predict distal DNase HS sites genome-wide

0

2

4

6

8

10

12

7 12 17Cutoff of cluster score

Fo

ld e

nri

chm

en

t

Summary• DNase HS sites identified from 6 cell types

Cell-type specific Common Ubiquitous (found in all cell types studied)

• Ubiquitous DNase HS sites are likely to function as…Promoters (TSS)Insulators (CTCF)(no enhancers?)

• Ubiquitous sites indicative of housekeeping chromatin structure

• Cell-type specific DNase HS sitesCorrelate with histone modifications in a cell type-specific mannerCorrelate with gene expression in a cell type-specific mannerCorrelate with enhancer elements in a cell type-specific mannerContain cell type-specific motifs

• Motif clusters can predict DNase HS sites genome-wide

Motif FindingMany Slides by Bill Noble @ UW

Outline

• What is a sequence motif?• Weight matrix representation• Motif search• Motif discovery

– Expectation-maximization– Gibbs sampling

• Patterns-with-mismatches representation

What is a “Motif”?

• Generally, a recurring pattern, e.g.– Sequence motif– Structure motif– Network motif

• More specifically, a set of similar substrings, within a family of diverged sequences.– Protein sequence motifs– DNA sequence motifs

Example motif

Motif in Logos Format

Gene 3’-Processing Signals

RNA

A simplified representation of the arrangement of control elements (withexample sequences) that identify the 3'-processing site in yeast mRNA.

JH Graber et al. (2002) Nucleic Acids Research 30(8):1851-8.

Splice site motif in logo format

weblogo.berkeley.edu

Exonic Splicing Enhancers

These motifs occur within exons and enhance splicing of introns from mRNA.

Letter height indicates its frequency at that position.

Fairbrother WG et al. (2002) Science 297(5583):1007-13

Transcription Factor Binding Sites

ERE

EstrogenReceptor Transcription start

DNA

Gene ERE Sequence

Efp … a g g g t c a t g g t g a c c c t …

TERT … t t g g t c a g g c t g a t c t c …

Oxytocin … g c g g t g a c c t t g a c c c c …

Lactoferrin … c a g g t c a a g g c g a t c t t …

Angiotensin … t a g g g c a t c g t g a c c c g …

VEGF … a t a a t c a g a c t g a c t g g …

(estrogen response element)

Outline




Weight matrix

• Probabilistic model: How likely is each letter at each motif position?

ACGT

1 2 3 4 5 6 7 8 9

.89 .02 .38 .34 .22 .27 .02 .03 .02

.04 .91 .20 .17 .28 .31 .30 .04 .02

.04 .05 .41 .18 .29 .16 .07 .92 .18

.03 .02 .01 .31 .21 .26 .61 .01 .78

A. K. A.

Weight matrices are also known as• Position-specific scoring matrices• Position-specific probability matrices• Position-specific weight matrices

Scoring a motif model

• A motif is interesting if it is very different from the background distribution

more interestingless interesting

ACGT

1 2 3 4 5 6 7 8 9

.89 .02 .38 .34 .22 .27 .02 .03 .02

.04 .91 .20 .17 .28 .31 .30 .04 .02

.04 .05 .41 .18 .29 .16 .07 .92 .18

.03 .02 .01 .31 .21 .26 .61 .01 .78

Relative entropy

• A motif is interesting if it is very different from the background distribution

• Use relative entropy*:

,,

position letter

log ii

i

pp

b

pi, = probability of in matrix position ib = background frequency (in non-motif sequence)

* Relative entropy is sometimes called information content.

Scoring motif instances

• A motif instance matches if it looks like it was generated by the weight matrix

Matches weight matrix

Hard to tell

“ A C G G C G C C T”

Not likely!

ACGT

1 2 3 4 5 6 7 8 9

.89 .02 .38 .34 .22 .27 .02 .03 .02

.04 .91 .20 .17 .28 .31 .30 .04 .02

.04 .05 .41 .18 .29 .16 .07 .92 .18

.03 .02 .01 .31 .21 .26 .61 .01 .78

Log likelihood ratio

• A motif instance matches if it looks like it was generated by the weight matrix

• Use log likelihood ratio

• Measures how much more like the weight matrix than like the background.

,

position

log i

i

i

i

p

b

i: the character atposition i of the instance

Outline




Position-specific scoring matrix

• This PSSM assigns the sequence NMFWAFGH a score of 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12.

A -1 -2 -1 0 -1 -2 0 -2

R 5 0 5 -2 1 -3 -2 0

N 0 6 0 0 0 -3 0 1

D -2 1 -2 -1 0 -3 -1 -1

C -3 -3 -3 -3 -3 -2 -3 -3

Q 1 0 1 -2 5 -3 -2 0

E 0 0 0 -2 2 -3 -2 0

G -2 0 -2 6 -2 -3 6 -2

H 0 1 0 -2 0 -1 -2 8

I -3 -3 -3 -4 -3 0 -4 -3

L -2 -3 -2 -4 -2 0 -4 -3

K 2 0 2 -2 1 -3 -2 -1

M -1 -2 -1 -3 0 0 -3 -2

F -3 -3 -3 -3 -3 6 -3 -1

P -2 -2 -2 -2 -1 -4 -2 -2

S -1 1 -1 0 0 -2 0 -1

T -1 0 -1 -2 -1 -2 -2 -2

W -3 -4 -3 -2 -2 1 -2 -2

Y -2 -2 -2 -3 -1 3 -3 2

V -3 -3 -3 -3 -2 -1 -3 -3

Significance of scores

MotifScanningalgorithm

LENENQGKCTIAEYKYDGKKASVYNSFVS

45

Low score = not a motifHigh score = motif occurrence

How high is high enough?

Computing a p-value

• The scores for all possible sequences of length that matches the motif.

• Use these scores to compute a p-value.

• The probability of observing a score >4 is the area under the curve to the right of 4.

• This probability is called a p-value.

• p-value = Pr(data|null)

Outline




Motif discovery problem

• Given sequences

• Find motif

IGRGGFGEVY at position 515LGEGCFGQVV at position 430VGSGGFGQVY at position 682

seq. 1seq. 2seq. 3

Motif discovery problem

• Given: a sequence or family of sequences.• Find:

the number of motifsthe width of each motifthe locations of motif occurrences

Why is this hard?

• Input sequences are long (thousands or millions of residues)

• Motif may be subtle– Instances are short.– Instances are only slightly similar.

?

?

Globin motifs

xxxxxxxxxxx.xxxxxxxxx.xxxxx..........xxxxxx.xxxxxxx.xxxxxxxxxx.xxxxxxxxxHAHU V.LSPADKTN..VKAAWGKVG.AHAGE..........YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSAHAOR M.LTDAEKKE..VTALWGKAA.GHGEE..........YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSAHADK V.LSAADKTN..VKGVFSKIG.GHAEE..........YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSAHBHU VHLTPEEKSA..VTALWGKVN.VDEVG...........G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPDHBOR VHLSGGEKSA..VTNLWGKVN.INELG...........G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAGHBDK VHWTAEEKQL..ITGLWGKVNvAD.CG...........A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPTMYHU G.LSDGEWQL..VLNVWGKVE.ADIPG..........HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSEDMYOR G.LSDGEWQL..VLKVWGKVE.GDLPG..........HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTEDIGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY.PDIQNKFSQFaGKDLASIKDGPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA..........HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPEGPYL GVLTDVQVAL..VKSSFEEFN.ANIPK...........N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQGGZLB M.L.DQQTIN..IIKATVPVLkEHGVT...........ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE xxxxx.xxxxxxxxxxxxx..xxxxxxxxxxxxxxx..xxxxxxx.xxxxxxx...xxxxxxxxxxxxxxxxHAHU QVKGH.GKKVADA.LTN......AVA.HVDDMPNA...LSALS.D.LHAHKL....RVDPVNF.KLLSHCLLHAOR QIKAH.GKKVADA.L.S......TAAGHFDDMDSA...LSALS.D.LHAHKL....RVDPVNF.KLLAHCILHADK QIKAH.GKKVAAA.LVE......AVN.HVDDIAGA...LSKLS.D.LHAQKL....RVDPVNF.KFLGHCFLHBHU AVMGNpKVKAHGK.KVLGA..FSDGLAHLDNLKGT...FATLS.E.LHCDKL....HVDPENF.RL.LGNVLHBOR AVMGNpKVKAHGA.KVLTS..FGDALKNLDDLKGT...FAKLS.E.LHCDKL....HVDPENFNRL..GNVLHBDK AILGNpMVRAHGK.KVLTS..FGDAVKNLDNIKNT...FAQLS.E.LHCDKL....HVDPENF.RL.LGDILMYHU EMKASeDLKKHGA.TVL......TALGGILKKKGHH..EAEIKPL.AQSHATK...HKIPVKYLEFISECIIMYOR EMKASaDLKKHGG.TVL......TALGNILKKKGQH..EAELKPL.AQSHATK...HKISIKFLEYISEAIIIGLOB T.GA...FATHATRIVSFLseVIALSGNTSNAAAV...NSLVSKL.GDDHKA....R.GVSAA.QF..GEFRGPUGNI NNPK...LKAHAAVIFKTI...CESATELRQKGHAVwdNNTLKRL.GSIHLK....N.KITDP.HF.EVMKGGPYL NNPD...LQAHAG.KVFKL..TYEAAIQLEVNGAVAs.DATLKSL.GSVHVS....K.GVVDA.HF.PVVKEGGZLB Q......PKALAM.TVL......AAAQNIENLPAIL..PAVKKIAvKHCQAGVaaaH.YPIVGQEL.LGAIK xxxxxxxxx.xxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxx..xHAHU VT.LAA.H..LPAEFTPA..VHASLDKFLASV.STVLTS..KY..RHAOR VV.LAR.H..CPGEFTPS..AHAAMDKFLSKV.ATVLTS..KY..RHADK VV.VAI.H..HPAALTPE..VHASLDKFMCAV.GAVLTA..KY..RHBHU VCVLAH.H..FGKEFTPP..VQAAYQKVVAGV.ANALAH..KY..HHBOR IVVLAR.H..FSKDFSPE..VQAAWQKLVSGV.AHALGH..KY..HHBDK IIVLAA.H..FTKDFTPE..CQAAWQKLVRVV.AHALAR..KY..HMYHU QV.LQSKHPgDFGADAQGA.MNKALELFRKDM.ASNYKELGFQ..GMYOR HV.LQSKHSaDFGADAQAA.MGKALELFRNDM.AAKYKEFGFQ..GIGLOB TA.LVA.Y..LQANVSWGDnVAAAWNKA.LDN.TFAIVV..PR..LGPUGNI ALLGTIKEA.IKENWSDE..MGQAWTEAYNQLVATIKAE..MK..EGPYL AILKTIKEV.VGDKWSEE..LNTAWTIAYDELAIIIKKE..MKdaAGGZLB EVLGDAAT..DDILDAWGK.AYGVIADVFIQVEADLYAQ..AV..E

Alternating approach

1. Guess an initial weight matrix2. Use weight matrix to predict instances in the input

sequences3. Use instances to predict a weight matrix4. Repeat 2 & 3 until satisfied.

Examples: Gibbs Sampler (Lawrence et al.) MEME (expectation maximization / Bailey, Elkan) ANN-Spec (neural network / Workman, Stormo)

Three Ingredients of Almost any Bioinformatics Method

1. Search space2. Scoring scheme3. Search algorithm (= optimization technique)

Strictly speaking, Gibbs sampling and expectation-maximization are search algorithms. They are not specific to motif discovery; indeed they were first used in other contexts.

Mathematically precise formulation of the problem

Expectation-Maximization

• Guarantees finding a local optimum.

• Widely used in bioinformatics:– The Baum-Welch algorithm for training HMMs is an

example– So is K-means clustering (e.g. used to analyze microarray

data).

Expectation-maximization (EM)

foreach subsequence of width Wconvert subsequence to a matrixdo {

re-estimate motif occurrences from matrixre-estimate matrix model from motif occurrences

} until (matrix model stops changing)endselect matrix with highest score

EM

Sample DNA sequences

>ce1cg TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCACAAAAATGGAAGTCCACAGTCTTGACAG

>ara GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAGAAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATGCCATAGCATTTTTATCCATAAG

>bglr1 ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATATAACTTTATAAATTCCTAAAATTACACAAAGTTAATAACTGTGAGCATGGTCATATTTTTATCAAT

>crp CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTACAGTAATACATTGATGTACTGCATGTATGCAAAGGACGTCACATTACCGTGCAGTACAGTTGATAGC

Motif occurrences

>ce1cg taatgtttgtgctggtttttgtggcatcgggcgagaatagcgcgtggtgtgaaagactgttttTTTGATCGTTTTCACaaaaatggaagtccacagtcttgacag

>ara gacaaaaacgcgtaacaaaagtgtctataatcacggcagaaaagtccacattgattaTTTGCACGGCGTCACactttgctatgccatagcatttttatccataag

>bglr1 acaaatcccaataacttaattattgggatttgttatatataactttataaattcctaaaattacacaaagttaataacTGTGAGCATGGTCATatttttatcaat

>crp cacaaagcgaaagctatgctaaaacagtcaggatgctacagtaatacattgatgtactgcatgtaTGCAAAGGACGTCACattaccgtgcagtacagttgatagc

Starting point

…gactgttttTTTGATCGTTTTCACaaaaatgg…

T T T G A T C G T T A 0.17 0.17 0.17 0.17 0.50 ...C 0.17 0.17 0.17 0.17 0.17G 0.17 0.17 0.17 0.50 0.17T 0.50 0.50 0.50 0.17 0.17

Re-estimating motif occurrences

TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA

T T T G A T C G T T A 0.17 0.17 0.17 0.17 0.50 ...C 0.17 0.17 0.17 0.17 0.17G 0.17 0.17 0.17 0.50 0.17 T 0.50 0.50 0.50 0.17 0.17

Score = 0.50 + 0.17 + 0.17 + 0.17 + 0.17 + ...

Scoring each subsequence

Subsequences ScoreTGTGCTGGTTTTTGT 2.95 GTGCTGGTTTTTGTG 4.62 TGCTGGTTTTTGTGG 2.31 GCTGGTTTTTGTGGC ...

Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA

Select from each sequence the subsequence with maximal score.

Re-estimating motif matrix

OccurrencesTTTGATCGTTTTCACTTTGCACGGCGTCACTGTGAGCATGGTCATTGCAAAGGACGTCAC

CountsA 000132011000040C 001010300200403G 020301131130000T 423001002114001

Adding pseudocounts

CountsA 000132011000040C 001010300200403G 020301131130000T 423001002114001

Counts + PseudocountsA 111243122111151C 112121411311514G 131412242241111T 534112113225112

Converting to frequencies

Counts + PseudocountsA 111243122111151C 112121411311514G 131412242241111T 534112113225112

T T T G A T C G T T A 0.13 0.13 0.13 0.25 0.50 ...C 0.13 0.13 0.25 0.13 0.25G 0.13 0.38 0.13 0.50 0.13T 0.63 0.38 0.50 0.13 0.13

Expectation-maximization

foreach subsequence of width Wconvert subsequence to a matrixdo {

re-estimate motif occurrences from matrixre-estimate matrix model from motif occurrences

} until (matrix model stops changing)endselect matrix with highest score

Problem: This procedure doesn't allow the motifs to move around very much. Taking the max is too brittle.

Solution: Associate with each start site a probability of motif occurrence.

Converting to probabilities

Occurrences Score ProbTGTGCTGGTTTTTGT 2.95 0.023 GTGCTGGTTTTTGTG 4.62 0.037 TGCTGGTTTTTGTGG 2.31 0.018 GCTGGTTTTTGTGGC ... ...Total 128.2 1.000

Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA

Computing weighted counts

Occurrences ProbTGTGCTGGTTTTTGT 0.023GTGCTGGTTTTTGTG 0.037TGCTGGTTTTTGTGG 0.018GCTGGTTTTTGTGGC ... 1 2 3 4 5 …

A

C

G

T

Include counts from all subsequences, weighted by the degree to which they match the motif model.

Occurrences ProbTGTGCTGGTTTTTGT 0.023GTGCTGGTTTTTGTG 0.037TGCTGGTTTTTGTGG 0.018GCTGGTTTTTGTGGC ... 1 2 3 4 5 …

A

C

G

T

Include counts from all subsequences, weighted by the degree to which they match the motif model.

Computing weighted counts

Problem: How do we estimate counts accurately when we have only a few examples?Solution: Use Dirichlet mixture priors.

Problem: Too many possible starting points.Solution: Save time by running only 1 iteration of EM at first.

Problem: Too many possible widths.Solution: Consider widths that vary by 2 and adjust motifs afterwards.

Problem: Algorithm assumes exactly one motif occurrence per sequence.Solution: Normalize motif occurrence probabilities across all sequences, using a user-specified parameter.

Problem: The EM algorithm finds only one motif.Solution: Probabilistically erase the motif from the data set, and repeat.

Problem: The motif model is too simplistic.Solution: Use a two-component mixture model that captures the background distribution. Allow the background model to be more complex, e.g. a Markov model.

Problem: The EM algorithm does not tell you how many motifs there are. Solution: Compute statistical significance of motifs and stop when they are no longer significant.

MEME algorithm

dofor (width = min; width *= 2; width < max)

foreach possible starting pointrun 1 iteration of EM

select candidate starting pointsforeach candidate

run EM to convergenceselect best motiferase motif occurrences

until (E-value of found motif > threshold)

Gibbs Samplinga type of Monte Carlo Markov chain method

Maximization Versus Sampling

• We are given some huge search space. Every point Z in the search space has some score SZ defined as before.

• Sampling: wander around the search space in such a way that how often we visit each point is proportional to πZ=exp(SZ).

• Maximization: find the point with the highest πZ, a likelihood ratio value between 0 and +∞.

• EM does maximization and MCMC does sampling.• MCMC attempts to escape local optima.

Gibbs SamplingUse a Markov chain to wander around the search space. If we are at point X, move to point Y with probability MXY

1

2

X

Start at a random point X.

Randomly pick a dimension.

Look at all points along this dimension.

Repeat.

Move to one of them randomly, proportional to its score π.

Suppose the search space is a 2D rectangle. (Typically, many dimensions!)

Initialization

Randomly guess an instance si from each of t input sequences {S1, ..., St}.

sequence 1

sequence 2

sequence 3

sequence 4

sequence 5

ACAGTGTTTAGACCGTGACCAACCCAGGCAGGTTT

Gibbs sampler

• Initially: randomly guess an instance si from each of t input sequences {S1, ..., St}.

• Steps 2 & 3 (search):– Throw away an instance si: remaining (t - 1) instances

define weight matrix.– Weight matrix defines instance probability at each position

of input string Si

– Pick new si according to probability distribution

• Return highest-scoring motif seen

Sampler step illustration:

ACAGTGTTAGGCGTACACCGT???????CAGGTTT

ACGT

.45 .45 .45 .05 .05 .05 .05

.25 .45 .05 .25 .45 .05 .05

.05 .05 .45 .65 .05 .65 .05

.25 .05 .05 .05 .45 .25 .85

ACGCCGT:20% ACGGCGT:52%

ACAGTGTTAGGCGTACACCGTACGCCGTCAGGTTT

sequence 411%

Comparison

• Both EM and Gibbs sampling involve iterating over two steps

• Convergence:– EM converges when the PSSM stops changing.– Gibbs sampling runs until you ask it to stop.

• Solution:– EM may not find the motif with the highest score.– Gibbs sampling will provably find the motif with the

highest score, if you let it run long enough.

Comparison of motif finders

Summary

• Motifs are represented by weight matrices.• Motif quality is measured by relative entropy. • Motif occurrences are scored using log likelihood

ratios.• EM and the Gibbs sampler attempt to find a motif

with maximal relative entropy.• Both algorithms alternate between predicting

instances and predicting the weight matrix.

Homework

• Go to UCSC genome browser to get the top 100 regions bound by CTCF

• Use MEME to find the binding motif of CTCF

genes and regulatory elements zhiping weng u mass medical school

Documents

dnase hs sites crawford

crawford slide

dnase hypersensitive

common hs sites

epigenomics slide

insulated slide

dnasei hs sites

ctcf chip slide