genes and regulatory elements zhiping weng u mass medical school

101
Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Post on 19-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Genes and Regulatory Elements

Zhiping WengU Mass Medical School

Page 2: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

2

ENCODEENCyclopedia Of DNA elements

(The ENCODE Project Consortium, Science 2004, Nature 2007)

                                                                                      

m001

m002

m003

m004m005m007

m008

m009m010m011

m012m013

m014

r111

r112

r113

r114

r121

r122

r123

r131

r132

r133

r211

r212

r213

r221

r222

r223

r231

r232

r233r311

r312

r313

r321

r322

r323

r334

r324m006

r331

r332

r333

1 2

3 4 5

6 987 10 1211

13 1514

2019

16

2221 Y

X

17 18

Goal: Identify all functional

elements in the human

genome.

Pilot phase: 1% of the genome

is being annotated very

extensively (30 Mb of

sequence).

Now genome-wide

Page 3: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

The ENCODE Project Consortium (2004)The ENCODE (ENCyclopedia Of DNA Elements) ProjectScience, Vol 306, 636-640.

Page 4: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Gene

RNA-seq

Page 5: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Epigenomics

Page 6: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Regulatory Elements

Page 7: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

The human genome

2% genes (25,000)

53% Unique and segmentalduplicated DNA

45% repetitive DNA

Where are the gene regulatory elements?

G. Crawford

Page 8: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

DNase hypersensitive (HS) sites identify active gene regulatory elements

DNase IHS sites

Regions hypersensitive to DNasePromotersEnhancersSilencersInsulatorsLocus control regionsMeiotic recombination hotspots

HS sites identify “open” regions of chromatin

Crawford et al., Nature Methods 2006

Page 9: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

DNase-chip to identify DNase HS sites

Crawford et al., Nature Methods 2006

or sequence directly.

Page 10: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Arrays used for DNase-chip

NimbleGen arrays385,000 50-mer oligosoligos spaced every 38 bases (12 base overlap)non-repetitive unique regions1% of the genome (44 ENCODE regions)

Crawford et al., Nature Methods 2006

Page 11: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

DNase-chip Quality Assessment

Xi H., Shulha H.P., Lin J.M., Vales T.R., Fu Y., Bodine D.M., McKay R.D.J,Chenoweth J.G., Tesar P.J., Furey T.S., Ren B., Weng Z.+, Crawford G.E.+ (2007)Identification and characterization of cell type-specific and ubiquitous chromatin regulatory structures in the human genome. +Co-corresponding authors PLoS Genetics, 8, 8-20.

Page 12: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

GM CD4 HeLa H9 K562 IMR90Ubiquitous HS sites20%

Cell-type specific andCommon HS sites80%

Unique, common, and ubiquitous DNase HS sites

Collectively, the DHS cover 8.3% of the ENCODE regions.

Page 13: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Have we reached saturation in identifying most DNase HS sites?

Page 14: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

CpG content of DNase HS sites

Page 15: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site

(TSS)

Ubiquitous DNase HS sites

are enriched for promoters

(TSS) What about ubiquitous

distal DNase HS sites?

Page 16: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Most Distal (non-TSS) ubiquitous DNase sites are insulators bound by CTCF

ChIP

Page 17: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Kim T.H. et al. Direct Isolation and

Identification of Promoters in the

Human Genome

Genome Research (2005)

Antibody against CTCF

Tiling array

Direct sequencing ChIP-seq

Chromatin-immunoprecipitation

(ChIP) - chip

Page 18: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

The H19/IGF2 Locus is well insulated

Page 19: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

DNase HS sites identify insulator in the Hox locus

Page 20: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Cell culture insulator assays demonstratethat DNaseI HS sites (that overlap CTCF) display enhancer blocking activity.

Page 21: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

CTCF motif sites are conserved

Page 22: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

CTCF sites make up a greater % of ubiquitousdistal DNase HS sites than enhancers

Page 23: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site

(TSS)

Ubiquitous DNase HS sites

are enriched for promoters

(TSS)

Page 24: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Ubiquitous proximal DNase HS sites

Page 25: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site

(TSS)

Page 26: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Antibody against histone modification

•Tiling array•Sequencing

Page 27: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Enrichment between tissue-specific H3K4me2 and DNase HS sites

Page 28: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Cell type-specific DNase HS sites correlatewith cell type-specific histone modifications

Similarly for H3K4me1, H3K4me3, H3ac and H4ac, for which we have experimental data.

Page 29: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Cell type-specific DNase HS sites correlatewith cell type-specific enhancers

Page 30: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Cell type-specific DNase HS sites correlatewith cell type-specific gene expression

Page 31: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Transcriptional Motifs

Gene transcription is controlled by molecules (transcription factors, or TFs) binding to short DNA sequences (cis-elements, TF motifs) in promoters and distal elements

Page 32: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Finding enriched motifs in tissue-specific DNase HS sites

Screen against a motif library, e.g., JASPAR or TRANSFAC

STAT

Myc/Max

YY1

(etc.)

the Clover algorithm

DHS #1

DHS #2

DHS #3

DHS #4

DHS #5

Page 33: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

JASPAR: a database of transcription factor motifs

Page 34: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Clover:Cis-eLement OVERrepresentation

Myc/Max

DHSsequences

17.3

Raw score

Page 35: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

The Clover AlgorithmFrith MC, Fu Y, Yu L, Chen J-F, Hansen U, Weng Z (2004). Detection of Functional DNA Motifs Via Statistical Overrepresentation. Nucleic

Acids Res. 32:1372-1381.

Lk: nucleotide at position k

W: motif width

S: a promoter sequence

Ms: number of motif locations in a sequence

A: all possibilities of choosing a subset of sequences

N: the total number of promoter sequences

Clover Raw score

Page 36: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Clover:Cis-eLement OVERrepresentation

Myc/Max

DHS sequences

P-value = 1/4

Control DNA sequences

17.3

Raw score

4.2 6.6 18 9.1

Page 37: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Motifs enriched in cell-type specific DNase HS sites

Cell type

Motif Family Proximal Distal Far distal

H9 ES Oct x

  Sp-1 x x

  STAT x

SOX x

K562 GATA x x x

  PR  x x

  GEN_INI x x

Tel-2 x

IMR90 AP-4 x x

Page 38: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Motifs enriched in cell-type specific DNase HS sites

Cell typeMotif Family Proximal Distal Far distal

TAL1, E2A, E12,Lmo2 x x

ETS x x

GM06990 Lmo2,

E12, E47 x x

  T3R x

  IRF x

PAX6 x

HeLa AP-1 x x x

  IPF1 x x

NF-1 x

CD4

Page 39: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Genome-wide DNase-chip and DNase-sequencing data

• CD4 cells• 23 k proximal DNaseI HS sites• 72 k distal DNaseI HS sites

Page 40: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Enriched transcription factor binding motifs in distal DNaseI HS sites

• Hematopoietic system:– TAL1– AML– PU.1 – C/EBPα

• Immune system: – STAT1, STAT3, STAT5– IRF1, IRF3 and IRF5

Page 41: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

acgtcggctgacaccaggtctgcttgattcgatgagattgaattcgtaggagctggattagag

ggcttggggcttgaggcttgacaccatatcgtagcgctgagttgctgagtttcgtatggcgct

cgatgcttattagcggctattataggctagctaggcaatacacatcgctgatatagcggctta

tgagatagcgtgctagctatatggattggaatattcggcgctgaaaggtcttagctagtcgta

aatatatgcgcgtatgcgtatggcgggtatatgggggcttggtcttttttttcgcttaggtcg

Enriched motifs

Distal DHS sequences

Find motif clustersin the human genome

Identify motif clusters (modules)

Page 42: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

MotifScore

Locationin DNA

Red = motif type 1 (e.g. TAL1)

Blue = motif type 2 (e.g. ETS)

Finding motif clusters with a hidden Markov model

Cluster-BusterMC Frith, MC Li, Z Weng (2003).

Cluster-Buster: Finding dense clusters of motifs in DNA sequences.

Nucleic Acids Research, 31(13):3666-8.http://zlab.bu.edu/cluster-buster/

0.8

0.1

0.1

Page 43: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Overlap between predicted motif clusters and distal DNase HS sites

CutoffDNase HS sitesPredicted motif clusters

Enrichment of the overlap =Overlap * Sequence space

DHS * Motif Clusters

Page 44: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Motif clusters can predict distal DNase HS sites genome-wide

0

2

4

6

8

10

12

7 12 17Cutoff of cluster score

Fo

ld e

nri

chm

en

t

Page 45: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Summary• DNase HS sites identified from 6 cell types

Cell-type specific Common Ubiquitous (found in all cell types studied)

• Ubiquitous DNase HS sites are likely to function as…Promoters (TSS)Insulators (CTCF)(no enhancers?)

• Ubiquitous sites indicative of housekeeping chromatin structure

• Cell-type specific DNase HS sitesCorrelate with histone modifications in a cell type-specific mannerCorrelate with gene expression in a cell type-specific mannerCorrelate with enhancer elements in a cell type-specific mannerContain cell type-specific motifs

• Motif clusters can predict DNase HS sites genome-wide

Page 46: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Motif FindingMany Slides by Bill Noble @ UW

Page 47: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Outline

• What is a sequence motif?• Weight matrix representation• Motif search• Motif discovery

– Expectation-maximization– Gibbs sampling

• Patterns-with-mismatches representation

Page 48: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

What is a “Motif”?

• Generally, a recurring pattern, e.g.– Sequence motif– Structure motif– Network motif

• More specifically, a set of similar substrings, within a family of diverged sequences.– Protein sequence motifs– DNA sequence motifs

Page 49: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Example motif

Page 50: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Motif in Logos Format

Page 51: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Gene 3’-Processing Signals

RNA

A simplified representation of the arrangement of control elements (withexample sequences) that identify the 3'-processing site in yeast mRNA.

JH Graber et al. (2002) Nucleic Acids Research 30(8):1851-8.

Page 52: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Splice site motif in logo format

weblogo.berkeley.edu

Page 53: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Exonic Splicing Enhancers

These motifs occur within exons and enhance splicing of introns from mRNA.

Letter height indicates its frequency at that position.

Fairbrother WG et al. (2002) Science 297(5583):1007-13

Page 54: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Transcription Factor Binding Sites

ERE

EstrogenReceptor Transcription start

DNA

Gene ERE Sequence

Efp … a g g g t c a t g g t g a c c c t …

TERT … t t g g t c a g g c t g a t c t c …

Oxytocin … g c g g t g a c c t t g a c c c c …

Lactoferrin … c a g g t c a a g g c g a t c t t …

Angiotensin … t a g g g c a t c g t g a c c c g …

VEGF … a t a a t c a g a c t g a c t g g …

(estrogen response element)

Page 55: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Outline

• What is a sequence motif?• Weight matrix representation• Motif search• Motif discovery

– Expectation-maximization– Gibbs sampling

• Patterns-with-mismatches representation

Page 56: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Weight matrix

• Probabilistic model: How likely is each letter at each motif position?

ACGT

1 2 3 4 5 6 7 8 9

.89 .02 .38 .34 .22 .27 .02 .03 .02

.04 .91 .20 .17 .28 .31 .30 .04 .02

.04 .05 .41 .18 .29 .16 .07 .92 .18

.03 .02 .01 .31 .21 .26 .61 .01 .78

Page 57: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

A. K. A.

Weight matrices are also known as• Position-specific scoring matrices• Position-specific probability matrices• Position-specific weight matrices

Page 58: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Scoring a motif model

• A motif is interesting if it is very different from the background distribution

more interestingless interesting

ACGT

1 2 3 4 5 6 7 8 9

.89 .02 .38 .34 .22 .27 .02 .03 .02

.04 .91 .20 .17 .28 .31 .30 .04 .02

.04 .05 .41 .18 .29 .16 .07 .92 .18

.03 .02 .01 .31 .21 .26 .61 .01 .78

Page 59: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Relative entropy

• A motif is interesting if it is very different from the background distribution

• Use relative entropy*:

,,

position letter

log ii

i

pp

b

pi, = probability of in matrix position ib = background frequency (in non-motif sequence)

* Relative entropy is sometimes called information content.

Page 60: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Scoring motif instances

• A motif instance matches if it looks like it was generated by the weight matrix

Matches weight matrix

Hard to tell

“ A C G G C G C C T”

Not likely!

ACGT

1 2 3 4 5 6 7 8 9

.89 .02 .38 .34 .22 .27 .02 .03 .02

.04 .91 .20 .17 .28 .31 .30 .04 .02

.04 .05 .41 .18 .29 .16 .07 .92 .18

.03 .02 .01 .31 .21 .26 .61 .01 .78

Page 61: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Log likelihood ratio

• A motif instance matches if it looks like it was generated by the weight matrix

• Use log likelihood ratio

• Measures how much more like the weight matrix than like the background.

,

position

log i

i

i

i

p

b

i: the character atposition i of the instance

Page 62: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Outline

• What is a sequence motif?• Weight matrix representation• Motif search• Motif discovery

– Expectation-maximization– Gibbs sampling

• Patterns-with-mismatches representation

Page 63: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Position-specific scoring matrix

• This PSSM assigns the sequence NMFWAFGH a score of 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12.

A -1 -2 -1 0 -1 -2 0 -2

R 5 0 5 -2 1 -3 -2 0

N 0 6 0 0 0 -3 0 1

D -2 1 -2 -1 0 -3 -1 -1

C -3 -3 -3 -3 -3 -2 -3 -3

Q 1 0 1 -2 5 -3 -2 0

E 0 0 0 -2 2 -3 -2 0

G -2 0 -2 6 -2 -3 6 -2

H 0 1 0 -2 0 -1 -2 8

I -3 -3 -3 -4 -3 0 -4 -3

L -2 -3 -2 -4 -2 0 -4 -3

K 2 0 2 -2 1 -3 -2 -1

M -1 -2 -1 -3 0 0 -3 -2

F -3 -3 -3 -3 -3 6 -3 -1

P -2 -2 -2 -2 -1 -4 -2 -2

S -1 1 -1 0 0 -2 0 -1

T -1 0 -1 -2 -1 -2 -2 -2

W -3 -4 -3 -2 -2 1 -2 -2

Y -2 -2 -2 -3 -1 3 -3 2

V -3 -3 -3 -3 -2 -1 -3 -3

Page 64: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Significance of scores

MotifScanningalgorithm

LENENQGKCTIAEYKYDGKKASVYNSFVS

45

Low score = not a motifHigh score = motif occurrence

How high is high enough?

Page 65: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Computing a p-value

• The scores for all possible sequences of length that matches the motif.

• Use these scores to compute a p-value.

• The probability of observing a score >4 is the area under the curve to the right of 4.

• This probability is called a p-value.

• p-value = Pr(data|null)

Page 66: Genes and Regulatory Elements Zhiping Weng U Mass Medical School
Page 67: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Outline

• What is a sequence motif?• Weight matrix representation• Motif search• Motif discovery

– Expectation-maximization– Gibbs sampling

• Patterns-with-mismatches representation

Page 68: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Motif discovery problem

• Given sequences

• Find motif

IGRGGFGEVY at position 515LGEGCFGQVV at position 430VGSGGFGQVY at position 682

seq. 1seq. 2seq. 3

Page 69: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Motif discovery problem

• Given: a sequence or family of sequences.• Find:

the number of motifsthe width of each motifthe locations of motif occurrences

Page 70: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Why is this hard?

• Input sequences are long (thousands or millions of residues)

• Motif may be subtle– Instances are short.– Instances are only slightly similar.

?

?

Page 71: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Globin motifs

xxxxxxxxxxx.xxxxxxxxx.xxxxx..........xxxxxx.xxxxxxx.xxxxxxxxxx.xxxxxxxxxHAHU V.LSPADKTN..VKAAWGKVG.AHAGE..........YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSAHAOR M.LTDAEKKE..VTALWGKAA.GHGEE..........YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSAHADK V.LSAADKTN..VKGVFSKIG.GHAEE..........YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSAHBHU VHLTPEEKSA..VTALWGKVN.VDEVG...........G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPDHBOR VHLSGGEKSA..VTNLWGKVN.INELG...........G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAGHBDK VHWTAEEKQL..ITGLWGKVNvAD.CG...........A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPTMYHU G.LSDGEWQL..VLNVWGKVE.ADIPG..........HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSEDMYOR G.LSDGEWQL..VLKVWGKVE.GDLPG..........HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTEDIGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY.PDIQNKFSQFaGKDLASIKDGPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA..........HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPEGPYL GVLTDVQVAL..VKSSFEEFN.ANIPK...........N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQGGZLB M.L.DQQTIN..IIKATVPVLkEHGVT...........ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE  xxxxx.xxxxxxxxxxxxx..xxxxxxxxxxxxxxx..xxxxxxx.xxxxxxx...xxxxxxxxxxxxxxxxHAHU QVKGH.GKKVADA.LTN......AVA.HVDDMPNA...LSALS.D.LHAHKL....RVDPVNF.KLLSHCLLHAOR QIKAH.GKKVADA.L.S......TAAGHFDDMDSA...LSALS.D.LHAHKL....RVDPVNF.KLLAHCILHADK QIKAH.GKKVAAA.LVE......AVN.HVDDIAGA...LSKLS.D.LHAQKL....RVDPVNF.KFLGHCFLHBHU AVMGNpKVKAHGK.KVLGA..FSDGLAHLDNLKGT...FATLS.E.LHCDKL....HVDPENF.RL.LGNVLHBOR AVMGNpKVKAHGA.KVLTS..FGDALKNLDDLKGT...FAKLS.E.LHCDKL....HVDPENFNRL..GNVLHBDK AILGNpMVRAHGK.KVLTS..FGDAVKNLDNIKNT...FAQLS.E.LHCDKL....HVDPENF.RL.LGDILMYHU EMKASeDLKKHGA.TVL......TALGGILKKKGHH..EAEIKPL.AQSHATK...HKIPVKYLEFISECIIMYOR EMKASaDLKKHGG.TVL......TALGNILKKKGQH..EAELKPL.AQSHATK...HKISIKFLEYISEAIIIGLOB T.GA...FATHATRIVSFLseVIALSGNTSNAAAV...NSLVSKL.GDDHKA....R.GVSAA.QF..GEFRGPUGNI NNPK...LKAHAAVIFKTI...CESATELRQKGHAVwdNNTLKRL.GSIHLK....N.KITDP.HF.EVMKGGPYL NNPD...LQAHAG.KVFKL..TYEAAIQLEVNGAVAs.DATLKSL.GSVHVS....K.GVVDA.HF.PVVKEGGZLB Q......PKALAM.TVL......AAAQNIENLPAIL..PAVKKIAvKHCQAGVaaaH.YPIVGQEL.LGAIK  xxxxxxxxx.xxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxx..xHAHU VT.LAA.H..LPAEFTPA..VHASLDKFLASV.STVLTS..KY..RHAOR VV.LAR.H..CPGEFTPS..AHAAMDKFLSKV.ATVLTS..KY..RHADK VV.VAI.H..HPAALTPE..VHASLDKFMCAV.GAVLTA..KY..RHBHU VCVLAH.H..FGKEFTPP..VQAAYQKVVAGV.ANALAH..KY..HHBOR IVVLAR.H..FSKDFSPE..VQAAWQKLVSGV.AHALGH..KY..HHBDK IIVLAA.H..FTKDFTPE..CQAAWQKLVRVV.AHALAR..KY..HMYHU QV.LQSKHPgDFGADAQGA.MNKALELFRKDM.ASNYKELGFQ..GMYOR HV.LQSKHSaDFGADAQAA.MGKALELFRNDM.AAKYKEFGFQ..GIGLOB TA.LVA.Y..LQANVSWGDnVAAAWNKA.LDN.TFAIVV..PR..LGPUGNI ALLGTIKEA.IKENWSDE..MGQAWTEAYNQLVATIKAE..MK..EGPYL AILKTIKEV.VGDKWSEE..LNTAWTIAYDELAIIIKKE..MKdaAGGZLB EVLGDAAT..DDILDAWGK.AYGVIADVFIQVEADLYAQ..AV..E

Page 72: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Alternating approach

1. Guess an initial weight matrix2. Use weight matrix to predict instances in the input

sequences3. Use instances to predict a weight matrix4. Repeat 2 & 3 until satisfied.

Examples: Gibbs Sampler (Lawrence et al.) MEME (expectation maximization / Bailey, Elkan) ANN-Spec (neural network / Workman, Stormo)

Page 73: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Three Ingredients of Almost any Bioinformatics Method

1. Search space2. Scoring scheme3. Search algorithm (= optimization technique)

Strictly speaking, Gibbs sampling and expectation-maximization are search algorithms. They are not specific to motif discovery; indeed they were first used in other contexts.

Mathematically precise formulation of the problem

Page 74: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Expectation-Maximization

• Guarantees finding a local optimum.

• Widely used in bioinformatics:– The Baum-Welch algorithm for training HMMs is an

example– So is K-means clustering (e.g. used to analyze microarray

data).

Page 75: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Expectation-maximization (EM)

foreach subsequence of width Wconvert subsequence to a matrixdo {

re-estimate motif occurrences from matrixre-estimate matrix model from motif occurrences

} until (matrix model stops changing)endselect matrix with highest score

EM

Page 76: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Sample DNA sequences

>ce1cg TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCACAAAAATGGAAGTCCACAGTCTTGACAG

>ara GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAGAAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATGCCATAGCATTTTTATCCATAAG

>bglr1 ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATATAACTTTATAAATTCCTAAAATTACACAAAGTTAATAACTGTGAGCATGGTCATATTTTTATCAAT

>crp CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTACAGTAATACATTGATGTACTGCATGTATGCAAAGGACGTCACATTACCGTGCAGTACAGTTGATAGC

Page 77: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Motif occurrences

>ce1cg taatgtttgtgctggtttttgtggcatcgggcgagaatagcgcgtggtgtgaaagactgttttTTTGATCGTTTTCACaaaaatggaagtccacagtcttgacag

>ara gacaaaaacgcgtaacaaaagtgtctataatcacggcagaaaagtccacattgattaTTTGCACGGCGTCACactttgctatgccatagcatttttatccataag

>bglr1 acaaatcccaataacttaattattgggatttgttatatataactttataaattcctaaaattacacaaagttaataacTGTGAGCATGGTCATatttttatcaat

>crp cacaaagcgaaagctatgctaaaacagtcaggatgctacagtaatacattgatgtactgcatgtaTGCAAAGGACGTCACattaccgtgcagtacagttgatagc

Page 78: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Starting point

…gactgttttTTTGATCGTTTTCACaaaaatgg…

T T T G A T C G T T A 0.17 0.17 0.17 0.17 0.50 ...C 0.17 0.17 0.17 0.17 0.17G 0.17 0.17 0.17 0.50 0.17T 0.50 0.50 0.50 0.17 0.17

Page 79: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Re-estimating motif occurrences

TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA

T T T G A T C G T T A 0.17 0.17 0.17 0.17 0.50 ...C 0.17 0.17 0.17 0.17 0.17G 0.17 0.17 0.17 0.50 0.17 T 0.50 0.50 0.50 0.17 0.17

Score = 0.50 + 0.17 + 0.17 + 0.17 + 0.17 + ...

Page 80: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Scoring each subsequence

Subsequences ScoreTGTGCTGGTTTTTGT 2.95 GTGCTGGTTTTTGTG 4.62 TGCTGGTTTTTGTGG 2.31 GCTGGTTTTTGTGGC ...

Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA

Select from each sequence the subsequence with maximal score.

Page 81: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Re-estimating motif matrix

OccurrencesTTTGATCGTTTTCACTTTGCACGGCGTCACTGTGAGCATGGTCATTGCAAAGGACGTCAC

CountsA 000132011000040C 001010300200403G 020301131130000T 423001002114001

Page 82: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Adding pseudocounts

CountsA 000132011000040C 001010300200403G 020301131130000T 423001002114001

Counts + PseudocountsA 111243122111151C 112121411311514G 131412242241111T 534112113225112

Page 83: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Converting to frequencies

Counts + PseudocountsA 111243122111151C 112121411311514G 131412242241111T 534112113225112

T T T G A T C G T T A 0.13 0.13 0.13 0.25 0.50 ...C 0.13 0.13 0.25 0.13 0.25G 0.13 0.38 0.13 0.50 0.13T 0.63 0.38 0.50 0.13 0.13

Page 84: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Expectation-maximization

foreach subsequence of width Wconvert subsequence to a matrixdo {

re-estimate motif occurrences from matrixre-estimate matrix model from motif occurrences

} until (matrix model stops changing)endselect matrix with highest score

Page 85: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Problem: This procedure doesn't allow the motifs to move around very much. Taking the max is too brittle.

Solution: Associate with each start site a probability of motif occurrence.

Page 86: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Converting to probabilities

Occurrences Score ProbTGTGCTGGTTTTTGT 2.95 0.023 GTGCTGGTTTTTGTG 4.62 0.037 TGCTGGTTTTTGTGG 2.31 0.018 GCTGGTTTTTGTGGC ... ...Total 128.2 1.000

Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA

Page 87: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Computing weighted counts

Occurrences ProbTGTGCTGGTTTTTGT 0.023GTGCTGGTTTTTGTG 0.037TGCTGGTTTTTGTGG 0.018GCTGGTTTTTGTGGC ... 1 2 3 4 5 …

A

C

G

T

Include counts from all subsequences, weighted by the degree to which they match the motif model.

Page 88: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Occurrences ProbTGTGCTGGTTTTTGT 0.023GTGCTGGTTTTTGTG 0.037TGCTGGTTTTTGTGG 0.018GCTGGTTTTTGTGGC ... 1 2 3 4 5 …

A

C

G

T

Include counts from all subsequences, weighted by the degree to which they match the motif model.

Computing weighted counts

Page 89: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Problem: How do we estimate counts accurately when we have only a few examples?Solution: Use Dirichlet mixture priors.

Problem: Too many possible starting points.Solution: Save time by running only 1 iteration of EM at first.

Problem: Too many possible widths.Solution: Consider widths that vary by 2 and adjust motifs afterwards.

Problem: Algorithm assumes exactly one motif occurrence per sequence.Solution: Normalize motif occurrence probabilities across all sequences, using a user-specified parameter.

Problem: The EM algorithm finds only one motif.Solution: Probabilistically erase the motif from the data set, and repeat.

Problem: The motif model is too simplistic.Solution: Use a two-component mixture model that captures the background distribution. Allow the background model to be more complex, e.g. a Markov model.

Problem: The EM algorithm does not tell you how many motifs there are. Solution: Compute statistical significance of motifs and stop when they are no longer significant.

Page 90: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

MEME algorithm

dofor (width = min; width *= 2; width < max)

foreach possible starting pointrun 1 iteration of EM

select candidate starting pointsforeach candidate

run EM to convergenceselect best motiferase motif occurrences

until (E-value of found motif > threshold)

Page 91: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Gibbs Samplinga type of Monte Carlo Markov chain method

Page 92: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Maximization Versus Sampling

• We are given some huge search space. Every point Z in the search space has some score SZ defined as before.

• Sampling: wander around the search space in such a way that how often we visit each point is proportional to πZ=exp(SZ).

• Maximization: find the point with the highest πZ, a likelihood ratio value between 0 and +∞.

• EM does maximization and MCMC does sampling.• MCMC attempts to escape local optima.

Page 93: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Gibbs SamplingUse a Markov chain to wander around the search space. If we are at point X, move to point Y with probability MXY

1

2

X

Start at a random point X.

Randomly pick a dimension.

Look at all points along this dimension.

Repeat.

Move to one of them randomly, proportional to its score π.

Suppose the search space is a 2D rectangle. (Typically, many dimensions!)

Page 94: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Initialization

Randomly guess an instance si from each of t input sequences {S1, ..., St}.

sequence 1

sequence 2

sequence 3

sequence 4

sequence 5

ACAGTGTTTAGACCGTGACCAACCCAGGCAGGTTT

Page 95: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Gibbs sampler

• Initially: randomly guess an instance si from each of t input sequences {S1, ..., St}.

• Steps 2 & 3 (search):– Throw away an instance si: remaining (t - 1) instances

define weight matrix.– Weight matrix defines instance probability at each position

of input string Si

– Pick new si according to probability distribution

• Return highest-scoring motif seen

Page 96: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Sampler step illustration:

ACAGTGTTAGGCGTACACCGT???????CAGGTTT

ACGT

.45 .45 .45 .05 .05 .05 .05

.25 .45 .05 .25 .45 .05 .05

.05 .05 .45 .65 .05 .65 .05

.25 .05 .05 .05 .45 .25 .85

ACGCCGT:20% ACGGCGT:52%

ACAGTGTTAGGCGTACACCGTACGCCGTCAGGTTT

sequence 411%

Page 97: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Comparison

• Both EM and Gibbs sampling involve iterating over two steps

• Convergence:– EM converges when the PSSM stops changing.– Gibbs sampling runs until you ask it to stop.

• Solution:– EM may not find the motif with the highest score.– Gibbs sampling will provably find the motif with the

highest score, if you let it run long enough.

Page 98: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Comparison of motif finders

Page 99: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Summary

• Motifs are represented by weight matrices.• Motif quality is measured by relative entropy. • Motif occurrences are scored using log likelihood

ratios.• EM and the Gibbs sampler attempt to find a motif

with maximal relative entropy.• Both algorithms alternate between predicting

instances and predicting the weight matrix.

Page 100: Genes and Regulatory Elements Zhiping Weng U Mass Medical School
Page 101: Genes and Regulatory Elements Zhiping Weng U Mass Medical School

Homework

• Go to UCSC genome browser to get the top 100 regions bound by CTCF

• Use MEME to find the binding motif of CTCF