transcription regulation transcription factor motif finding

75
Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Upload: aoife

Post on 13-Feb-2016

76 views

Category:

Documents


0 download

DESCRIPTION

Transcription Regulation Transcription Factor Motif Finding. Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520. Outline. Biology of transcription regulation and challenges of computational motif finding Scan for known TF motif sites TRASFAC and JASPAR, Sequence Logo De novo method - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Transcription Regulation Transcription Factor Motif Finding

Transcription RegulationTranscription Factor Motif Finding

Xiaole Shirley LiuSTAT115, STAT215, BIO298, BIST520

Page 2: Transcription Regulation Transcription Factor Motif Finding

Outline

• Biology of transcription regulation and challenges of computational motif finding

• Scan for known TF motif sites– TRASFAC and JASPAR, Sequence Logo

• De novo method– Regular expression enumeration: w-mer enumerate– Position weight matrix update: EM and Gibbs

• Motif finding in different organisms– Motif clusters and conservation

2

Page 3: Transcription Regulation Transcription Factor Motif Finding

Imagine a Chef

Restaurant Dinner Home Lunch

Certain recipes used tomake certain dishes

3

Page 4: Transcription Regulation Transcription Factor Motif Finding

Each Cell Is Like a Chef

4

Page 5: Transcription Regulation Transcription Factor Motif Finding

Each Cell Is Like a Chef

Infant Skin Adult Liver

Glucose, Oxygen, Amino Acid

Fat, AlcoholNicotine

HealthySkin Cell

State

DiseaseLiver Cell

State

Certain genes expressed tomake certain proteins

5

Page 6: Transcription Regulation Transcription Factor Motif Finding

Understanding a Genome

Get the complete sequence (encoded cook book)

Observe gene expressionsat different cell states

(meals prepared at different situations)

Decode gene regulation(decode the book, understand the rules)

6

Page 7: Transcription Regulation Transcription Factor Motif Finding

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTCATTTACCACATCGCATCACAGTTCAGGACTAGACACGGACGGCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTATCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAGCGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

Information in DNA

Milk->Yogurt

Beef->Burger

Egg->OmeletFish->SushiFlour->Cake

Coding region 2%What is to be made

7

Page 8: Transcription Regulation Transcription Factor Motif Finding

Information in DNANon-coding region 98% Regulation: When, Where, Amount, Other Conditions, etc

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTCATTTACCACATCGCATCACTACGACGGACTAGACACGGACGGCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTATCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAGCGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT

Milk->Yogurt

Beef->Burger

Egg->OmeletFish->SushiFlour->Cake

MorningMorning

Japanese Restaurant5 Oz9 Oz

Butter

Butter

Coding region 2%

8

Page 9: Transcription Regulation Transcription Factor Motif Finding

Measure Gene Expression

• Microarray or SAGE detects the expression of every gene at a certain cell state

• Clustering find genes that are co-expressed (potentially share regulation)

9

Page 10: Transcription Regulation Transcription Factor Motif Finding

STAT115, 04/01/2008

Decode Gene Regulation

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGCCACATCGCATATTTACCACCAGTTCAGACACGGACGGCGCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAATCTCGTTAGGACCATATTTACCACCCACATCGAGAGCGCGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Scrambled EggBaconCereal

Hash BrownOrange Juice

Look at genes always expressed together:Upstream Regions Co-expressed

Genes

Page 11: Transcription Regulation Transcription Factor Motif Finding

STAT115, 04/01/2008

Decode Gene Regulation

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGCCACATCGCATATTTACCACCAGTTCAGACACGGACGGCGCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAATCTCGTTAGGACCATATTTACCACCCACATCGAGAGCGCGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Scrambled EggBaconCereal

Hash BrownOrange Juice

Look at genes always expressed together:Upstream Regions Co-expressed

Genes

Page 12: Transcription Regulation Transcription Factor Motif Finding

STAT115, 04/01/2008

Decode Gene Regulation

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAGTTCAGACACGGACGGC GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAATCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Scrambled EggBacon

CerealHash Brown

Orange Juice

Look at genes always expressed together:Upstream Regions Co-expressed

Genes

Morning

Page 13: Transcription Regulation Transcription Factor Motif Finding

Biology of Transcription Regulation

...acatttgcttctgacacaactgtgttcactagcaacctca...aacagacaccATGGTGCACCTGACTCCTGAGGAGAAGTCT...

...agcaggcccaactccagtgcagctgcaacctgcccactcc...ggcagcgcacATGTCTCTGACCAAGACTGAGAGTGCCGTC...

...cgctcgcgggccggcactcttctggtccccacagactcag...gatacccaccgATGGTGCTGTCTCCTGCCGACAAGACCAA...

...gccccgccagcgccgctaccgccctgcccccgggcgagcg...gatgcgcgagtATGGTGCTGTCTCCTGCCGACAAGACCAA...

atttgctt ttcact gcaacct

aactccagt

actca

gcaacct

gcaacct

gcaacctccagcgccg

gcaacctTranscription Factor (TF)

TF Binding Motif

Hemoglobin Beta

Hemoglobin Zeta

Hemoglobin Alpha

Hemoglobin Gamma

Motif can only be computational discovered when there are enough cases for machine learning

13

Page 14: Transcription Regulation Transcription Factor Motif Finding

Computational Motif Finding

• Input data:– Upstream sequences of gene expression profile cluster – 20-800 sequences, each 300-5000 bps long

• Output: enriched sequence patterns (motifs)• Ultimate goals:

– Which TFs are involved and their binding motifs and effects (enhance / repress gene expression)?

– Which genes are regulated by this TF, why is there disease when a TF goes wrong?

– Are there binding partner / competitor for a TF?14

Page 15: Transcription Regulation Transcription Factor Motif Finding

Challenges: Where/what the signalThe motif should be abundant

GAAATATGCACATTTACCTATGCCCTACGACCTCTCGCCACATCGCATATTTACCACCAAATAAGACACGGACGGCGCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAATCTCGTATTTACCATATTAAATACCCACATCGAGAGCGCGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT

WaterWater

WaterWater

Water

15

Page 16: Transcription Regulation Transcription Factor Motif Finding

The motif should be abundantAnd Abundant with significance

GAAATATGCACATTTACCTATGCCCTACGACCTCTCGCCACATCGCATATTTACCACCAAATAAGACACGGACGGCGCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAATCTCGTATTTACCATATTAAATACCCACATCGAGAGCGCGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT

CoconutCoconut

CoconutCoconut

Coconut

Challenges: Where/what the signal

16

Page 17: Transcription Regulation Transcription Factor Motif Finding

Challenges: Double stranded DNAMotif appears in bothstrandsGATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC

TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG

|||||||||||||||||||||||||||||GTGTAGCGTACCATTTATGGTCAAGTCTG

|||||||||||||||||||||||||||||AGAGTCCATTTAGTCAGTATGATGGGTGT

17

Page 18: Transcription Regulation Transcription Factor Motif Finding

Challenges: Base substitutions

Sequences do not have to match the motif perfectly, base substitutions are allowed

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATGTACCACCAGTTCAGACACGGACGGC GCCTCGATTTGCCGTGGTACAGTTCAAACCTGACTAAATCTCGTTAGGACCATATTTATCACCCACATCGAGAGCG CGCTAGCCAATTACCGATCTTGTTCGAGAATTGCCTAT

18

Page 19: Transcription Regulation Transcription Factor Motif Finding

Challenges: Variable motif copies

Some sequences do not have the motifSome have multiple copies of the motif

GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGCCACATCGCAATGCAGCAATGCGTTCAGACACGGACGGCTCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCGGCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAACGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT

19

Page 20: Transcription Regulation Transcription Factor Motif Finding

Challenges: Variable motif copies

Some sequences do not have the motifSome have multiple copies of the motif

GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGCCACATCGCAATGCAGCAATGCGTTCAGACACGGACGGCTCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCGGCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAACGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT

SushiHand Roll

SashimiTempura

Sake

FishFish Fish

Fish Fish Fish Fish

20

Page 21: Transcription Regulation Transcription Factor Motif Finding

Challenges: Two-block motifsSome motifs have two parts

GACACATTTACCTATGC TGGCCCTACGACCTCTCGC CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAATCTCGTTAGATTTACCACCCA TGGCCGTATCGAGAGCG CGCTAGCCATTTACCGAT TGGCGTTCTCGAGAATTGCCTAT

AATGCGGCGTAA

or palindromic patterns

Coconut Milk

21

Page 22: Transcription Regulation Transcription Factor Motif Finding

Scan for Known TF Motif Sites

• Experimental TF sites: TRANSFAC, JASPAR• Motif representation:

– Regular expression: Consensus CACAAAA

binary decision Degenerate CRCAAAW IUPAC A/TA/G

22

Page 23: Transcription Regulation Transcription Factor Motif Finding

Scan for Known TF Motif Sites

• Experimental TF sites: TRANSFAC, JASPAR• Motif representation:

– Regular expression: Consensus CACAAAA

binary decision Degenerate CRCAAAW

– Position weight matrix (PWM): need score cutoffPos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG

Motif MatrixPos A C G T Con

1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G

Site

s

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)p(generate ATGCAGCT from background)

p0A p0T p0G p0C p0A p0G p0C p0T

23

Page 24: Transcription Regulation Transcription Factor Motif Finding

IUPAC for DNA

A adenosineC cytidineG guanineT thymidineU uridineR G A (purine) Y T C

(pyrimidine)K G T (keto)

M A C (amino)S G C (strong)W A T (weak)B C G T (not A)D A G T (not C)H A C T (not G)V A C G (not T)N A C G T (any)

24

Page 25: Transcription Regulation Transcription Factor Motif Finding

Protein Binding Microarrays

• In vitro protein-DNA interactions

• Better capture motifs

25

Page 26: Transcription Regulation Transcription Factor Motif Finding

JASPAR

• User defined cutoff to scan for a particular motif

26

Page 27: Transcription Regulation Transcription Factor Motif Finding

A Word on Sequence Logo

• SeqLogo consists of stacks of symbols, one stack for each position in the sequence

• The overall height of the stack indicates the sequence conservation at that position

• The height of symbols within the stack indicates the relative frequency of nucleic acid at that position

ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG

27

Page 28: Transcription Regulation Transcription Factor Motif Finding

Scan Known TF Motifs

• Drawbacks:– Limited number of motifs– Limited number of sites to represent each motif

• Low sensitivity and specificity– Poor description of motif

• Binding site borders not clear• Binding site many mismatches

– Many motifs look very similar• E.g. GC-rich motif, E-box (CACGTG)

28

Page 29: Transcription Regulation Transcription Factor Motif Finding

De novo Sequence Motif Finding• Goal: look for common sequence patterns

enriched in the input data (compared to the genome background)

• Regular expression enumeration – Pattern driven approach– Enumerate patterns, check significance in dataset– Oligonucleotide analysis, MobyDick

• Position weight matrix update – Data driven approach, use data to refine motifs– Consensus, EM & Gibbs sampling– Motif score and Markov background

29

Page 30: Transcription Regulation Transcription Factor Motif Finding

Regular Expression Enumeration

• Oligonucleotide Analysis: check over-representation for every w-mer:– Expected w occurrence in data

• Consider genome sequence + current data size– Observed w occurrence in data– Over-represented w is potential TF binding motif

Observed occurrence of w in the data

pw from genome background

size of sequence data

Expected occurrence of w in the data

30

Page 31: Transcription Regulation Transcription Factor Motif Finding

MobyDick

• A sequence data and a dictionary of motif wordsATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTCATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACGGCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTATCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAGCGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

D = {A, C, G, T}Pw = {0.22, 0.28, 0.28, 0.22}

31

Page 32: Transcription Regulation Transcription Factor Motif Finding

MobyDick

• A sequence data and a dictionary of motif words

• Check over-representation of every word-pair

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTCATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACGGCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTATCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAGCGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

D = {A, C, G, T}Pw = {0.28, 0.22, 0.22, 0.28}

  A C G TA AA AC AG AT

C CA CC CG CT

G GA GC GG GT

T TA TC TG TT

32

Page 33: Transcription Regulation Transcription Factor Motif Finding

MobyDick

• A sequence data and a dictionary of motif words

• Check over-representation of every word-pair

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTCATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACGGCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTATCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAGCGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

D = {A, C, G, T}Pw = {0.28, 0.28, 0.22, 0.22}

  A C G TA AA AC AG AT

C CA CC CG CT

G GA GC GG GT

T TA TC TG TT

D = {A,C,G,T,AA,GA,TA,GG}Pw = {?}

33

Page 34: Transcription Regulation Transcription Factor Motif Finding

MobyDick

• D = {A,C,G,T,AA,GA,TA,GG}• Seq: AAGATAA• Possible partitions:A A G A T A A pA pA pG pA pT pA pA

AA G A T A A pAA pG pA pT pA pA

AA GA T A A pAA pGA pT pA pA

AA GA TA ApAA pGA pTA pA

A A GA T AA pAA pGA pT pAA

…• Assign probabilities as to maximize total probability of

generating the sequence

34

Page 35: Transcription Regulation Transcription Factor Motif Finding

MobyDick

• A sequence data and a dictionary of motif words

• Check over-representation of every word-pair

• Reassign word probability and consider every new word-pair to build even longer words

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTCATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACGGCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTATCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAGCGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

D = {A, C, G, T}Pw = {0.28, 0.28, 0.22, 0.22}

  A C G TA AA AC AG AT

C CA CC CG CT

G GA GC GG GT

T TA TC TG TT

D = {A,C,G,T,AA,GA,TA,GG}Pw = {?}

35

Page 36: Transcription Regulation Transcription Factor Motif Finding

Regular Expression Enumeration

• RE Enumeration Derivatives:– oligo-analysis, spaced dyads w1.ns.w2– IUPAC alphabet – Markov background (later)– 2-bit encoding, fast index access– Enumerate limited RE patterns known for a TF

protein structure or interaction theme• Exhaustive, guaranteed to find global optimum,

and can find multiple motifs• Not as flexible with base substitutions, long list of

similar good motifs, and limited with motif width36

Page 37: Transcription Regulation Transcription Factor Motif Finding

Consensus• Starting from the 1st sequence, add one sequence

at a time to look for the best motifs obtained with the additional sequence

Seq1

Seq2

……

Good MotifsCACGTGC GTCAGTCCACGTTC GTCAGTC

Bad MotifsCACGTGC GTCAGTCGTGACAT TGGAAAT

37

Page 38: Transcription Regulation Transcription Factor Motif Finding

Consensus• Starting from the 1st sequence, add one sequence

at a time to look for the best motifs obtained with the additional sequence

Seq3

Good MotifsCACGTGC GTCAGTCCACGTTC GTCAGTCCTCGTGC GACAGTC

Bad MotifsCACGTGC GTCAGTCCACGTTC GTCAGTCTTCAAAG AGACTCA

Remaining good motifs

38

Page 39: Transcription Regulation Transcription Factor Motif Finding

Consensus• Starting from the 1st sequence, add one sequence

at a time to look for the best motifs obtained with the additional sequence

• G Stormo, algorithm runs very fast

• Sequence order plays a big role in performance – First two sequences better contain the motif– Sites stop accumulating at the first bad sequence– Newer version allowing [0-n] is much slower

39

Page 40: Transcription Regulation Transcription Factor Motif Finding

Expectation Maximization and Gibbs Sampling Model

• Objects:– Seq: sequence data to search for motif 0: non-motif (genome background) probability : motif probability matrix parameter : motif site locations

• Problem: P(, | seq, 0)• Approach: alternately estimate

by P( | , seq, 0) by P( | , seq, 0)– EM and Gibbs differ in the estimation methods

40

Page 41: Transcription Regulation Transcription Factor Motif Finding

Expectation Maximization

• E step: | , seq, 0

TTGACGACTGCACGTTTGACp1

TGACG p2

GACGA p3

ACGAC p4

CGACT p5

GACTGp6

ACTGC p7

CTGCA p8

...

P1 = likelihood ratio =P(TTGAC| )P(TTGAC| 0)

p0T p0T p0G p0A p0C= 0.3 0.3 0.2 0.3 0.2

41

Page 42: Transcription Regulation Transcription Factor Motif Finding

Expectation Maximization• E step: | , seq, 0

TTGACGACTGCACGTTTGAC p1

TGACG p2

GACGA p3

ACGAC p4

CGACT p5

GACTG p6

ACTGC p7

CTGCA p8

...

• M step: | , seq, 0

p1 TTGAC p2 TGACG p3 GACGA p4 ACGAC

...

• Scale ACGT at each position, reflects weighted average of

42

Page 43: Transcription Regulation Transcription Factor Motif Finding

EM Derivatives• First EM motif finder (C Lawrence)

– Deterministic algorithm, guarantee local optimum• MEME (TL Bailey)

– Prior probability allows 0-n site / sequence– Parallel running multiple

EM with different seed– User friendly results

43

Page 44: Transcription Regulation Transcription Factor Motif Finding

Gibbs Sampling

• Stochastic process, although still may need multiple initializations– Sample from P( | , seq, 0)– Sample from P( | , seq, 0)

• Collapsed form: estimated with counts, not sampling from Dirichlet– Sample site from one seq based on sites from other seqs

• Converged motif matrix and converged motif sites represent stationary distribution of a Markov Chain

44

Page 45: Transcription Regulation Transcription Factor Motif Finding

1

2

3

4

5

Gibbs Sampler

Initial 1

31

4151

21

11

• Randomly initialize a probability matrixRandomly initialize a probability matrix

nA1 + sA

nA1 + sA + nC1 + sC + nG1 + sG + nT1 + sT

estimated with counts

pA1 =

45

Page 46: Transcription Regulation Transcription Factor Motif Finding

Gibbs Sampler

1 Without11 Segment

• Take out one sequence with its sites from current Take out one sequence with its sites from current motifmotif

31

4151

21

11

46

Page 47: Transcription Regulation Transcription Factor Motif Finding

Segment Scores of Sequence 1

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting Position of Segment

Segm

ent S

core

Segment (1-8) Sequence 1

Gibbs Sampler• Score each possible segment of this sequenceScore each possible segment of this sequence

1 Without11 Segment

31

4151

21

47

Page 48: Transcription Regulation Transcription Factor Motif Finding

Segment (2-9)

Segment Scores of Sequence 1

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting Position of Segment

Segm

ent S

core

Sequence 1

Gibbs Sampler• Score each possible segment of this sequenceScore each possible segment of this sequence

31

4151

21

1 Without11 Segment 48

Page 49: Transcription Regulation Transcription Factor Motif Finding

Segment Score

• Use current motif matrix to score a segment

Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG

Motif MatrixPos A C G T Con

1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G

Site

s

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)p(generate ATGCAGCT from background)

p0A p0T p0G p0C p0A p0G p0C p0T

49

Page 50: Transcription Regulation Transcription Factor Motif Finding

Scoring Segments

Motif 1 2 3 4 5 bgA 0.4 0.1 0.3 0.4 0.2 0.3T 0.2 0.5 0.1 0.2 0.2 0.3G 0.2 0.2 0.2 0.3 0.4 0.2C 0.2 0.2 0.4 0.1 0.2 0.2

Ignore pseudo counts for now…

Sequence: TTCCATATTAATCAGATTCCG… scoreTAATC …AATCA 0.4/0.3 x 0.1/0.3 x 0.1/0.3 x 0.1/0.2 x 0.2/0.3 = 0.049383ATCAG 0.4/0.3 x 0.5/0.3 x 0.4/0.2 x 0.4/0.3 x 0.4/0.2 = 11.85185TCAGA 0.2/0.3 x 0.2/0.3 x 0.3/0.3 x 0.3/0.2 x 0.2/0.3 = 0.444444CAGAT …

50

Page 51: Transcription Regulation Transcription Factor Motif Finding

Segment Scores of Sequence 1

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting Position of Segment

Segm

ent S

core

12

Gibbs Sampler• Sample site from one seq based on sites from other seqs

31

4151

21

Modified 1 estimated with counts 51

Page 52: Transcription Regulation Transcription Factor Motif Finding

How to Sample?

52

Pos 1 2 3 4 5 6 7 8 9

Score 3 1 12 5 8 9 1 2 6

SubT 3 4 16 21 29 38 39 41 47

• Rand(subtotal) = X• Find the first position with subtotal larger than X

Pos 1 2 3 4 5 6 7 8 9

Score 3 1 12 5 8 9 500 2 6

SubT 3 4 16 21 29 38 538 540 546

Page 53: Transcription Regulation Transcription Factor Motif Finding

Gibbs Sampler

• Repeat the process until motif convergesRepeat the process until motif converges

1 Without21 Segment

31

4151

12

21

53

Page 54: Transcription Regulation Transcription Factor Motif Finding

Gibbs Sampler Intuition

• Beginning:– Randomly initialized motif– No preference towards any segment

Beginning Iterations

0

10

20

30

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting position of segments54

Page 55: Transcription Regulation Transcription Factor Motif Finding

Gibbs Sampler Intuition• Motif appears:

– Motif should have enriched signal (more sites)– By chance some correct sites come to alignment– Sites bias motif to attract other similar sites

Some good aligned segments come

0

10

20

30

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting position of segments55

Page 56: Transcription Regulation Transcription Factor Motif Finding

Gibbs Sampler Intuition

• Motif converges:– All sites come to alignment– Motif totally biased to sample sites every time

Motif converges towards the end

0

10

20

30

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting position of segments56

Page 57: Transcription Regulation Transcription Factor Motif Finding

1

2

3

4

5

Gibbs Sampler

3i

4i5i

2i

1i

• Column shift

• Metropolis algorithm:– Propose * as shifted 1 column to left or right– Calculate motif score u() and u(*)– Accept * with prob = min(1, u(*) / u())

57

Page 58: Transcription Regulation Transcription Factor Motif Finding

Gibbs Sampling Derivatives

• Gibbs Motif Sampler (JS Liu)– Add prior probability to allow 0-n site / seq– Sample motif positions to consider

• AlignACE (F Roth)– Look for motifs from both strands– Mask out one motif to find more different motifs

• BioProspector (XS Liu)– Use background model with Markov dependencies– Sampling with threshold (0-n sites / seq), new scoring function– Can find two-block motifs with variable gap

58

Page 59: Transcription Regulation Transcription Factor Motif Finding

Scoring Motifs

• Information Content (also known as relative entropy)– Suppose you have x aligned segments for the motif– pb(s1 from mtf) / pb(s1 from bg) *

pb(s2 from mtf) / pb(s2 from bg) *…pb(sx from mtf) / pb(sx from bg)

Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG

Motif MatrixPos A C G T Con

1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G

Site

s

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)p(generate ATGCAGCT from background)

p0A p0T p0G p0C p0A p0G p0C p0T

59

Page 60: Transcription Regulation Transcription Factor Motif Finding

Scoring Motifs

• Information Content (also known as relative entropy)– Suppose you have x aligned segments for the motif– pb(s1 from mtf) / pb(s1 from bg) *

pb(s2 from mtf) / pb(s2 from bg) *…pb(sx from mtf) / pb(sx from bg)

Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG

Motif MatrixPos A C G T Con

1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G

Site

s

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)p(generate ATGCAGCT from background)

p0A p0T p0G p0C p0A p0G p0C p0T

60

Page 61: Transcription Regulation Transcription Factor Motif Finding

Scoring Motifs

pb(s1 from mtf) / pb(s1 from bg) *pb(s2 from mtf) / pb(s2 from bg) *…pb(sx from mtf) / pb(sx from bg)

= (pA1/pA0)A1 (pT1/pT0)T1 (pT2/pT0)T2 (pG2/pG0)G2 (pC2/pC0)C2…Take log of this:= A1 log (pA1/pA0) + T1 log (pT1/pT0) +

T2 log (pT2/pT0) + G2 log (pG2/pG0) + …Divide by the number of segments (if all the motifs have

same number of segments)= pA1 log (pA1/pA0) + pT1 log (pT1/pT0) + pT2 log (pT2/pT0)…

Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG

61

Page 62: Transcription Regulation Transcription Factor Motif Finding

Scoring Motifs

• Original function: Information Content

=

Motif Conservedness: How likely to see the current aligned segments from this motif model

GoodATGCAATGCCATGCAATGCATTGCAATGGAATGCA

BadAGGCAATCCCGCGCACGGTATGCCAATGGTTTGAA

62

Page 63: Transcription Regulation Transcription Factor Motif Finding

Scoring Motifs

• Original function: Information Content

Motif Specificity: How likely to see the current aligned segments from background

=

GoodAGTCCAGTCCAGTCCAGTCCAGTCCAGTCCAGTCC

BadATAAAATAAAATAAAATAAAATAAAATAAAATAAA

63

Page 64: Transcription Regulation Transcription Factor Motif Finding

Scoring Motifs

• Original function: Information Content

Which is better?(data = 8 seqs)

=

Motif 1AGGCTAACAGGCTAAC

Motif 2AGGCTAACAGGCTACCAGGCTAACAGCCTAACAGGCCAACAGGCTAACTGGCTAACAGGCTTACAGGCTAACAGGGTAAC 64

Page 65: Transcription Regulation Transcription Factor Motif Finding

Scoring Motifs

• Motif scoring function:

• Prefer: conserved motifs with many sites, but are not often seen in the genome background

Motif Signal Abundant

PositionsConserved

Specific (unlikely in genome background)

65

Page 66: Transcription Regulation Transcription Factor Motif Finding

Markov Background Increases Motif Specificity

Prefers motif segments enriched only in data, but not so likely to occur in the background

Segment ATGTA score = p(generate ATGTA from )p(generate ATGTA from 0)

3rd order Markov dependencyp( )

TCAGC = .25 .25 .25 .25 .25 .3 .18 .16 .22 .24

ATATA = .25 .25 .25 .25 .25 .3 .41 .38 .42 .30

66

Page 67: Transcription Regulation Transcription Factor Motif Finding

Position Weight Matrix Update

• Advantage– Can look for motifs of any widths– Flexible with base substitutions

• Disadvantage:– EM and Gibbs sampling: no guaranteed

convergence time– No guaranteed global optimum

67

Page 68: Transcription Regulation Transcription Factor Motif Finding

Motif Finding in Bacteria

• Promoter sequences are short (200-300 bp)• Motif are usually long (10-20 bases)

– Some have two blocks with a gap, some are palindromes

– Long motifs are usually very degenerate• Single microarray experiment sometimes already

provides enough information to search for TF motifs

68

Page 69: Transcription Regulation Transcription Factor Motif Finding

Motif Finding in Lower Eukaryotes

• Upstream sequences longer (500-1000 bp), with some simple repeats

• Motif width varies (5 – 17 bases)• Expression clusters provide decent input

sequences quality for TF motif finding• Motif combination and redundancy appears,

although single motifs are usually significant enough for identification

69

Page 70: Transcription Regulation Transcription Factor Motif Finding

Yeast Promoter

Architecture• Co-occurring

regulators suggest physical interaction between the regulators

70

Page 71: Transcription Regulation Transcription Factor Motif Finding

Motif Finding in Higher Eukaryotes

• Upstream sequences very long (3KB-20KB) with repeats, TF motif could appear downstream

• Motifs can be short or long (6-20 bases), and appear in combination and clusters

• Gene expression cluster not good enough input• Need:

– Comparative Genomics: phastcons score– Motif modules: motif clusters– ChIP-chip/seq

71

Page 72: Transcription Regulation Transcription Factor Motif Finding

72

Yeast Regulatory Sequence Conservation

Page 73: Transcription Regulation Transcription Factor Motif Finding

73

UCSC PhastCons Conservation• Functional regulatory sequences are under

stronger evolutionary constraint• Align orthologous sequences together• PhastCons conservation score (0 – 1) for each

nucleotide in the genome can be downloaded from UCSC

Page 74: Transcription Regulation Transcription Factor Motif Finding

74

Conserved Motif Clusters

• First find conserved regions in the genome

• Then look for repeated transcription factors (TF) binding sites

• They form transcription factor modules

Page 75: Transcription Regulation Transcription Factor Motif Finding

Summary• Biology and challenge of transcription regulation• Scan for known TF motif sites: TRANSFAC & JASPAR• De novo method

– Regular expression enumeration• Oligonucleotide analysis• MobyDick: build long motifs from short ones

– Position weight matrix update• CONSENSUS (sequence order)• EM (iterate , ; ~ weighted average)• Gibbs Sampler (sample , ; Markov chain convergence)• Motif score and Markov background

• Motif cluster and motif conservation75