transcription regulation transcription factor motif finding

Transcription RegulationTranscription Factor Motif Finding

Xiaole Shirley LiuSTAT115, STAT215, BIO298, BIST520

Outline

• Biology of transcription regulation and challenges of computational motif finding

• Scan for known TF motif sites– TRASFAC and JASPAR, Sequence Logo

• De novo method– Regular expression enumeration: w-mer enumerate– Position weight matrix update: EM and Gibbs

• Motif finding in different organisms– Motif clusters and conservation

2

Imagine a Chef

Restaurant Dinner Home Lunch

Certain recipes used tomake certain dishes

3

Each Cell Is Like a Chef

4

Each Cell Is Like a Chef

Infant Skin Adult Liver

Glucose, Oxygen, Amino Acid

Fat, AlcoholNicotine

HealthySkin Cell

State

DiseaseLiver Cell

State

Certain genes expressed tomake certain proteins

5

Understanding a Genome

Get the complete sequence (encoded cook book)

Observe gene expressionsat different cell states

(meals prepared at different situations)

Decode gene regulation(decode the book, understand the rules)

6

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTCATTTACCACATCGCATCACAGTTCAGGACTAGACACGGACGGCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTATCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAGCGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

Information in DNA

Milk->Yogurt

Beef->Burger

Egg->OmeletFish->SushiFlour->Cake

Coding region 2%What is to be made

7

Information in DNANon-coding region 98% Regulation: When, Where, Amount, Other Conditions, etc

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTCATTTACCACATCGCATCACTACGACGGACTAGACACGGACGGCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTATCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAGCGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT

Milk->Yogurt

Beef->Burger

Egg->OmeletFish->SushiFlour->Cake

MorningMorning

Japanese Restaurant5 Oz9 Oz

Butter

Butter

Coding region 2%

8

Measure Gene Expression

• Microarray or SAGE detects the expression of every gene at a certain cell state

• Clustering find genes that are co-expressed (potentially share regulation)

9

STAT115, 04/01/2008

Decode Gene Regulation

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGCCACATCGCATATTTACCACCAGTTCAGACACGGACGGCGCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAATCTCGTTAGGACCATATTTACCACCCACATCGAGAGCGCGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Scrambled EggBaconCereal

Hash BrownOrange Juice

Look at genes always expressed together:Upstream Regions Co-expressed

Genes

STAT115, 04/01/2008

Decode Gene Regulation

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAGTTCAGACACGGACGGC GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAATCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Scrambled EggBacon

CerealHash Brown

Orange Juice

Look at genes always expressed together:Upstream Regions Co-expressed

Genes

Morning

Biology of Transcription Regulation

...acatttgcttctgacacaactgtgttcactagcaacctca...aacagacaccATGGTGCACCTGACTCCTGAGGAGAAGTCT...

...agcaggcccaactccagtgcagctgcaacctgcccactcc...ggcagcgcacATGTCTCTGACCAAGACTGAGAGTGCCGTC...

...cgctcgcgggccggcactcttctggtccccacagactcag...gatacccaccgATGGTGCTGTCTCCTGCCGACAAGACCAA...

...gccccgccagcgccgctaccgccctgcccccgggcgagcg...gatgcgcgagtATGGTGCTGTCTCCTGCCGACAAGACCAA...

atttgctt ttcact gcaacct

aactccagt

actca

gcaacct

gcaacct

gcaacctccagcgccg

gcaacctTranscription Factor (TF)

TF Binding Motif

Hemoglobin Beta

Hemoglobin Zeta

Hemoglobin Alpha

Hemoglobin Gamma

Motif can only be computational discovered when there are enough cases for machine learning

13

Computational Motif Finding

• Input data:– Upstream sequences of gene expression profile cluster – 20-800 sequences, each 300-5000 bps long

• Output: enriched sequence patterns (motifs)• Ultimate goals:

– Which TFs are involved and their binding motifs and effects (enhance / repress gene expression)?

– Which genes are regulated by this TF, why is there disease when a TF goes wrong?

– Are there binding partner / competitor for a TF?14

Challenges: Where/what the signalThe motif should be abundant

GAAATATGCACATTTACCTATGCCCTACGACCTCTCGCCACATCGCATATTTACCACCAAATAAGACACGGACGGCGCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAATCTCGTATTTACCATATTAAATACCCACATCGAGAGCGCGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT

WaterWater

WaterWater

Water

15

The motif should be abundantAnd Abundant with significance

GAAATATGCACATTTACCTATGCCCTACGACCTCTCGCCACATCGCATATTTACCACCAAATAAGACACGGACGGCGCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAATCTCGTATTTACCATATTAAATACCCACATCGAGAGCGCGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT

CoconutCoconut

CoconutCoconut

Coconut

Challenges: Where/what the signal

16

Challenges: Double stranded DNAMotif appears in bothstrandsGATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC

TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG

|||||||||||||||||||||||||||||GTGTAGCGTACCATTTATGGTCAAGTCTG

|||||||||||||||||||||||||||||AGAGTCCATTTAGTCAGTATGATGGGTGT

17

Challenges: Base substitutions

Sequences do not have to match the motif perfectly, base substitutions are allowed

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATGTACCACCAGTTCAGACACGGACGGC GCCTCGATTTGCCGTGGTACAGTTCAAACCTGACTAAATCTCGTTAGGACCATATTTATCACCCACATCGAGAGCG CGCTAGCCAATTACCGATCTTGTTCGAGAATTGCCTAT

18

Challenges: Variable motif copies

Some sequences do not have the motifSome have multiple copies of the motif

GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGCCACATCGCAATGCAGCAATGCGTTCAGACACGGACGGCTCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCGGCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAACGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT

19

Challenges: Variable motif copies

Some sequences do not have the motifSome have multiple copies of the motif

GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGCCACATCGCAATGCAGCAATGCGTTCAGACACGGACGGCTCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCGGCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAACGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT

SushiHand Roll

SashimiTempura

Sake

FishFish Fish

Fish Fish Fish Fish

20

Challenges: Two-block motifsSome motifs have two parts

GACACATTTACCTATGC TGGCCCTACGACCTCTCGC CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAATCTCGTTAGATTTACCACCCA TGGCCGTATCGAGAGCG CGCTAGCCATTTACCGAT TGGCGTTCTCGAGAATTGCCTAT

AATGCGGCGTAA

or palindromic patterns

Coconut Milk

21

Scan for Known TF Motif Sites

• Experimental TF sites: TRANSFAC, JASPAR• Motif representation:

– Regular expression: Consensus CACAAAA

binary decision Degenerate CRCAAAW IUPAC A/TA/G

22

Scan for Known TF Motif Sites

• Experimental TF sites: TRANSFAC, JASPAR• Motif representation:

– Regular expression: Consensus CACAAAA

binary decision Degenerate CRCAAAW

– Position weight matrix (PWM): need score cutoffPos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG

Motif MatrixPos A C G T Con

1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G

Site

s

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)p(generate ATGCAGCT from background)

p0A p0T p0G p0C p0A p0G p0C p0T

23

IUPAC for DNA

A adenosineC cytidineG guanineT thymidineU uridineR G A (purine) Y T C

(pyrimidine)K G T (keto)

M A C (amino)S G C (strong)W A T (weak)B C G T (not A)D A G T (not C)H A C T (not G)V A C G (not T)N A C G T (any)

24

Protein Binding Microarrays

• In vitro protein-DNA interactions

• Better capture motifs

25

JASPAR

• User defined cutoff to scan for a particular motif

26

A Word on Sequence Logo

• SeqLogo consists of stacks of symbols, one stack for each position in the sequence

• The overall height of the stack indicates the sequence conservation at that position

• The height of symbols within the stack indicates the relative frequency of nucleic acid at that position

ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG

27

Scan Known TF Motifs

• Drawbacks:– Limited number of motifs– Limited number of sites to represent each motif

• Low sensitivity and specificity– Poor description of motif

• Binding site borders not clear• Binding site many mismatches

– Many motifs look very similar• E.g. GC-rich motif, E-box (CACGTG)

28

De novo Sequence Motif Finding• Goal: look for common sequence patterns

enriched in the input data (compared to the genome background)

• Regular expression enumeration – Pattern driven approach– Enumerate patterns, check significance in dataset– Oligonucleotide analysis, MobyDick

• Position weight matrix update – Data driven approach, use data to refine motifs– Consensus, EM & Gibbs sampling– Motif score and Markov background

29

Regular Expression Enumeration

• Oligonucleotide Analysis: check over-representation for every w-mer:– Expected w occurrence in data

• Consider genome sequence + current data size– Observed w occurrence in data– Over-represented w is potential TF binding motif

Observed occurrence of w in the data

pw from genome background

size of sequence data

Expected occurrence of w in the data

30

MobyDick

• A sequence data and a dictionary of motif wordsATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTCATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACGGCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTATCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAGCGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

D = {A, C, G, T}Pw = {0.22, 0.28, 0.28, 0.22}

31

MobyDick

• A sequence data and a dictionary of motif words

• Check over-representation of every word-pair

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTCATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACGGCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTATCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAGCGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

D = {A, C, G, T}Pw = {0.28, 0.22, 0.22, 0.28}

A C G TA AA AC AG AT

C CA CC CG CT

G GA GC GG GT

T TA TC TG TT

32

MobyDick




D = {A, C, G, T}Pw = {0.28, 0.28, 0.22, 0.22}


C CA CC CG CT

G GA GC GG GT

T TA TC TG TT

D = {A,C,G,T,AA,GA,TA,GG}Pw = {?}

33

MobyDick

• D = {A,C,G,T,AA,GA,TA,GG}• Seq: AAGATAA• Possible partitions:A A G A T A A pA pA pG pA pT pA pA

AA G A T A A pAA pG pA pT pA pA

AA GA T A A pAA pGA pT pA pA

AA GA TA ApAA pGA pTA pA

A A GA T AA pAA pGA pT pAA

…• Assign probabilities as to maximize total probability of

generating the sequence

34

MobyDick



• Reassign word probability and consider every new word-pair to build even longer words


D = {A, C, G, T}Pw = {0.28, 0.28, 0.22, 0.22}


C CA CC CG CT

G GA GC GG GT

T TA TC TG TT

D = {A,C,G,T,AA,GA,TA,GG}Pw = {?}

35

Regular Expression Enumeration

• RE Enumeration Derivatives:– oligo-analysis, spaced dyads w1.ns.w2– IUPAC alphabet – Markov background (later)– 2-bit encoding, fast index access– Enumerate limited RE patterns known for a TF

protein structure or interaction theme• Exhaustive, guaranteed to find global optimum,

and can find multiple motifs• Not as flexible with base substitutions, long list of

similar good motifs, and limited with motif width36

Consensus• Starting from the 1st sequence, add one sequence

at a time to look for the best motifs obtained with the additional sequence

Seq1

Seq2

……

Good MotifsCACGTGC GTCAGTCCACGTTC GTCAGTC

Bad MotifsCACGTGC GTCAGTCGTGACAT TGGAAAT

37



Seq3

…

…

…

Good MotifsCACGTGC GTCAGTCCACGTTC GTCAGTCCTCGTGC GACAGTC

Bad MotifsCACGTGC GTCAGTCCACGTTC GTCAGTCTTCAAAG AGACTCA

Remaining good motifs

38



• G Stormo, algorithm runs very fast

• Sequence order plays a big role in performance – First two sequences better contain the motif– Sites stop accumulating at the first bad sequence– Newer version allowing [0-n] is much slower

39

Expectation Maximization and Gibbs Sampling Model

• Objects:– Seq: sequence data to search for motif 0: non-motif (genome background) probability : motif probability matrix parameter : motif site locations

• Problem: P(, | seq, 0)• Approach: alternately estimate

by P( | , seq, 0) by P( | , seq, 0)– EM and Gibbs differ in the estimation methods

40

Expectation Maximization

• E step: | , seq, 0

TTGACGACTGCACGTTTGACp1

TGACG p2

GACGA p3

ACGAC p4

CGACT p5

GACTGp6

ACTGC p7

CTGCA p8

...

P1 = likelihood ratio =P(TTGAC| )P(TTGAC| 0)

p0T p0T p0G p0A p0C= 0.3 0.3 0.2 0.3 0.2

41

Expectation Maximization• E step: | , seq, 0

TTGACGACTGCACGTTTGAC p1

TGACG p2

GACGA p3

ACGAC p4

CGACT p5

GACTG p6

ACTGC p7

CTGCA p8

...

• M step: | , seq, 0

p1 TTGAC p2 TGACG p3 GACGA p4 ACGAC

...

• Scale ACGT at each position, reflects weighted average of

42

EM Derivatives• First EM motif finder (C Lawrence)

– Deterministic algorithm, guarantee local optimum• MEME (TL Bailey)

– Prior probability allows 0-n site / sequence– Parallel running multiple

EM with different seed– User friendly results

43

Gibbs Sampling

• Stochastic process, although still may need multiple initializations– Sample from P( | , seq, 0)– Sample from P( | , seq, 0)

• Collapsed form: estimated with counts, not sampling from Dirichlet– Sample site from one seq based on sites from other seqs

• Converged motif matrix and converged motif sites represent stationary distribution of a Markov Chain

44

1

2

3

4

5

Gibbs Sampler

Initial 1

31

4151

21

11

• Randomly initialize a probability matrixRandomly initialize a probability matrix

nA1 + sA

nA1 + sA + nC1 + sC + nG1 + sG + nT1 + sT

estimated with counts

pA1 =

45

Gibbs Sampler

1 Without11 Segment

• Take out one sequence with its sites from current Take out one sequence with its sites from current motifmotif

31

4151

21

11

46

Segment Scores of Sequence 1

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting Position of Segment

Segm

ent S

core

Segment (1-8) Sequence 1

Gibbs Sampler• Score each possible segment of this sequenceScore each possible segment of this sequence

1 Without11 Segment

31

4151

21

47

Segment (2-9)


0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20


Segm

ent S

core

Sequence 1

Gibbs Sampler• Score each possible segment of this sequenceScore each possible segment of this sequence

31

4151

21

1 Without11 Segment 48

Segment Score

• Use current motif matrix to score a segment

Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG


1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G

Site

s




49

Scoring Segments

Motif 1 2 3 4 5 bgA 0.4 0.1 0.3 0.4 0.2 0.3T 0.2 0.5 0.1 0.2 0.2 0.3G 0.2 0.2 0.2 0.3 0.4 0.2C 0.2 0.2 0.4 0.1 0.2 0.2

Ignore pseudo counts for now…

Sequence: TTCCATATTAATCAGATTCCG… scoreTAATC …AATCA 0.4/0.3 x 0.1/0.3 x 0.1/0.3 x 0.1/0.2 x 0.2/0.3 = 0.049383ATCAG 0.4/0.3 x 0.5/0.3 x 0.4/0.2 x 0.4/0.3 x 0.4/0.2 = 11.85185TCAGA 0.2/0.3 x 0.2/0.3 x 0.3/0.3 x 0.3/0.2 x 0.2/0.3 = 0.444444CAGAT …

50


0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20


Segm

ent S

core

12

Gibbs Sampler• Sample site from one seq based on sites from other seqs

31

4151

21

Modified 1 estimated with counts 51

How to Sample?

52

Pos 1 2 3 4 5 6 7 8 9

Score 3 1 12 5 8 9 1 2 6

SubT 3 4 16 21 29 38 39 41 47

• Rand(subtotal) = X• Find the first position with subtotal larger than X

Pos 1 2 3 4 5 6 7 8 9

Score 3 1 12 5 8 9 500 2 6

SubT 3 4 16 21 29 38 538 540 546

Gibbs Sampler

• Repeat the process until motif convergesRepeat the process until motif converges

1 Without21 Segment

31

4151

12

21

53

Gibbs Sampler Intuition

• Beginning:– Randomly initialized motif– No preference towards any segment

Beginning Iterations

0

10

20

30

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting position of segments54

Gibbs Sampler Intuition• Motif appears:

– Motif should have enriched signal (more sites)– By chance some correct sites come to alignment– Sites bias motif to attract other similar sites

Some good aligned segments come

0

10

20

30

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20


Gibbs Sampler Intuition

• Motif converges:– All sites come to alignment– Motif totally biased to sample sites every time

Motif converges towards the end

0

10

20

30

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20


1

2

3

4

5

Gibbs Sampler

3i

4i5i

2i

1i

• Column shift

• Metropolis algorithm:– Propose * as shifted 1 column to left or right– Calculate motif score u() and u(*)– Accept * with prob = min(1, u(*) / u())

57

Gibbs Sampling Derivatives

• Gibbs Motif Sampler (JS Liu)– Add prior probability to allow 0-n site / seq– Sample motif positions to consider

• AlignACE (F Roth)– Look for motifs from both strands– Mask out one motif to find more different motifs

• BioProspector (XS Liu)– Use background model with Markov dependencies– Sampling with threshold (0-n sites / seq), new scoring function– Can find two-block motifs with variable gap

58

Scoring Motifs

• Information Content (also known as relative entropy)– Suppose you have x aligned segments for the motif– pb(s1 from mtf) / pb(s1 from bg) *

pb(s2 from mtf) / pb(s2 from bg) *…pb(sx from mtf) / pb(sx from bg)



1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G

Site

s




59

Scoring Motifs

• Information Content (also known as relative entropy)– Suppose you have x aligned segments for the motif– pb(s1 from mtf) / pb(s1 from bg) *

pb(s2 from mtf) / pb(s2 from bg) *…pb(sx from mtf) / pb(sx from bg)



1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G

Site

s




60

Scoring Motifs

pb(s1 from mtf) / pb(s1 from bg) *pb(s2 from mtf) / pb(s2 from bg) *…pb(sx from mtf) / pb(sx from bg)

= (pA1/pA0)A1 (pT1/pT0)T1 (pT2/pT0)T2 (pG2/pG0)G2 (pC2/pC0)C2…Take log of this:= A1 log (pA1/pA0) + T1 log (pT1/pT0) +

T2 log (pT2/pT0) + G2 log (pG2/pG0) + …Divide by the number of segments (if all the motifs have

same number of segments)= pA1 log (pA1/pA0) + pT1 log (pT1/pT0) + pT2 log (pT2/pT0)…


61

Scoring Motifs

• Original function: Information Content

=

Motif Conservedness: How likely to see the current aligned segments from this motif model

GoodATGCAATGCCATGCAATGCATTGCAATGGAATGCA

BadAGGCAATCCCGCGCACGGTATGCCAATGGTTTGAA

62

Scoring Motifs


Motif Specificity: How likely to see the current aligned segments from background

=

GoodAGTCCAGTCCAGTCCAGTCCAGTCCAGTCCAGTCC

BadATAAAATAAAATAAAATAAAATAAAATAAAATAAA

63

Scoring Motifs


Which is better?(data = 8 seqs)

=

Motif 1AGGCTAACAGGCTAAC

Motif 2AGGCTAACAGGCTACCAGGCTAACAGCCTAACAGGCCAACAGGCTAACTGGCTAACAGGCTTACAGGCTAACAGGGTAAC 64

Scoring Motifs

• Motif scoring function:

• Prefer: conserved motifs with many sites, but are not often seen in the genome background

Motif Signal Abundant

PositionsConserved

Specific (unlikely in genome background)

65

Markov Background Increases Motif Specificity

Prefers motif segments enriched only in data, but not so likely to occur in the background

Segment ATGTA score = p(generate ATGTA from )p(generate ATGTA from 0)

3rd order Markov dependencyp( )

TCAGC = .25 .25 .25 .25 .25 .3 .18 .16 .22 .24

ATATA = .25 .25 .25 .25 .25 .3 .41 .38 .42 .30

66

Position Weight Matrix Update

• Advantage– Can look for motifs of any widths– Flexible with base substitutions

• Disadvantage:– EM and Gibbs sampling: no guaranteed

convergence time– No guaranteed global optimum

67

Motif Finding in Bacteria

• Promoter sequences are short (200-300 bp)• Motif are usually long (10-20 bases)

– Some have two blocks with a gap, some are palindromes

– Long motifs are usually very degenerate• Single microarray experiment sometimes already

provides enough information to search for TF motifs

68

Motif Finding in Lower Eukaryotes

• Upstream sequences longer (500-1000 bp), with some simple repeats

• Motif width varies (5 – 17 bases)• Expression clusters provide decent input

sequences quality for TF motif finding• Motif combination and redundancy appears,

although single motifs are usually significant enough for identification

69

Yeast Promoter

Architecture• Co-occurring

regulators suggest physical interaction between the regulators

70

Motif Finding in Higher Eukaryotes

• Upstream sequences very long (3KB-20KB) with repeats, TF motif could appear downstream

• Motifs can be short or long (6-20 bases), and appear in combination and clusters

• Gene expression cluster not good enough input• Need:

– Comparative Genomics: phastcons score– Motif modules: motif clusters– ChIP-chip/seq

71

72

Yeast Regulatory Sequence Conservation

73

UCSC PhastCons Conservation• Functional regulatory sequences are under

stronger evolutionary constraint• Align orthologous sequences together• PhastCons conservation score (0 – 1) for each

nucleotide in the genome can be downloaded from UCSC

74

Conserved Motif Clusters

• First find conserved regions in the genome

• Then look for repeated transcription factors (TF) binding sites

• They form transcription factor modules

Summary• Biology and challenge of transcription regulation• Scan for known TF motif sites: TRANSFAC & JASPAR• De novo method

– Regular expression enumeration• Oligonucleotide analysis• MobyDick: build long motifs from short ones

– Position weight matrix update• CONSENSUS (sequence order)• EM (iterate , ; ~ weighted average)• Gibbs Sampler (sample , ; Markov chain convergence)• Motif score and Markov background

• Motif cluster and motif conservation75

transcription regulation transcription factor motif finding

Documents

cook bookobserve gene

different cell statesmeals

complete sequence

certain cell stateclustering

known tf motif sitestrasfac

dnanoncoding region

machine learning