transcription regulation transcription factor motif finding
DESCRIPTION
Transcription Regulation Transcription Factor Motif Finding. Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520. Outline. Biology of transcription regulation and challenges of computational motif finding Scan for known TF motif sites TRASFAC and JASPAR, Sequence Logo De novo method - PowerPoint PPT PresentationTRANSCRIPT
Transcription RegulationTranscription Factor Motif Finding
Xiaole Shirley LiuSTAT115, STAT215, BIO298, BIST520
Outline
• Biology of transcription regulation and challenges of computational motif finding
• Scan for known TF motif sites– TRASFAC and JASPAR, Sequence Logo
• De novo method– Regular expression enumeration: w-mer enumerate– Position weight matrix update: EM and Gibbs
• Motif finding in different organisms– Motif clusters and conservation
2
Imagine a Chef
Restaurant Dinner Home Lunch
Certain recipes used tomake certain dishes
3
Each Cell Is Like a Chef
4
Each Cell Is Like a Chef
Infant Skin Adult Liver
Glucose, Oxygen, Amino Acid
Fat, AlcoholNicotine
HealthySkin Cell
State
DiseaseLiver Cell
State
Certain genes expressed tomake certain proteins
5
Understanding a Genome
Get the complete sequence (encoded cook book)
Observe gene expressionsat different cell states
(meals prepared at different situations)
Decode gene regulation(decode the book, understand the rules)
6
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTCATTTACCACATCGCATCACAGTTCAGGACTAGACACGGACGGCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTATCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAGCGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT
Information in DNA
Milk->Yogurt
Beef->Burger
Egg->OmeletFish->SushiFlour->Cake
Coding region 2%What is to be made
7
Information in DNANon-coding region 98% Regulation: When, Where, Amount, Other Conditions, etc
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTCATTTACCACATCGCATCACTACGACGGACTAGACACGGACGGCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTATCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAGCGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT
Milk->Yogurt
Beef->Burger
Egg->OmeletFish->SushiFlour->Cake
MorningMorning
Japanese Restaurant5 Oz9 Oz
Butter
Butter
Coding region 2%
8
Measure Gene Expression
• Microarray or SAGE detects the expression of every gene at a certain cell state
• Clustering find genes that are co-expressed (potentially share regulation)
9
STAT115, 04/01/2008
Decode Gene Regulation
GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGCCACATCGCATATTTACCACCAGTTCAGACACGGACGGCGCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAATCTCGTTAGGACCATATTTACCACCCACATCGAGAGCGCGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT
Scrambled EggBaconCereal
Hash BrownOrange Juice
Look at genes always expressed together:Upstream Regions Co-expressed
Genes
STAT115, 04/01/2008
Decode Gene Regulation
GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGCCACATCGCATATTTACCACCAGTTCAGACACGGACGGCGCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAATCTCGTTAGGACCATATTTACCACCCACATCGAGAGCGCGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT
Scrambled EggBaconCereal
Hash BrownOrange Juice
Look at genes always expressed together:Upstream Regions Co-expressed
Genes
STAT115, 04/01/2008
Decode Gene Regulation
GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAGTTCAGACACGGACGGC GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAATCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT
Scrambled EggBacon
CerealHash Brown
Orange Juice
Look at genes always expressed together:Upstream Regions Co-expressed
Genes
Morning
Biology of Transcription Regulation
...acatttgcttctgacacaactgtgttcactagcaacctca...aacagacaccATGGTGCACCTGACTCCTGAGGAGAAGTCT...
...agcaggcccaactccagtgcagctgcaacctgcccactcc...ggcagcgcacATGTCTCTGACCAAGACTGAGAGTGCCGTC...
...cgctcgcgggccggcactcttctggtccccacagactcag...gatacccaccgATGGTGCTGTCTCCTGCCGACAAGACCAA...
...gccccgccagcgccgctaccgccctgcccccgggcgagcg...gatgcgcgagtATGGTGCTGTCTCCTGCCGACAAGACCAA...
atttgctt ttcact gcaacct
aactccagt
actca
gcaacct
gcaacct
gcaacctccagcgccg
gcaacctTranscription Factor (TF)
TF Binding Motif
Hemoglobin Beta
Hemoglobin Zeta
Hemoglobin Alpha
Hemoglobin Gamma
Motif can only be computational discovered when there are enough cases for machine learning
13
Computational Motif Finding
• Input data:– Upstream sequences of gene expression profile cluster – 20-800 sequences, each 300-5000 bps long
• Output: enriched sequence patterns (motifs)• Ultimate goals:
– Which TFs are involved and their binding motifs and effects (enhance / repress gene expression)?
– Which genes are regulated by this TF, why is there disease when a TF goes wrong?
– Are there binding partner / competitor for a TF?14
Challenges: Where/what the signalThe motif should be abundant
GAAATATGCACATTTACCTATGCCCTACGACCTCTCGCCACATCGCATATTTACCACCAAATAAGACACGGACGGCGCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAATCTCGTATTTACCATATTAAATACCCACATCGAGAGCGCGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT
WaterWater
WaterWater
Water
15
The motif should be abundantAnd Abundant with significance
GAAATATGCACATTTACCTATGCCCTACGACCTCTCGCCACATCGCATATTTACCACCAAATAAGACACGGACGGCGCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAATCTCGTATTTACCATATTAAATACCCACATCGAGAGCGCGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT
CoconutCoconut
CoconutCoconut
Coconut
Challenges: Where/what the signal
16
Challenges: Double stranded DNAMotif appears in bothstrandsGATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC
TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG
|||||||||||||||||||||||||||||GTGTAGCGTACCATTTATGGTCAAGTCTG
|||||||||||||||||||||||||||||AGAGTCCATTTAGTCAGTATGATGGGTGT
17
Challenges: Base substitutions
Sequences do not have to match the motif perfectly, base substitutions are allowed
GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATGTACCACCAGTTCAGACACGGACGGC GCCTCGATTTGCCGTGGTACAGTTCAAACCTGACTAAATCTCGTTAGGACCATATTTATCACCCACATCGAGAGCG CGCTAGCCAATTACCGATCTTGTTCGAGAATTGCCTAT
18
Challenges: Variable motif copies
Some sequences do not have the motifSome have multiple copies of the motif
GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGCCACATCGCAATGCAGCAATGCGTTCAGACACGGACGGCTCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCGGCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAACGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT
19
Challenges: Variable motif copies
Some sequences do not have the motifSome have multiple copies of the motif
GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGCCACATCGCAATGCAGCAATGCGTTCAGACACGGACGGCTCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCGGCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAACGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT
SushiHand Roll
SashimiTempura
Sake
FishFish Fish
Fish Fish Fish Fish
20
Challenges: Two-block motifsSome motifs have two parts
GACACATTTACCTATGC TGGCCCTACGACCTCTCGC CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAATCTCGTTAGATTTACCACCCA TGGCCGTATCGAGAGCG CGCTAGCCATTTACCGAT TGGCGTTCTCGAGAATTGCCTAT
AATGCGGCGTAA
or palindromic patterns
Coconut Milk
21
Scan for Known TF Motif Sites
• Experimental TF sites: TRANSFAC, JASPAR• Motif representation:
– Regular expression: Consensus CACAAAA
binary decision Degenerate CRCAAAW IUPAC A/TA/G
22
Scan for Known TF Motif Sites
• Experimental TF sites: TRANSFAC, JASPAR• Motif representation:
– Regular expression: Consensus CACAAAA
binary decision Degenerate CRCAAAW
– Position weight matrix (PWM): need score cutoffPos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG
Motif MatrixPos A C G T Con
1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G
Site
s
Segment ATGCAGCT score =
p(generate ATGCAGCT from motif matrix)p(generate ATGCAGCT from background)
p0A p0T p0G p0C p0A p0G p0C p0T
23
IUPAC for DNA
A adenosineC cytidineG guanineT thymidineU uridineR G A (purine) Y T C
(pyrimidine)K G T (keto)
M A C (amino)S G C (strong)W A T (weak)B C G T (not A)D A G T (not C)H A C T (not G)V A C G (not T)N A C G T (any)
24
Protein Binding Microarrays
• In vitro protein-DNA interactions
• Better capture motifs
25
JASPAR
• User defined cutoff to scan for a particular motif
26
A Word on Sequence Logo
• SeqLogo consists of stacks of symbols, one stack for each position in the sequence
• The overall height of the stack indicates the sequence conservation at that position
• The height of symbols within the stack indicates the relative frequency of nucleic acid at that position
ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG
27
Scan Known TF Motifs
• Drawbacks:– Limited number of motifs– Limited number of sites to represent each motif
• Low sensitivity and specificity– Poor description of motif
• Binding site borders not clear• Binding site many mismatches
– Many motifs look very similar• E.g. GC-rich motif, E-box (CACGTG)
28
De novo Sequence Motif Finding• Goal: look for common sequence patterns
enriched in the input data (compared to the genome background)
• Regular expression enumeration – Pattern driven approach– Enumerate patterns, check significance in dataset– Oligonucleotide analysis, MobyDick
• Position weight matrix update – Data driven approach, use data to refine motifs– Consensus, EM & Gibbs sampling– Motif score and Markov background
29
Regular Expression Enumeration
• Oligonucleotide Analysis: check over-representation for every w-mer:– Expected w occurrence in data
• Consider genome sequence + current data size– Observed w occurrence in data– Over-represented w is potential TF binding motif
Observed occurrence of w in the data
pw from genome background
size of sequence data
Expected occurrence of w in the data
30
MobyDick
• A sequence data and a dictionary of motif wordsATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTCATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACGGCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTATCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAGCGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT
D = {A, C, G, T}Pw = {0.22, 0.28, 0.28, 0.22}
31
MobyDick
• A sequence data and a dictionary of motif words
• Check over-representation of every word-pair
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTCATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACGGCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTATCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAGCGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT
D = {A, C, G, T}Pw = {0.28, 0.22, 0.22, 0.28}
A C G TA AA AC AG AT
C CA CC CG CT
G GA GC GG GT
T TA TC TG TT
32
MobyDick
• A sequence data and a dictionary of motif words
• Check over-representation of every word-pair
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTCATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACGGCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTATCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAGCGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT
D = {A, C, G, T}Pw = {0.28, 0.28, 0.22, 0.22}
A C G TA AA AC AG AT
C CA CC CG CT
G GA GC GG GT
T TA TC TG TT
D = {A,C,G,T,AA,GA,TA,GG}Pw = {?}
33
MobyDick
• D = {A,C,G,T,AA,GA,TA,GG}• Seq: AAGATAA• Possible partitions:A A G A T A A pA pA pG pA pT pA pA
AA G A T A A pAA pG pA pT pA pA
AA GA T A A pAA pGA pT pA pA
AA GA TA ApAA pGA pTA pA
A A GA T AA pAA pGA pT pAA
…• Assign probabilities as to maximize total probability of
generating the sequence
34
MobyDick
• A sequence data and a dictionary of motif words
• Check over-representation of every word-pair
• Reassign word probability and consider every new word-pair to build even longer words
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTCATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACGGCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTATCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAGCGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT
D = {A, C, G, T}Pw = {0.28, 0.28, 0.22, 0.22}
A C G TA AA AC AG AT
C CA CC CG CT
G GA GC GG GT
T TA TC TG TT
D = {A,C,G,T,AA,GA,TA,GG}Pw = {?}
35
Regular Expression Enumeration
• RE Enumeration Derivatives:– oligo-analysis, spaced dyads w1.ns.w2– IUPAC alphabet – Markov background (later)– 2-bit encoding, fast index access– Enumerate limited RE patterns known for a TF
protein structure or interaction theme• Exhaustive, guaranteed to find global optimum,
and can find multiple motifs• Not as flexible with base substitutions, long list of
similar good motifs, and limited with motif width36
Consensus• Starting from the 1st sequence, add one sequence
at a time to look for the best motifs obtained with the additional sequence
Seq1
Seq2
……
Good MotifsCACGTGC GTCAGTCCACGTTC GTCAGTC
Bad MotifsCACGTGC GTCAGTCGTGACAT TGGAAAT
37
Consensus• Starting from the 1st sequence, add one sequence
at a time to look for the best motifs obtained with the additional sequence
Seq3
…
…
…
Good MotifsCACGTGC GTCAGTCCACGTTC GTCAGTCCTCGTGC GACAGTC
Bad MotifsCACGTGC GTCAGTCCACGTTC GTCAGTCTTCAAAG AGACTCA
Remaining good motifs
38
Consensus• Starting from the 1st sequence, add one sequence
at a time to look for the best motifs obtained with the additional sequence
• G Stormo, algorithm runs very fast
• Sequence order plays a big role in performance – First two sequences better contain the motif– Sites stop accumulating at the first bad sequence– Newer version allowing [0-n] is much slower
39
Expectation Maximization and Gibbs Sampling Model
• Objects:– Seq: sequence data to search for motif 0: non-motif (genome background) probability : motif probability matrix parameter : motif site locations
• Problem: P(, | seq, 0)• Approach: alternately estimate
by P( | , seq, 0) by P( | , seq, 0)– EM and Gibbs differ in the estimation methods
40
Expectation Maximization
• E step: | , seq, 0
TTGACGACTGCACGTTTGACp1
TGACG p2
GACGA p3
ACGAC p4
CGACT p5
GACTGp6
ACTGC p7
CTGCA p8
...
P1 = likelihood ratio =P(TTGAC| )P(TTGAC| 0)
p0T p0T p0G p0A p0C= 0.3 0.3 0.2 0.3 0.2
41
Expectation Maximization• E step: | , seq, 0
TTGACGACTGCACGTTTGAC p1
TGACG p2
GACGA p3
ACGAC p4
CGACT p5
GACTG p6
ACTGC p7
CTGCA p8
...
• M step: | , seq, 0
p1 TTGAC p2 TGACG p3 GACGA p4 ACGAC
...
• Scale ACGT at each position, reflects weighted average of
42
EM Derivatives• First EM motif finder (C Lawrence)
– Deterministic algorithm, guarantee local optimum• MEME (TL Bailey)
– Prior probability allows 0-n site / sequence– Parallel running multiple
EM with different seed– User friendly results
43
Gibbs Sampling
• Stochastic process, although still may need multiple initializations– Sample from P( | , seq, 0)– Sample from P( | , seq, 0)
• Collapsed form: estimated with counts, not sampling from Dirichlet– Sample site from one seq based on sites from other seqs
• Converged motif matrix and converged motif sites represent stationary distribution of a Markov Chain
44
1
2
3
4
5
Gibbs Sampler
Initial 1
31
4151
21
11
• Randomly initialize a probability matrixRandomly initialize a probability matrix
nA1 + sA
nA1 + sA + nC1 + sC + nG1 + sG + nT1 + sT
estimated with counts
pA1 =
45
Gibbs Sampler
1 Without11 Segment
• Take out one sequence with its sites from current Take out one sequence with its sites from current motifmotif
31
4151
21
11
46
Segment Scores of Sequence 1
0
10
20
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Starting Position of Segment
Segm
ent S
core
Segment (1-8) Sequence 1
Gibbs Sampler• Score each possible segment of this sequenceScore each possible segment of this sequence
1 Without11 Segment
31
4151
21
47
Segment (2-9)
Segment Scores of Sequence 1
0
10
20
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Starting Position of Segment
Segm
ent S
core
Sequence 1
Gibbs Sampler• Score each possible segment of this sequenceScore each possible segment of this sequence
31
4151
21
1 Without11 Segment 48
Segment Score
• Use current motif matrix to score a segment
Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG
Motif MatrixPos A C G T Con
1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G
Site
s
Segment ATGCAGCT score =
p(generate ATGCAGCT from motif matrix)p(generate ATGCAGCT from background)
p0A p0T p0G p0C p0A p0G p0C p0T
49
Scoring Segments
Motif 1 2 3 4 5 bgA 0.4 0.1 0.3 0.4 0.2 0.3T 0.2 0.5 0.1 0.2 0.2 0.3G 0.2 0.2 0.2 0.3 0.4 0.2C 0.2 0.2 0.4 0.1 0.2 0.2
Ignore pseudo counts for now…
Sequence: TTCCATATTAATCAGATTCCG… scoreTAATC …AATCA 0.4/0.3 x 0.1/0.3 x 0.1/0.3 x 0.1/0.2 x 0.2/0.3 = 0.049383ATCAG 0.4/0.3 x 0.5/0.3 x 0.4/0.2 x 0.4/0.3 x 0.4/0.2 = 11.85185TCAGA 0.2/0.3 x 0.2/0.3 x 0.3/0.3 x 0.3/0.2 x 0.2/0.3 = 0.444444CAGAT …
50
Segment Scores of Sequence 1
0
10
20
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Starting Position of Segment
Segm
ent S
core
12
Gibbs Sampler• Sample site from one seq based on sites from other seqs
31
4151
21
Modified 1 estimated with counts 51
How to Sample?
52
Pos 1 2 3 4 5 6 7 8 9
Score 3 1 12 5 8 9 1 2 6
SubT 3 4 16 21 29 38 39 41 47
• Rand(subtotal) = X• Find the first position with subtotal larger than X
Pos 1 2 3 4 5 6 7 8 9
Score 3 1 12 5 8 9 500 2 6
SubT 3 4 16 21 29 38 538 540 546
Gibbs Sampler
• Repeat the process until motif convergesRepeat the process until motif converges
1 Without21 Segment
31
4151
12
21
53
Gibbs Sampler Intuition
• Beginning:– Randomly initialized motif– No preference towards any segment
Beginning Iterations
0
10
20
30
40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Starting position of segments54
Gibbs Sampler Intuition• Motif appears:
– Motif should have enriched signal (more sites)– By chance some correct sites come to alignment– Sites bias motif to attract other similar sites
Some good aligned segments come
0
10
20
30
40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Starting position of segments55
Gibbs Sampler Intuition
• Motif converges:– All sites come to alignment– Motif totally biased to sample sites every time
Motif converges towards the end
0
10
20
30
40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Starting position of segments56
1
2
3
4
5
Gibbs Sampler
3i
4i5i
2i
1i
• Column shift
• Metropolis algorithm:– Propose * as shifted 1 column to left or right– Calculate motif score u() and u(*)– Accept * with prob = min(1, u(*) / u())
57
Gibbs Sampling Derivatives
• Gibbs Motif Sampler (JS Liu)– Add prior probability to allow 0-n site / seq– Sample motif positions to consider
• AlignACE (F Roth)– Look for motifs from both strands– Mask out one motif to find more different motifs
• BioProspector (XS Liu)– Use background model with Markov dependencies– Sampling with threshold (0-n sites / seq), new scoring function– Can find two-block motifs with variable gap
58
Scoring Motifs
• Information Content (also known as relative entropy)– Suppose you have x aligned segments for the motif– pb(s1 from mtf) / pb(s1 from bg) *
pb(s2 from mtf) / pb(s2 from bg) *…pb(sx from mtf) / pb(sx from bg)
Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG
Motif MatrixPos A C G T Con
1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G
Site
s
Segment ATGCAGCT score =
p(generate ATGCAGCT from motif matrix)p(generate ATGCAGCT from background)
p0A p0T p0G p0C p0A p0G p0C p0T
59
Scoring Motifs
• Information Content (also known as relative entropy)– Suppose you have x aligned segments for the motif– pb(s1 from mtf) / pb(s1 from bg) *
pb(s2 from mtf) / pb(s2 from bg) *…pb(sx from mtf) / pb(sx from bg)
Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG
Motif MatrixPos A C G T Con
1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G
Site
s
Segment ATGCAGCT score =
p(generate ATGCAGCT from motif matrix)p(generate ATGCAGCT from background)
p0A p0T p0G p0C p0A p0G p0C p0T
60
Scoring Motifs
pb(s1 from mtf) / pb(s1 from bg) *pb(s2 from mtf) / pb(s2 from bg) *…pb(sx from mtf) / pb(sx from bg)
= (pA1/pA0)A1 (pT1/pT0)T1 (pT2/pT0)T2 (pG2/pG0)G2 (pC2/pC0)C2…Take log of this:= A1 log (pA1/pA0) + T1 log (pT1/pT0) +
T2 log (pT2/pT0) + G2 log (pG2/pG0) + …Divide by the number of segments (if all the motifs have
same number of segments)= pA1 log (pA1/pA0) + pT1 log (pT1/pT0) + pT2 log (pT2/pT0)…
Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG
61
Scoring Motifs
• Original function: Information Content
=
Motif Conservedness: How likely to see the current aligned segments from this motif model
GoodATGCAATGCCATGCAATGCATTGCAATGGAATGCA
BadAGGCAATCCCGCGCACGGTATGCCAATGGTTTGAA
62
Scoring Motifs
• Original function: Information Content
Motif Specificity: How likely to see the current aligned segments from background
=
GoodAGTCCAGTCCAGTCCAGTCCAGTCCAGTCCAGTCC
BadATAAAATAAAATAAAATAAAATAAAATAAAATAAA
63
Scoring Motifs
• Original function: Information Content
Which is better?(data = 8 seqs)
=
Motif 1AGGCTAACAGGCTAAC
Motif 2AGGCTAACAGGCTACCAGGCTAACAGCCTAACAGGCCAACAGGCTAACTGGCTAACAGGCTTACAGGCTAACAGGGTAAC 64
Scoring Motifs
• Motif scoring function:
• Prefer: conserved motifs with many sites, but are not often seen in the genome background
Motif Signal Abundant
PositionsConserved
Specific (unlikely in genome background)
65
Markov Background Increases Motif Specificity
Prefers motif segments enriched only in data, but not so likely to occur in the background
Segment ATGTA score = p(generate ATGTA from )p(generate ATGTA from 0)
3rd order Markov dependencyp( )
TCAGC = .25 .25 .25 .25 .25 .3 .18 .16 .22 .24
ATATA = .25 .25 .25 .25 .25 .3 .41 .38 .42 .30
66
Position Weight Matrix Update
• Advantage– Can look for motifs of any widths– Flexible with base substitutions
• Disadvantage:– EM and Gibbs sampling: no guaranteed
convergence time– No guaranteed global optimum
67
Motif Finding in Bacteria
• Promoter sequences are short (200-300 bp)• Motif are usually long (10-20 bases)
– Some have two blocks with a gap, some are palindromes
– Long motifs are usually very degenerate• Single microarray experiment sometimes already
provides enough information to search for TF motifs
68
Motif Finding in Lower Eukaryotes
• Upstream sequences longer (500-1000 bp), with some simple repeats
• Motif width varies (5 – 17 bases)• Expression clusters provide decent input
sequences quality for TF motif finding• Motif combination and redundancy appears,
although single motifs are usually significant enough for identification
69
Yeast Promoter
Architecture• Co-occurring
regulators suggest physical interaction between the regulators
70
Motif Finding in Higher Eukaryotes
• Upstream sequences very long (3KB-20KB) with repeats, TF motif could appear downstream
• Motifs can be short or long (6-20 bases), and appear in combination and clusters
• Gene expression cluster not good enough input• Need:
– Comparative Genomics: phastcons score– Motif modules: motif clusters– ChIP-chip/seq
71
72
Yeast Regulatory Sequence Conservation
73
UCSC PhastCons Conservation• Functional regulatory sequences are under
stronger evolutionary constraint• Align orthologous sequences together• PhastCons conservation score (0 – 1) for each
nucleotide in the genome can be downloaded from UCSC
74
Conserved Motif Clusters
• First find conserved regions in the genome
• Then look for repeated transcription factors (TF) binding sites
• They form transcription factor modules
Summary• Biology and challenge of transcription regulation• Scan for known TF motif sites: TRANSFAC & JASPAR• De novo method
– Regular expression enumeration• Oligonucleotide analysis• MobyDick: build long motifs from short ones
– Position weight matrix update• CONSENSUS (sequence order)• EM (iterate , ; ~ weighted average)• Gibbs Sampler (sample , ; Markov chain convergence)• Motif score and Markov background
• Motif cluster and motif conservation75