motif finding pssms expectation maximization gibbs sampling
TRANSCRIPT
Motif Finding
PSSMs
Expectation Maximization
Gibbs Sampling
Complexity of Transcription
Representing Binding Sites for a TF
A set of sites represented as a consensus VDRTWRWWSHD (IUPAC degenerate DNA)
A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4
A matrix describing a a set of sites
A single site AAGTTAATGA
Set of binding
sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA
Set of binding
sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA
Nucleic acid codes
code description
A Adenine
C Cytosine
G Guanine
T Thymine
U Uracil
R Purine (A or G)
Y Pyrimidine (C, T, or U)
M C or A
K T, U, or G
W T, U, or A
S C or G
B C, T, U, or G (not A)
D A, T, U, or G (not C)
H A, T, U, or C (not G)
V A, C, or G (not T, not U)
N Any base (A, C, G, T, or U)
From frequencies to log scores
TGCTG = 0.9
A 5 0 1 0 0C 0 2 2 4 0G 0 3 1 0 4T 0 0 1 1 1
A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3T -1.7 -1.7 -0.2 -0.2 -0.2
f matrix w matrix
Log ( )f(b,i) + s(N)p(b)
TFs do not act alone
http://www.bioinformatics.ca/
PSSMs for Liver TFs…
HNF1
C/EBP
HNF3
HNF4
PSSMs for Helix-Turn-Helix Motif
Promoter…
Promoter Weight Matrices (PWM)
E.Coli PWMs
Motif Logo Motifs can mutate on
less important bases. The five motifs at top
right have mutations in position 3 and 5.
Representations called motif logos illustrate the conserved regions of a motif.
http://weblogo.berkeley.eduhttp://fold.stanford.edu/eblocks/acsearch.html
1234567TGGGGGATGAGAGATGGGGGATGAGAGATGAGGGA
Position:
Example: Calmodulin-Binding Motif (calcium-binding proteins)
Sequence Motifs
• Motifs represent a short common sequence– Regulatory motifs (TF binding sites)
– Functional site in proteins (DNA binding motif)
http://webcourse.cs.technion.ac.il/236523/Winter2005-2006/en/ho_Lectures.html
Regulatory Motifs
Transcription Factors bind to regulatory motifs Motifs are 6 – 20 nucleotides long Activators and repressors Usually located near target gene, mostly
upstream
Challenges
How to recognize a regulatory motif? Can we identify new occurrences of
known motifs in genome sequences? Can we discover new motifs within
upstream sequences of genes?
Motif Representation
Exact motif: CGGATATA Consensus: represent only
deterministic nucleotides. Example: HAP1 binding
sites in 5 sequences. consensus motif:
CGGNNNTANCGG N stands for any nucleotide.
Representing only consensus loses information. How can this be avoided?
CGGATATACCGG
CGGTGATAGCGG
CGGTACTAACGG
CGGCGGTAACGG
CGGCCCTAACGG
------------
CGGNNNTANCGG
1 2 3 4 5
A 10 25 5 70 60
C 30 25 80 10 15
T 50 25 5 10 5
G 10 25 10 10 20
PSPM – Position Specific Probability Matrix
Represents a motif of length k (5) Count the number of occurrence of each
nucleotide in each position
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
PSPM – Position Specific Probability Matrix
Defines Pi{A,C,G,T} for i={1,..,k}. Pi (A) – frequency of nucleotide A in position i.
Identification of Known Motifs within Genomic Sequences
Motivation: identification of new genes controlled by the
same TF. Infer the function of these genes. enable better understanding of the regulation
mechanism.
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
PSPM – Position Specific Probability Matrix
Each k-mer is assigned a probability. Example: P(TCCAG)=0.5*0.25*0.8*0.7*0.2
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
Detecting a Known Motif within a Sequence using PSPM
The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the
PSPM. Example:
sequence = ATGCAAGTCT…
The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the
PSPM. Example: sequence = ATGCAAGTCT… Position 1: ATGCA
0.1*0.25*0.1*0.1*0.6=1.5*10-4
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
Detecting a Known Motif within a Sequence using PSPM
The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the
PSPM. Example: sequence = ATGCAAGTCT… Position 1: ATGCA
0.1*0.25*0.1*0.1*0.6=1.5*10-4 Position 2: TGCAA
0.5*0.25*0.8*0.7*0.6=0.042
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
Detecting a Known Motif within a Sequence using PSPM
Detecting a Known Motif within a Sequence using PSSM
Is it a random match, or is it indeed an occurrence of the motif?
PSPM -> PSSM (Probability Specific Scoring Matrix) odds score matrix: Oi(n) where n {A,C,G,T} for i={1,..,k} defined as Pi(n)/P(n), where P(n) is background
frequency. Oi(n) increases => higher odds that n at position i is
part of a real motif.
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
1 2 3 4 5
A 0.4 1 0.2 2.8 2.4
1 2 3 4 5
A -1.322 0 -2.322 1.485
1.263
PSSM as Odds Score Matrix Assumption: the background frequency of each nucleotide is
0.25.
Original PSPM (Pi):
Odds Matrix (Oi):
Going to log scale we get an additive score,Log odds Matrix (log2Oi):
1 2 3 4 5
A -1.32 0 -2.32 1.48 1.26
C 0.26 0 1.68 -1.32 -0.74
T 1 0 -2.32 -1.32 -2.32
G -1.32 0 -1.32 -1.32 -0.32
Calculating using Log Odds Matrix
Odds 0 implies random match; Odds > 0 implies real match (?).
Example: sequence = ATGCAAGTCT… Position 1: ATGCA
-1.32+0-1.32-1.32+1.26=-2.7odds= 2-2.7=0.15
Position 2: TGCAA1+0+1.68+1.48+1.26 =5.42odds=25.42=42.8
Calculating the probability of a match
ATGCAAG Position 1 ATGCA = 0.15 Position 2 TGCAA = 42.3 Position 3 GCAAG =0.18
P (i) = S / (∑ S)Example 0.15 /(.15+42.8+.18)=0.003
P (1)= 0.003P (2)= 0.993P (3) =0.004
Building a PSSM
Collect all known sequences that bind a certain TF.
Align all sequences (using multiple sequence alignment).
Compute the frequency of each nucleotide in each position (PSPM).
Incorporate background frequency for each nucleotide (PSSM).
Finding new Motifs
We are given a group of genes, which presumably contain a common regulatory motif.
We know nothing of the TF that binds to the putative motif.
The problem: discover the motif.
Example
Predicting the cAMP Receptor Protein (CRP) binding site motif
GGATAACAATTTCACAAGTGTGTGAGCGGATAACAAAAGGTGTGAGTTAGCTCACTCCCCTGTGATCTCTGTTACATAGACGTGCGAGGATGAGAACACAATGTGTGTGCTCGGTTTAGTTCACCTGTGACACAGTGCAAACGCGCCTGACGGAGTTCACAAATTGTGAGTGTCTATAATCACGATCGATTTGGAATATCCATCACATGCAAAGGACGTCACGATTTGGGAGCTGGCGACCTGGGTCATGTGTGATGTGTATCGAACCGTGTATTTATTTGAACCACATCGCAGGTGAGAGCCATCACAGGAGTGTGTAAGCTGTGCCACGTTTATTCCATGTCACGAGTGTTGTTATACACATCACTAGTGAAACGTGCTCCCACTCGCATGTGATTCGATTCACA
Extract experimentally defined CRP Binding Sites
GGATAACAATTTCACATGTGAGCGGATAACAATGTGAGTTAGCTCACTTGTGATCTCTGTTACACGAGGATGAGAACACACTCGGTTTAGTTCACCTGTGACACAGTGCAAACCTGACGGAGTTCACAAGTGTCTATAATCACGTGGAATATCCATCACATGCAAAGGACGTCACGGGCGACCTGGGTCATGTGTGATGTGTATCGAATTTGAACCACATCGCAGGTGAGAGCCATCACATGTAAGCTGTGCCACGTTTATTCCATGTCACGTGTTATACACATCACTCGTGCTCCCACTCGCATGTGATTCGATTCACA
Create a Multiple Sequence Alignment
A C G T
1 -0.43 0.1 -0.46 0.55
2 1.37 0.12 -1.59 -11.2
3 1.69 -1.28 -11.2 -1.43
4 -1.28 0.12 -11.2 1.32
5 0.91 -11.2 -0.46 0.47
6 1.53 -1.38 -1.48 -1.43
7 0.9 -0.48 -11.2 0.12
8 -1.37 -1.28 -11.2 1.68
9 -11.2 -11.2 1.73 -0.56
10 -11.2 -0.51 -11.2 1.72
11 -0.48 -11.2 1.72 -11.2
12 1.56 -1.59 -11.2 -0.46
13 -0.51 -0.38 -0.55 0.88
14 -11.2 0.5 0.57 0.13
15 0.17 -0.51 0.12 0.12
16 0.9 -11.2 0.5 -0.48
17 0.17 0.16 0.06 -0.48
18 -0.4 -0.38 0.82 -0.48
19 -1.38 -1.28 -11.2 1.68
20 -1.48 1.7 -11.2 -1.38
21 1.5 -1.38 -1.43 -1.28
Generate a PSSM
Shannon Entropy
Expected variation per column can be calculated
Low entropy means higher conservation
Entropy
The entropy (H) for a column is:
a: is a residue, fa: frequency of residue a in a column,
pa : probability of residue a in that column
)(
)log(aresidues
aa pfH
Entropy
entropy measures can determine which evolutionary distance (PAM250, BLOSUM80, etc) should be used
Entropy yields amount of information per column (discussed with sequence logos in a bit)
Log-odds score
Profiles can also indicate log-odds score: Log2(observed:expected)
Result is a bit score
Matlab
Multalign1 Enter an array of sequences.seqs =
{'CACGTAACATCTC','ACGACGTAACATCTTCT','AAACGTAACATCTCGC'};
2 Promote terminations with gaps in the alignment.multialign(seqs,'terminalGapAdjust',true)
ans =--CACGTAACATCTC--ACGACGTAACATCTTCT-AAACGTAACATCTCGC
Matlab
3 Compare alignment without termination gap adjustment.
multialign(seqs)
ans =
CA--CGTAACATCT--C
ACGACGTAACATCTTCT
AA-ACGTAACATCTCGC
Matlab
>> a={'ATATAGGAG','AATTATAGA','TTAGAGAAA'}
>> a =
'ATATAGGAG' 'AATTATAGA' 'TTAGAGAAA'
Char function
>> cseq=char(a)
cseq =
ATATAGGAG
AATTATAGA
TTAGAGAAA
Double function
>> intseq=double(cseq)
intseq =
65 84 65 84 65 71 71 65 71
65 65 84 84 65 84 65 71 65
84 84 65 71 65 71 65 65 65
double
>> double('A')ans = 65>> double('C')ans = 67>> double('G')ans = 71>> double('T')ans = 84
Initiate PSPM matrix
>> Pspm=zeros(4,length(intseq))
Pspm =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Use a for loop to count each nucleotide at each position>> for i = 1:length(intseq)Pspm(1,i)=length(find(intseq(:,i)==65));Pspm(2,i)=length(find(intseq(:,i)==67));Pspm(3,i)=length(find(intseq(:,i)==71));Pspm(4,i)=length(find(intseq(:,i)==84));end>> Pspm
Pspm =
2 1 2 0 3 0 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 1 1 1 1 2 1 2 0 1 0 0 0
Add pseudocounts
>> Pspmp=Pspm+1
Pspmp =
3 2 3 1 4 1 3 3 3
1 1 1 1 1 1 1 1 1
1 1 1 2 1 3 2 2 2
2 3 2 3 1 2 1 1 1
Normalize to get frequencies>> Pspmnorm=Pspmp./repmat(sum(Pspmp),4,1)
Pspmnorm =
Columns 1 through 7
0.4286 0.2857 0.4286 0.1429 0.5714 0.1429 0.4286 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.2857 0.1429 0.4286 0.2857 0.2857 0.4286 0.2857 0.4286 0.1429 0.2857 0.1429
Columns 8 through 9
0.4286 0.4286 0.1429 0.1429 0.2857 0.2857 0.1429 0.1429
Calculate odds score>> Pswm=Pspmnorm/0.25
Pswm =
Columns 1 through 7
1.7143 1.1429 1.7143 0.5714 2.2857 0.5714 1.7143 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 1.1429 0.5714 1.7143 1.1429 1.1429 1.7143 1.1429 1.7143 0.5714 1.1429 0.5714
Columns 8 through 9
1.7143 1.7143 0.5714 0.5714 1.1429 1.1429 0.5714 0.5714
Log odds ratio>> logPswm=log2(Pswm)
logPswm =
Columns 1 through 7
0.7776 0.1926 0.7776 -0.8074 1.1926 -0.8074 0.7776 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 0.1926 -0.8074 0.7776 0.1926 0.1926 0.7776 0.1926 0.7776 -0.8074 0.1926 -0.8074
Columns 8 through 9
0.7776 0.7776 -0.8074 -0.8074 0.1926 0.1926 -0.8074 -0.8074
Estimate the probability of the given sequence to belong to the defined PSWM
>> Unknown='TTAAGAAGG'
Unknown =
TTAAGAAGG
>> intunknown=double(Unknown)
intunknown =
84 84 65 65 71 65 65 71 71
Get the index of the PSWM for the unknown sequence>> for i=1:length(intunknown)
A=find(intunknown==65)intunknown(A)=1;C=find(intunknown==67)intunknown(C)=2;G=find(intunknown==71)intunknown(G)=3;T=find(intunknown==84)intunknown(T)=4;
end>> intunknownintunknown =
4 4 1 1 3 1 1 3 3
Calculate the log odds-ratio of the Unknown 'TTAAGAAGG'
>> logunknown=logPswm(intunknown)
logunknown =
Columns 1 through 7
0.1926 0.1926 0.7776 0.7776 -0.8074 0.7776 0.7776
Columns 8 through 9
-0.8074 -0.8074
>> Punknown=sum(logunknown)
Punknown =
1.0737
Is this significant score or just random similarity?
>> cseqcseq =
ATATAGGAGAATTATAGATTAGAGAAA
>> Unknown
Unknown =
TTAAGAAGG
What would be the maximum score?
>> logPswm
logPswm = Columns 1 through 7 0.7776 0.1926 0.7776 -0.8074 1.1926 -0.8074 0.7776 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 0.1926 -0.8074 0.7776 0.1926 0.1926 0.7776 0.1926 0.7776 -0.8074 0.1926 -0.8074 Columns 8 through 9 0.7776 0.7776 -0.8074 -0.8074 0.1926 0.1926 -0.8074 -0.8074
>> maxscore=max(logPswm)maxscore =Columns 1 through 7 0.7776 0.7776 0.7776 0.7776 1.1926 0.7776 0.7776Columns 8 through 9 0.7776 0.7776>> totalmaxscore=sum(maxscore)
totalmaxscore=
7.4135
Write a function using the above statements to scan a sequence
Write a function named ‘logodds’ that calculates the logs-odd ratio of a given alignment.
Write a function named ‘scanmotif’ that calls the ‘logodds’ to search through a sequence using a sliding window to calculate the logodds of a subsequence and store these scores. The function should allow for selection of a maximum number of locations that are likely to contain the motif based on the scores obtained.
Position Specific Scoring Matrix (PSSM) incorporate information theory to
indicate information contained within each column of a multiple alignment.
information is a logarithmic transformation of the frequency of each residue in the motif
PSSMs and Pseudocounts
Problem: PSSMs are only as good as the initial msa Some residues may be underrepresented Other columns may be too conserved
Solution: Introduce Pseudocounts to get a better indication
Pseudocounts
New estimated probability:
Pca: Probability of residue a in column c nca: count of a’s in column c bca: pseudocount of a’s in column c Nc: total count in column c Bc: total pseudocount in column c
cc
cacaca BN
bnP
PSSMs and pseudocounts
probabilities converted into a log-odds form (usually log2 so the information
can be reported in bits) and placed in the PSSM.
Searching PSSMs
value for the first residue in the sequence occurring in the first column is calculated by searching the PSSM
the value for the residue occurring in each column is calculated
Searching PSSMs
values are added (since they are logarithms) to produce a summed log odds score, S
S can be converted to an odds score using the formula 2S
odds scores for each position can be summed together and normalized to produce a probability of the motif occurring at each location.
Information in PSSMs
Information theory: amount of information contained within each sequence.
No information: amount of uncertainty can be measured as log220 = 4.32 for amino
acids, since there are 20 amino acids. For nucleic acid sequences, the amount of uncertainty can be measured as log24 = 2.
Information in PSSMs
If a column is completely conserved then the uncertainty is 0 – there is only one choice.
two residues occurring with equal probability -- uncertainty to deciding which residue it is.
Measure of Uncertainty
Measured as the entropy
)(
)log(aresidues
acacC pfH
Relative Entropy
. Relative entropy takes into account overall composition of the organism being studied
Ba is background frequency of residue a in the organism
)(
2 )/(logaresidues
aacacC bpfR
PSSM Uncertainty
Uncertainty for whole model is summed over all columns:
allcolumns
cc HH
Sequence Logos
Information in PSSMs can be viewed visually
Sequence logos illustrate information in each column of a motif
height of logo is calculated as the amount by which uncertainty has been decreased
Sequence Logos
Statistical Methods
Commonly used methods for locating motifs:
Expectation-Maximization (EM) Gibbs Sampling
Expectation-Maximization
Begin with set of sequences with an unknown signal in common Signal may be subtle Approximate length of signal must be
given
Randomly assign locations of this motif in each sequence
Expectation-Maximization
Two steps: Expectation Step Maximization Step
Expectation-Maximization
Expectation step Residue Frequencies for each position
calculated Residues not in a motif are background
Frequencies used to determine probability of finding site at any position in a sequence to fit motif model
Maximization Step
Determine location for each sequence that maximally aligns to the motif pattern
Once new motif location found for each sequence, motif pattern is revised in the expectation
E-M continues until solution converges
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCTCCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTGTCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAGAAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTCGGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGCAGCCTAGAGTGATGACTCCTATCTGGGTCCCCAGCAGGAGCCTCAGGATCCAGCACACATTATCACAAACTTAGTGTCCACATTATCACAAACTTAGTGTCCATCCATCACTGCTGACCCTTCGGAACAAGGCAAAGGCTATAAAAAAAATTAAGCAGCGCCCCTTCCCCACACTATCTCAATGCAAATATCTGTCTGAAACGGTTCCCATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGGGATTGGTCACAGCATTTCAAGGGAGAGACCTCATTGTAAGTCCCCAACTCCCAACTGACCTTATCTGTGGGGGAGGCTTTTGACCTTATCTGTGGGGGAGGCTTTTGAAAAGTAATTAGGTTTAGCATTATTTTCCTTATCAGAAGCAGAGAGACAAGCCATTTCTCTTTCCTCCCGGTAGGCTATAAAAAAAATTAAGCAGCAGTATCCTCTTGGGGGCCCCTTCCCAGCACACACACTTATCCAGTGGTAAATACACATCATTCAAATAGGTACGGATAAGTAGATATTGAAGTAAGGATACTTGGGGTTCCAGTTTGATAAGAAAAGACTTCCTGTGGATGGCCGCAGGAAGGTGGGCCTGGAAGATAACAGCTAGTAGGCTAAGGCCAGCAACCACAACCTCTGTATCCGGTAGTGGCAGATGGAAACTGTATCCGGTAGTGGCAGATGGAAAGAGAAACGGTTAGAAGAAAAAAAATAAATGAAGTCTGCCTATCTCCGGGCCAGAGCCCCTTGCCTTGTCTGTTGTAGATAATGAATCTATCCTCCAGTGACTGGCCAGGCTGATGGGCCTTATCTCTTTACCCACCTGGCTGTCAACAGCAGGTCCTACTATCGCCTCCCTCTAGTCTCTGCCAACCGTTAATGCTAGAGTTATCACTTTCTGTTATCAAGTGGCTTCAGCTATGCAGGGAGGGTGGGGCCCCTATCTCTCCTAGACTCTGTGCTTTGTCACTGGATCTGATAAGAAACACCACCCCTGC
Residue Counts
Given motif alignment, count for each location is calculated:
Residue Frequencies
The counts are then converted to frequencies:
Example Maximization Step
Consider the first sequence:
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
There are 41 residues; 41-6+1 = 36
sites to consider
MEME Software
One of three motif models:
OOPS: One expected occurrence per sequence
ZOOPS: Zero or one expected occurrence per sequence
TCM: Any number of occurrences of the motif
Gibbs Sampling
Similar to E-M algorithm Combines E-M and simulated annealing
Goal: Find most probable pattern by sampling from motif probabilities to maximize ratio of model:background probabilities
Predictive Update Step
random motif start position chosen for all sequences except one
Initial alignment used to calculate residue frequencies for motif and background
similar to the Expectation Step of EM
Sampling Step
ratio of model:background probabilities normalized and weighted
motif start position chosen based on a random sampling with the given weights
Different than E-M algorithm
Gibbs Sampling
process repeated until residue frequencies in each column do not change
The sampling step is then repeated for a different initial random alignment
Sampling allows escape from local maxima
Gibbs Sampling
Dirichlet priors (pseudocounts) are added into the nucleotide counts to improve performance
shifting routine shifts motif a few bases to the left or the right
A range of motif sizes is checked
Gibbs Sampler Web Interface
http://bayesweb.wadsworth.org/gibbs/gibbs.html