cisgreedy motif finder for cistematic sarah aerni mentors: ali mortazavi barbara wold
Post on 21-Dec-2015
217 views
TRANSCRIPT
cisGreedy
Algorithm similar to Consensus motif finder– Greedy method over multiple iterations– De novo motif finder based on input values
Implemented in Cistematic package using Python
Goal: To provide an efficient Greedy algorithm to be included in the Cistematic package that performs similarly to
Consensus
Cistematic
One motif finder is generally insufficient
Further automated analysis performed to refine motifs
Enhances motif finder performance through additional steps
Image:Ali Mortazavi
Cistematic
Image:Ali Mortazavi
cisGreedy becomes part of “Bottom Tier”
Offers an alternative to downloading Consensus software
– Additional motif finders will be made available
What is a Motif?
cis-Regulatory elements– Transcription Factor Binding Sites(TFBS)– Binding by transcription factors may increase or
decrease transcription of genes
Gene Regulation believed to be a major source of complexity – Plants may have more genes or larger genomes
than humans – are they more complex?
Multiple Products from One Gene
Other methods to increase complexity– Polyadenylation
Different “endings” available
– Alternative splicing Many more cDNAs
– Methylation
Identification of cis-regulatory elements will help us understand gene regulatory networks
Motif Finding in DNA Sequences
cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc
5 sample sequences
Motif Finding in DNA Sequences
cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc
But motifs are rarely conserved to such a degree
Motif Finding in DNA Sequences
cctgatagacgctatctggctatccaTgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatacTtaGgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgAacgAgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacCtCcgtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaaGgtGcgtc
Motifs less discernable without 100% identity
Motif Finding in DNA Sequences
cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttCcaaccat
agtactggtgtAcAtttGatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaAAAtttt
agcctccgatgtaagtcatagctgtaactattacctgccacCcCtAttacatcttacgtacgtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc
Other subsequences which are not motifs may appear more conserved– filtering out noise becomes challenging!
Motifs are degenerate
Only certain positions need to be specified– Binding Sites for different control elements may
overlap – more complex regulation
Often use Position Specific Frequency Matrix (PSFM) where each nucleotide is represented as a fraction - columns add to 1
Also represented by “Motif Logo”
How do we find motifs?
Hard to identify– Relatively short sequences– Many positions not well conserved
Factors improving identification– Usually localized in certain proximity of a gene
(search within 3 kb upstream)– Some positions highly conserved– Use other data (Microarray?)
Motif Finders
Greedy– Maximizes similarity of motifs from sequences
through a greedy approach
Gibbs Sampling– Attempts to find best motifs using a combination
of probability and scores to avoid local maximums being identified
Expectation Maximization
Consensus Score
Determine number of occurrences of each base at each position
Sum of the occurrences of each nucleotide at every index must add to the total number of sequences included
Consensus Score
Determine number of occurrences of each base at each position
Identify the most common base at each position – Consensus Sequence
Consensus Sequence
Consensus Score
Determine number of occurrences of each base at each position
Identify the most common base at each position – Consensus Sequence
Add occurrence of each base in the consensus sequence at each index to determine consensus Score
Consensus Sequence
Consensus Score = 31
Position Specific Frequency Matrix
T G G G G G A
T G A G A G A
T G G G G G A
T G A G A G A
T G A G G G A
A 0.0 0.0 0.6 0.0 0.4 0.0 1.0
C 0.0 0.0 0.0 0.0 0.0 0.0 0.0
G 0.0 1.0 0.4 1.0 0.6 1.0 0.0
T 1.0 0.0 0.0 0.0 0.0 0.0 0.0
Frequencies are the number of each base at every position divided by the total number of sequences
Sum for each column is 1 (at least one base must occur)
Motif Logo
T G G G G G A
T G A G A G A
T G G G G G A
T G A G A G A
T G A G G G A
A 0.0 0.0 0.6 0.0 0.4 0.0 1.0
C 0.0 0.0 0.0 0.0 0.0 0.0 0.0
G 0.0 1.0 0.4 1.0 0.6 1.0 0.0
T 1.0 0.0 0.0 0.0 0.0 0.0 0.0
bioalgorithms.info
Frequencies affect logo size
Size of letter indicates the frequency of occurrence relative to other sequences
Size indicates confidence of letter
Consensus Scoring
Use equation similar to log likelihood called Information Content
Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 8 1999: 563-577.
L columns in the matrixA = {A,C,G,T}
frequency of each letter i at each position j a priori probabiliy of letter i
Our implementation substitutes the a priori probability with a specific dependent probability based on the Markov Model
cisGreedy
Input sequences are analyzed – possibly establish background
– Background models are used to filter out noise
Randomly select 2 sequences to b compaired
cisGreedy
The two selected sequences are independently analyzed
Windows of motif size are scanned starting at the beginning of each sequence
cisGreedy
Sequences are scanned in an attempt to locate the highest scoring alignment
– Alignments are ungapped– Score is the Information Content
cisGreedy
Reverse Complements are analyzed (unless specified otherwise)
Once start locations are established with a top alignment score, these are left unchanged
cisGreedy
Select an additional sequence in which to identify the location of the motif
Additional sequences windows are aligned to previous established windows (hence Greedy)
cisGreedy
Additional sequence scanned as before, reverse complement (unless otherwise specified)
Alignment score established as before
cisGreedy
Final motif locations are taken in order to build position specific frequency matrices
Reverse complement sequence used in building PSFM if used
cisGreedy User Input
Sequence input Motif size (may be a range) Number of motifs cisGreedy should find Iterations to perform at each step before selecting a
motif Background model Markov Model Size Reverse complement – whether to include it May designate which sequences will be “founder”
sequences – select homologs Designate percent identity between founder
sequences
cisGreedy Output
Multiple motifs represented as PWMs or PSFMs
Motifs represented as symbols.
– Basic nucleotides represented by respective symbols (A-adenine, etc)
– Remaining symbols may require threshold
NTDs Symbols
AC M
AG R
AT W
CG S
CT Y
GT K
ACG V
ACT H
AGT D
CGT B
ACGT N
Symbol Example
A .75 .25 0 1 0 .25 0 1 0 0 1
C 0 0 .75 0 0 .75 0 0 .25 .25 0
G .25 .75 .25 0 1 0 1 0 .75 .75 0
T 0 0 0 0 0 0 0 0 0 0 0
R R S A G M G A S S A
cisGreedy - Optimization
Zoops - Zero or One Occurrence per Sequence– If no good motifs identified in a sequence it is removed
If subsequence’s Pvalue is not greater than the average PValue Background model (default Markov3 model)
– can be input (Ex: C/G-rich regions)– Markov model can be up to Markov6 (unreasonable for input
sequences of a certain size) Find multiple Motifs
– mask each motif after identification (windows cannot be reused) Allow for ranges of motif lengths Perform multiple iterations before choosing a motif
– Avoid local maxima
cisGreedy Markov3 Background Model
Collection of all 4-mers with corresponding frequency of word in input sequences
Use 4-mer frequencies in order to describe P- value of last nucleotide in the 4-mer
– Nucleotide p-value not independent Probability of any sequence is the product of the
probability of each nucleotide which make up that sequence
A word is deemed significant if its probability is less than the average of all words of the same size in the background model
cisGreedy Markov3 - Example
Each word has a probability associated with it– Probability of seeing the word based on its frequency in the
model
5.3 * 10-12
cisGreedy Markov3 - Example
Each word has a probability associated with it– Probability of seeing the word based on its frequency in the
model– Describes probability of seeing letter in the last position
based on the 3-mer preceding it
5.3 * 10-8
Calculating probability of a wordCalculation of word probability based on Markov Models
-ln p
roba
bilit
y of
sub
sequ
ence
Sequence
1kb upstream region of a yeast gene nucleotide distribution
Sequence Position (Upstream from Transcription Start Site)
Distribution of nucleotide probabilities based on Markov Model
-ln p
roba
bilit
y of
nuc
leot
ide
1kb upstream region of a yeast gene word probabilities
-ln p
roba
bilit
y of
seq
uenc
e- ln probabilities of all words based on Markov Model
Sequence Position (Upstream from Transcription Start Site)
1kb upstream region of a yeast gene word probabilities
-ln p
roba
bilit
y of
seq
uenc
e
Sequence Position (Upstream from Transcription Start Site)
- ln probabilities of all words based on Markov Model
1kb upstream region of a yeast gene word probabilities
-ln p
roba
bilit
y of
seq
uenc
e
Sequence Position (Upstream from Transcription Start Site)
- ln probabilities of all words based on Markov Model
1kb upstream region of a yeast gene Motif probability
-ln p
roba
bilit
y of
seq
uenc
e
Sequence Position (Upstream from Transcription Start Site)
- ln probabilities based on Markov Model of all words within 50 nucleotides of a know Yeast motif
Probability of a motif
Probabilities of seeing a motif given a background should be lower– Chance of seeing the word at random should be
low
A motif will not have an extremely low probability as it should be seen multiple times in a data set for it to be identified
MSP Results –
Testing using nematode data– C. elegans and C. briggsae– Major Sperm Protein (MSP)
Cytoskeletal element required for mobility of nematode spermatozoa
Multiple genes in genomes Co-regulated
MSP Results – MEME motifs
Motifs identified by MEME plotted on input sequences– Total 10 motifs identified (not all plotted)
MSP Results – cisGreedy motifs
Motifs identified by cisGreedy plotted on input sequences– Total 10 motifs identified (not all plotted)
Future goals
Test CisGreedy with dataset used in paper analyzing available motif finding tools
Make adjustments to improve results Build upon CisGreedy to make more complex
algorithms - Weeder? Additionally motif finders based on different theories Gibbs Sampler Expectation maximization
References
Bioalgorithms.info Jones, Neil C., and Pavel A. Pevzner. An Introduction to
Bioinformatics Algorithms . : MIT Press , 2004. Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and
protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 8 1999: 563-577.
Tompa, Martin et al. “Assessing computational tools for the discovery of transcription factor binding sites." Nature Biotechnology January 2005: 137-144.
http://cistematic.caltech.edu