cisgreedy motif finder for cistematic sarah aerni mentors: ali mortazavi barbara wold

cisGreedy Motif Finder for Cistematic

Sarah Aerni

Mentors: Ali Mortazavi

Barbara Wold

cisGreedy

Algorithm similar to Consensus motif finder– Greedy method over multiple iterations– De novo motif finder based on input values

Implemented in Cistematic package using Python

Goal: To provide an efficient Greedy algorithm to be included in the Cistematic package that performs similarly to

Consensus

Cistematic

One motif finder is generally insufficient

Further automated analysis performed to refine motifs

Enhances motif finder performance through additional steps

Image:Ali Mortazavi

Cistematic

Image:Ali Mortazavi

cisGreedy becomes part of “Bottom Tier”

Offers an alternative to downloading Consensus software

– Additional motif finders will be made available

What is a Motif?

cis-Regulatory elements– Transcription Factor Binding Sites(TFBS)– Binding by transcription factors may increase or

decrease transcription of genes

Gene Regulation believed to be a major source of complexity – Plants may have more genes or larger genomes

than humans – are they more complex?

Multiple Products from One Gene

Other methods to increase complexity– Polyadenylation

Different “endings” available

– Alternative splicing Many more cDNAs

– Methylation

Identification of cis-regulatory elements will help us understand gene regulatory networks

Motif Finding in DNA Sequences

cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc

5 sample sequences


cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca


But motifs are rarely conserved to such a degree


cctgatagacgctatctggctatccaTgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatacTtaGgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgAacgAgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacCtCcgtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaaGgtGcgtc

Motifs less discernable without 100% identity


cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttCcaaccat

agtactggtgtAcAtttGatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaAAAtttt

agcctccgatgtaagtcatagctgtaactattacctgccacCcCtAttacatcttacgtacgtataca


Other subsequences which are not motifs may appear more conserved– filtering out noise becomes challenging!

Motifs are degenerate

Only certain positions need to be specified– Binding Sites for different control elements may

overlap – more complex regulation

Often use Position Specific Frequency Matrix (PSFM) where each nucleotide is represented as a fraction - columns add to 1

Also represented by “Motif Logo”

How do we find motifs?

Hard to identify– Relatively short sequences– Many positions not well conserved

Factors improving identification– Usually localized in certain proximity of a gene

(search within 3 kb upstream)– Some positions highly conserved– Use other data (Microarray?)

Motif Finders

Greedy– Maximizes similarity of motifs from sequences

through a greedy approach

Gibbs Sampling– Attempts to find best motifs using a combination

of probability and scores to avoid local maximums being identified

Expectation Maximization

Consensus Score

Determine number of occurrences of each base at each position

Consensus Score


Sum of the occurrences of each nucleotide at every index must add to the total number of sequences included

Consensus Score


Identify the most common base at each position – Consensus Sequence

Consensus Sequence

Consensus Score


Identify the most common base at each position – Consensus Sequence

Add occurrence of each base in the consensus sequence at each index to determine consensus Score

Consensus Sequence

Consensus Score = 31

Position Specific Frequency Matrix

T G G G G G A

T G A G A G A

T G G G G G A

T G A G A G A

T G A G G G A

A 0.0 0.0 0.6 0.0 0.4 0.0 1.0

C 0.0 0.0 0.0 0.0 0.0 0.0 0.0

G 0.0 1.0 0.4 1.0 0.6 1.0 0.0

T 1.0 0.0 0.0 0.0 0.0 0.0 0.0

Frequencies are the number of each base at every position divided by the total number of sequences

Sum for each column is 1 (at least one base must occur)

Motif Logo

T G G G G G A

T G A G A G A

T G G G G G A

T G A G A G A

T G A G G G A

A 0.0 0.0 0.6 0.0 0.4 0.0 1.0

C 0.0 0.0 0.0 0.0 0.0 0.0 0.0

G 0.0 1.0 0.4 1.0 0.6 1.0 0.0

T 1.0 0.0 0.0 0.0 0.0 0.0 0.0

bioalgorithms.info

Frequencies affect logo size

Size of letter indicates the frequency of occurrence relative to other sequences

Size indicates confidence of letter

Consensus Scoring

Use equation similar to log likelihood called Information Content

Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 8 1999: 563-577.

L columns in the matrixA = {A,C,G,T}

frequency of each letter i at each position j a priori probabiliy of letter i

Our implementation substitutes the a priori probability with a specific dependent probability based on the Markov Model

cisGreedy

Input sequences are analyzed – possibly establish background

– Background models are used to filter out noise

Randomly select 2 sequences to b compaired

cisGreedy

The two selected sequences are independently analyzed

cisGreedy

The two selected sequences are independently analyzed

Windows of motif size are scanned starting at the beginning of each sequence

cisGreedy

Sequences are scanned in an attempt to locate the highest scoring alignment

– Alignments are ungapped– Score is the Information Content

cisGreedy

Reverse Complements are analyzed (unless specified otherwise)

Once start locations are established with a top alignment score, these are left unchanged

cisGreedy

Select an additional sequence in which to identify the location of the motif

Additional sequences windows are aligned to previous established windows (hence Greedy)

cisGreedy

Additional sequence scanned as before, reverse complement (unless otherwise specified)

Alignment score established as before

cisGreedy

Final motif locations are taken in order to build position specific frequency matrices

Reverse complement sequence used in building PSFM if used

cisGreedy User Input

Sequence input Motif size (may be a range) Number of motifs cisGreedy should find Iterations to perform at each step before selecting a

motif Background model Markov Model Size Reverse complement – whether to include it May designate which sequences will be “founder”

sequences – select homologs Designate percent identity between founder

sequences

cisGreedy Output

Multiple motifs represented as PWMs or PSFMs

Motifs represented as symbols.

– Basic nucleotides represented by respective symbols (A-adenine, etc)

– Remaining symbols may require threshold

NTDs Symbols

AC M

AG R

AT W

CG S

CT Y

GT K

ACG V

ACT H

AGT D

CGT B

ACGT N

Symbol Example

A .75 .25 0 1 0 .25 0 1 0 0 1

C 0 0 .75 0 0 .75 0 0 .25 .25 0

G .25 .75 .25 0 1 0 1 0 .75 .75 0

T 0 0 0 0 0 0 0 0 0 0 0

R R S A G M G A S S A

cisGreedy - Optimization

Zoops - Zero or One Occurrence per Sequence– If no good motifs identified in a sequence it is removed

If subsequence’s Pvalue is not greater than the average PValue Background model (default Markov3 model)

– can be input (Ex: C/G-rich regions)– Markov model can be up to Markov6 (unreasonable for input

sequences of a certain size) Find multiple Motifs

– mask each motif after identification (windows cannot be reused) Allow for ranges of motif lengths Perform multiple iterations before choosing a motif

– Avoid local maxima

cisGreedy Markov3 Background Model

Collection of all 4-mers with corresponding frequency of word in input sequences

Use 4-mer frequencies in order to describe P- value of last nucleotide in the 4-mer

– Nucleotide p-value not independent Probability of any sequence is the product of the

probability of each nucleotide which make up that sequence

A word is deemed significant if its probability is less than the average of all words of the same size in the background model

cisGreedy Markov3 - Example

Each word has a probability associated with it– Probability of seeing the word based on its frequency in the

model

5.3 * 10-12

cisGreedy Markov3 - Example

Each word has a probability associated with it– Probability of seeing the word based on its frequency in the

model– Describes probability of seeing letter in the last position

based on the 3-mer preceding it

5.3 * 10-8

Calculating probability of a wordCalculation of word probability based on Markov Models

-ln p

roba

bilit

y of

sub

sequ

ence

Sequence

1kb upstream region of a yeast gene nucleotide distribution

Sequence Position (Upstream from Transcription Start Site)

Distribution of nucleotide probabilities based on Markov Model

-ln p

roba

bilit

y of

nuc

leot

ide

1kb upstream region of a yeast gene word probabilities

-ln p

roba

bilit

y of

seq

uenc

e- ln probabilities of all words based on Markov Model


1kb upstream region of a yeast gene word probabilities

-ln p

roba

bilit

y of

seq

uenc

e


- ln probabilities of all words based on Markov Model

1kb upstream region of a yeast gene Motif probability

-ln p

roba

bilit

y of

seq

uenc

e


- ln probabilities based on Markov Model of all words within 50 nucleotides of a know Yeast motif

Probability of a motif

Probabilities of seeing a motif given a background should be lower– Chance of seeing the word at random should be

low

A motif will not have an extremely low probability as it should be seen multiple times in a data set for it to be identified

MSP Results –

Testing using nematode data– C. elegans and C. briggsae– Major Sperm Protein (MSP)

Cytoskeletal element required for mobility of nematode spermatozoa

Multiple genes in genomes Co-regulated

MSP Results – cisGreedy motifs

Motifs represented by symbols identified by MEME


Motifs represented by symbols identified by cisGreedy

MSP Results – MEME motifs

Motifs identified by MEME plotted on input sequences– Total 10 motifs identified (not all plotted)


Motifs identified by cisGreedy plotted on input sequences– Total 10 motifs identified (not all plotted)

Future goals

Test CisGreedy with dataset used in paper analyzing available motif finding tools

Make adjustments to improve results Build upon CisGreedy to make more complex

algorithms - Weeder? Additionally motif finders based on different theories Gibbs Sampler Expectation maximization

References

Bioalgorithms.info Jones, Neil C., and Pavel A. Pevzner. An Introduction to

Bioinformatics Algorithms . : MIT Press , 2004. Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and

protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 8 1999: 563-577.

Tompa, Martin et al. “Assessing computational tools for the discovery of transcription factor binding sites." Nature Biotechnology January 2005: 137-144.

http://cistematic.caltech.edu

Acknowledgements

Ali Mortazavi Barbara Wold Wold Lab funding provided by DOE & NASA Additional funding by NSF & NIH SoCalBSI faculty, staff and fellow students

cisgreedy motif finder for cistematic sarah aerni mentors: ali mortazavi barbara wold

Documents

consensus slide

cisgreedy motif finder

available slide

identity slide

degree slide

novo motif finder

motif finder performance

ali mortazavi slide