finding regulatory motifs in dna sequences an introduction to bioinformatics algorithms (jones and...
TRANSCRIPT
Finding Regulatory Motifs in Finding Regulatory Motifs in DNA SequencesDNA Sequences
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Combinatorial Gene Regulation
A microarray experiment showed that A microarray experiment showed that when gene X is knocked out, 20 other when gene X is knocked out, 20 other genes are not expressedgenes are not expressed
How can one gene have such drastic How can one gene have such drastic effects?effects?
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Regulatory Proteins Gene X encodes regulatory protein, a.k.a. a Gene X encodes regulatory protein, a.k.a. a
transcription factortranscription factor (TF)(TF)
The 20 unexpressed genes rely on gene X’s TF to The 20 unexpressed genes rely on gene X’s TF to induce transcriptioninduce transcription
A single TF may regulate multiple genes A single TF may regulate multiple genes
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Regulatory Regions Every gene contains a regulatory region (RR) typically Every gene contains a regulatory region (RR) typically
stretching 100-1000 bp upstream of the transcriptional stretching 100-1000 bp upstream of the transcriptional start sitestart site
Located within the RR are the Located within the RR are the Transcription Factor Transcription Factor Binding SitesBinding Sites (TFBS), also known as (TFBS), also known as motifsmotifs, specific , specific for a given transcription factorfor a given transcription factor
TFs influence gene expression by binding to a specific TFs influence gene expression by binding to a specific location in the respective gene’s regulatory region - location in the respective gene’s regulatory region - TFBS TFBS
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Transcription Factor Binding Sites
A TFBS can be located anywhere within A TFBS can be located anywhere within the Regulatory Region.the Regulatory Region.
TFBS may vary slightly across different TFBS may vary slightly across different regulatory regions since non-essential regulatory regions since non-essential bases could mutatebases could mutate
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Motifs and Transcriptional Start Sites
geneATCCCG
geneTTCCGG
geneATCCCG
geneATGCCG
geneATGCCC
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Transcription Factors and Motifs
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Motif Logo Motifs can mutate on non Motifs can mutate on non
important bases important bases The five motifs in five different The five motifs in five different
genes have mutations in genes have mutations in position 3 and 5position 3 and 5
Representations called Representations called motif motif logoslogos illustrate the conserved illustrate the conserved and variable regions of a motifand variable regions of a motif
TGGGGGATGGGGGA
TGAGAGATGAGAGA
TGGGGGATGGGGGA
TGAGAGATGAGAGA
TGAGGGATGAGGGA
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Information content Ix at site x = 2 + i pi log(pi) where pi is frequency of base i at site x
ExamplesFor one nucleotide at a site: Ix = 2 + 1 * log(1) = 2 bits
For two nucleotides at a site: Ix = 2 + 1/2 * log(1/2) + 1/2 *log(1/2) = 1 bit
For four nucleotides at a site: Ix = 2 + 4 (1/4 * log(1/4)) = 0 bits
Motif Logos: An Example
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Identifying Motifs
Genes are turned on or off by regulatory Genes are turned on or off by regulatory proteinsproteins
These proteins bind to upstream regulatory These proteins bind to upstream regulatory regions of genes to either attract or block an regions of genes to either attract or block an RNA polymeraseRNA polymerase
Regulatory protein (TF) binds to a short DNA Regulatory protein (TF) binds to a short DNA sequence called a motif (TFBS)sequence called a motif (TFBS)
So finding the same motif in multiple genes’ So finding the same motif in multiple genes’ regulatory regions suggests a regulatory regulatory regions suggests a regulatory relationship amongst those genesrelationship amongst those genes
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Identifying Motifs: Complications
We do not know the motif sequenceWe do not know the motif sequence
We do not know where it is located relative We do not know where it is located relative to the gene’s start to the gene’s start
Motifs can differ slightly from one gene to Motifs can differ slightly from one gene to the nextthe next
How to discern it from “random” motifs?How to discern it from “random” motifs?
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Random Sample
atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgatgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtacaacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca
tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagagctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatagtcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag
gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaagtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcatcggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtaaacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaagttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagcactggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Implanting Motif AAAAAAAAGGGGGGG
atgaccgggatactgatatgaccgggatactgatAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGaa
tgagtatccctgggatgactttgagtatccctgggatgacttAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatggctgagaattggatgAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaattcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGcttatagcttatag
gtcaatcatgttcttgtgaatggatttgtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtcggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcatcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaacttgagttAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtactggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGaccgaaagggaagaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGaa
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Where is the Implanted Motif?
atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgatgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaagggggggaacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga
tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagagctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatagtcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag
gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaagtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcatcggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtaaacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaagttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaagggggggactggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Implanting Motif AAAAAAAAGGGGGGG with Four Mutations
atgaccgggatactgatatgaccgggatactgatAAggAAAAggAAAGGAAAGGttttGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataccAAAAttAAAAAAAAccGGGGccGGGGGGaa
tgagtatccctgggatgactttgagtatccctgggatgacttAAAAAAAAttAAAAttGGGGaaGGttGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatggctgagaattggatgccAAAAAAAGGGAAAAAAAGGGattattGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaattcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAttAAAAttAAAGGAAAGGaaaaGGGGGGcttatagcttatag
gtcaatcatgttcttgtgaatggatttgtcaatcatgttcttgtgaatggatttAAAAccAAAAttAAGGGAAGGGctctGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtcggttttggcccttgttagaggcccccgtAAttAAAAAAccAAGGAAGGaaGGGGGGcccaattatgagagagctaatctatcgcgtgcgtgttcatcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaacttgagttAAAAAAAAAAAAttAGGGAGGGaaGGccccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtactggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAActctAAAAAGGAAAAAGGaaGGccGGGGaccgaaagggaagaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAActctAAAAAGGAAAAAGGaaGGccGGGGaa
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Where is the Motif???
atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgatgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcgggaacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga
tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagagctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatagtcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag
gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaagtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcatcggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtaaacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaagttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcggactggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Why Finding (15,4) Motif is Difficult?
atgaccgggatactgatatgaccgggatactgatAAggAAAAggAAAGGAAAGGttttGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataccAAAAttAAAAAAAAccGGGGccGGGGGGaa
tgagtatccctgggatgactttgagtatccctgggatgacttAAAAAAAAttAAAAttGGGGaaGGttGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatggctgagaattggatgccAAAAAAAGGGAAAAAAAGGGattattGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaattcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAttAAAAttAAAGGAAAGGaaaaGGGGGGcttatagcttatag
gtcaatcatgttcttgtgaatggatttgtcaatcatgttcttgtgaatggatttAAAAccAAAAttAAGGGAAGGGctctGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtcggttttggcccttgttagaggcccccgtAAttAAAAAAccAAGGAAGGaaGGGGGGcccaattatgagagagctaatctatcgcgtgcgtgttcatcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaacttgagttAAAAAAAAAAAAttAGGGAGGGaaGGccccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtactggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAActctAAAAAGGAAAAAGGaaGGccGGGGaccgaaagggaagaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAActctAAAAAGGAAAAAGGaaGGccGGGGaa
AgAAgAAAGGttGGG
cAAtAAAAcGGcGGG..|..|||.|..|||
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Challenge Problem
Find a motif in a sample of Find a motif in a sample of
- 20 “random” sequences (e.g. 600 nt long)- 20 “random” sequences (e.g. 600 nt long)
- each sequence containing an implanted - each sequence containing an implanted
pattern of length 15, pattern of length 15,
- each pattern appearing with 4 mismatches - each pattern appearing with 4 mismatches
as (15,4)-motif.as (15,4)-motif.
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
A Motif Finding Analogy
The Motif Finding Problem is similar to the The Motif Finding Problem is similar to the problem posed by Edgar Allan Poe (1809 problem posed by Edgar Allan Poe (1809 – 1849) in his – 1849) in his Gold Bug Gold Bug storystory
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The Gold Bug ProblemThe Gold Bug Problem
Given a secret message:Given a secret message:53++!305))6*;4826)4+.)4+);806*;48!8`60))85;]8*:+*8!53++!305))6*;4826)4+.)4+);806*;48!8`60))85;]8*:+*8!
83(88)5*!; 83(88)5*!; 46(;88*96*?;8)*+(;485);5*!2:*+(;4956*2(5*-4)8`8*; 46(;88*96*?;8)*+(;485);5*!2:*+(;4956*2(5*-4)8`8*;
4069285);)64069285);)6!8)4++;1(+9;48081;8:8+1;48!85;4)485!528806*81(+9;48;(88;4(+?!8)4++;1(+9;48081;8:8+1;48!85;4)485!528806*81(+9;48;(88;4(+?
334;48)4+;161;:188;+?; 4;48)4+;161;:188;+?;
Decipher the message encrypted in the Decipher the message encrypted in the fragmentfragment
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Hints for The Gold Bug ProblemHints for The Gold Bug Problem
Additional hints:Additional hints:The encrypted message is in EnglishThe encrypted message is in EnglishEach symbol corresponds to one letter in Each symbol corresponds to one letter in
the English alphabetthe English alphabetNo punctuation marks are encodedNo punctuation marks are encoded
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The Gold Bug Problem: Symbol CountsThe Gold Bug Problem: Symbol Counts
Naive approach to solving the problem:Naive approach to solving the problem:Count the frequency of each symbol in the Count the frequency of each symbol in the
encrypted messageencrypted messageFind the frequency of each letter in the Find the frequency of each letter in the
alphabet in the English languagealphabet in the English languageCompare the frequencies of the previous Compare the frequencies of the previous
steps, try to find a correlation and map the steps, try to find a correlation and map the symbols to a letter in the alphabetsymbols to a letter in the alphabet
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Symbol Frequencies in the Gold Bug MessageSymbol Frequencies in the Gold Bug Message
Gold Bug MessageGold Bug Message::
English LanguageEnglish Language::
e t a o i n s r h l d c u m f p g w y b v k x j q ze t a o i n s r h l d c u m f p g w y b v k x j q z
Most frequentMost frequent Least frequentLeast frequent
Symbol 8 ; 4 ) + * 5 6 ( ! 1 0 2 9 3 : ? ` - ] .
Frequency 34 25 19 16 15 14 12 11 9 8 7 6 5 5 4 4 3 2 1 1 1
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The Gold Bug Message Decoding: First AttemptThe Gold Bug Message Decoding: First Attempt
By simply mapping the most frequent By simply mapping the most frequent symbols to the most frequent letters of the symbols to the most frequent letters of the alphabet:alphabet:
sfiilfcsoorntaeuroaikoaiotecrntaeleyrcooestvenpinelefheeosnltsfiilfcsoorntaeuroaikoaiotecrntaeleyrcooestvenpinelefheeosnlt
arhteenmrnwteonihtaesotsnlupnihtamsrnuhsnbaoeyentacrmuesotorlarhteenmrnwteonihtaesotsnlupnihtamsrnuhsnbaoeyentacrmuesotorl
eoaiitdhimtaecedtepeidtaelestaoaeslsueecrnedhimtaetheetahiwfaeoaiitdhimtaecedtepeidtaelestaoaeslsueecrnedhimtaetheetahiwfa
taeoaitdrdtpdeetiwttaeoaitdrdtpdeetiwt
The result does not make senseThe result does not make sense
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The Gold Bug Problem: The Gold Bug Problem: ll-tuple count-tuple count
A better approach:A better approach:Examine frequencies of Examine frequencies of ll-tuples, -tuples,
combinations of 2 symbols, 3 symbols, etc.combinations of 2 symbols, 3 symbols, etc.““TheThe” is the most frequent 3-tuple in ” is the most frequent 3-tuple in
English and “English and “;48;48” is the most frequent 3-” is the most frequent 3-tuple in the encrypted texttuple in the encrypted text
Make inferences of unknown symbols by Make inferences of unknown symbols by examining other frequent examining other frequent ll-tuples -tuples
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The Gold Bug Problem: the The Gold Bug Problem: the ;48;48 clue clue
Mapping “Mapping “;48;48” to “” to “thethe” and substituting all ” and substituting all occurrences of the symbols:occurrences of the symbols:
53++!305))6*53++!305))6*thethe26)h+.)h+)te06*26)h+.)h+)te06*thethe!e`60))e5t]e*:+*e!e3(ee)5*!t!e`60))e5t]e*:+*e!e3(ee)5*!t
h6(tee*96*?te)*+(h6(tee*96*?te)*+(thethe5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!e5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!e
)h++t1(+9)h++t1(+9thethe0e1te:e+10e1te:e+1thethe!e5th)he5!52ee06*e1(+9!e5th)he5!52ee06*e1(+9thethet(eeth(+?3ht(eeth(+?3htt
hehe)h+t161t:1eet+?t)h+t161t:1eet+?t
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The Gold Bug Message Decoding: Second AttemptThe Gold Bug Message Decoding: Second Attempt
Make inferences:Make inferences:
53++!305))6*the26)h+.)h+)te06*the!e`60))e5t]e*:+*e!e3(ee)5*!t53++!305))6*the26)h+.)h+)te06*the!e`60))e5t]e*:+*e!e3(ee)5*!th6(tee*96*?te)*+(the5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!eh6(tee*96*?te)*+(the5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!e)h++t1(+9the0e1te:e+1the!e5th)he5!52ee06*e1(+9)h++t1(+9the0e1te:e+1the!e5th)he5!52ee06*e1(+9thet(eethet(eeth(+?3hth(+?3htthe)h+t161t:1eet+?the)h+t161t:1eet+?t
““thet(eethet(ee” most likely means “” most likely means “the treethe tree””Infer “(“ = “r”Infer “(“ = “r”
““th(+?3hth(+?3h” becomes “” becomes “thr+?3hthr+?3h””Can we guess “+,” “?,” and “3”?Can we guess “+,” “?,” and “3”?
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The Gold Bug Problem: The SolutionThe Gold Bug Problem: The Solution
After figuring out all the mappings, the final After figuring out all the mappings, the final message is:message is:
AGOODGLASSINTHEBISHOPSHOSTELINTHEDEVILSSEATWENYONEDEGREAGOODGLASSINTHEBISHOPSHOSTELINTHEDEVILSSEATWENYONEDEGRE
ESANDTHIRTEENMINUTESNORTHEASTANDBYNORTHMAINBRANCHSEVENT ESANDTHIRTEENMINUTESNORTHEASTANDBYNORTHMAINBRANCHSEVENT HLIMBEASTSIDESHOOTFROMTHELEFTEYEOFTHEDEATHSHEADABEELINEHLIMBEASTSIDESHOOTFROMTHELEFTEYEOFTHEDEATHSHEADABEELINE
FROMTHETREETHROUGHTHESHOTFIFTYFEETOUTFROMTHETREETHROUGHTHESHOTFIFTYFEETOUT
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The SolutionThe Solution (cont’d) (cont’d)
Punctuation is important:Punctuation is important:
A GOOD GLASS IN THE BISHOP’S HOSTEL IN THE DEVIL’S SEA, A GOOD GLASS IN THE BISHOP’S HOSTEL IN THE DEVIL’S SEA,
TWENY ONE DEGREES AND THIRTEEN MINUTES NORTHEAST AND BY NORTH, TWENY ONE DEGREES AND THIRTEEN MINUTES NORTHEAST AND BY NORTH,
MAIN BRANCH SEVENTH LIMB, EAST SIDE, SHOOT FROM THE LEFT EYE OF MAIN BRANCH SEVENTH LIMB, EAST SIDE, SHOOT FROM THE LEFT EYE OF
THE DEATH’S HEAD A BEE LINE FROM THE TREE THROUGH THE SHOT, THE DEATH’S HEAD A BEE LINE FROM THE TREE THROUGH THE SHOT,
FIFTY FEET OUT.FIFTY FEET OUT.
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Solving the Gold Bug ProblemSolving the Gold Bug Problem
Prerequisites to solve the problem:Prerequisites to solve the problem:
Need to know the relative frequencies of Need to know the relative frequencies of single letters, and combinations of two and single letters, and combinations of two and three letters in Englishthree letters in English
Knowledge of all the words in the English Knowledge of all the words in the English dictionary is highly desired to make dictionary is highly desired to make accurate inferencesaccurate inferences
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Nucleotides in motifs encode for a message in Nucleotides in motifs encode for a message in the “genetic” language. Symbols in “The Gold the “genetic” language. Symbols in “The Gold Bug” encode for a message in EnglishBug” encode for a message in English
In order to solve the problem, we analyze the In order to solve the problem, we analyze the frequencies of patterns in DNA/Gold Bug frequencies of patterns in DNA/Gold Bug message. message.
Knowledge of established regulatory motifs Knowledge of established regulatory motifs makes the Motif Finding problem simpler. makes the Motif Finding problem simpler. Knowledge of the words in the English Knowledge of the words in the English dictionary helps to solve the Gold Bug problem.dictionary helps to solve the Gold Bug problem.
Motif Finding and The Gold Bug Problem: SimilaritiesMotif Finding and The Gold Bug Problem: Similarities
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
SimilaritiesSimilarities (cont’d) (cont’d)
Gold Bug ProblemGold Bug Problem:: In order to solve the problem, we analyze the In order to solve the problem, we analyze the
frequencies of patterns in the text written in Englishfrequencies of patterns in the text written in English
Motif FindingMotif Finding:: In order to solve the problem, we analyze the In order to solve the problem, we analyze the
frequencies of patterns in the nucleotide sequencesfrequencies of patterns in the nucleotide sequences
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
SimilaritiesSimilarities (cont’d) (cont’d)
Gold Bug ProblemGold Bug Problem::Knowledge of the words in the dictionary is Knowledge of the words in the dictionary is
highly desirablehighly desirable
Motif FindingMotif Finding::Knowledge of established motifs reduces Knowledge of established motifs reduces
the complexity of the problemthe complexity of the problem
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Motif Finding and The Gold Bug Problem: DifferencesMotif Finding and The Gold Bug Problem: Differences
Motif FindingMotif Finding is harder than is harder than Gold Bug problemGold Bug problem::
We don’t have the complete dictionary of motifsWe don’t have the complete dictionary of motifsThe “genetic” language does not have a The “genetic” language does not have a
standard “grammar”standard “grammar”Only a small fraction of nucleotide sequences Only a small fraction of nucleotide sequences
encode for motifs; the size of data is enormousencode for motifs; the size of data is enormous
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The Motif Finding ProblemThe Motif Finding Problem
Given a random sample of DNA sequences:Given a random sample of DNA sequences:
cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccatcctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgcagtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaattttaaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtatacaagcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtcctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc
Find the pattern that is implanted in each of Find the pattern that is implanted in each of the individual sequences, namely, the motifthe individual sequences, namely, the motif
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The Motif Finding ProblemThe Motif Finding Problem (cont’d) (cont’d)
Additional information:Additional information:
The hidden sequence is of length 8The hidden sequence is of length 8
The pattern is not exactly the same in each The pattern is not exactly the same in each array because random point mutations array because random point mutations may occur in the sequencesmay occur in the sequences
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The Motif Finding Problem The Motif Finding Problem (cont’d)(cont’d)
The patterns revealed with no mutations:The patterns revealed with no mutations:cctgatagacgctatctggctatcccctgatagacgctatctggctatccacgtacgtacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccataggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatagtactggtgtacatttgatacgtacgtacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgcacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaaaacgtacgtacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaattttgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttagcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtacgtacgtatacaataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttactgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtacgtacgtcc
acgtacgtacgtacgt
Consensus StringConsensus String
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The Motif Finding Problem The Motif Finding Problem (cont’d)(cont’d)
The patterns with 2 point mutations:The patterns with 2 point mutations:
cctgatagacgctatctggctatcccctgatagacgctatctggctatccaaGGgtacgtacTTttaggtcctctgtgcgaatctatgcgtttccaaccataggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatagtactggtgtacatttgatCCccAAtacgttacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgcacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaaaacgtacgtTATAgtgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaattttgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttagcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtCCccAAttatacaataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttactgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCCcgtacgcgtacgGGcc
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The Motif Finding Problem The Motif Finding Problem (cont’d)(cont’d)
The patterns with 2 point mutations:The patterns with 2 point mutations:
cctgatagacgctatctggctatcccctgatagacgctatctggctatccaaGGgtacgtacTTttaggtcctctgtgcgaatctatgcgtttccaaccataggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatagtactggtgtacatttgatCCccAAtacgttacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgcacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaaaacgtacgtTATAgtgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaattttgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttagcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtCCccAAttatacaataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttactgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCCcgtacgcgtacgGGcc
Can we still find the motif, now that we have 2 mutations?
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Defining Motifs Defining Motifs
To define a motif, lets say we know where the To define a motif, lets say we know where the motif starts in the sequencemotif starts in the sequence
The motif start positions in their sequences can The motif start positions in their sequences can be represented as be represented as ss = ( = (ss11,,ss22,,ss33,…,,…,sstt))
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Motifs: Profiles and ConsensusMotifs: Profiles and Consensus aa GG g t a c g t a c TT tt CC cc AA t a c g tt a c g tAlignmentAlignment a c g t a c g t T AT A g tg t a c g t a c g t CC cc AA tt CC c g t a c g c g t a c g GG
__________________________________
AA 33 0 0 11 0 0 33 1 11 1 0 0ProfileProfile CC 22 44 0 0 0 0 11 44 0 0 0 0 GG 0 1 0 1 44 0 0 0 0 0 0 33 11 TT 0 0 0 0 0 0 55 1 0 1 0 11 44
__________________________________
Consensus Consensus A C G T A C G TA C G T A C G T
Line up the patterns by Line up the patterns by their start indexes their start indexes
ss = ( = (ss11, , ss22, …, , …, sstt))
Construct profile matrix Construct profile matrix with frequencies of each with frequencies of each nucleotide in columnsnucleotide in columns
Consensus nucleotide in Consensus nucleotide in each position has the each position has the highest score in columnhighest score in column
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
ConsensusConsensus
Think of consensus as an “ancestor” Think of consensus as an “ancestor” motif, from which mutated motifs emergedmotif, from which mutated motifs emerged
The The distancedistance between a real motif and the between a real motif and the consensus sequence is generally less consensus sequence is generally less than that for two real motifsthan that for two real motifs
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
ConsensusConsensus (cont’d) (cont’d)
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Evaluating MotifsEvaluating Motifs
We have a guess about the consensus We have a guess about the consensus sequence, but how “good” is this sequence, but how “good” is this consensus?consensus?
Need to introduce a scoring function to Need to introduce a scoring function to compare different guesses and choose the compare different guesses and choose the “best” one. “best” one.
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Defining Some TermsDefining Some Terms
tt - number of sample DNA sequences - number of sample DNA sequences nn - length of each DNA sequence - length of each DNA sequence DNADNA - sample of DNA sequences ( - sample of DNA sequences (tt x x nn array) array)
ll - length of the motif ( - length of the motif (ll-mer)-mer) ssii - starting position of an - starting position of an ll-mer in sequence -mer in sequence ii
ss=(=(ss11, s, s22,… s,… stt)) - array of motif starting positions - array of motif starting positions
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
ParametersParameters
cctgatagacgctatctggctatcccctgatagacgctatctggctatccaaGGgtacgtacTTttaggtcctctgtgcgaatctatgcgtttccaaccataggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatagtactggtgtacatttgatCCccAAtacgttacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgcacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaaaacgtacgtTATAgtgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaattttgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttagcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtCCccAAttatacaataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttactgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCCcgtacgcgtacgGGcc
l = 8
t=5
s1 = 26 s2 = 21 s3= 3 s4 = 56 s5 = 60 s
DNA
n = 69
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Scoring MotifsScoring Motifs
Given Given ss = (s = (s11, … s, … stt) ) and and DNADNA::
ScoreScore((ss,,DNADNA)) = =
aa GG g t a c g t a c TT tt CC cc AA t a c g tt a c g t a c g t a c g t T AT A g tg t a c g t a c g t CC cc AA tt CC c g t a c g c g t a c g GG __________________________________ AA 33 0 0 11 0 0 33 1 11 1 0 0 CC 22 44 0 0 0 0 11 44 0 0 0 0 GG 0 0 11 44 0 0 0 0 0 0 33 11 TT 0 0 0 0 0 0 55 11 0 0 11 44 __________________________________
Consensus Consensus a c g t a c g ta c g t a c g t
Score 3+4+4+5+3+4+3+4=Score 3+4+4+5+3+4+3+4=3030
l
t
l
i GCTAk
ikcount1 },,,{
),(max
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The Motif Finding ProblemThe Motif Finding Problem
If starting positions If starting positions ss=(=(ss11, s, s22,… s,… stt) are ) are
given, finding consensus is easy even with given, finding consensus is easy even with mutations in the sequences because we mutations in the sequences because we can simply construct the profile to find the can simply construct the profile to find the motif (consensus) motif (consensus)
But… the starting positions But… the starting positions ss are usually are usually not given. How can we find the “best” not given. How can we find the “best” profile matrix?profile matrix?
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The Motif Finding Problem: FormulationThe Motif Finding Problem: Formulation
GoalGoal: Given a set of DNA sequences, find a set of : Given a set of DNA sequences, find a set of ll--mers, one from each sequence, that maximizes the mers, one from each sequence, that maximizes the consensus scoreconsensus score
InputInput: A : A tt x x nn matrix of matrix of DNADNA, and , and ll, the length of the , the length of the pattern to findpattern to find
OutputOutput: An array of : An array of tt starting positions starting positions ss = ( = (ss11, s, s22, … s, … stt) maximizing ) maximizing ScoreScore((ss,,DNADNA))
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The Motif Finding Problem: Brute Force SolutionThe Motif Finding Problem: Brute Force Solution
Compute the scores for each possible Compute the scores for each possible combination of starting positions combination of starting positions ss
The best score will determine the best profile The best score will determine the best profile and the consensus pattern in and the consensus pattern in DNADNA
The goal is to maximize The goal is to maximize ScoreScore((ss,,DNADNA) by ) by varying the starting positions varying the starting positions ssii, where:, where:
si = [1, …, n-l+1]i = [1, …, t]
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
BruteForceMotifSearchBruteForceMotifSearch
1.1. BruteForceMotifSearchBruteForceMotifSearch((DNADNA, , tt, , nn, , ll))2.2. bestScorebestScore 0 0
3.3. forfor each each s=s=((ss11,s,s22 , . . ., s , . . ., stt) from) from (1,1 . . . 1) (1,1 . . . 1) to ( to (nn--ll+1, . . ., +1, . . ., nn--
ll+1)+1)4.4. ifif ( (ScoreScore((ss,,DNADNA) > ) > bestScorebestScore))5.5. bestScorebestScore scorescore((s, s, DNADNA))
6.6. bestMotifbestMotif ( (ss11,s,s22 , . . . , s , . . . , stt) ) 7.7. returnreturn bestMotifbestMotif
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Running Time of BruteForceMotifSearchRunning Time of BruteForceMotifSearch
Varying (Varying (n - n - ll + 1) + 1) positions in each of positions in each of tt sequences, we’re looking at (sequences, we’re looking at (n - n - ll + 1) + 1)tt sets of sets of starting positionsstarting positions
For each set of starting positions, the scoring For each set of starting positions, the scoring function makes function makes ll operations, so complexity is operations, so complexity is l l (n – (n – ll + 1) + 1)tt == OO((l l nntt))
For For tt = 8, = 8, nn = 1000, and = 1000, and ll = 10, how long will it = 10, how long will it take for a computer performing one million take for a computer performing one million operations per second to complete the task?operations per second to complete the task?
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Running Time of BruteForceMotifSearchRunning Time of BruteForceMotifSearch(continued)(continued)
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
For For tt = 8, = 8, nn = 1000, = 1000, ll = 10: = 10:l l nntt = 10 x 1000 = 10 x 100088 = 10 = 102525 operations operations
At 10At 1066 operations/second that is: operations/second that is:10102525 / 10 / 1066 = 10 = 101919 seconds seconds 3.17 x 10 3.17 x 101111 years years
Running Time of BruteForceMotifSearchRunning Time of BruteForceMotifSearch(continued)(continued)
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
For For tt = 8, = 8, nn = 1000, = 1000, ll = 10: = 10:l l nntt = 10 x 1000 = 10 x 100088 = 10 = 102525 operations operations
At 10At 1066 operations/second that is: operations/second that is:10102525 / 10 / 1066 = 10 = 101919 seconds seconds 3.17 x 10 3.17 x 101111 years years
Let’s try something different…Let’s try something different…
The Median String ProblemThe Median String Problem
Given a set of Given a set of tt DNA sequences find a DNA sequences find a pattern that appears in all pattern that appears in all tt sequences sequences with the minimum number of mutations with the minimum number of mutations
This pattern will be the motifThis pattern will be the motif
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Hamming DistanceHamming Distance
Hamming distance:Hamming distance: ddHH((vv,,ww)) is the number of nucleotide pairs is the number of nucleotide pairs
that do not match when that do not match when vv and and ww are are aligned. For example:aligned. For example:
ddHH(AAAAAA(AAAAAA,,ACAAAC) = 2ACAAAC) = 2
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Total Distance: ExampleTotal Distance: Example
Given Given vv = “ = “acgtacgtacgtacgt” and ” and ss acgtacGtacgtacGt
cctgatagacgctatctggctatcccctgatagacgctatctggctatccacgtacAtacgtacAtaggtcctctgtgcgaatctatgcgtttccaaccataggtcctctgtgcgaatctatgcgtttccaaccat acgtacgtacgtacgtagtactggtgtacatttgatagtactggtgtacatttgatacgtacgtacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgcacaccggcaacctgaaacaaacgctcagaaccagaagtgc aCgtAcgtaCgtAcgtaaaaaAgtCcgtaAgtCcgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaattttgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt acgtacgtacgtacgtagcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttagcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtacgtacgtatacaataca acgtaCgtacgtaCgtctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttactgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtaGgtacgtaGgtcc
vv is the sequence in is the sequence in redred, , xx is the sequence in is the sequence in blueblue
TotalDistance(TotalDistance(vv,,DNADNA) = ) = 1+0+2+0+1 = 41+0+2+0+1 = 4
dH(v, x) = 2
dH(v, x) = 1
dH(v, x) = 0
dH(v, x) = 0
dH(v, x) = 1
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Total Distance: DefinitionTotal Distance: Definition
For each DNA sequence For each DNA sequence ii, compute all , compute all ddHH((vv, , xx), where ), where xx is an is an ll-mer with starting position -mer with starting position ssi i
(1 (1 << ssii << nn – – l l + 1)+ 1) Find minimum of Find minimum of ddHH((vv, , xx) among all ) among all ll-mers in sequence -mers in sequence ii TotalDistance(TotalDistance(vv,,DNADNA)) is the sum of the minimum is the sum of the minimum
Hamming distances for each DNA sequenceHamming distances for each DNA sequence ii
So, So, TotalDistance(TotalDistance(vv,,DNADNA) = min) = minss d dHH((vv, , ss), where ), where ss is the is the set of starting positions set of starting positions ss11, s, s22,… s,… stt
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
The Median String Problem: FormulationThe Median String Problem: Formulation
GoalGoal: Given a set of DNA sequences, find : Given a set of DNA sequences, find a median string a median string vv
InputInput: A : A tt x x nn matrix DNA, and matrix DNA, and ll, the length , the length of the pattern to findof the pattern to find
OutputOutput: A string : A string vv of of l l nucleotides that nucleotides that minimizesminimizes TotalDistance(TotalDistance(vv,,DNADNA)) over all over all strings of that lengthstrings of that length
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Median String Search AlgorithmMedian String Search Algorithm
1.1. MedianStringSearch (MedianStringSearch (DNA, t, n, DNA, t, n, l l ))
2.2. bestWord bestWord AAA…A AAA…A
3.3. bestDistancebestDistance ∞ ∞
4.4. forfor each each ll-mer -mer wordword fromfrom AAA…A AAA…A toto TTT…T TTT…T
5.5. ifif TotalDistanceTotalDistance((wordword,,DNADNA)) < < bestDistancebestDistance
6.6. bestDistance bestDistance TotalDistanceTotalDistance((wordword,,DNADNA))
7.7. bestWordbestWord wordword
8.8. returnreturn bestWordbestWord
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Key: Motif Finding Problem == Median String ProblemKey: Motif Finding Problem == Median String Problem
The Motif Finding is a maximization problem while Median String is a minimization problem.
However, the Motif Finding problem and Median String problem are computationally equivalent.
To prove it, let’s show that minimizing TotalDistance is equivalent to maximizing Score…
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
We are looking for the same thingWe are looking for the same thing
aa GG g t a c g t a c TT tt CC cc AA t a c g tt a c g tAlignmentAlignment a c g ta c g t T AT A g tg t a c g t a c g t CC cc AA tt CC c g t a c g c g t a c g GG __________________________________ A A 33 0 0 11 0 0 33 11 11 0 0Profile Profile C C 22 44 0 0 0 0 11 44 0 0 0 0 G G 0 0 11 44 0 0 0 0 0 0 33 1 1 T T 0 0 0 0 0 0 55 11 0 0 11 44 __________________________________
Consensus Consensus a c g t a c g ta c g t a c g t
Score Score 3+4+4+5+3+4+3+43+4+4+5+3+4+3+4
TotalDistance TotalDistance 2+1+1+0+2+1+2+12+1+1+0+2+1+2+1
Sum 5 5 5 5 5 5 5 5Sum 5 5 5 5 5 5 5 5
At any column At any column jjScoreScorej j + + TotalDistanceTotalDistancejj = = tt
Because there are Because there are ll columns columns ScoreScore + TotalDistance + TotalDistance = = ll * * tt
Rearranging:Rearranging:ScoreScore = = ll * * tt - TotalDistance - TotalDistance
Because Because ll * * tt is constant, the is constant, the minimization of the right side is minimization of the right side is equivalent to the maximization equivalent to the maximization of the left side.of the left side.
l
t
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Motif Finding Problem vs. Motif Finding Problem vs. Median String ProblemMedian String Problem
Why bother reformulating the Motif Finding Why bother reformulating the Motif Finding problem into the Median String problem?problem into the Median String problem?
The Motif Finding Problem needs to The Motif Finding Problem needs to examine all the combinations for examine all the combinations for ss. That is . That is ((n -n - ll + 1)+ 1)tt combinations!!! combinations!!!
The Median String Problem needs to The Median String Problem needs to examine all 4examine all 4ll combinations for combinations for vv. This . This number is relatively smaller. By how much?number is relatively smaller. By how much?
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Median String Problem EfficiencyMedian String Problem EfficiencyThere are 4l possible l -mers to try, they must be placed in each of n – l + 1 locations in t sequences, and the Hamming distance computed for each position...
This is results in 4l x t x (n – l + 1) x l operations (i.e., O(4l tnl )).
Recall that the brute force motif finding problem for t = 8, n = 1000, and l = 10 was going to require 1025 operations and 3.17 x 1011 years at 106 ops/second.
For the median string algorithm and those same parameters we have 410 x 8 x 1000 x 10 = 8.39 x 1010 ops. At 106 ops/second, this algorithm will require 8.39 x 104 secs, which is 23.3 hours. Hmmm…
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Recall the BruteForceMotifSearch:Recall the BruteForceMotifSearch:
1.1. BruteForceMotifSearchBruteForceMotifSearch((DNADNA, , tt, , nn, , ll))
2.2. bestScorebestScore 0 0
3.3. forfor each each s=s=((ss11,s,s22 , . . ., s , . . ., stt) from) from (1,1 . . . 1) to ((1,1 . . . 1) to (nn--ll+1, . . ., +1, . . ., nn--ll+1)+1)
4.4. ifif ( (ScoreScore((ss,,DNADNA) > ) > bestScorebestScore))
5.5. bestScorebestScore ScoreScore((s, s, DNADNA))
6.6. bestMotifbestMotif ( (ss11,s,s22 , . . . , s , . . . , stt) )
7.7. return return bestMotifbestMotif
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Structuring the SearchStructuring the Search
Structuring the SearchStructuring the Search
How can we perform the lineHow can we perform the line
forfor each each s=s=((ss11,s,s22 , . . ., s , . . ., stt) from) from (1,1 . . . 1) to ((1,1 . . . 1) to (nn--ll+1, . . ., +1, . . ., nn--ll+1)+1) ??
We need a method for efficiently structuring We need a method for efficiently structuring and navigating the many possible motifs and navigating the many possible motifs
This is not very different than exploring all This is not very different than exploring all tt--digit numbersdigit numbers
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
1.1. MedianStringSearch (MedianStringSearch (DNADNA, , tt, , nn, , ll))
2.2. bestWordbestWord AAA…A AAA…A
3.3. bestDistancebestDistance ∞ ∞
4.4. forfor each each ll-mer -mer ss fromfrom AAA…A to TTT…T AAA…A to TTT…T ifif TotalDistanceTotalDistance((s,s,DNADNA)) < < bestDistancebestDistance
5.5. bestDistancebestDistanceTotalDistanceTotalDistance((s,s,DNADNA))
6.6. bestWordbestWord ss
7.7. returnreturn bestWordbestWord
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Structuring the SearchStructuring the Search
Structuring the SearchStructuring the Search
For the Median String Problem we need to For the Median String Problem we need to consider all 4consider all 4ll possible possible ll-mers:-mers:
aa… aaaa… aaaa… acaa… acaa… agaa… agaa… ataa… at
..
..tt… tttt… tt
How to organize this search?How to organize this search?
l
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Alternative Representation of the Search SpaceAlternative Representation of the Search Space
Let Let AA = 1, = 1, CC = 2, = 2, GG = 3, = 3, TT = 4 = 4 Then the sequences from AA…A to TT…T become:Then the sequences from AA…A to TT…T become:
11…1111…1111…1211…1211…1311…1311…1411…14....
44…4444…44 Notice that the sequences above simply list all numbers Notice that the sequences above simply list all numbers
using four sequential digits beginning with 1using four sequential digits beginning with 1
l
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Search TreeSearch Tree
a- c- g- t-a- c- g- t-
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg ttaa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
--
root
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Search TreeSearch Tree
1- 2- 1- 2-
11 12 21 2211 12 21 22
--
root
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
An instance of the travelling salesperson problem
Luger: Artificial Intelligence, 5th edition. © Pearson Education Limited, 2005
Branch and Bound in the Travelling Salesperson Problem
Luger: Artificial Intelligence, 5th edition. © Pearson Education Limited, 2005
Can We Do Better?Can We Do Better?
Sets of Sets of s=s=((ss11, , ss22, …,, …,sstt) may have a weak profile for the ) may have a weak profile for the first first ii positions ( positions (ss11, , ss22, …,, …,ssii))
Every row of alignment may add at most Every row of alignment may add at most ll to to ScoreScore OptimismOptimism: if all subsequent : if all subsequent (t-i)(t-i) positions ( positions (ssi+1i+1, …, …sstt) add ) add
((tt – – i i ) * ) * ll toto Score(Score(ss,,ii,,DNADNA)…)…
If If Score(Score(ss,,ii,,DNADNA) + () + (tt – – ii ) * ) * ll < < BestScoreBestScore, it makes , it makes no sense to search in vertices of the current subtreeno sense to search in vertices of the current subtreeTerminate search below current position…Terminate search below current position…
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Branch and Bound for Motif SearchBranch and Bound for Motif Search
Since each level of the tree goes deeper into search, Since each level of the tree goes deeper into search, discarding a poor partial solution that cannot possibly discarding a poor partial solution that cannot possibly get better discards all of the following branchesget better discards all of the following branches
This eliminates consideration of (This eliminates consideration of (nn – – ll + 1) + 1)t-it-i positions positions (per candidate motif)(per candidate motif)
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
A Greedy Approach to Motif FindingA Greedy Approach to Motif Finding
GreedyMotifSearch(GreedyMotifSearch(DNADNA,,tt,,nn,,ll))1.1. bestMotifbestMotif (1,…,1) (1,…,1)2.2. ss (1,…,1) (1,…,1)3.3. forfor s s11 1 1 toto n- n-l l ++114.4. forfor s s22 1 1 toto n- n-l l ++115.5. ifif ScoreScore((ss,2,DNA) > ,2,DNA) > ScoreScore((bestMotifbestMotif,2,DNA),2,DNA)6.6. BestMotifBestMotif11 ss11
7.7. BestMotifBestMotif22 ss22
8.8. ss1 1 BestMotifBestMotif11
9.9. ss2 2 BestMotifBestMotif22
10.10. forfor i i 3 3 toto t t11.11. forfor s sii 1 1 toto n- n-l l ++1112.12. ifif ScoreScore((ss,,ii,DNA) > ,DNA) > ScoreScore((bestMotifbestMotif,,ii,DNA),DNA)13.13. BestMotifBestMotifii ssii
14.14. ssi i BestMotifBestMotifii
15.15. return return bestMotifbestMotif
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
A Greedy Approach to Motif FindingA Greedy Approach to Motif Finding
GreedyMotifSearch(GreedyMotifSearch(DNADNA,,tt,,nn,,ll))1.1. bestMotifbestMotif (1,…,1) (1,…,1)2.2. ss (1,…,1) (1,…,1)3.3. forfor s s11 1 1 toto n- n-l l ++114.4. forfor s s22 1 1 toto n- n-l l ++115.5. ifif ScoreScore((ss,2,DNA) > ,2,DNA) > ScoreScore((bestMotifbestMotif,2,DNA),2,DNA)6.6. BestMotifBestMotif11 ss11
7.7. BestMotifBestMotif22 ss22
8.8. ss1 1 BestMotifBestMotif11
9.9. ss2 2 BestMotifBestMotif22
10.10. forfor i i 3 3 toto t t11.11. forfor s sii 1 1 toto n- n-l l ++1112.12. ifif ScoreScore((ss,,ii,DNA) > ,DNA) > ScoreScore((bestMotifbestMotif,,ii,DNA),DNA)13.13. BestMotifBestMotifii ssii
14.14. ssi i BestMotifBestMotifii
15.15. return return bestMotifbestMotif
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info
Complexity: l (n-l +1)2 operations to find first two closest l-mers; ~l (n-l +1) operations per sequence to find l-mer that maximizes score so far; total # of operations, therefore, is l (n-l +1)2 + l (n-l +1)(t-2) so the complexity is O(l n2+l nt) < O(4l tnl) << O(l nt).
Brute ForceMedian String
Comparing EfficiencyComparing EfficiencyConsider t = 8, n = 1000, and l = 10 and 106 ops/second:
Brute force motif finding 3.17 x 1011 years
Median string 23.3 hours
Greedy approach = l n2+l nt = (10*10002 + 10*1000*8)/106
10 seconds
An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info