finding regulatory motifs in dna sequences an introduction to bioinformatics algorithms (jones and...

Finding Regulatory Motifs in Finding Regulatory Motifs in DNA SequencesDNA Sequences

An Introduction to Bioinformatics Algorithms (Jones and Pevzner) www.bioalgorithms.info

Combinatorial Gene Regulation

A microarray experiment showed that A microarray experiment showed that when gene X is knocked out, 20 other when gene X is knocked out, 20 other genes are not expressedgenes are not expressed

How can one gene have such drastic How can one gene have such drastic effects?effects?


Regulatory Proteins Gene X encodes regulatory protein, a.k.a. a Gene X encodes regulatory protein, a.k.a. a

transcription factortranscription factor (TF)(TF)

The 20 unexpressed genes rely on gene X’s TF to The 20 unexpressed genes rely on gene X’s TF to induce transcriptioninduce transcription

A single TF may regulate multiple genes A single TF may regulate multiple genes


Regulatory Regions Every gene contains a regulatory region (RR) typically Every gene contains a regulatory region (RR) typically

stretching 100-1000 bp upstream of the transcriptional stretching 100-1000 bp upstream of the transcriptional start sitestart site

Located within the RR are the Located within the RR are the Transcription Factor Transcription Factor Binding SitesBinding Sites (TFBS), also known as (TFBS), also known as motifsmotifs, specific , specific for a given transcription factorfor a given transcription factor

TFs influence gene expression by binding to a specific TFs influence gene expression by binding to a specific location in the respective gene’s regulatory region - location in the respective gene’s regulatory region - TFBS TFBS


Transcription Factor Binding Sites

A TFBS can be located anywhere within A TFBS can be located anywhere within the Regulatory Region.the Regulatory Region.

TFBS may vary slightly across different TFBS may vary slightly across different regulatory regions since non-essential regulatory regions since non-essential bases could mutatebases could mutate


Motifs and Transcriptional Start Sites

geneATCCCG

geneTTCCGG

geneATCCCG

geneATGCCG

geneATGCCC


Transcription Factors and Motifs


Motif Logo Motifs can mutate on non Motifs can mutate on non

important bases important bases The five motifs in five different The five motifs in five different

genes have mutations in genes have mutations in position 3 and 5position 3 and 5

Representations called Representations called motif motif logoslogos illustrate the conserved illustrate the conserved and variable regions of a motifand variable regions of a motif

TGGGGGATGGGGGA

TGAGAGATGAGAGA

TGGGGGATGGGGGA

TGAGAGATGAGAGA

TGAGGGATGAGGGA


Information content Ix at site x = 2 + i pi log(pi) where pi is frequency of base i at site x

ExamplesFor one nucleotide at a site: Ix = 2 + 1 * log(1) = 2 bits

For two nucleotides at a site: Ix = 2 + 1/2 * log(1/2) + 1/2 *log(1/2) = 1 bit

For four nucleotides at a site: Ix = 2 + 4 (1/4 * log(1/4)) = 0 bits

Motif Logos: An Example

(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)


Identifying Motifs

Genes are turned on or off by regulatory Genes are turned on or off by regulatory proteinsproteins

These proteins bind to upstream regulatory These proteins bind to upstream regulatory regions of genes to either attract or block an regions of genes to either attract or block an RNA polymeraseRNA polymerase

Regulatory protein (TF) binds to a short DNA Regulatory protein (TF) binds to a short DNA sequence called a motif (TFBS)sequence called a motif (TFBS)

So finding the same motif in multiple genes’ So finding the same motif in multiple genes’ regulatory regions suggests a regulatory regulatory regions suggests a regulatory relationship amongst those genesrelationship amongst those genes


Identifying Motifs: Complications

We do not know the motif sequenceWe do not know the motif sequence

We do not know where it is located relative We do not know where it is located relative to the gene’s start to the gene’s start

Motifs can differ slightly from one gene to Motifs can differ slightly from one gene to the nextthe next

How to discern it from “random” motifs?How to discern it from “random” motifs?


Random Sample

atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgatgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtacaacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca

tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagagctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatagtcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag

gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaagtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcatcggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtaaacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaagttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagcactggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca


Implanting Motif AAAAAAAAGGGGGGG

atgaccgggatactgatatgaccgggatactgatAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGaa

tgagtatccctgggatgactttgagtatccctgggatgacttAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatggctgagaattggatgAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaattcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGcttatagcttatag

gtcaatcatgttcttgtgaatggatttgtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtcggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcatcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaacttgagttAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtactggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGaccgaaagggaagaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGAAAAAAAAGGGGGGGaa


Where is the Implanted Motif?

atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgatgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaagggggggaacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga

tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagagctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatagtcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag

gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaagtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcatcggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtaaacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaagttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaagggggggactggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga


Implanting Motif AAAAAAAAGGGGGGG with Four Mutations

atgaccgggatactgatatgaccgggatactgatAAggAAAAggAAAGGAAAGGttttGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataccAAAAttAAAAAAAAccGGGGccGGGGGGaa

tgagtatccctgggatgactttgagtatccctgggatgacttAAAAAAAAttAAAAttGGGGaaGGttGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatggctgagaattggatgccAAAAAAAGGGAAAAAAAGGGattattGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaattcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAttAAAAttAAAGGAAAGGaaaaGGGGGGcttatagcttatag

gtcaatcatgttcttgtgaatggatttgtcaatcatgttcttgtgaatggatttAAAAccAAAAttAAGGGAAGGGctctGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtcggttttggcccttgttagaggcccccgtAAttAAAAAAccAAGGAAGGaaGGGGGGcccaattatgagagagctaatctatcgcgtgcgtgttcatcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaacttgagttAAAAAAAAAAAAttAGGGAGGGaaGGccccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtactggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAActctAAAAAGGAAAAAGGaaGGccGGGGaccgaaagggaagaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAActctAAAAAGGAAAAAGGaaGGccGGGGaa


Where is the Motif???

atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgatgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcgggaacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga

tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagagctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatagtcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag

gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaagtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcatcggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtaaacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaagttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcggactggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga


Why Finding (15,4) Motif is Difficult?

atgaccgggatactgatatgaccgggatactgatAAggAAAAggAAAGGAAAGGttttGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataccAAAAttAAAAAAAAccGGGGccGGGGGGaa

tgagtatccctgggatgactttgagtatccctgggatgacttAAAAAAAAttAAAAttGGGGaaGGttGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgatgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatggctgagaattggatgccAAAAAAAGGGAAAAAAAGGGattattGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaattcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAttAAAAttAAAGGAAAGGaaaaGGGGGGcttatagcttatag

gtcaatcatgttcttgtgaatggatttgtcaatcatgttcttgtgaatggatttAAAAccAAAAttAAGGGAAGGGctctGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtcggttttggcccttgttagaggcccccgtAAttAAAAAAccAAGGAAGGaaGGGGGGcccaattatgagagagctaatctatcgcgtgcgtgttcatcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaacttgagttAAAAAAAAAAAAttAGGGAGGGaaGGccccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtactggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAActctAAAAAGGAAAAAGGaaGGccGGGGaccgaaagggaagaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAActctAAAAAGGAAAAAGGaaGGccGGGGaa

AgAAgAAAGGttGGG

cAAtAAAAcGGcGGG..|..|||.|..|||


Challenge Problem

Find a motif in a sample of Find a motif in a sample of

- 20 “random” sequences (e.g. 600 nt long)- 20 “random” sequences (e.g. 600 nt long)

- each sequence containing an implanted - each sequence containing an implanted

pattern of length 15, pattern of length 15,

- each pattern appearing with 4 mismatches - each pattern appearing with 4 mismatches

as (15,4)-motif.as (15,4)-motif.


A Motif Finding Analogy

The Motif Finding Problem is similar to the The Motif Finding Problem is similar to the problem posed by Edgar Allan Poe (1809 problem posed by Edgar Allan Poe (1809 – 1849) in his – 1849) in his Gold Bug Gold Bug storystory


The Gold Bug ProblemThe Gold Bug Problem

Given a secret message:Given a secret message:53++!305))6*;4826)4+.)4+);806*;48!8`60))85;]8*:+*8!53++!305))6*;4826)4+.)4+);806*;48!8`60))85;]8*:+*8!

83(88)5*!; 83(88)5*!; 46(;88*96*?;8)*+(;485);5*!2:*+(;4956*2(5*-4)8`8*; 46(;88*96*?;8)*+(;485);5*!2:*+(;4956*2(5*-4)8`8*;

4069285);)64069285);)6!8)4++;1(+9;48081;8:8+1;48!85;4)485!528806*81(+9;48;(88;4(+?!8)4++;1(+9;48081;8:8+1;48!85;4)485!528806*81(+9;48;(88;4(+?

334;48)4+;161;:188;+?; 4;48)4+;161;:188;+?;

Decipher the message encrypted in the Decipher the message encrypted in the fragmentfragment


Hints for The Gold Bug ProblemHints for The Gold Bug Problem

Additional hints:Additional hints:The encrypted message is in EnglishThe encrypted message is in EnglishEach symbol corresponds to one letter in Each symbol corresponds to one letter in

the English alphabetthe English alphabetNo punctuation marks are encodedNo punctuation marks are encoded


The Gold Bug Problem: Symbol CountsThe Gold Bug Problem: Symbol Counts

Naive approach to solving the problem:Naive approach to solving the problem:Count the frequency of each symbol in the Count the frequency of each symbol in the

encrypted messageencrypted messageFind the frequency of each letter in the Find the frequency of each letter in the

alphabet in the English languagealphabet in the English languageCompare the frequencies of the previous Compare the frequencies of the previous

steps, try to find a correlation and map the steps, try to find a correlation and map the symbols to a letter in the alphabetsymbols to a letter in the alphabet


Symbol Frequencies in the Gold Bug MessageSymbol Frequencies in the Gold Bug Message

Gold Bug MessageGold Bug Message::

English LanguageEnglish Language::

e t a o i n s r h l d c u m f p g w y b v k x j q ze t a o i n s r h l d c u m f p g w y b v k x j q z

Most frequentMost frequent Least frequentLeast frequent

Symbol 8 ; 4 ) + * 5 6 ( ! 1 0 2 9 3 : ? ` - ] .

Frequency 34 25 19 16 15 14 12 11 9 8 7 6 5 5 4 4 3 2 1 1 1


The Gold Bug Message Decoding: First AttemptThe Gold Bug Message Decoding: First Attempt

By simply mapping the most frequent By simply mapping the most frequent symbols to the most frequent letters of the symbols to the most frequent letters of the alphabet:alphabet:

sfiilfcsoorntaeuroaikoaiotecrntaeleyrcooestvenpinelefheeosnltsfiilfcsoorntaeuroaikoaiotecrntaeleyrcooestvenpinelefheeosnlt

arhteenmrnwteonihtaesotsnlupnihtamsrnuhsnbaoeyentacrmuesotorlarhteenmrnwteonihtaesotsnlupnihtamsrnuhsnbaoeyentacrmuesotorl

eoaiitdhimtaecedtepeidtaelestaoaeslsueecrnedhimtaetheetahiwfaeoaiitdhimtaecedtepeidtaelestaoaeslsueecrnedhimtaetheetahiwfa

taeoaitdrdtpdeetiwttaeoaitdrdtpdeetiwt

The result does not make senseThe result does not make sense


The Gold Bug Problem: The Gold Bug Problem: ll-tuple count-tuple count

A better approach:A better approach:Examine frequencies of Examine frequencies of ll-tuples, -tuples,

combinations of 2 symbols, 3 symbols, etc.combinations of 2 symbols, 3 symbols, etc.““TheThe” is the most frequent 3-tuple in ” is the most frequent 3-tuple in

English and “English and “;48;48” is the most frequent 3-” is the most frequent 3-tuple in the encrypted texttuple in the encrypted text

Make inferences of unknown symbols by Make inferences of unknown symbols by examining other frequent examining other frequent ll-tuples -tuples


The Gold Bug Problem: the The Gold Bug Problem: the ;48;48 clue clue

Mapping “Mapping “;48;48” to “” to “thethe” and substituting all ” and substituting all occurrences of the symbols:occurrences of the symbols:

53++!305))6*53++!305))6*thethe26)h+.)h+)te06*26)h+.)h+)te06*thethe!e`60))e5t]e*:+*e!e3(ee)5*!t!e`60))e5t]e*:+*e!e3(ee)5*!t

h6(tee*96*?te)*+(h6(tee*96*?te)*+(thethe5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!e5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!e

)h++t1(+9)h++t1(+9thethe0e1te:e+10e1te:e+1thethe!e5th)he5!52ee06*e1(+9!e5th)he5!52ee06*e1(+9thethet(eeth(+?3ht(eeth(+?3htt

hehe)h+t161t:1eet+?t)h+t161t:1eet+?t


The Gold Bug Message Decoding: Second AttemptThe Gold Bug Message Decoding: Second Attempt

Make inferences:Make inferences:

53++!305))6*the26)h+.)h+)te06*the!e`60))e5t]e*:+*e!e3(ee)5*!t53++!305))6*the26)h+.)h+)te06*the!e`60))e5t]e*:+*e!e3(ee)5*!th6(tee*96*?te)*+(the5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!eh6(tee*96*?te)*+(the5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!e)h++t1(+9the0e1te:e+1the!e5th)he5!52ee06*e1(+9)h++t1(+9the0e1te:e+1the!e5th)he5!52ee06*e1(+9thet(eethet(eeth(+?3hth(+?3htthe)h+t161t:1eet+?the)h+t161t:1eet+?t

““thet(eethet(ee” most likely means “” most likely means “the treethe tree””Infer “(“ = “r”Infer “(“ = “r”

““th(+?3hth(+?3h” becomes “” becomes “thr+?3hthr+?3h””Can we guess “+,” “?,” and “3”?Can we guess “+,” “?,” and “3”?


The Gold Bug Problem: The SolutionThe Gold Bug Problem: The Solution

After figuring out all the mappings, the final After figuring out all the mappings, the final message is:message is:

AGOODGLASSINTHEBISHOPSHOSTELINTHEDEVILSSEATWENYONEDEGREAGOODGLASSINTHEBISHOPSHOSTELINTHEDEVILSSEATWENYONEDEGRE

ESANDTHIRTEENMINUTESNORTHEASTANDBYNORTHMAINBRANCHSEVENT ESANDTHIRTEENMINUTESNORTHEASTANDBYNORTHMAINBRANCHSEVENT HLIMBEASTSIDESHOOTFROMTHELEFTEYEOFTHEDEATHSHEADABEELINEHLIMBEASTSIDESHOOTFROMTHELEFTEYEOFTHEDEATHSHEADABEELINE

FROMTHETREETHROUGHTHESHOTFIFTYFEETOUTFROMTHETREETHROUGHTHESHOTFIFTYFEETOUT


The SolutionThe Solution (cont’d) (cont’d)

Punctuation is important:Punctuation is important:

A GOOD GLASS IN THE BISHOP’S HOSTEL IN THE DEVIL’S SEA, A GOOD GLASS IN THE BISHOP’S HOSTEL IN THE DEVIL’S SEA,

TWENY ONE DEGREES AND THIRTEEN MINUTES NORTHEAST AND BY NORTH, TWENY ONE DEGREES AND THIRTEEN MINUTES NORTHEAST AND BY NORTH,

MAIN BRANCH SEVENTH LIMB, EAST SIDE, SHOOT FROM THE LEFT EYE OF MAIN BRANCH SEVENTH LIMB, EAST SIDE, SHOOT FROM THE LEFT EYE OF

THE DEATH’S HEAD A BEE LINE FROM THE TREE THROUGH THE SHOT, THE DEATH’S HEAD A BEE LINE FROM THE TREE THROUGH THE SHOT,

FIFTY FEET OUT.FIFTY FEET OUT.


Solving the Gold Bug ProblemSolving the Gold Bug Problem

Prerequisites to solve the problem:Prerequisites to solve the problem:

Need to know the relative frequencies of Need to know the relative frequencies of single letters, and combinations of two and single letters, and combinations of two and three letters in Englishthree letters in English

Knowledge of all the words in the English Knowledge of all the words in the English dictionary is highly desired to make dictionary is highly desired to make accurate inferencesaccurate inferences


Nucleotides in motifs encode for a message in Nucleotides in motifs encode for a message in the “genetic” language. Symbols in “The Gold the “genetic” language. Symbols in “The Gold Bug” encode for a message in EnglishBug” encode for a message in English

In order to solve the problem, we analyze the In order to solve the problem, we analyze the frequencies of patterns in DNA/Gold Bug frequencies of patterns in DNA/Gold Bug message. message.

Knowledge of established regulatory motifs Knowledge of established regulatory motifs makes the Motif Finding problem simpler. makes the Motif Finding problem simpler. Knowledge of the words in the English Knowledge of the words in the English dictionary helps to solve the Gold Bug problem.dictionary helps to solve the Gold Bug problem.

Motif Finding and The Gold Bug Problem: SimilaritiesMotif Finding and The Gold Bug Problem: Similarities


SimilaritiesSimilarities (cont’d) (cont’d)

Gold Bug ProblemGold Bug Problem:: In order to solve the problem, we analyze the In order to solve the problem, we analyze the

frequencies of patterns in the text written in Englishfrequencies of patterns in the text written in English

Motif FindingMotif Finding:: In order to solve the problem, we analyze the In order to solve the problem, we analyze the

frequencies of patterns in the nucleotide sequencesfrequencies of patterns in the nucleotide sequences


SimilaritiesSimilarities (cont’d) (cont’d)

Gold Bug ProblemGold Bug Problem::Knowledge of the words in the dictionary is Knowledge of the words in the dictionary is

highly desirablehighly desirable

Motif FindingMotif Finding::Knowledge of established motifs reduces Knowledge of established motifs reduces

the complexity of the problemthe complexity of the problem


Motif Finding and The Gold Bug Problem: DifferencesMotif Finding and The Gold Bug Problem: Differences

Motif FindingMotif Finding is harder than is harder than Gold Bug problemGold Bug problem::

We don’t have the complete dictionary of motifsWe don’t have the complete dictionary of motifsThe “genetic” language does not have a The “genetic” language does not have a

standard “grammar”standard “grammar”Only a small fraction of nucleotide sequences Only a small fraction of nucleotide sequences

encode for motifs; the size of data is enormousencode for motifs; the size of data is enormous


The Motif Finding ProblemThe Motif Finding Problem

Given a random sample of DNA sequences:Given a random sample of DNA sequences:

cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccatcctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgcagtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaattttaaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtatacaagcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtcctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc

Find the pattern that is implanted in each of Find the pattern that is implanted in each of the individual sequences, namely, the motifthe individual sequences, namely, the motif


The Motif Finding ProblemThe Motif Finding Problem (cont’d) (cont’d)

Additional information:Additional information:

The hidden sequence is of length 8The hidden sequence is of length 8

The pattern is not exactly the same in each The pattern is not exactly the same in each array because random point mutations array because random point mutations may occur in the sequencesmay occur in the sequences


The Motif Finding Problem The Motif Finding Problem (cont’d)(cont’d)

The patterns revealed with no mutations:The patterns revealed with no mutations:cctgatagacgctatctggctatcccctgatagacgctatctggctatccacgtacgtacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccataggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatagtactggtgtacatttgatacgtacgtacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgcacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaaaacgtacgtacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaattttgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttagcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtacgtacgtatacaataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttactgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtacgtacgtcc

acgtacgtacgtacgt

Consensus StringConsensus String



The patterns with 2 point mutations:The patterns with 2 point mutations:

cctgatagacgctatctggctatcccctgatagacgctatctggctatccaaGGgtacgtacTTttaggtcctctgtgcgaatctatgcgtttccaaccataggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatagtactggtgtacatttgatCCccAAtacgttacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgcacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaaaacgtacgtTATAgtgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaattttgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttagcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtCCccAAttatacaataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttactgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCCcgtacgcgtacgGGcc



The patterns with 2 point mutations:The patterns with 2 point mutations:






Can we still find the motif, now that we have 2 mutations?


Defining Motifs Defining Motifs

To define a motif, lets say we know where the To define a motif, lets say we know where the motif starts in the sequencemotif starts in the sequence

The motif start positions in their sequences can The motif start positions in their sequences can be represented as be represented as ss = ( = (ss11,,ss22,,ss33,…,,…,sstt))


Motifs: Profiles and ConsensusMotifs: Profiles and Consensus aa GG g t a c g t a c TT tt CC cc AA t a c g tt a c g tAlignmentAlignment a c g t a c g t T AT A g tg t a c g t a c g t CC cc AA tt CC c g t a c g c g t a c g GG

__________________________________

AA 33 0 0 11 0 0 33 1 11 1 0 0ProfileProfile CC 22 44 0 0 0 0 11 44 0 0 0 0 GG 0 1 0 1 44 0 0 0 0 0 0 33 11 TT 0 0 0 0 0 0 55 1 0 1 0 11 44

__________________________________

Consensus Consensus A C G T A C G TA C G T A C G T

Line up the patterns by Line up the patterns by their start indexes their start indexes

ss = ( = (ss11, , ss22, …, , …, sstt))

Construct profile matrix Construct profile matrix with frequencies of each with frequencies of each nucleotide in columnsnucleotide in columns

Consensus nucleotide in Consensus nucleotide in each position has the each position has the highest score in columnhighest score in column


ConsensusConsensus

Think of consensus as an “ancestor” Think of consensus as an “ancestor” motif, from which mutated motifs emergedmotif, from which mutated motifs emerged

The The distancedistance between a real motif and the between a real motif and the consensus sequence is generally less consensus sequence is generally less than that for two real motifsthan that for two real motifs


ConsensusConsensus (cont’d) (cont’d)


Evaluating MotifsEvaluating Motifs

We have a guess about the consensus We have a guess about the consensus sequence, but how “good” is this sequence, but how “good” is this consensus?consensus?

Need to introduce a scoring function to Need to introduce a scoring function to compare different guesses and choose the compare different guesses and choose the “best” one. “best” one.


Defining Some TermsDefining Some Terms

tt - number of sample DNA sequences - number of sample DNA sequences nn - length of each DNA sequence - length of each DNA sequence DNADNA - sample of DNA sequences ( - sample of DNA sequences (tt x x nn array) array)

ll - length of the motif ( - length of the motif (ll-mer)-mer) ssii - starting position of an - starting position of an ll-mer in sequence -mer in sequence ii

ss=(=(ss11, s, s22,… s,… stt)) - array of motif starting positions - array of motif starting positions


ParametersParameters






l = 8

t=5

s1 = 26 s2 = 21 s3= 3 s4 = 56 s5 = 60 s

DNA

n = 69


Scoring MotifsScoring Motifs

Given Given ss = (s = (s11, … s, … stt) ) and and DNADNA::

ScoreScore((ss,,DNADNA)) = =

aa GG g t a c g t a c TT tt CC cc AA t a c g tt a c g t a c g t a c g t T AT A g tg t a c g t a c g t CC cc AA tt CC c g t a c g c g t a c g GG __________________________________ AA 33 0 0 11 0 0 33 1 11 1 0 0 CC 22 44 0 0 0 0 11 44 0 0 0 0 GG 0 0 11 44 0 0 0 0 0 0 33 11 TT 0 0 0 0 0 0 55 11 0 0 11 44 __________________________________

Consensus Consensus a c g t a c g ta c g t a c g t

Score 3+4+4+5+3+4+3+4=Score 3+4+4+5+3+4+3+4=3030

l

t

l

i GCTAk

ikcount1 },,,{

),(max


The Motif Finding ProblemThe Motif Finding Problem

If starting positions If starting positions ss=(=(ss11, s, s22,… s,… stt) are ) are

given, finding consensus is easy even with given, finding consensus is easy even with mutations in the sequences because we mutations in the sequences because we can simply construct the profile to find the can simply construct the profile to find the motif (consensus) motif (consensus)

But… the starting positions But… the starting positions ss are usually are usually not given. How can we find the “best” not given. How can we find the “best” profile matrix?profile matrix?


The Motif Finding Problem: FormulationThe Motif Finding Problem: Formulation

GoalGoal: Given a set of DNA sequences, find a set of : Given a set of DNA sequences, find a set of ll--mers, one from each sequence, that maximizes the mers, one from each sequence, that maximizes the consensus scoreconsensus score

InputInput: A : A tt x x nn matrix of matrix of DNADNA, and , and ll, the length of the , the length of the pattern to findpattern to find

OutputOutput: An array of : An array of tt starting positions starting positions ss = ( = (ss11, s, s22, … s, … stt) maximizing ) maximizing ScoreScore((ss,,DNADNA))


The Motif Finding Problem: Brute Force SolutionThe Motif Finding Problem: Brute Force Solution

Compute the scores for each possible Compute the scores for each possible combination of starting positions combination of starting positions ss

The best score will determine the best profile The best score will determine the best profile and the consensus pattern in and the consensus pattern in DNADNA

The goal is to maximize The goal is to maximize ScoreScore((ss,,DNADNA) by ) by varying the starting positions varying the starting positions ssii, where:, where:

si = [1, …, n-l+1]i = [1, …, t]


BruteForceMotifSearchBruteForceMotifSearch

1.1. BruteForceMotifSearchBruteForceMotifSearch((DNADNA, , tt, , nn, , ll))2.2. bestScorebestScore 0 0

3.3. forfor each each s=s=((ss11,s,s22 , . . ., s , . . ., stt) from) from (1,1 . . . 1) (1,1 . . . 1) to ( to (nn--ll+1, . . ., +1, . . ., nn--

ll+1)+1)4.4. ifif ( (ScoreScore((ss,,DNADNA) > ) > bestScorebestScore))5.5. bestScorebestScore scorescore((s, s, DNADNA))

6.6. bestMotifbestMotif ( (ss11,s,s22 , . . . , s , . . . , stt) ) 7.7. returnreturn bestMotifbestMotif


Running Time of BruteForceMotifSearchRunning Time of BruteForceMotifSearch

Varying (Varying (n - n - ll + 1) + 1) positions in each of positions in each of tt sequences, we’re looking at (sequences, we’re looking at (n - n - ll + 1) + 1)tt sets of sets of starting positionsstarting positions

For each set of starting positions, the scoring For each set of starting positions, the scoring function makes function makes ll operations, so complexity is operations, so complexity is l l (n – (n – ll + 1) + 1)tt == OO((l l nntt))

For For tt = 8, = 8, nn = 1000, and = 1000, and ll = 10, how long will it = 10, how long will it take for a computer performing one million take for a computer performing one million operations per second to complete the task?operations per second to complete the task?


Running Time of BruteForceMotifSearchRunning Time of BruteForceMotifSearch(continued)(continued)


For For tt = 8, = 8, nn = 1000, = 1000, ll = 10: = 10:l l nntt = 10 x 1000 = 10 x 100088 = 10 = 102525 operations operations

At 10At 1066 operations/second that is: operations/second that is:10102525 / 10 / 1066 = 10 = 101919 seconds seconds 3.17 x 10 3.17 x 101111 years years

Running Time of BruteForceMotifSearchRunning Time of BruteForceMotifSearch(continued)(continued)


For For tt = 8, = 8, nn = 1000, = 1000, ll = 10: = 10:l l nntt = 10 x 1000 = 10 x 100088 = 10 = 102525 operations operations

At 10At 1066 operations/second that is: operations/second that is:10102525 / 10 / 1066 = 10 = 101919 seconds seconds 3.17 x 10 3.17 x 101111 years years

Let’s try something different…Let’s try something different…

The Median String ProblemThe Median String Problem

Given a set of Given a set of tt DNA sequences find a DNA sequences find a pattern that appears in all pattern that appears in all tt sequences sequences with the minimum number of mutations with the minimum number of mutations

This pattern will be the motifThis pattern will be the motif


Hamming DistanceHamming Distance

Hamming distance:Hamming distance: ddHH((vv,,ww)) is the number of nucleotide pairs is the number of nucleotide pairs

that do not match when that do not match when vv and and ww are are aligned. For example:aligned. For example:

ddHH(AAAAAA(AAAAAA,,ACAAAC) = 2ACAAAC) = 2


Total Distance: ExampleTotal Distance: Example

Given Given vv = “ = “acgtacgtacgtacgt” and ” and ss acgtacGtacgtacGt

cctgatagacgctatctggctatcccctgatagacgctatctggctatccacgtacAtacgtacAtaggtcctctgtgcgaatctatgcgtttccaaccataggtcctctgtgcgaatctatgcgtttccaaccat acgtacgtacgtacgtagtactggtgtacatttgatagtactggtgtacatttgatacgtacgtacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgcacaccggcaacctgaaacaaacgctcagaaccagaagtgc aCgtAcgtaCgtAcgtaaaaaAgtCcgtaAgtCcgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaattttgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt acgtacgtacgtacgtagcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttagcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtacgtacgtatacaataca acgtaCgtacgtaCgtctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttactgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtaGgtacgtaGgtcc

vv is the sequence in is the sequence in redred, , xx is the sequence in is the sequence in blueblue

TotalDistance(TotalDistance(vv,,DNADNA) = ) = 1+0+2+0+1 = 41+0+2+0+1 = 4

dH(v, x) = 2

dH(v, x) = 1

dH(v, x) = 0

dH(v, x) = 0

dH(v, x) = 1


Total Distance: DefinitionTotal Distance: Definition

For each DNA sequence For each DNA sequence ii, compute all , compute all ddHH((vv, , xx), where ), where xx is an is an ll-mer with starting position -mer with starting position ssi i

(1 (1 << ssii << nn – – l l + 1)+ 1) Find minimum of Find minimum of ddHH((vv, , xx) among all ) among all ll-mers in sequence -mers in sequence ii TotalDistance(TotalDistance(vv,,DNADNA)) is the sum of the minimum is the sum of the minimum

Hamming distances for each DNA sequenceHamming distances for each DNA sequence ii

So, So, TotalDistance(TotalDistance(vv,,DNADNA) = min) = minss d dHH((vv, , ss), where ), where ss is the is the set of starting positions set of starting positions ss11, s, s22,… s,… stt


The Median String Problem: FormulationThe Median String Problem: Formulation

GoalGoal: Given a set of DNA sequences, find : Given a set of DNA sequences, find a median string a median string vv

InputInput: A : A tt x x nn matrix DNA, and matrix DNA, and ll, the length , the length of the pattern to findof the pattern to find

OutputOutput: A string : A string vv of of l l nucleotides that nucleotides that minimizesminimizes TotalDistance(TotalDistance(vv,,DNADNA)) over all over all strings of that lengthstrings of that length


Median String Search AlgorithmMedian String Search Algorithm

1.1. MedianStringSearch (MedianStringSearch (DNA, t, n, DNA, t, n, l l ))

2.2. bestWord bestWord AAA…A AAA…A

3.3. bestDistancebestDistance ∞ ∞

4.4. forfor each each ll-mer -mer wordword fromfrom AAA…A AAA…A toto TTT…T TTT…T

5.5. ifif TotalDistanceTotalDistance((wordword,,DNADNA)) < < bestDistancebestDistance

6.6. bestDistance bestDistance TotalDistanceTotalDistance((wordword,,DNADNA))

7.7. bestWordbestWord wordword

8.8. returnreturn bestWordbestWord


Key: Motif Finding Problem == Median String ProblemKey: Motif Finding Problem == Median String Problem

The Motif Finding is a maximization problem while Median String is a minimization problem.

However, the Motif Finding problem and Median String problem are computationally equivalent.

To prove it, let’s show that minimizing TotalDistance is equivalent to maximizing Score…


We are looking for the same thingWe are looking for the same thing

aa GG g t a c g t a c TT tt CC cc AA t a c g tt a c g tAlignmentAlignment a c g ta c g t T AT A g tg t a c g t a c g t CC cc AA tt CC c g t a c g c g t a c g GG __________________________________ A A 33 0 0 11 0 0 33 11 11 0 0Profile Profile C C 22 44 0 0 0 0 11 44 0 0 0 0 G G 0 0 11 44 0 0 0 0 0 0 33 1 1 T T 0 0 0 0 0 0 55 11 0 0 11 44 __________________________________

Consensus Consensus a c g t a c g ta c g t a c g t

Score Score 3+4+4+5+3+4+3+43+4+4+5+3+4+3+4

TotalDistance TotalDistance 2+1+1+0+2+1+2+12+1+1+0+2+1+2+1

Sum 5 5 5 5 5 5 5 5Sum 5 5 5 5 5 5 5 5

At any column At any column jjScoreScorej j + + TotalDistanceTotalDistancejj = = tt

Because there are Because there are ll columns columns ScoreScore + TotalDistance + TotalDistance = = ll * * tt

Rearranging:Rearranging:ScoreScore = = ll * * tt - TotalDistance - TotalDistance

Because Because ll * * tt is constant, the is constant, the minimization of the right side is minimization of the right side is equivalent to the maximization equivalent to the maximization of the left side.of the left side.

l

t


Motif Finding Problem vs. Motif Finding Problem vs. Median String ProblemMedian String Problem

Why bother reformulating the Motif Finding Why bother reformulating the Motif Finding problem into the Median String problem?problem into the Median String problem?

The Motif Finding Problem needs to The Motif Finding Problem needs to examine all the combinations for examine all the combinations for ss. That is . That is ((n -n - ll + 1)+ 1)tt combinations!!! combinations!!!

The Median String Problem needs to The Median String Problem needs to examine all 4examine all 4ll combinations for combinations for vv. This . This number is relatively smaller. By how much?number is relatively smaller. By how much?


Median String Problem EfficiencyMedian String Problem EfficiencyThere are 4l possible l -mers to try, they must be placed in each of n – l + 1 locations in t sequences, and the Hamming distance computed for each position...

This is results in 4l x t x (n – l + 1) x l operations (i.e., O(4l tnl )).

Recall that the brute force motif finding problem for t = 8, n = 1000, and l = 10 was going to require 1025 operations and 3.17 x 1011 years at 106 ops/second.

For the median string algorithm and those same parameters we have 410 x 8 x 1000 x 10 = 8.39 x 1010 ops. At 106 ops/second, this algorithm will require 8.39 x 104 secs, which is 23.3 hours. Hmmm…


Recall the BruteForceMotifSearch:Recall the BruteForceMotifSearch:

1.1. BruteForceMotifSearchBruteForceMotifSearch((DNADNA, , tt, , nn, , ll))

2.2. bestScorebestScore 0 0

3.3. forfor each each s=s=((ss11,s,s22 , . . ., s , . . ., stt) from) from (1,1 . . . 1) to ((1,1 . . . 1) to (nn--ll+1, . . ., +1, . . ., nn--ll+1)+1)

4.4. ifif ( (ScoreScore((ss,,DNADNA) > ) > bestScorebestScore))

5.5. bestScorebestScore ScoreScore((s, s, DNADNA))

6.6. bestMotifbestMotif ( (ss11,s,s22 , . . . , s , . . . , stt) )

7.7. return return bestMotifbestMotif


Structuring the SearchStructuring the Search


How can we perform the lineHow can we perform the line

forfor each each s=s=((ss11,s,s22 , . . ., s , . . ., stt) from) from (1,1 . . . 1) to ((1,1 . . . 1) to (nn--ll+1, . . ., +1, . . ., nn--ll+1)+1) ??

We need a method for efficiently structuring We need a method for efficiently structuring and navigating the many possible motifs and navigating the many possible motifs

This is not very different than exploring all This is not very different than exploring all tt--digit numbersdigit numbers


1.1. MedianStringSearch (MedianStringSearch (DNADNA, , tt, , nn, , ll))

2.2. bestWordbestWord AAA…A AAA…A

3.3. bestDistancebestDistance ∞ ∞

4.4. forfor each each ll-mer -mer ss fromfrom AAA…A to TTT…T AAA…A to TTT…T ifif TotalDistanceTotalDistance((s,s,DNADNA)) < < bestDistancebestDistance

5.5. bestDistancebestDistanceTotalDistanceTotalDistance((s,s,DNADNA))

6.6. bestWordbestWord ss

7.7. returnreturn bestWordbestWord




For the Median String Problem we need to For the Median String Problem we need to consider all 4consider all 4ll possible possible ll-mers:-mers:

aa… aaaa… aaaa… acaa… acaa… agaa… agaa… ataa… at

..

..tt… tttt… tt

How to organize this search?How to organize this search?

l


Alternative Representation of the Search SpaceAlternative Representation of the Search Space

Let Let AA = 1, = 1, CC = 2, = 2, GG = 3, = 3, TT = 4 = 4 Then the sequences from AA…A to TT…T become:Then the sequences from AA…A to TT…T become:

11…1111…1111…1211…1211…1311…1311…1411…14....

44…4444…44 Notice that the sequences above simply list all numbers Notice that the sequences above simply list all numbers

using four sequential digits beginning with 1using four sequential digits beginning with 1

l


Search TreeSearch Tree

a- c- g- t-a- c- g- t-

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg ttaa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt

--

root



1- 2- 1- 2-

11 12 21 2211 12 21 22

--

root


An instance of the travelling salesperson problem

Luger: Artificial Intelligence, 5th edition. © Pearson Education Limited, 2005

Branch and Bound in the Travelling Salesperson Problem

Luger: Artificial Intelligence, 5th edition. © Pearson Education Limited, 2005

Can We Do Better?Can We Do Better?

Sets of Sets of s=s=((ss11, , ss22, …,, …,sstt) may have a weak profile for the ) may have a weak profile for the first first ii positions ( positions (ss11, , ss22, …,, …,ssii))

Every row of alignment may add at most Every row of alignment may add at most ll to to ScoreScore OptimismOptimism: if all subsequent : if all subsequent (t-i)(t-i) positions ( positions (ssi+1i+1, …, …sstt) add ) add

((tt – – i i ) * ) * ll toto Score(Score(ss,,ii,,DNADNA)…)…

If If Score(Score(ss,,ii,,DNADNA) + () + (tt – – ii ) * ) * ll < < BestScoreBestScore, it makes , it makes no sense to search in vertices of the current subtreeno sense to search in vertices of the current subtreeTerminate search below current position…Terminate search below current position…


Branch and Bound for Motif SearchBranch and Bound for Motif Search

Since each level of the tree goes deeper into search, Since each level of the tree goes deeper into search, discarding a poor partial solution that cannot possibly discarding a poor partial solution that cannot possibly get better discards all of the following branchesget better discards all of the following branches

This eliminates consideration of (This eliminates consideration of (nn – – ll + 1) + 1)t-it-i positions positions (per candidate motif)(per candidate motif)


A Greedy Approach to Motif FindingA Greedy Approach to Motif Finding

GreedyMotifSearch(GreedyMotifSearch(DNADNA,,tt,,nn,,ll))1.1. bestMotifbestMotif (1,…,1) (1,…,1)2.2. ss (1,…,1) (1,…,1)3.3. forfor s s11 1 1 toto n- n-l l ++114.4. forfor s s22 1 1 toto n- n-l l ++115.5. ifif ScoreScore((ss,2,DNA) > ,2,DNA) > ScoreScore((bestMotifbestMotif,2,DNA),2,DNA)6.6. BestMotifBestMotif11 ss11

7.7. BestMotifBestMotif22 ss22

8.8. ss1 1 BestMotifBestMotif11


10.10. forfor i i 3 3 toto t t11.11. forfor s sii 1 1 toto n- n-l l ++1112.12. ifif ScoreScore((ss,,ii,DNA) > ,DNA) > ScoreScore((bestMotifbestMotif,,ii,DNA),DNA)13.13. BestMotifBestMotifii ssii

14.14. ssi i BestMotifBestMotifii



A Greedy Approach to Motif FindingA Greedy Approach to Motif Finding

GreedyMotifSearch(GreedyMotifSearch(DNADNA,,tt,,nn,,ll))1.1. bestMotifbestMotif (1,…,1) (1,…,1)2.2. ss (1,…,1) (1,…,1)3.3. forfor s s11 1 1 toto n- n-l l ++114.4. forfor s s22 1 1 toto n- n-l l ++115.5. ifif ScoreScore((ss,2,DNA) > ,2,DNA) > ScoreScore((bestMotifbestMotif,2,DNA),2,DNA)6.6. BestMotifBestMotif11 ss11

7.7. BestMotifBestMotif22 ss22



10.10. forfor i i 3 3 toto t t11.11. forfor s sii 1 1 toto n- n-l l ++1112.12. ifif ScoreScore((ss,,ii,DNA) > ,DNA) > ScoreScore((bestMotifbestMotif,,ii,DNA),DNA)13.13. BestMotifBestMotifii ssii

14.14. ssi i BestMotifBestMotifii



Complexity: l (n-l +1)2 operations to find first two closest l-mers; ~l (n-l +1) operations per sequence to find l-mer that maximizes score so far; total # of operations, therefore, is l (n-l +1)2 + l (n-l +1)(t-2) so the complexity is O(l n2+l nt) < O(4l tnl) << O(l nt).

Brute ForceMedian String

Comparing EfficiencyComparing EfficiencyConsider t = 8, n = 1000, and l = 10 and 106 ops/second:

Brute force motif finding 3.17 x 1011 years

Median string 23.3 hours

Greedy approach = l n2+l nt = (10*10002 + 10*1000*8)/106

10 seconds


finding regulatory motifs in dna sequences an introduction to bioinformatics algorithms (jones and...

Documents