exhaustive search (cont’d) cs 466 saurabh sinha. finding motifs ab initio enumerate all possible...

Exhaustive search (cont’d)

CS 466

Saurabh Sinha

Finding motifs ab initio

• Enumerate all possible strings of some fixed (small) length

• For each such string (“motif”) count its occurrences in the promoters

• Report the most frequently occurring motif

• Does the true motif pop out ?

Chapter 4.5-4.9

Simple statistics

• Consider 10 promoters, each 100 bp long

• Suppose a secret motif ATGCAACT has been “planted”

in each promoter

• Our enumerative method counts every possible “8-mer”

• Expected number of occurrences of an 8-mer is 10 x

100 x (1/4)8 ≈ 0.015

• Most likely, an arbitrary 8-mer will occur at most once,

may be twice

• 10 occurrences of ATGCAACT will stand out

Variation in binding sites

• Motif occurrences will not always be exact copies of the consensus string

• The transcription factor can usually tolerate some variability in its binding sites

• It’s possible that none of the 10 occurrences of our motif ATGCAACT is actualy this precise string

A new motif model

• To define a motif, lets say we know where the motif occurrence starts in the sequence

• The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st)

Motifs: Profiles and Consensus

a G g t a c T t

C c A t a c g tAlignment a c g t T A g t a c g t C c A t C c g t a c g G

_________________

A 3 0 1 0 3 1 1 0Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4

_________________

Consensus A C G T A C G T

• Line up the patterns by their start indexes

s = (s1, s2, …, st)

• Construct matrix profile with frequencies of each nucleotide in columns

• Consensus nucleotide in each position has the highest score in column

Profile matrices

• Suppose there were t sequences to begin with

• Consider a column of a profile matrix

• The column may be (t, 0, 0, 0)

– A perfectly conserved column

• The column may be (t/4, t/4, t/4, t/4)

– A completely uniform column

• “Good” profile matrices should have more

conserved columns

Scoring Motifs

• Given s = (s1, … st) and DNA

Score(s,DNA) =

a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 3 1 1 0 C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 _________________

Consensus a c g t a c g t

Score 3+4+4+5+3+4+3+4=30

∑= ∈

i GCTAk

ikcount1 },,,{

),(max

Good profile matrices

• Goal is to find the starting positions s=(s1,…st) to maximize the score(s,DNA) of the resulting profile matrix

• This is one formulation of the “motif finding problem”

Another formulation

• Hamming distance between two strings v and w is

dH(v,w) = number of mismatches between v and w

• Given an array of starting positions s=(s1,…st), we

define dH(v, s) = ∑idH(v,si)

• Define: TotalDist(v, DNA) = mins dH(v,s)

• Computing TotalDist is easy

– find closest string to v in each input sequence

The median string problem

• Find v that minimizes TotalDist(v)

• A double minimization (mins, minv)

• Equivalent to motif finding problem– Show this

Naïve time complexity

• Motif finding problem: Consider every (s1,…st): O((n-l+1)t)

• Median string problem: Consider every l-mer: O(4l). Relatively fast !

• Common form of both expressions: Find a vector of L variables, each variable can take k values: O(kL)

An algorithm to enumerate !

• Want to generate all strings in {1,2,3,4}L

11…1111…1211…1311…14..

44…44

NEXTLEAF(a, L, k)for i := L to 1 if ai < k ai := ai+1 return a ai := 1return a

Increment the least significant digit; and “carry over” to next position if necessary

ALLLEAVES(L, k)a := (1,..,1)while true output a a := NEXTLEAF(a,L,k) if a = (1,..,1) return

“Seach Tree” for enumeration

Order of steps

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

1- 2- 3- 4-

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Visiting all the vertices in tree

• Not just the leaves, but the internal nodes also– Why? Why not only the leaves of the tree?– We’ll see later.

• PreOrder traversal of a tree• PreOrder(node):

1. Visit node2. PreOrder(left child of node)3. PreOrder(right child of node)

• How about a non- “recursive” version?

Visit the Next Vertex

1. NextVertex(a,i,L,k) // a : the array of digits2. if i < L // i : prefix length 3. a i+1 1 // L: max length4. return ( a,i+1) // k : max digit value5. else6. for j L to 17. if aj < k8. aj aj +19. return( a,j )10. return(a,0)

In words

• If at an internal node, just go one level deeper.

• If at a leaf node, – go to next leaf– if moved to a non-sibling in this process,

jump up

Bypassing

• What if we wish to skip an entire subtree (at some internal node) during the tree traversal ?

BYPASS(a, i, L, k)for j := i to 1 if aj < k aj := aj+1 return (a,j)return (a,0)

Brute Force Solution for the Motif finding problem

1. BruteForceMotifSearchAgain(DNA, t, n, l)2. s (1,1,…, 1)3. bestScore Score(s,DNA)4. while forever5. s NextLeaf (s, t, n-l+1)6. if (Score(s,DNA) > bestScore)7. bestScore Score(s, DNA)8. bestMotif (s1,s2 , . . . , st) 9. return bestMotif

O(l(n-l+1)t)

Can We Do Better?

• Sets of s=(s1, s2, …,st) may have a weak profile for the first i positions (s1, s2, …,si)

• Every row of alignment may add at most l to Score• Optimism: if all subsequent (t-i) positions (si+1, …st) add

(t – i ) * l to Score(s,i,DNA)

• If Score(s,i,DNA) + (t – i ) * l < BestScore, it makes no sense to search in vertices of the current subtree– Use ByPass()

• “Branch and bound” strategy– This saves us from looking at (n – l + 1)t-i leaves

Pseudocode for Branch and Bound Motif Search

1. BranchAndBoundMotifSearch(DNA,t,n,l)2. s (1,…,1)3. bestScore 04. i 15. while i > 06. if i < t7. optimisticScore Score(s, i, DNA) +(t – i ) * l8. if optimisticScore < bestScore9. (s, i) Bypass(s,i, n-l +1)10. else 11. (s, i) NextVertex(s, i, n-l +1)12. else 13. if Score(s,DNA) > bestScore14. bestScore Score(s)15. bestMotif (s1, s2, s3, …, st)16. (s,i) NextVertex(s,i,t,n-l + 1)17. return bestMotif

The median string problem

• Enumerate 4l strings v

• For each v, compute TotalDist(v, DNA)– This requires linear scan of DNA, i.e., O(nt)

• Overall: O(nt4l)

• Improvement by branch and bound ?

• During enumeration of l-mers, suppose we are at some prefix v’, and find that TotalDist(v’,DNA) > BestDistanceSoFar.

• Why enumerate further ?

Bounded Median String Search1. BranchAndBoundMedianStringSearch(DNA,t,n,l )2. s (1,…,1)3. bestDistance ∞4. i 15. while i > 06. if i < l7. prefix string corresponding to the first i nucleotides of s8. optimisticDistance TotalDistance(prefix,DNA)9. if optimisticDistance > bestDistance10. (s, i ) Bypass(s,i, l, 4)11. else 12. (s, i ) NextVertex(s, i, l, 4)13. else 14. word nucleotide string corresponding to s15. if TotalDistance(s,DNA) < bestDistance16. bestDistance TotalDistance(word, DNA)17. bestWord word18. (s,i ) NextVertex(s,i,l, 4)19. return bestWord

Greedy Algorithms

A greedy approach to the motif finding problem

• Given t sequences of length n each, to find a profile matrix of length l.

• Enumerative approach O(l nt) – Impractical

• Instead consider a more practical algorithm called “GREEDYMOTIFSEARCH”

Chapter 5.5

Greedy Motif Search

• Find two closest l-mers in sequences 1 and 2 and form 2 x l alignment matrix with Score(s,2,DNA)• At each of the following t-2 iterations, finds a “best” l-mer in

sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences

• In other words, it finds an l-mer in sequence i maximizing Score(s,i,DNA)

under the assumption that the first (i-1) l-mers have been already chosen

• Sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first 2 l-mers

Greedy Motif Search pseudocode

• GREEDYMOTIFSEARCH (DNA, t, n, l)• bestMotif := (1,…,1)• s := (1,…,1)• for s1=1 to n-l+1

for s2 = 1 to n-l+1 if (Score(s,2,DNA) > Score(bestMotif,2,DNA)

bestMotif1 := s1

bestMotif2 := s2

• s1 := bestMotif1; s2 := bestMotif2

• for i = 3 to t

for si = 1 to n-l+1 if (Score(s,i,DNA) > Score(bestMotif,i,DNA)

bestMotifi := si

si := bestMotifi • Return bestMotif

A digression

• Score of a profile matrix looks only at the “majority” base in each column, not at the entire distribution

• The issue of non-uniform “background” frequencies of bases in the genome

• A better “score” of a profile matrix ?

Information Content• First convert a “profile matrix” to a “position

weight matrix” or PWM– Convert frequencies to probabilities

• PWM W: Wk = frequency of base at position k

• q = frequency of base by chance• Information content of W:

Wβk logWβk

qββ ∈{A ,C ,G,T}

Information Content

• If Wk is always equal to q, i.e., if W is similar to random sequence, information content of W is 0.

• If W is different from q, information content is high.

Greedy Motif Search• Can be trivially modified to use “Information Content” as the

• Use statistical criteria to evaluate significance of Information Content

• At each step, instead of choosing the top (1) partial motif, keep the top k partial motifs– “Beam search”

• The program “CONSENSUS” from Stormo lab.

• Further Reading: Hertz, Hartzell & Stormo, CABIOS (1990) http://bioinf.kvl.dk/~gorodkin/teach/bioinf2004/hertz90.pdf

More on Greedy algorithms in next lecture

exhaustive search (cont’d) cs 466 saurabh sinha. finding motifs ab initio enumerate all possible...

Documents

saurabh arora.pdf

confidential and proprietary information of enumerate...

enumerate conceptual framework

hero motifs

japanese motifs

saurabh presentation

maruti saurabh

combinational motifs

vision motifs

scsc 555 frank li. introduction to enumeration enumerate...

mini motifs

hsv saurabh

saurabh training

ch04 motifs

saurabh ppt.ppsx

saurabh ahuja

mutiple motifs charles yan spring 2006. 2 mutiple motifs

sequence motifs

cvt saurabh

modernist motifs