exhaustive search (cont’d) cs 466 saurabh sinha. finding motifs ab initio enumerate all possible...

Post on 18-Jan-2016

216 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Exhaustive search (cont’d)

CS 466

Saurabh Sinha

Finding motifs ab initio

• Enumerate all possible strings of some fixed (small) length

• For each such string (“motif”) count its occurrences in the promoters

• Report the most frequently occurring motif

• Does the true motif pop out ?

Chapter 4.5-4.9

Simple statistics

• Consider 10 promoters, each 100 bp long

• Suppose a secret motif ATGCAACT has been “planted”

in each promoter

• Our enumerative method counts every possible “8-mer”

• Expected number of occurrences of an 8-mer is 10 x

100 x (1/4)8 ≈ 0.015

• Most likely, an arbitrary 8-mer will occur at most once,

may be twice

• 10 occurrences of ATGCAACT will stand out

Variation in binding sites

• Motif occurrences will not always be exact copies of the consensus string

• The transcription factor can usually tolerate some variability in its binding sites

• It’s possible that none of the 10 occurrences of our motif ATGCAACT is actualy this precise string

A new motif model

• To define a motif, lets say we know where the motif occurrence starts in the sequence

• The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st)

Motifs: Profiles and Consensus

a G g t a c T t

C c A t a c g tAlignment a c g t T A g t a c g t C c A t C c g t a c g G

_________________

A 3 0 1 0 3 1 1 0Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4

_________________

Consensus A C G T A C G T

• Line up the patterns by their start indexes

s = (s1, s2, …, st)

• Construct matrix profile with frequencies of each nucleotide in columns

• Consensus nucleotide in each position has the highest score in column

Profile matrices

• Suppose there were t sequences to begin with

• Consider a column of a profile matrix

• The column may be (t, 0, 0, 0)

– A perfectly conserved column

• The column may be (t/4, t/4, t/4, t/4)

– A completely uniform column

• “Good” profile matrices should have more

conserved columns

Scoring Motifs

• Given s = (s1, … st) and DNA

Score(s,DNA) =

a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 3 1 1 0 C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 _________________

Consensus a c g t a c g t

Score 3+4+4+5+3+4+3+4=30

l

t

∑= ∈

l

i GCTAk

ikcount1 },,,{

),(max

Good profile matrices

• Goal is to find the starting positions s=(s1,…st) to maximize the score(s,DNA) of the resulting profile matrix

• This is one formulation of the “motif finding problem”

Another formulation

• Hamming distance between two strings v and w is

dH(v,w) = number of mismatches between v and w

• Given an array of starting positions s=(s1,…st), we

define dH(v, s) = ∑idH(v,si)

• Define: TotalDist(v, DNA) = mins dH(v,s)

• Computing TotalDist is easy

– find closest string to v in each input sequence

The median string problem

• Find v that minimizes TotalDist(v)

• A double minimization (mins, minv)

• Equivalent to motif finding problem– Show this

Naïve time complexity

• Motif finding problem: Consider every (s1,…st): O((n-l+1)t)

• Median string problem: Consider every l-mer: O(4l). Relatively fast !

• Common form of both expressions: Find a vector of L variables, each variable can take k values: O(kL)

An algorithm to enumerate !

• Want to generate all strings in {1,2,3,4}L

11…1111…1211…1311…14..

44…44

NEXTLEAF(a, L, k)for i := L to 1 if ai < k ai := ai+1 return a ai := 1return a

Increment the least significant digit; and “carry over” to next position if necessary

ALLLEAVES(L, k)a := (1,..,1)while true output a a := NEXTLEAF(a,L,k) if a = (1,..,1) return

“Seach Tree” for enumeration

--

Order of steps

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

1- 2- 3- 4-

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

12 3

4

12 3

4

Visiting all the vertices in tree

• Not just the leaves, but the internal nodes also– Why? Why not only the leaves of the tree?– We’ll see later.

• PreOrder traversal of a tree• PreOrder(node):

1. Visit node2. PreOrder(left child of node)3. PreOrder(right child of node)

• How about a non- “recursive” version?

Visit the Next Vertex

1. NextVertex(a,i,L,k) // a : the array of digits2. if i < L // i : prefix length 3. a i+1 1 // L: max length4. return ( a,i+1) // k : max digit value5. else6. for j L to 17. if aj < k8. aj aj +19. return( a,j )10. return(a,0)

In words

• If at an internal node, just go one level deeper.

• If at a leaf node, – go to next leaf– if moved to a non-sibling in this process,

jump up

Bypassing

• What if we wish to skip an entire subtree (at some internal node) during the tree traversal ?

BYPASS(a, i, L, k)for j := i to 1 if aj < k aj := aj+1 return (a,j)return (a,0)

Brute Force Solution for the Motif finding problem

1. BruteForceMotifSearchAgain(DNA, t, n, l)2. s (1,1,…, 1)3. bestScore Score(s,DNA)4. while forever5. s NextLeaf (s, t, n-l+1)6. if (Score(s,DNA) > bestScore)7. bestScore Score(s, DNA)8. bestMotif (s1,s2 , . . . , st) 9. return bestMotif

O(l(n-l+1)t)

Can We Do Better?

• Sets of s=(s1, s2, …,st) may have a weak profile for the first i positions (s1, s2, …,si)

• Every row of alignment may add at most l to Score• Optimism: if all subsequent (t-i) positions (si+1, …st) add

(t – i ) * l to Score(s,i,DNA)

• If Score(s,i,DNA) + (t – i ) * l < BestScore, it makes no sense to search in vertices of the current subtree– Use ByPass()

• “Branch and bound” strategy– This saves us from looking at (n – l + 1)t-i leaves

Pseudocode for Branch and Bound Motif Search

1. BranchAndBoundMotifSearch(DNA,t,n,l)2. s (1,…,1)3. bestScore 04. i 15. while i > 06. if i < t7. optimisticScore Score(s, i, DNA) +(t – i ) * l8. if optimisticScore < bestScore9. (s, i) Bypass(s,i, n-l +1)10. else 11. (s, i) NextVertex(s, i, n-l +1)12. else 13. if Score(s,DNA) > bestScore14. bestScore Score(s)15. bestMotif (s1, s2, s3, …, st)16. (s,i) NextVertex(s,i,t,n-l + 1)17. return bestMotif

The median string problem

• Enumerate 4l strings v

• For each v, compute TotalDist(v, DNA)– This requires linear scan of DNA, i.e., O(nt)

• Overall: O(nt4l)

• Improvement by branch and bound ?

• During enumeration of l-mers, suppose we are at some prefix v’, and find that TotalDist(v’,DNA) > BestDistanceSoFar.

• Why enumerate further ?

Bounded Median String Search1. BranchAndBoundMedianStringSearch(DNA,t,n,l )2. s (1,…,1)3. bestDistance ∞4. i 15. while i > 06. if i < l7. prefix string corresponding to the first i nucleotides of s8. optimisticDistance TotalDistance(prefix,DNA)9. if optimisticDistance > bestDistance10. (s, i ) Bypass(s,i, l, 4)11. else 12. (s, i ) NextVertex(s, i, l, 4)13. else 14. word nucleotide string corresponding to s15. if TotalDistance(s,DNA) < bestDistance16. bestDistance TotalDistance(word, DNA)17. bestWord word18. (s,i ) NextVertex(s,i,l, 4)19. return bestWord

Greedy Algorithms

A greedy approach to the motif finding problem

• Given t sequences of length n each, to find a profile matrix of length l.

• Enumerative approach O(l nt) – Impractical

• Instead consider a more practical algorithm called “GREEDYMOTIFSEARCH”

Chapter 5.5

Greedy Motif Search

• Find two closest l-mers in sequences 1 and 2 and form 2 x l alignment matrix with Score(s,2,DNA)• At each of the following t-2 iterations, finds a “best” l-mer in

sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences

• In other words, it finds an l-mer in sequence i maximizing Score(s,i,DNA)

under the assumption that the first (i-1) l-mers have been already chosen

• Sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first 2 l-mers

Greedy Motif Search pseudocode

• GREEDYMOTIFSEARCH (DNA, t, n, l)• bestMotif := (1,…,1)• s := (1,…,1)• for s1=1 to n-l+1

for s2 = 1 to n-l+1 if (Score(s,2,DNA) > Score(bestMotif,2,DNA)

bestMotif1 := s1

bestMotif2 := s2

• s1 := bestMotif1; s2 := bestMotif2

• for i = 3 to t

for si = 1 to n-l+1 if (Score(s,i,DNA) > Score(bestMotif,i,DNA)

bestMotifi := si

si := bestMotifi • Return bestMotif

A digression

• Score of a profile matrix looks only at the “majority” base in each column, not at the entire distribution

• The issue of non-uniform “background” frequencies of bases in the genome

• A better “score” of a profile matrix ?

Information Content• First convert a “profile matrix” to a “position

weight matrix” or PWM– Convert frequencies to probabilities

• PWM W: Wk = frequency of base at position k

• q = frequency of base by chance• Information content of W:

Wβk logWβk

qββ ∈{A ,C ,G,T}

∑k

Information Content

• If Wk is always equal to q, i.e., if W is similar to random sequence, information content of W is 0.

• If W is different from q, information content is high.

Greedy Motif Search• Can be trivially modified to use “Information Content” as the

score

• Use statistical criteria to evaluate significance of Information Content

• At each step, instead of choosing the top (1) partial motif, keep the top k partial motifs– “Beam search”

• The program “CONSENSUS” from Stormo lab.

• Further Reading: Hertz, Hartzell & Stormo, CABIOS (1990) http://bioinf.kvl.dk/~gorodkin/teach/bioinf2004/hertz90.pdf

More on Greedy algorithms in next lecture

top related