pres suffix trees.ppt

31
What about the What about the trees of trees of the Mississippi? the Mississippi? Suffix Trees explained in an Suffix Trees explained in an algorithm for indexing large algorithm for indexing large biological sequences biological sequences Jacob Kleerekoper & Marjolijn Jacob Kleerekoper & Marjolijn Elsinga Elsinga

Upload: sruthy-sasi

Post on 17-Nov-2015

244 views

Category:

Documents


4 download

TRANSCRIPT

  • What about the trees of the Mississippi?Suffix Trees explained in an algorithm for indexing large biological sequencesJacob Kleerekoper & Marjolijn Elsinga

  • OverviewSuffixSuffix arraySuffix treeSuffix links treeDemo

  • SuffixSuffices of mississippi:1 mississippi11i2 ississippi8ippi3 ssissippi5issippi4 sissippi2ississippi5 issippi sort alphabetically 1mississippi6 ssippi10pi7sippi9ppi8 ippi7sippi9 ppi4sissippi10 pi6ssippi11 i3ssissippi

  • Suffix arraym i s s i s s i p p i

  • Search in suffix arrayIdea: two binary searches- search for leftmost position of X- search for rightmost position of X

    In between are all suffices that begin with X

  • Search in suffix arraySearch for leftmost occurrence of is

    m i s s i s s i p p i

    more occurrences of is left of this one possible!piissippiippiFound leftmost

  • Search in suffix arraySearch for rightmost occurrence of is

    m i s s i s s i p p i

    more occurrences of is right of this one possible!issippipiFound rightmostississippimississippi

  • Result search in suffix arrayLeftmost occurrence of is: 5 at index 2Rightmost occurrence of is: 2 at index 3

    is can be found at [2..3] in the suffix array

  • Tree & TrieSuffix tree is a compressed digital (suffix) trie

  • Suffix tree definitionA suffix tree is a rooted directed tree with m leaves, where m is the length S (the database string)

    For any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i

  • Suffix tree buildingSuffices of mississippi:mississippiississippississippisissippiissippissippisippiippi9ppi10 pi11 imississippiississippississippirootissippippippippippippii

  • Result suffix tree building

    mississippissippissippippippippipii111root82

    5

    10963

    i47ssippippissipissi

  • QuestionsAdriano: How is the tree created for ANA$?

  • Answer: Tree creation ANA$1 root2 ANA$3 NA$4 A$5 $2

    root3245ANA$NA$NA$$$A

  • Implicit vs. explicitTrees in which a special end symbol is used are called explicit

    Searching in this trees can only be stopped at this end symbol, which is always in a leaf

    A search in a implicit tree can stop at any internal or external node, at the last matching symbol

  • QuestionPeter: How does this method search for homologous sequences as is done in BLAST and CAFE?

  • Searching in a suffix treeissi2 ississippi5 issippimississippissippissippippippippipii111root82

    5

    10963

    i47ssippippissipissi

  • Time analysis of suffix treeBuilding a suffix tree can be done in O(k) where k is the length of the database string

    Searching a suffix tree can be done in O(n) where n is the length of the query string

    (Note: only in Ukkonens implementation)

  • QuestionLaurence: Can you explain the suffix links tree?

  • Suffix linksA necessary implementation trick to achieve a linear time and space bound during building the tree

    A suffix link is: a pointer from an internal node xS to another internal node S where x is a arbitrary character and S is a possibly empty substringxSS

  • Suffix linked tree

    root246891357ACACACAC$ACACAC$$$$$$$$

  • QuestionIngmar: Why is the memory bottleneck a problem, and how is it solved with the use of suffix links?

    Answer: we interpreted the article in such way that the suffix links cause the memory bottleneck and not the other way around

  • QuestionLee: How can suffix links cause the memory bottleneck and why is its reliance on virtual memory impractical?

    Answer: Suffix links are designed to take you from one region of the tree to another. It could be possible, because of the size of the tree, that the region pointed to is not in memory available. The same holds for virtual memory.

  • QuestionBram: Why do we need random access of the memory?

    Answer: a tree is based on pointers, these are not sequentially inserted into the memory, so random access is necessary

  • QuestionBogdan: How does this index cope with partial matches, gapped alignments and so forth, or is it just used for exact matches, which usually dont help a lot?

    Answer: Your intuition is correct here. Suffix trees as described in the article can only be used for exact (local) matches

  • QuestionLee: Can this method be used for protein data as well / can this method also be used for similar matches?

    Answer: Suffix trees probably can be used for protein data, but it is not possible to implement wildcards or the fact that amino acids are evolutionary related, but do not match exactly in some cases.

  • QuestionPeter: Why is it a problem that DNA cannot be broken into words, and why doesnt it use the overlapping intervals as in CAFE?

    Answer: the begin and end of a base string cannot be determined. Suffices are a special kind of overlapping intervals.

  • QuestionBogdan: Why do we have to change the index for each search instead of building the index once and update it when the database is changed?

    Answer: the index mentioned is the BLAST index and in BLAST the index has to be updated for every search. It has not much to do with suffix trees.

  • QuestionAdriano: What is the meaning of "cold store" and "warm store"?

    Answer: We think that cold store means that not the entire database is available in the memory and in the case of warm store the used part of the database is in the physical memory. This can be concluded from the fact that in warm store only short queries are run.

  • QuestionBogdan: What is the checkpointing which is done?

    Answer: Checkpointing is the process of associating a resource with one or more registry keys so that when the resource is moved to a new node, the required keys are propagated to the local registry on the new node.We think that the checkpointing is used to first build a portion of the tree in the memory and then put the finished (checkpointed) portion onto the disk

  • DemoUkkonens linear time suffix tree algorithm (on-line available at: http://www.i.kyushu-u.ac.jp/~takeda/PM/SuffixTree/STreeDemo.html)