lecture 135 full text new - ut · •burrows wheeler transform (bwt) –compact suffix tree...

21
12/29/17 1 Text Algorithms (6EAP) Full text indexing Jaak Vilo 2017 fall 1 MTAT.03.190 Text Algorithms Jaak Vilo Key take home messages Suffix trie, Suffix tree Application examples Suffix array Burrows Wheeler Transform (BWT) Compact Suffix tree representation: BWT + succinct Problem Given P and S – find all exact or approximate occurrences of P in S You are allowed to preprocess S (and P, of course) Goal: to speed up the searches E.g. Dictionary problem Does P belong to a dictionary D={d 1 ,…,d n } Build a binary search tree of D B-Tree of D Hashing Sorting + Binary search Build a keyword trie: search in O(|P|) Assuming alphabet has up to a constant size c See Aho-Corasick algorithm, Trie construction Sorted array and binary search he hers his global index happy head header info informal search show stop 1 13 Sorted array and binary search he hers his global index happy head header info informal search show stop 1 13 O( |P| log n )

Upload: others

Post on 20-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

1

TextAlgorithms(6EAP)

Fulltextindexing

JaakVilo2017fall

1MTAT.03.190TextAlgorithmsJaakVilo

Keytakehomemessages

• Suffixtrie,Suffixtree– Applicationexamples

• Suffixarray

• BurrowsWheelerTransform(BWT)– CompactSuffixtreerepresentation:BWT+succinct

Problem

• GivenPandS– findallexactorapproximateoccurrencesofPinS

• YouareallowedtopreprocessS(andP,ofcourse)

• Goal:tospeedupthesearches

E.g.Dictionaryproblem

• DoesPbelongtoadictionaryD={d1,…,dn}– BuildabinarysearchtreeofD– B-TreeofD– Hashing– Sorting+Binarysearch

• Buildakeywordtrie:searchinO(|P|)– Assumingalphabethasuptoaconstantsizec– SeeAho-Corasickalgorithm,Trieconstruction

Sortedarrayandbinarysearch

he

hershis

global

indexhappy

head

header

info

informal

search

show

stop

1 13

Sortedarrayandbinarysearch

he

hershis

global

indexhappy

head

header

info

informal

search

show

stop

1 13

O( |P| log n )

Page 2: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

2

TrieforD={he,hers,his,she}

0

1

2

h

e

3

s

4

5

e

h

8

i

7

s

9

r

6

s

O( |P| )

S!=setofwords

• Soflengthn

• Howtoindex?

• Indexfromeverypositionofatext

• Prefixofeverypossiblesuffixisimportant

a

b

b

aaa

aa

b

b

b

babaababaab

baabaab

abb

Trie(babaab)

b

a

a

b

Suffixtree• Definition: Acompactrepresentationofatriecorrespondingtothe

suffixesofagivenstringwhereallnodeswithonechildaremergedwiththeirparents.

• Definition(suffixtree).AsuffixtreeTforastringS(withn=|S|)isarooted,labeledtreewithaleafforeachnon-emptysuffixofS.Furthermore,asuffixtreesatisfiesthefollowingproperties:

• Eachinternalnode,otherthantheroot,hasatleasttwochildren;• Eachedgeleavingaparticularnodeislabeledwithanon-emptysubstring

ofSofwhichthefirstsymbolisuniqueamongallfirstsymbolsoftheedgelabelsoftheedgesleavingthisparticularnode;

• Foranyleafinthetree,theconcatenationoftheedgelabelsonthepathfromtheroottothisleafexactlyspellsoutanon-emptysuffixofs.

• DanGusfield: AlgorithmsonStrings,Trees,andSequences:ComputerScienceandComputationalBiology.Hardcover- 534pages1stedition(January15,1997).CambridgeUnivPr(Short);ISBN:0521585198.

Literatureonsuffixtrees• http://en.wikipedia.org/wiki/Suffix_tree• DanGusfield: AlgorithmsonStrings,Trees,andSequences:Computer

ScienceandComputationalBiology.Hardcover- 534pages1stedition(January15,1997).CambridgeUnivPr(Short);ISBN:0521585198.(pages:89--208)

• E.Ukkonen.On-lineconstructionofsuffixtrees.Algorithmica,14:249-60,1995. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751

• Ching-FungCheung,JeffreyXuYu,HongjunLu."ConstructingSuffixTreeforGigabyteSequenceswithMegabyteMemory,"IEEETransactionsonKnowledgeandDataEngineering,vol.17,no.1,pp.90-105,January,2005.http://www2.computer.org/portal/web/csdl/doi/10.1109/TKDE.2005.3

• CPMarticlesarchive:http://www.cs.ucr.edu/~stelo/cpm/

• MarkNelson.FastStringSearchingWithSuffixTreesDr.Dobb'sJournal,August,1996.http://www.dogma.net/markn/articles/suffixt/suffixt.htm

http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english

Page 3: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

3

13

Suffix tree andsuffix array techniques forpatternanalysis instringsEskoUkkonenUniv Helsinki

Erice School30Oct 2005E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt

Partlybasedon:

High-throughput genome-scale sequence analysis andmapping using compressed datastructures

VeliMäkinenDepartmentofComputerScience

University ofHelsinki

14

ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgt

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt

15

Analysisofastringofsymbols

• T=hattivatti’text’• P=att’pattern’

• FindtheoccurrencesofPinT:hattivatti

• Patternsynthesis:#(t)=4#(atti)=2#(t****t)=2

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 16

Solution:backtrackingwithsuffixtree

...ACACATTATCACAGGCATCGGCATTAGCGATCGAGTCG.....

17

Patternfinding&synthesisproblems• T=t1t2 …tn,P=p1p2 …pn ,stringsofsymbolsinfinitealphabet

• Indexingproblem:PreprocessT(buildanindexstructure)suchthattheoccurrencesofdifferentpatternsPcanbefoundfast– statictext,anygivenpatternP

• Patternsynthesisproblem:LearnfromTnewpatternsthatoccursurprisinglyoften

• Whatisapattern?Exactsubstring,approximatesubstring,withgeneralizedsymbols,withgaps,… 18

1. Suffix tree

2. Suffix array

3. Some applications

4. Finding motifs

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt

Page 4: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

4

19

ThesuffixtreeTree(T)ofT

• datastructuresuffixtree, Tree(T),iscompactedtrie thatrepresentsallthesuffixesofstringT

• linearsize:|Tree(T)|=O(|T|)• canbeconstructedinlineartimeO(|T|)• hasmyriadvirtues (A.Apostolico)• iswell-known:366000Googlehits

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt 20

Suffixtrieandsuffixtree

a

b

b

aaa

aa

b

b

b

abaabbaabaababb

Trie(abaab)

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt

21

Suffixtrieandsuffixtree

a

b

b

aaa

aa

b

b

b

a

baab

baabab

abaabbaabaababb

Trie(abaab) Tree(abaab)

22

Trie(T)canbelarge

• |Trie(T)|=O(|T|2)• badexample:T=anbn

• Trie(T)canbeseenasaDFA:languageaccepted=thesuffixesofT

• minimizetheDFA=>directedcyclicwordgraph(’DAWG’)

23

Tree(T)isoflinearsize

• onlytheinternalbranchingnodesandtheleavesrepresentedexplicitly

• edgeslabeledbysubstringsofT• v=node(α)ifthepathfromroottovspellsα• one-to-onecorrespondenceofleavesandsuffixes

• |T|leaves,hence<|T|internalnodes• |Tree(T)|=O(|T|+size(edgelabels))

24

Tree(hattivatti)hattivatti

attivatti

ttivattitivatti

ivattivatti

attitti

tii

hattivattiattivatti ttivatti

tivatti

ivatti

vatti

vattivatti

attiti

i

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

atti

Page 5: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

5

25

Tree(hattivatti)hattivatti

attivatti

ttivattitivatti

ivattivatti

attitti

tii

hattivattiattivatti ttivatti

tivatti

ivatti

vatti

vattivatti

attiti

i

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

hattivatti

atti

substring labels of edges represented as pairs of pointers

26

Tree(hattivatti)hattivatti

attivatti

ttivattitivatti

ivattivatti

attitti

tii

1 2 34

5

6

6,106,10

2,54,5

i

10

8

9

3,3

i

vatti

vatti

vatti

hattivatti

hattivatti

7

27

Tree(T)isfull textindexTree(T)

P

31 8

P occurs in T at locations 8, 31, …

P occurs in T ó P is a prefix of some suffix of T ó Path for P exists in Tree(T)

All occurrences of P in time O(|P| + #occ)28

Findatt fromTree(hattivatti)hattivatti

attivatti

ttivattitivatti

ivattivatti

attitti

tii

hattivattiattivatti ttivatti

tivatti

ivatti

vatti

vattivatti

attiti

2

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

atti7

29

LineartimeconstructionofTree(T)

hattivattiattivatti

ttivattitivatti

ivattivatti

attitti

tii

Weiner (1973),

’algorithm of the year’

McCreight (1976)

’on-line’ algorithm (Ukkonen 1992) 30

On-lineconstructionofTrie(T)

• T=t1t2 …tn$• Pi =t1t2 …ti i:th prefix ofT• on-lineidea:updateTrie(Pi) toTrie(Pi+1)• =>verysimpleconstruction

Page 6: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

6

31

Trie(abaab)

a a

b

b a

b

b

aa

Trie(a) Trie(ab) Trie(aba)

chain of links connects the end points of current suffixes

abaabaa

aaεaε

32

Trie(abaab)

a a

b

b a

b

b

aa

a

b

b

aaa

aa

Trie(abaa)

33

Trie(abaab)

a a

b

b a

b

b

aa

a

b

b

aaa

aa

Trie(abaa)

Add next symbol = b

34

Trie(abaab)

a a

b

b a

b

b

aa

a

b

b

aaa

aa

Trie(abaa)

Add next symbol = b

From here on b-arc already exists

35

Trie(abaab)

a a

b

b a

b

b

aa

a

b

b

aaa

aa

a

b

b

aaa

aa

b

b

b

Trie(abaab)

36

WhathappensinTrie(Pi) =>Trie(Pi+1) ?

ai

ai

aiai

aiai

Before

After

New nodes

New suffix links

From here on the ai-arc exists already => stop updating here

Page 7: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

7

37

WhathappensinTrie(Pi) =>Trie(Pi+1) ?

• time:O(sizeofTrie(T))• suffixlinks:

slink(node(aα))=node(α)

38

On-lineprocedureforsuffixtrie

1. Create Trie(t1): nodes root and v, an arc son(root, t1) = v, and suffix links slink(v) := root and slink(root) := root

2. for i := 2 to n do begin3. vi-1 := leaf of Trie(t1…ti-1) for string t1…ti-1 (i.e., the deepest leaf)

4. v := vi-1; v´ := 0

5. while node v has no outgoing arc for ti do begin

6. Create a new node v´´ and an arc son(v,ti) = v´´

7. if v´ ≠ 0 then slink(v) := v´´

8. v := slink(v); v´ := v´´ end9. for the node v´´ such that v´´= son(v,ti) do

if v´´ = v´ then slink(v’) := root else slink(v´) := v´´

39

Suffixtreeson-line

• ’compactedversion’oftheon-linetrieconstruction:simulatetheconstructiononthelinearsizetreeinsteadofthetrie=>timeO(|T|)

• alltrienodesareconceptuallystillneeded=>implicit andreal nodes

40

Implicitandrealnodes

• Pair(v,α)isanimplicitnode inTree(T)ifvisanodeofTreeandα isa(proper)prefixofthelabelofsomearcfrom v.Ifα istheemptystringthen (v,α)isa ’real’ node(=v).

• Let v=node(α´)in Tree(T). Then implicitnode(v,α)representsnode(α´α)ofTrie(T)

41

Implicitnode

v

(v, α)α…

α´

42

Suffixlinksandopenarcs

v

α

root

slink(v)

label [i,*] instead of [i,j] if w is a leaf and j is the scanned position of T

w

Page 8: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

8

43

Bigpicture

… … …

suffix link path traversed: total work O(n)

new arcs and nodes created: total work O(size(Tree(T)) 44

On-lineprocedureforsuffixtree

Input: string T = t1t2 … tn$

Output: Tree(T)

Notation: son(v,α) = w iff there is an arc from v to w with label α

son(v,ε) = v

Function Canonize(v, α):

while son(v, α´) ≠ 0 where α = α´ α´´, | α´| > 0 dov := son(v, α´); α := α´´

return (v, α)

45

Suffix-treeon-line:mainprocedure

Create Tree(t1); slink(root) := root(v, α) := (root, ε) /* (v, α) is the start node */for i := 2 to n+1 do

v´ := 0while there is no arc from v with label prefix αti do

if α ≠ ε then /* divide the arc w = son(v, αη) into two */son(v, α) := v´´; son(v´´,ti) := v´´´; son(v´´,η) := w

elseson(v,ti) := v´´´; v´´ := v

if v´ ≠ 0 then slink(v´) := v´´v´ := v´´; v := slink(v); (v, α) := Canonize(v, α)

if v´ ≠ 0 then slink(v´) := v(v, α) := Canonize(v, αti) /* (v, α) = start node of the next round */

http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english

47

Theactualtimeandspace

• |Tree(T)|isabout20|T|inpractice• brute-forceconstructionisO(|T|log|T|)forrandomstringsastheaveragedepthofinternalnodesisO(log|T|)

• differencebetweenlinearandbrute-forceconstructionsnotnecessarilylarge(Giegerich&Kurtz)

• truncatedsuffixtrees:ksymbolslongprefixofeachsuffixrepresented(Naetal.2003)

• alphabetindependentlineartime(Farach1997)

abc

Page 9: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

9

abcabxabcd

ApplicationsofSuffixTrees

• DanGusfield: AlgorithmsonStrings,Trees,andSequences:ComputerScienceandComputationalBiology.Hardcover-534pages1stedition(January15,1997).CambridgeUnivPr(Short);ISBN:0521585198.- book

• APL1:ExactStringMatchingSearchforPfromtextS.Solution1:buildSTree(S)- oneachievesthesameO(n+m)asKnuth-Morris-Pratt,forexample!

• SearchfromthesuffixtreeisO(|P|)• APL2:ExactsetmatchingSearchforasetofpatternsP

Page 10: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

10

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 55

C

Backtobacktracking

AC

T

4 2 1 5 36

CT

TAT

TAC

T

AT C

T

A

ACA, 1 mismatch

Same idea can be used to many otherforms of approximate search, like Smith-Waterman, position-restricted scoringmatrices, regular expression search, etc.

ApplicationsofSuffixTrees

• APL3:substringproblemforadatabaseofpatternsGivenasetofstringsS=S1,...,Sn--- adatabaseFindallSithathavePasasubstring

• GeneralizedsuffixtreecontainsallsuffixesofallSi• QueryintimeO(|P|),andcanidentifytheLONGESTcommonprefixofPinallSi

ApplicationsofSuffixTrees

• APL4:Longestcommonsubstringoftwostrings• FindthelongestcommonsubstringofSandT.• OveralltherearepotentiallyO(n2 )suchsubstrings,ifnisthelengthofashorterofSandT

• Donald Knuthonce(1970)conjectured thatlinear-timealgorithmisimpossible.

• Solution:constructtheSTree(S+T)andfindthenodedeepestinthetreethathassuffixesfrombothSandTinsubtreeleaves.

• Ex:S=superiorcalifornialives T=sealiver havebothasubstringalive. ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 58

Simpleanalysistask:LCSS

• LetLCSSA(A,B) denotethelongestcommonsubstringtwosequencesA andB. E.g.:– LCSS(AGATCTATCT,CGCCTCTATG)=TCTAT.

• Agoodsolutionistobuildsuffixtreefortheshortersequenceandmakeadescendingsuffixwalk withtheothersequence.

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 59

Suffixlink

X

aX

suffix link

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 60

Descendingsuffixwalk

suffix tree of A Read B left-to-right,always going down in thetree when possible.If the next symbol of B doesnot match any edge labelon current position, takesuffix link, and try again.(Suffix link in the root to itself emits a symbol).The node v encountered with largest string depthis the solution.v

Page 11: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

11

ApplicationsofSuffixTrees

• APL5:RecognizingDNAcontaminationRelatedtoDNAsequencing,searchforlongeststrings(longerthanthreshold)thatarepresentintheDBofsequencesofothergenomes.

• APL6: CommonsubstringsofmorethantwostringsGeneralizationofAPL4,canbedoneinlinear(intotallengthofallstrings)time

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 62

Anothercommontool:Generalizedsuffixtree

ACCTTA....ACCT#CACATT..CAT#TGTCGT...GTA#TCACCACC...C$

AC

C

node info:subtree size 47813871sequence count 87

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 63

Generalizedsuffixtreeapplication

...ACC..#...ACC...#...ACC...ACC..ACC..#..ACC..ACC...#...ACC...#...

...#....#...#...#...ACC...#...#...#...#...#...#..#..ACC..ACC...#......#...

AC

C

node info:subtree size 4398blue sequences 12/15red sequences 2/62

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 64

Casestudycontinued

genome

regions with ChIP-seq matches

suffix tree of genome

5 blue1 red

TAC..........T

motif?

ApplicationsofSuffixTrees

• APL7:Buildingadirectedgraphforexactmatching: Suffixgraph - directedacyclicwordgraph(DAWG),asmallestfinitestateautomaton recognizingallsuffixesofastringS.Thisautomatoncanrecognize membership,butnottellwhichsuffixwasmatched.

• Construction:mergeisomorficsubtrees.• IsomorficinSuffixTreewhenexistssuffixlinkpath,andsubtreeshaveequalnr.ofleaves.

ApplicationsofSuffixTrees

• APL8:Areverseroleforsuffixtrees,andmajorspacereductionIndexthepattern,nottree...

• Matchingstatistics.• APL10:All-pairssuffix-prefixmatchingForallpairsSi, Sj, findthelongestmatchingsuffix-prefixpair.Usedinshortestcommonsuperstringgeneration(e.g.DNAsequenceassembly),ESTalignmentmetc.

Page 12: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

12

ApplicationsofSuffixTrees

• APL11:Findingallmaximalrepetitivestructuresinlineartime

• APL12:Circularstringlinearizatione.g.circularchemicalmoleculesinthedatabase,onewantstolienarizetheminacanonicalway...

• APL13:Suffixarrays- morespacereductionwilltouchthatseparately

ApplicationsofSuffixTrees

• APL14:Suffixtreesingenome-scaleprojects• APL15:ABoyer-Mooreapproachtoexactsetmatching

• APL16:Ziv-Lempeldatacompression• APL17:MinimumlengthencodingofDNA

ApplicationsofSuffixTrees• AdditionalapplicationsMostlyexercises...• Extrafeature:CONSTANTtimelowestcommonancestorretrieval(LCA)

Andmestruktuurmisvõimaldableidakonstantseajagaalumistühistvanemat(seevastabpikimaleühiseleprefixile!)onvõimalikkoostadalineaarseajaga.

• APL:Longestcommonextension:abridgetoinexactmatching• APL:Findingallmaximalpalindromesinlineartime

Palindromereadsfromcentralpositionthesametoleftandright.E.g.:kirik,saippuakivikauppias.

• BuildthesuffixtreeofSandinvertedS(aabcbad=>aabcbad#dabcbaa)andusingtheLCAonecanaskforanypositionpair(i,2i-1),thelongestcommonprefixinconstanttime.

• ThewholeproblemcanbesolvedinO(n).

ApplicationsofSuffixTrees

• APL:Exactmatchingwithwildcards• APL:Thek-mismatchproblem• Approximatepalindromesandrepeats• Fastermethodsfortandemrepeats• Alinear-timesolutiontothemultiplecommonsubstringproblem

• Andmany-manymore...

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 71

Propertiesofsuffixtree

• Suffixtreehasn leavesandatmostn-1internalnodes,wheren isthetotallengthofallsequencesindexed.

• Eachnoderequiresconstantnumberofintegers(pointerstofirstchild,sibling,parent,textrangeofincomingedge,statisticscounters,etc.).

• Canbeconstructedinlineartime.

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 72

Propertiesofsuffixtree...inpractice

• Hugeoverheadduetopointerstructure:– Standardimplementationofsuffixtreeforhumangenomerequiresover200GB memory!

– Acarefulimplementation(usinglogn -bitfieldsforeachvalueandarraylayoutforthetree)stillrequiresover40GB.

– Humangenomeitselftakeslessthan1GB using2-bitsperbp.

Page 13: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

13

73

1. Suffix tree

2. Suffix array

3. Some applications

4. Finding motifs

74

Suffixes- sorted

• Sortallsuffixes.Allowstoperformbinarysearch!

hattivattiattivatti

ttivattitivattiivattivatti

attittitiiε

εattiattivattihattivattiiivattititivattittittivattivatti

75

Suffixarray:example

• suffixarray=lexicographicorderofthesuffixes

hattivattiattivattittivattitivattiivattivattiattittitiiε

εattiattivattihattivattiiivattititivattittittivattivatti

1172

1105948

36

1234567891011

1172110594836

76

Suffixarrayconstruction:sort!

• suffixarray=lexicographicorderofthesuffixes

hattivattiattivattittivattitivattiivattivattiattittitiiε

1172

1105948

36

123

456789

1011

1172110594836

77

Suffixarray

• suffixarray SA(T)=anarraygivingthelexicographicorderofthesuffixesofT

• spacerequirement:5|T|• practitionerslikesuffixarrays(simplicity,spaceefficiency)

• theoreticianslikesuffixtrees(explicitstructure)

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 78

Reducingspace:suffixarray

AC

T

4 2 1 5 36

C A T A C T1 2 3 4 5 6

=[3,3]=[3,3]=[2,2]

suffix array

=[4,6]=[6,6]=[2,6]

=[3,6]=[5,6]

CC

TTA

T

TAC

T CT

A

T

A

Page 14: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

14

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 79

Suffixarray

• Manyalgorithmsonsuffixtreecanbesimulatedusingsuffixarray...– ...andcoupleofadditionalarrays...– ...formingso-calledenhancedsuffixarray...– ...leadingtothesimilarspacerequirementascarefulimplementationofsuffixtree

• Notasatisfactorysolutiontothespaceissue.

80

Patternsearchfromsuffixarrayhattivattiattivattittivattitivattiivattivattiattittitiiε

εattiattivattihattivattiiivattititivattittittivattivatti

1172110594836

att binary search

ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 81

Whatwelearntoday?

• Welearnthatitispossibletoreplacesuffixtreeswithcompressedsuffixtrees thattake8.8GB forthehumangenome.

• Welearnthatbacktracking canbedoneusingcompressedsuffixarrays requiringonly2.1GBforthehumangenome.

• Welearnthatdiscovering interestingmotifseedsfromthehumangenometakes40hoursandrequires9.3GB space.

82

Recentsuffixarrayconstructions

• Manber&Myers (1990):O(|T|log|T|)• linear time viasuffix tree• January /June 2003:direct linear timeconstruction ofsuffix array- Kim,Sim,Park,Park(CPM03)- Kärkkäinen&Sanders (ICALP03)- Ko &Aluru (CPM03)

83

Kärkkäinen-Sandersalgorithm

1. Construct the suffix array of the suffixes starting at positions i mod 3 ≠ 0. This is done by reduction to the suffix array construction of a string of two thirds the length, which is solved recursively.

2. Construct the suffix array of the remaining suffixes using the result of the first step.

3. Merge the two suffix arrays into one.

84

Notation

• stringT=T[0,n)=t0t1 …tn-1• suffixSi =T[i,0)=titi+1 …tn-1• forC\subset[0,n]:SC ={Si|iinC}

• suffixarray SA[0,n]ofTisapermutationof[0,n]satisfyingSSA[0] <SSA[1] <…<SSA[n]

Page 15: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

15

85

Runningexample

• T[0,n)=yabbadabbado00…

• SA=(12,1,6,4,9,3,8,2,7,5,10,11,0)

0 1 2 3 4 5 6 7 8 9 10 11

86

Step0:Constructasample

• fork=0,1,2Bk={iє [0,n]|imod3=k}

• C=B1UB2samplepositions• SC samplesuffixes

• Example:B1={1,4,7,10},B2={2,5,8,11},C={1,4,7,10,2,5,8,11}

87

Step1:Sortsamplesuffixes• fork=1,2,construct

Rk=[tktk+1tk+2][tk+3tk+4tk+5]…[tmaxBktmaxBk+1tmaxBk+2]

R=R1^R2concatenationofR1andR2

SuffixesofRcorrespondtoSC:suffix[titi+1ti+2]…correspondstoSi;correspondenceisorderpreserving.

SortthesuffixesofR:radixsortthecharactersandrenamewithrankstoobtainR´.Ifallcharactersdifferent,theirorderdirectlygivestheorderofsuffixes.Otherwise,sortthesuffixesofR´ usingKärkkäinen-Sanders.Note:|R´|=2n/3.

88

Step1(cont.)

• oncethesamplesuffixesaresorted,assignaranktoeach:rank(Si)=therankofSiinSC;rank(Sn+1)=rank(Sn+2)=0

• Example:R=[abb][ada][bba][do0][bba][dab][bad][o00]R´ =(1,2,4,6,4,5,3,7)SAR´ =(8,0,1,6,4,2,5,3,7)rank(Si)- 14- 26- 53– 78– 00

89

Step2:Sortnonsamplesuffixes

• foreachnon-sampleSi є SB0 (notethatrank(Si+1)isalwaysdefinedforiє B0):

Si ≤Sj ↔(ti,rank(Si+1))≤(tj,rank(Sj+1))• radixsortthepairs(ti,rank(Si+1)).

• Example:S12 <S6 <S9 <S3 <S0because(0,0)<(a,5)<(a,7)<(b,2)<(y,1)

90

Step3:Merge• mergethetwosortedsetsofsuffixesusingastandardcomparison-basedmerging:

• tocompareSi є SC withSj є SB0,distinguishtwocases:

• iє B1:Si ≤Sj ↔(ti,rank(Si+1))≤(tj,rank(Sj+1))• iє B2:Si ≤Sj ↔(ti,ti+1,rank(Si+2))≤(tj,tj+1,rank(Sj+2))

• notethattheranksaredefinedinallcases!• S1 <S6 as(a,4)<(a,5)andS3 <S8 as(b,a,6)<(b,a,7)

Page 16: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

16

91

RunningtimeO(n)

• excludingtherecursivecall,everythingcanbedoneinlineartime

• therecursionisonastringoflength2n/3• thusthetimeisgivenbyrecurrence

T(n)=T(2n/3)+O(n)• henceT(n)=O(n)

92

Implementation

• about50linesofC++• codeavailablee.g.viaJuhaKärkkäinen’shomepage

93

LCPtable

• LongestCommonPrefixofsuccessiveelementsofsuffixarray:

• LCP[i]=lengthofthelongestcommonprefixofsuffixesSSA[i] andSSA[i+1]

• buildinversearraySA-1 fromSAinlineartime• thenLCPtablefromSA-1 inlineartime(Kasaietal,CPM2001)

• OxfordEnglishDisctionary http://www.oed.com/• Example- WordoftheDay,Fourth

http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/L7_SuffixTrees/wotd_fourth.htmlhttp://www.oed.com/cgi/display/wotd

• PATindex- byGastonGonnet(taonsamutiMapletarkvaraüksloojatestninghiljemmolekulaarbioloogiatarkvarapaketiväljatöötajaid)

• PATindexisessentiallyasuffixarray.Tosavespace,indexedonlyfromfirstcharacterofeveryword

• XML-tagging(orSGML,atthattime!)alsoindexed• TomarkcertainfieldsofXML,thebitvectorswereused.• Mainconcern- improvethespeedofsearchontheCD- minimizerandom

accesses.• Forslowmediumeven15-20accessesistooslow...• G.H.Gonnet,R.A.Baeza-Yates,andT.Snider,Lexicographicalindicesfor

text:Invertedfilesvs.PATtrees,TechnicalReportOED-91-01,CentrefortheNewOED,UniversityofWaterloo,1991.

95

Suffixtreevssuffixarray

• suffixtreeó suffixarray+LCPtable

96

1. Suffix tree

2. Suffix array

3. Some applications

4. Finding motifs

Page 17: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

17

97

SubstringmotifsofstringT

• stringT =t1 …tninalphabetA.• Problem:whatarethefrequentlyoccurring(ungapped)substringsofT?Longestsubstringthatoccursatleastq times?

• Thm:SuffixtreeTree(T) givescompleteoccurrencecountsofallsubstringmotifsofTinO(n) time(althoughT mayhaveO(n2)substrings!)

98

Countingthesubstringmotifs

• internalnodesofTree(T)↔repeatingsubstringsofT

• numberofleavesofthesubtreeofanodeforstringP=numberofoccurrencesofPinT

99

Substringmotifsofhattivatti

hattivattiattivatti ttivatti

tivatti

ivatti

vatti

vattivatti

attiti

i

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

atti

2

2 2

24

Counts for the O(n) maximal motifs shown100

FindingrepeatsinDNA

• humanchromosome3• thefirst48999930bases• 31mincputime(8processors,4GB)

• Humangenome:3x109 bases• Tree(HumanGenome)feasible

101

Longestrepeat?

Occurrences at: 28395980, 28401554r Length: 2559

ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaatgctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaacatgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcatagtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacatacgtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaatcaccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgccattctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttcttttgagaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagtaggttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggcttttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatgtaagtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaaggaagggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatcagatagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagtatagtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttccaattctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatgagcgtgtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtattttattctctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgagactttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctcttttcctaattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgtgccagttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaatttattgagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtacgtttattgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattttattgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatgagttagg

102

Tenoccurrences?

ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgt

Length: 277

Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130, 47398125In the reversed complement at: 17858493, 41463059, 42431718, 42580925

Page 18: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

18

103

Usingsuffixtrees:plagiarism

• findlongestcommonsubstringofstringsXandY

• buildTree(X$Y)andfindthedeepestnodewhichhasaleafpointingtoXandanotherpointingtoY

104

Usingsuffixtrees:approximatematching

• editdistance:insertions,deletions,changes

• STOCKHOLMvsTUKHOLMA

105

Stringdistance/similarityfunctions

STOCKHOLM vs TUKHOLMA

STOCKHOLM__TU_ KHOLMA

=> 2 deletions, 1 insertion, 1 change

106

Dynamicprogrammingdi,j = min(if ai=bj then di-1,j-1 else ¥,

di-1,j + 1, di,j-1 + 1)

= distance between i-prefix of A and j-prefix of B(substitution excluded)

di,j

di-1,j-1

di,j-1

di-1,j

dm,n

mxn table d

A

B

ai

bj

+1

+1

107

A\B s t o c k h o l m0 1 2 3 4 5 6 7 8 9

t 1 2 1 2 3 4 5 6 7 8u 2 3 2 3 4 5 6 7 8 9k 3 4 3 4 5 4 5 6 7 8h 4 5 4 5 6 5 4 5 6 7o 5 6 5 4 5 6 5 4 5 6l 6 7 6 5 6 7 6 5 4 5m 7 8 7 6 7 8 7 6 5 4a 8 9 8 7 8 9 8 7 6 5

di,j = min(if ai=bj then di-1,j-1 else ¥, di-1,j + 1, di,j-1 + 1)

dID(A,B)optimal alignment by trace-back 108

Searchproblem

• findapproximateoccurrencesofpatternPintextT:substringsP’ofTsuchthatd(P,P’)small

• dynprogrwithsmallmodification:O(mn)• lotsof(practical)improvementtricks

P

T P’

Page 19: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

19

109

Indexforapproximatesearching?

• dynamicprogramming:PxTree(T)withbacktracking

P

Tree(T)

Burrows-WheelerTransformation

• BWTfortextcompressionandindexing

Burrows-Wheeler• SeeFAQ http://www.faqs.org/faqs/compression-faq/part2/section-9.html• Themethoddescribedintheoriginalpaperisreallyacompositeofthreedifferent

algorithms:– theblocksortingmainengine(alossless,veryslightlyexpansivepreprocessor),– themove-to-frontcoder(abyte-for-bytesimple,fast,locallyadaptivenoncompressivecoder)and– asimplestatisticalcompressor(firstorderHuffmanismentionedasacandidate)eventuallydoing

thecompression.• Ofthesethreemethodsonlythefirsttwoarediscussedhereastheyarewhatconstitutesthe

heartofthealgorithm.Thesetwoalgorithmscombinedformacompletelyreversible(lossless)transformationthat- withtypicalinput- skewsthefirstordersymboldistributionstomakethedatamorecompressiblewithsimplemethods.Intuitivelyspeaking,themethodtransformsslackinthehigherorderprobabilitiesoftheinputblock(thusmakingthemmoreeven,whiteningthem)toslackinthelowerorderstatistics.Thiseffectiswhatisseeninthehistogramoftheresultingsymboldata.

• Please,readthearticlebyMarkNelson:• DataCompressionwiththeBurrows-WheelerTransform MarkNelson,Dr.Dobb'sJournal

September,1996.http://marknelson.us/1996/09/01/bwt/

BWT

Page 20: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

20

DRD

1645*=>5240163023

DRDOBBS

0123456

1645023

Page 21: Lecture 135 Full Text new - ut · •Burrows Wheeler Transform (BWT) –Compact Suffix tree representation: BWT + succinct Problem •Given P and S –find all exact or approximate

12/29/17

21

CODE:t: hat acts like this:<13><10><1

t: hat buffer to the constructor

t: hat corrupted the heap, or wo

W: hat goes up must come down<13

t: hat happens, it isn't likely

w: hat if you want to dynamicall

t: hat indicates an error.<13><1

t: hat it removes arguments from

t: hat looks like this:<13><10><

t: hat looks something like this

t: hat looks something like this

t: hat once I detect the mangled

Example

• Decode:errktreteoe.e

• Hint:. Isthelastcharacter,alphabeticallyfirst…