text algorithms (6eap) - arvutiteaduse instituut · •in exact search we searched for a string or...

56
Text Algorithms (6EAP) Approximate Matching Jaak Vilo 2017 fall 1 MTAT.03.190 Text Algorithms Jaak Vilo

Upload: duongkhanh

Post on 10-Apr-2018

219 views

Category:

Documents


6 download

TRANSCRIPT

TextAlgorithms(6EAP)

ApproximateMatching

JaakVilo2017fall

1MTAT.03.190TextAlgorithmsJaakVilo

Exactvsapproximatesearch

• Inexactsearchwesearchedforastringorsetofstringsinalongtext

• Thewelearnedhowtomeasurethesimilaritybetweensequences

• Thereareplentyofapplicationsthatrequireapproximatesearch

• Approximate matching,i.e.findthoseregionsinalongtextthataresimilartothequerystring

• E.g.tofindsubstringsofSthathaveeditdistacne<ktoquerystringm.

Problem

• GivenPandS– findallapproximateoccurrencesofPinS

S

P

• Reviews• P.A.HallandG.R.Dowling.Approximatestringmatching.ACMComputing

Surveys,12(4):381--402,1980.ACMDL,PDF• G.Navarro.Aguidedtourtoapproximatestringmatching.ACM

ComputingSurveys,33(1):31--88,2001.(TechnicalReportTR/DCC-99-5,Dept.ofComputerScience,Univ.ofChile,1999.)CiteSeer,ACMDL,PDF

• Algorithms• S.WuandU.Manber.Fasttextsearchingallowingerrors.Communications

oftheACM,35(10):83--91,1992.ACMDL PDF• G.Myers.Afastbit-vectoralgorithmforapproximatestringmatching

basedondynamicprogamming.JournaloftheACM,46(3):395--415,1999.CiteSeer,PDF

• A.Amir,M.Lewenstein,andE.Porat.Fasteralgorithmsforstringmatchingwithkmismatches.InProc.11thACM-SIAMSymp.onDiscreteAlgorithms(SODA),pages794--803,2000.CiteSeer,ACMDL,PDF

• Multipleapproximatematching• R.MuthandU.Manber.Approximatemultiplestringsearch.InProc.CPM'96,

pages75--86,1996.CiteSeer,Postscript• KimmoFredriksson- publicationshttp://www.cs.uku.fi/~fredriks/publications.html• Applications• UdiManber.Asimpleschemetomakepasswordsbasedonone-wayfunctions

muchhardertocrack.ComputersandSecurity,15(2):171-- 176,1996.CiteSeer,TR94-34,Postscript

• Tools• Webglimpse - glimpse,agrep

agrepforWin/DOSOriginalagrep

• Links• PatternMatchingPointers (StefanoLonardi)• Articles

Problemstatement• LetS=s1s2...sn∈ Σ* beatextandP=p1p2...pm thepattern.Letkbea

pregivenconstant.• Mainproblems• kmismatches

– FindfromSallsubstringsX,|X|=|P|,thatdifferfromPatmaxkpositsions(Hammingdistance)

• kdifferences– FindfromSallsubstringsX,whereD(X,P)≤k

(Editdistance)• bestmatch

– FindfromSsuchsubstringsX,thatD(X,P)isminimal• DistanceDcanbedefinedusingoneofthewaysfrompreviouschapters

Measureeditdistance

Findapproximateoccurrences

Algorithmforapproximatesearch,k editoperations

Input: P, S, kOutput: Approximate occurrences of P in S (with edit distance ≤ k)for j=0 to m do hj,0=j // Initialize first columnfor i=1 to n do

h0,i = 0for j=1 to m do

hj,i = min( hi-1,j-1 + (if pj==si then 0 else 1),hi-1,j + 1, hi,j-1 + 1 )

if hm,i ≤ k Report match at iTrace back and report the minimizing path (from-to)

Example

abracadabra000

r 110a 21d32a 43

• Theorem Letsassumethatinthematrixhij thepaththatleadstothevaluehmj inthelastrowstartsfromsquareh0r.ThentheeditdistanceD(P,sr+1sr+2...sj)=hmj,andhmj istheminimalsuchdistanceforanysubstringstartingbeforej'thposition,hmj=min{D(P,stst+1...sj )|t≤j}

• Proofbyinduction• Everyminimizingpathstartsfromsomevalueintherow0• Sinceitispossibletoreachtothesameresultviamultiple

paths,thentheapproximatematchisnotalwaysunique

• TimeandspacecomplexityO(mn)• Asncanbelarge,itissufficienttokeepthelastm+kcolumnsonly,whichcanfullyfitthefulloptimalpath.

• SpacecomplecityO(m2)• Or,onecankeepjustthesinglelastcolumnandincaseofamatchtorecalculatetheexactpath.

• SpacecomplecityO(m)• IfnoneedtofindthepathmO(m)

• Diagonallemmawillhold• Ifoneneestofindonlytheregionswithatmostkeditoperations,thenonecanrestrictthedepthofthecalculations

• Itsufficestocomputeuntilk-border• Modifiedalgorithm(homeassignment)willworkinaveragetimeO(kn)

• TherearebettermethodswhichworkinO(kn)atrtheworstcase.

• Landau&Vishkin(1988),Chang&Lampe(1991).

ImprovedaveragecaseE.Ukkonen.Findingapproximatepatternsinstrings.JournalofAlgorithms,6(1-3):132-137,1985.1.//Preprocessing2.for j=0..mdo C[j]=j3.lact=k+1 //lastactiverow4.//Searching5.for i=0..n6.pC=0;nC=0 //previousandnewcolumnvalue7.for j=1..lact8.if S[i]==P[j]then nC=pC //why?9.else10. if pC<nCthen nC=pC11. if C[j]<nCthen nC=C[j]12. nC=nC+113. pC=C[j]14.C[j]=nC15.while C[lact]>kdo lact=lact-116.if lact=mthen reportmatchatpositioni17.else lact=lact+1

Ukkonen1985;O(kn)

FourRussianstechnique

• Thisisageneraltechniquethatcanbeappliedindifferentcontexts

• Itimprovesthespeedofmatrixmultiplications• Hasbeenusedforregularexpressionandapproximate

matching• Letthecolumnvectord*j=(d0j,...,dmj)presentthecurrent

state• Letspreprocesstheautomatonfromeachstate• F(X,a)=Y,s.t.columnvectorXafterreadingcharactera

becomescolumnvectorY.• Example: LetsfindP=abc approximatematcheswhenthereis

atmost1operationallowed.

FourRussianstechnique

• Thereare13differentpossibilities:

• Fromeachstatecomputepossiblenextstatesforallcharactersa,b,c,andx(xnotinP)

• Thestateswithdmj ≤1arefinalstates.• Thiscanbecometoolargetohandle.• Cuttheregionsintosmallerpieces,usethattoreducethe

complexity.• NavarroandRaffinot FlexiblePatternMatchinginStrings.

(CambridgeUniversityPress,2002).pp.152Fig6.5.

0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 1 1 1 1 1 1 1 10 0 1 1 1 0 0 1 1 1 2 2 20 1 0 1 2 0 1 0 1 2 1 2 3

Four-Russiansversion

NFA/DFA

• Createanautomatonformatchingawordapproximately

• Allow0,1,…nerrors

Regularexpressions

Filteringtechniques

• q-gram (alsok-mer,oligomer)• (sub)stringoflengthq• LetshaveapatternPoflengthm• AssumepatternP isratherlongandkissmall,findoccurrenceswithatmostkmismatches

• HowlongsubstringsofPmusthaveanexactmatch?

• Ifmismatchesaremostevenly,thenweget~m/kpieces

Kmismatches

• K=3

• P

• For3-mismatchmatch,atleastonesubstringoflength(m-3)/4mustoccurexactly.

Filteringtechniqueswithq-grams

• IfPhaskmismatches,thenSmusthaveatleastonesubstringofPwhoselengthisatleast⌈ (m-k)/k⌉

• Filterforallpossibleq-merswhereqiscarefullyselected.– Becarefulwithoverlappingandnon-overlappingq-grams.– Ifnon-overlapping,thenhowlongexactmatchescanwefind?

• UsemultipleexactmatchingO(n)(orsublinear)algorithms• Whenanexactmatchofsuchsubstringisfound,thereisa

possibilityforanapproximateoverallmatch.• Checkfortheactualmatch

Filterandverify!

• P

Filteringtechniquescont.

• Lotsofresearchonapproximatematchingusingq-gramtechniques

• Lotsoftimesreinventedthewheelindifferentfields

Indexingusingq-grams• Filteringcanalsobeusedforindexing.E.g.indexallq-gramsandtheir

matchesinS.• IfonesearchesforP,firstsearchforq-gramsinindex.Ifasufficientnrof

matchesisfound,thenmakethecomparisontoseeifthematchisreal.• Filteringshouldbeefficientforcaseswhereahighsimilaritymatchfora

longpatternislookedfor.• Thisislikereverseindexfortexts:• word doc_id:word_iddoc_id:pos_id• word1 1:57:9167:987...

word2 2:53:678:1067:3...word3 3:55:677:1016:3......

• Q:wheredotheword1andword3occurtogether?

Bitparallelsearch

• Canweusebit-parallelismforapproximatesearch?

• T=lasteaed,P=aste

l a s t e a e d

0 1 0 0 0 1

0 0 1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

Generalizedpatterns

• AgeneralizedpatternP=p1p2...pm consistsofgeneralizedcharacterspi suchthateachpi representsanon-emptysubsetofalphabetΣ*;

• pi =a,a∈ Σ• pi =#,"wildcard"(anynranysymbols)• pi =[group];e.g.:[abc],[^abc],[a-h],...• pi =¬C;CharactersfromasetΣ-C.• Example:[Tt][aeiou][kpt]#[^aeiou][mnr]matches

Tekstialgoritmbutnotwordtekstuur.• Problem:Searchforgeneralizedpatternsfromtext• ComparetoSHIFT-ORalgorithm!

P= a[b-h]a¬a // agrep a[b-h]a[^a]paganamaa

a 110101[b-h] 221011a 332101¬a 433210

zero at last row – exact match!

• Whataboutmismatches?• Mismatchifcharacterdoesnotbelongtoclassdefinedbypattern.Unitcost1.

• SHIFT-ADD- similartoSHIFT-OR,butinsteadofORanADDisused.(noinsertionsdeletionsonthisexample)

• (noinsertionsdeletionsonthisexample)P=a[kpt]a¬a //agrepa[kpt]a[^a]

1 atlastpos- matchwith1mismatch!• Eachvalueofmatrixdij canbepresentedwithbbits(4bits

allowsvaluesupto16).Columnscanbesimpleintegers.

paganamaa0000000000

a 110101[kpt] 2211 21a 33221 3¬a 433221

• Eachvalueofmatrixdij canbepresentedwithbbits(4bitsallowsvaluesupto16).Columnscanbesimpleintegers.

• Bj=dmj2b(m-1) +dm-1,j2b(m-1) +...d1j.(d0jisalways0,canbeomitted)

• Whenaddinganotherinteger,where0isonpositioniifthenextcharatj'thpositionbelongstoasetrepresentedbyPi and1otherwise.

• Whenaddinganotherinteger,where0isonpositioniifthenextcharatj'thpositionbelongstoasetrepresentedbyPi and1otherwise.

010001000001011+ 001001 000000 001

----------------------------= 011010000001100

• Oneneedstobeverycarefulnottohaveoverflow(111+001=1000).

• Shiftby3positions==multiplyby8

010001000001011 *8= 001000001011000

Usemultiplevectors,oneforeachkvalue

• Onecanalsouseseveralindividual1-bitvectors,eachcorrespondstodifferentk

• CanbeextendedtomaskoutregionswheremismatchesareNOTallowed

• Canintroducewildcardsofarbitrarylength

Bit-parallelism

• Maintainalistofpossible“states”

• Updatelistsusingbit-leveloperations

Example(note:leastsignificantbitisleftinthisoutput)

Pattern=AC#T<GA>[TG]Alength7,#=.*CV[char]A65 11111111111111111111111110101110C67 11111111111111111111111111111101G71 11111111111111111111111111010111T84 11111111111111111111111111011011WILDCARD

11111111111111111111111111111101ENDMASK

00000000000000000000000001000000NO_ERROR

00000000000000000000000000011000

7654321

0 – position is “active”• R[0] – vector for (so far) 0 mismatches• R[1] – vector for (so far) 1 mismatch• R[2] – vector for (so far) 2 mismatches

• “Minimum” bybitwiseAND• If(even)oneofthevectorshas0,

thenbitwiseANDproduces0(whichissmallerof0and1,1and0,0and0)

• Ifboth(orall)ofthevectorshave1,thenbitwiseANDproduces1 (whichissmallerof1and1)

• Howtogetnewvaluesfromoldones• P[0]P[1]...=>R[0]R[1]...R[0]

– isminofthree possibilities:

(P[i]shift1)bitorCV[textchar]//previouslyactive,nowmatchwithcharacter

(P[i]bitorWILDCARD)//wildcardmatch– thesamepositionremainsactive

(P[i-1]shift1bitorNO_ERROR)//Previously1lesserrors(unlessNO_ERRORallowed)

Thealgorithm

• R[i]ingeneralistheminimumof3possibilities:

(P[i]shift1)bitorCV[textchar]& //match(P[i]bitorWILDCARD)& //wildcard(P[i-1]shift1bitorNO_ERROR) //mismatch

Last-- Addonemismatchunlesserrorsnotallowed

diktorantuur

BPR (p = p1p2...pm, T = t1t2...tn, k)1. Preprocessing2. for c ∈ S Do B[c] <- 0m3. for j ∈ 1 ... m Do B[pj] <- B[pj] | 0m-j10j-1

4. Searching5. for i ∈ 0 ... k Do Ri <- 0m-i1i6. for pos ∈ 1 ... n Do7. oldR <- R08. newR <- ((oldR << 1) | 1) & B[tpos]9. R0 <- newR10. for i ∈ 1 ... k Do11. newR <- ((Ri << 1) & B[tpos]) | oldR | ((oldR | newR) << 1)12. oldR <- Ri, Ri <- newR13. end of for14. If newR & 10m-1 <> 0 Then report an occurrence at pos15. End of for

public static void BPR(string pattern, string text, int errors){

int[] B = new int[ushort.MaxValue];for (int i = 0; i < ushort.MaxValue; i++) B[i] = 0;// Initialize all characters positionsfor (int i = 0; i < pattern.Length; i++){

B[(ushort)pattern[i]] |= 1 << i;}// Initialize NFA statesint[] states = new int[errors+1]; for(int i= 0; i <= errors; i++){

states[i] = (i == 0) ? 0 : (1 << (i - 1) | states[i-1]);}//int oldR, newR;int exitCriteria = 1 << pattern.Length -1;

for (int i = 0; i < text.Length; i++){

oldR = states[0];newR = ((oldR << 1) | 1) & B[text[i]];states[0] = newR;

for (int j = 1; j <= errors; j++){

newR = ((states[j] << 1) & B[text[i]]) | oldR | ((oldR | newR) << 1);

oldR = states[j];states[j] = newR;

}

if ((newR & exitCriteria) != 0) Console.WriteLine("Occurrence at position {0}", i+1);

}}

agrep

• S.WuandU.Manber.Fasttextsearchingallowingerrors.CommunicationsoftheACM,35(10):83--91,1992.ACMDL PDF

• Insertions,deletions• Wildcards• Non-uniformcostsforsubstitution,insertion,deletion

• Findbestmatch• Maskregionsfornoerrors• Recordorientated,notlineorientated

Agrepexamples(frommanagrep)• agrep-2-cABCDEFGfoo

givesthenumberoflinesinfilefoothatcontainABCDEFGwithintwoerrors.• agrep-1-D2-S2'ABCD#YZ'foo

outputsthelinescontainingABCDfollowed,withinarbitrarydistance,byYZ,withuptooneadditionalinsertion(-D2and-S2makedeletionsandsubstitutionstoo"expensive").

• agrep-5-pabcdefghij/usr/dict/wordsoutputsthelistofallwordscontainingatleast5ofthefirst10lettersofthealphabetinorder.(Tryit:anyliststart- ingwithacademiaandendingwithsacrilegiousmustmeansome- thing!)

• agrep-1'abc[0-9](de|fg)*[x-z]'foooutputsthelinescontaining,withinuptooneerror,thestringthatstartswithabcfollowedbyonedigit,followedbyzeroormorerepetitionsofeitherdeorfg,followedbyeitherx,y,orz.

• agrep-d'^From''breakdown;internet'mboxoutputsallmailmessages(thepattern'^From'separatesmailmessagesinamailfile)thatcontainkeywords'breakdown'and'internet'.

• agrep-d'$$'-1''foofindsallparagraphsthatcontainword1followedbyword2withoneerrorinplaceoftheblank.Inparticular,ifword1isthelastwordinalineandword2isthefirstwordinthenextline,thenthespacewillbesubstitutedbyanewlinesymbolanditwillmatch.Thus,thisisawaytoovercomeseparationbyanewline.Notethat-d'$$'(oranotherdelimwhichspansmorethanoneline)isnecessary,becauseotherwiseagrepsearchesonlyonelineatatime.

• agrep'^agrep'outputsalltheexamplesoftheuseofagrepinthismanpages.

• GeneMyers:Afastbit-vectoralgorithmforapproximatestringmatchingbasedondynamicprogramming JournaloftheACM(JACM),Volume46,Issue3(May1999).http://doi.acm.org/10.1145/316542.316550.PDF

• Abstract• Theapproximatestringmatchingproblemistofindalllocationsatwhichaqueryoflengthmmatchesasubstringofatext

oflengthn withk-or-fewerdifferences.• Simpleandpracticalbit-vectoralgorithmshavebeendesignedforthisproblem,mostnotablytheoneusedinagrep.• Thesealgorithmscomputeabitrepresentationofthecurrentstate-setofthek-differenceautomatonforthequery,and

asymptoticallyrunineitherO(nm/w)orO(nmlogσ/w)timewherewisthewordsizeofthemachine(e.g.,32or64inpractice),andσisthesizeofthepatternalphabet.

• HerewepresentanalgorithmofcomparablesimplicitythatrequiresonlyO(nm/w)timebyvirtueofcomputingabitrepresentationoftherelocatabledynamicprogrammingmatrixfortheproblem.

• Thus,thealgorithm'sperformanceisindependentofk,anditisfoundtobemoreefficientthanthepreviousresultsformanychoicesofkandsmallm.

• Moreover,becausethealgorithmisnotdependentonk,itcanbeusedtorapidlycomputeblocksofthedynamicprogrammingmatrixasinthe4-RussiansalgorithmofWuetal.(1996).

• ThisgivesrisetoanO(kn/w)expected-timealgorithmforthecasewheremmaybearbitrarilylarge.• Inpracticethisnewalgorithm,thatcomputesaregionofthedynamicprogramming(d.p.)matrxwentriesatatimeusing

thebasicalgorithmasasubroutineissignificantlyfasterthanourprevious4-Russiansalgorithm,thatcomputesthesameregion4or5entriesatatimeusingtablelookup.

• Thisperformanceimprovementyieldsacodethatiseithersuperiororcompetitivewithallexistingalgorithmsexceptforsomefiltrationalgorithmsthataresuperiorwhenk/missufficientlysmall.

• Writingofanoverview,implementingthealgorithmandcreatingausefultoolcouldbeabigtopicforaBScorMScthesis.

Multipleapproximatestringmatching

• Howtofindsimultaneouslytheapproximatematchesforasetofwords,e.g.adictionary.

• Orasetofregularexpressions,generalizedpatterns,etc.

• Onecanbuildautomatonsforsetsofwords,andthenmatchtheautomatonsapproximately.

• Filteringapproaches– ifcloseenough,test• Notmany(good)methodshavebeenproposed

• OverimposeNFAautomata• Filteronall(necessary)factors