text algorithms (6eap) - arvutiteaduse instituut · •in exact search we searched for a string or...
TRANSCRIPT
Exactvsapproximatesearch
• Inexactsearchwesearchedforastringorsetofstringsinalongtext
• Thewelearnedhowtomeasurethesimilaritybetweensequences
• Thereareplentyofapplicationsthatrequireapproximatesearch
• Approximate matching,i.e.findthoseregionsinalongtextthataresimilartothequerystring
• E.g.tofindsubstringsofSthathaveeditdistacne<ktoquerystringm.
• Reviews• P.A.HallandG.R.Dowling.Approximatestringmatching.ACMComputing
Surveys,12(4):381--402,1980.ACMDL,PDF• G.Navarro.Aguidedtourtoapproximatestringmatching.ACM
ComputingSurveys,33(1):31--88,2001.(TechnicalReportTR/DCC-99-5,Dept.ofComputerScience,Univ.ofChile,1999.)CiteSeer,ACMDL,PDF
• Algorithms• S.WuandU.Manber.Fasttextsearchingallowingerrors.Communications
oftheACM,35(10):83--91,1992.ACMDL PDF• G.Myers.Afastbit-vectoralgorithmforapproximatestringmatching
basedondynamicprogamming.JournaloftheACM,46(3):395--415,1999.CiteSeer,PDF
• A.Amir,M.Lewenstein,andE.Porat.Fasteralgorithmsforstringmatchingwithkmismatches.InProc.11thACM-SIAMSymp.onDiscreteAlgorithms(SODA),pages794--803,2000.CiteSeer,ACMDL,PDF
• Multipleapproximatematching• R.MuthandU.Manber.Approximatemultiplestringsearch.InProc.CPM'96,
pages75--86,1996.CiteSeer,Postscript• KimmoFredriksson- publicationshttp://www.cs.uku.fi/~fredriks/publications.html• Applications• UdiManber.Asimpleschemetomakepasswordsbasedonone-wayfunctions
muchhardertocrack.ComputersandSecurity,15(2):171-- 176,1996.CiteSeer,TR94-34,Postscript
• Tools• Webglimpse - glimpse,agrep
agrepforWin/DOSOriginalagrep
• Links• PatternMatchingPointers (StefanoLonardi)• Articles
Problemstatement• LetS=s1s2...sn∈ Σ* beatextandP=p1p2...pm thepattern.Letkbea
pregivenconstant.• Mainproblems• kmismatches
– FindfromSallsubstringsX,|X|=|P|,thatdifferfromPatmaxkpositsions(Hammingdistance)
• kdifferences– FindfromSallsubstringsX,whereD(X,P)≤k
(Editdistance)• bestmatch
– FindfromSsuchsubstringsX,thatD(X,P)isminimal• DistanceDcanbedefinedusingoneofthewaysfrompreviouschapters
Algorithmforapproximatesearch,k editoperations
Input: P, S, kOutput: Approximate occurrences of P in S (with edit distance ≤ k)for j=0 to m do hj,0=j // Initialize first columnfor i=1 to n do
h0,i = 0for j=1 to m do
hj,i = min( hi-1,j-1 + (if pj==si then 0 else 1),hi-1,j + 1, hi,j-1 + 1 )
if hm,i ≤ k Report match at iTrace back and report the minimizing path (from-to)
• Theorem Letsassumethatinthematrixhij thepaththatleadstothevaluehmj inthelastrowstartsfromsquareh0r.ThentheeditdistanceD(P,sr+1sr+2...sj)=hmj,andhmj istheminimalsuchdistanceforanysubstringstartingbeforej'thposition,hmj=min{D(P,stst+1...sj )|t≤j}
• Proofbyinduction• Everyminimizingpathstartsfromsomevalueintherow0• Sinceitispossibletoreachtothesameresultviamultiple
paths,thentheapproximatematchisnotalwaysunique
• TimeandspacecomplexityO(mn)• Asncanbelarge,itissufficienttokeepthelastm+kcolumnsonly,whichcanfullyfitthefulloptimalpath.
• SpacecomplecityO(m2)• Or,onecankeepjustthesinglelastcolumnandincaseofamatchtorecalculatetheexactpath.
• SpacecomplecityO(m)• IfnoneedtofindthepathmO(m)
• Diagonallemmawillhold• Ifoneneestofindonlytheregionswithatmostkeditoperations,thenonecanrestrictthedepthofthecalculations
• Itsufficestocomputeuntilk-border• Modifiedalgorithm(homeassignment)willworkinaveragetimeO(kn)
• TherearebettermethodswhichworkinO(kn)atrtheworstcase.
• Landau&Vishkin(1988),Chang&Lampe(1991).
ImprovedaveragecaseE.Ukkonen.Findingapproximatepatternsinstrings.JournalofAlgorithms,6(1-3):132-137,1985.1.//Preprocessing2.for j=0..mdo C[j]=j3.lact=k+1 //lastactiverow4.//Searching5.for i=0..n6.pC=0;nC=0 //previousandnewcolumnvalue7.for j=1..lact8.if S[i]==P[j]then nC=pC //why?9.else10. if pC<nCthen nC=pC11. if C[j]<nCthen nC=C[j]12. nC=nC+113. pC=C[j]14.C[j]=nC15.while C[lact]>kdo lact=lact-116.if lact=mthen reportmatchatpositioni17.else lact=lact+1
FourRussianstechnique
• Thisisageneraltechniquethatcanbeappliedindifferentcontexts
• Itimprovesthespeedofmatrixmultiplications• Hasbeenusedforregularexpressionandapproximate
matching• Letthecolumnvectord*j=(d0j,...,dmj)presentthecurrent
state• Letspreprocesstheautomatonfromeachstate• F(X,a)=Y,s.t.columnvectorXafterreadingcharactera
becomescolumnvectorY.• Example: LetsfindP=abc approximatematcheswhenthereis
atmost1operationallowed.
FourRussianstechnique
• Thereare13differentpossibilities:
• Fromeachstatecomputepossiblenextstatesforallcharactersa,b,c,andx(xnotinP)
• Thestateswithdmj ≤1arefinalstates.• Thiscanbecometoolargetohandle.• Cuttheregionsintosmallerpieces,usethattoreducethe
complexity.• NavarroandRaffinot FlexiblePatternMatchinginStrings.
(CambridgeUniversityPress,2002).pp.152Fig6.5.
0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 1 1 1 1 1 1 1 10 0 1 1 1 0 0 1 1 1 2 2 20 1 0 1 2 0 1 0 1 2 1 2 3
Filteringtechniques
• q-gram (alsok-mer,oligomer)• (sub)stringoflengthq• LetshaveapatternPoflengthm• AssumepatternP isratherlongandkissmall,findoccurrenceswithatmostkmismatches
• HowlongsubstringsofPmusthaveanexactmatch?
• Ifmismatchesaremostevenly,thenweget~m/kpieces
Filteringtechniqueswithq-grams
• IfPhaskmismatches,thenSmusthaveatleastonesubstringofPwhoselengthisatleast⌈ (m-k)/k⌉
• Filterforallpossibleq-merswhereqiscarefullyselected.– Becarefulwithoverlappingandnon-overlappingq-grams.– Ifnon-overlapping,thenhowlongexactmatchescanwefind?
• UsemultipleexactmatchingO(n)(orsublinear)algorithms• Whenanexactmatchofsuchsubstringisfound,thereisa
possibilityforanapproximateoverallmatch.• Checkfortheactualmatch
Filteringtechniquescont.
• Lotsofresearchonapproximatematchingusingq-gramtechniques
• Lotsoftimesreinventedthewheelindifferentfields
Indexingusingq-grams• Filteringcanalsobeusedforindexing.E.g.indexallq-gramsandtheir
matchesinS.• IfonesearchesforP,firstsearchforq-gramsinindex.Ifasufficientnrof
matchesisfound,thenmakethecomparisontoseeifthematchisreal.• Filteringshouldbeefficientforcaseswhereahighsimilaritymatchfora
longpatternislookedfor.• Thisislikereverseindexfortexts:• word doc_id:word_iddoc_id:pos_id• word1 1:57:9167:987...
word2 2:53:678:1067:3...word3 3:55:677:1016:3......
• Q:wheredotheword1andword3occurtogether?
Generalizedpatterns
• AgeneralizedpatternP=p1p2...pm consistsofgeneralizedcharacterspi suchthateachpi representsanon-emptysubsetofalphabetΣ*;
• pi =a,a∈ Σ• pi =#,"wildcard"(anynranysymbols)• pi =[group];e.g.:[abc],[^abc],[a-h],...• pi =¬C;CharactersfromasetΣ-C.• Example:[Tt][aeiou][kpt]#[^aeiou][mnr]matches
Tekstialgoritmbutnotwordtekstuur.• Problem:Searchforgeneralizedpatternsfromtext• ComparetoSHIFT-ORalgorithm!
P= a[b-h]a¬a // agrep a[b-h]a[^a]paganamaa
a 110101[b-h] 221011a 332101¬a 433210
zero at last row – exact match!
• Whataboutmismatches?• Mismatchifcharacterdoesnotbelongtoclassdefinedbypattern.Unitcost1.
• SHIFT-ADD- similartoSHIFT-OR,butinsteadofORanADDisused.(noinsertionsdeletionsonthisexample)
• (noinsertionsdeletionsonthisexample)P=a[kpt]a¬a //agrepa[kpt]a[^a]
1 atlastpos- matchwith1mismatch!• Eachvalueofmatrixdij canbepresentedwithbbits(4bits
allowsvaluesupto16).Columnscanbesimpleintegers.
paganamaa0000000000
a 110101[kpt] 2211 21a 33221 3¬a 433221
• Eachvalueofmatrixdij canbepresentedwithbbits(4bitsallowsvaluesupto16).Columnscanbesimpleintegers.
• Bj=dmj2b(m-1) +dm-1,j2b(m-1) +...d1j.(d0jisalways0,canbeomitted)
• Whenaddinganotherinteger,where0isonpositioniifthenextcharatj'thpositionbelongstoasetrepresentedbyPi and1otherwise.
• Whenaddinganotherinteger,where0isonpositioniifthenextcharatj'thpositionbelongstoasetrepresentedbyPi and1otherwise.
010001000001011+ 001001 000000 001
----------------------------= 011010000001100
• Oneneedstobeverycarefulnottohaveoverflow(111+001=1000).
• Shiftby3positions==multiplyby8
010001000001011 *8= 001000001011000
Usemultiplevectors,oneforeachkvalue
• Onecanalsouseseveralindividual1-bitvectors,eachcorrespondstodifferentk
• CanbeextendedtomaskoutregionswheremismatchesareNOTallowed
• Canintroducewildcardsofarbitrarylength
Example(note:leastsignificantbitisleftinthisoutput)
Pattern=AC#T<GA>[TG]Alength7,#=.*CV[char]A65 11111111111111111111111110101110C67 11111111111111111111111111111101G71 11111111111111111111111111010111T84 11111111111111111111111111011011WILDCARD
11111111111111111111111111111101ENDMASK
00000000000000000000000001000000NO_ERROR
00000000000000000000000000011000
7654321
0 – position is “active”• R[0] – vector for (so far) 0 mismatches• R[1] – vector for (so far) 1 mismatch• R[2] – vector for (so far) 2 mismatches
• “Minimum” bybitwiseAND• If(even)oneofthevectorshas0,
thenbitwiseANDproduces0(whichissmallerof0and1,1and0,0and0)
• Ifboth(orall)ofthevectorshave1,thenbitwiseANDproduces1 (whichissmallerof1and1)
• Howtogetnewvaluesfromoldones• P[0]P[1]...=>R[0]R[1]...R[0]
– isminofthree possibilities:
(P[i]shift1)bitorCV[textchar]//previouslyactive,nowmatchwithcharacter
(P[i]bitorWILDCARD)//wildcardmatch– thesamepositionremainsactive
(P[i-1]shift1bitorNO_ERROR)//Previously1lesserrors(unlessNO_ERRORallowed)
Thealgorithm
• R[i]ingeneralistheminimumof3possibilities:
(P[i]shift1)bitorCV[textchar]& //match(P[i]bitorWILDCARD)& //wildcard(P[i-1]shift1bitorNO_ERROR) //mismatch
Last-- Addonemismatchunlesserrorsnotallowed
diktorantuur
BPR (p = p1p2...pm, T = t1t2...tn, k)1. Preprocessing2. for c ∈ S Do B[c] <- 0m3. for j ∈ 1 ... m Do B[pj] <- B[pj] | 0m-j10j-1
4. Searching5. for i ∈ 0 ... k Do Ri <- 0m-i1i6. for pos ∈ 1 ... n Do7. oldR <- R08. newR <- ((oldR << 1) | 1) & B[tpos]9. R0 <- newR10. for i ∈ 1 ... k Do11. newR <- ((Ri << 1) & B[tpos]) | oldR | ((oldR | newR) << 1)12. oldR <- Ri, Ri <- newR13. end of for14. If newR & 10m-1 <> 0 Then report an occurrence at pos15. End of for
public static void BPR(string pattern, string text, int errors){
int[] B = new int[ushort.MaxValue];for (int i = 0; i < ushort.MaxValue; i++) B[i] = 0;// Initialize all characters positionsfor (int i = 0; i < pattern.Length; i++){
B[(ushort)pattern[i]] |= 1 << i;}// Initialize NFA statesint[] states = new int[errors+1]; for(int i= 0; i <= errors; i++){
states[i] = (i == 0) ? 0 : (1 << (i - 1) | states[i-1]);}//int oldR, newR;int exitCriteria = 1 << pattern.Length -1;
for (int i = 0; i < text.Length; i++){
oldR = states[0];newR = ((oldR << 1) | 1) & B[text[i]];states[0] = newR;
for (int j = 1; j <= errors; j++){
newR = ((states[j] << 1) & B[text[i]]) | oldR | ((oldR | newR) << 1);
oldR = states[j];states[j] = newR;
}
if ((newR & exitCriteria) != 0) Console.WriteLine("Occurrence at position {0}", i+1);
}}
agrep
• S.WuandU.Manber.Fasttextsearchingallowingerrors.CommunicationsoftheACM,35(10):83--91,1992.ACMDL PDF
• Insertions,deletions• Wildcards• Non-uniformcostsforsubstitution,insertion,deletion
• Findbestmatch• Maskregionsfornoerrors• Recordorientated,notlineorientated
Agrepexamples(frommanagrep)• agrep-2-cABCDEFGfoo
givesthenumberoflinesinfilefoothatcontainABCDEFGwithintwoerrors.• agrep-1-D2-S2'ABCD#YZ'foo
outputsthelinescontainingABCDfollowed,withinarbitrarydistance,byYZ,withuptooneadditionalinsertion(-D2and-S2makedeletionsandsubstitutionstoo"expensive").
• agrep-5-pabcdefghij/usr/dict/wordsoutputsthelistofallwordscontainingatleast5ofthefirst10lettersofthealphabetinorder.(Tryit:anyliststart- ingwithacademiaandendingwithsacrilegiousmustmeansome- thing!)
• agrep-1'abc[0-9](de|fg)*[x-z]'foooutputsthelinescontaining,withinuptooneerror,thestringthatstartswithabcfollowedbyonedigit,followedbyzeroormorerepetitionsofeitherdeorfg,followedbyeitherx,y,orz.
• agrep-d'^From''breakdown;internet'mboxoutputsallmailmessages(thepattern'^From'separatesmailmessagesinamailfile)thatcontainkeywords'breakdown'and'internet'.
• agrep-d'$$'-1''foofindsallparagraphsthatcontainword1followedbyword2withoneerrorinplaceoftheblank.Inparticular,ifword1isthelastwordinalineandword2isthefirstwordinthenextline,thenthespacewillbesubstitutedbyanewlinesymbolanditwillmatch.Thus,thisisawaytoovercomeseparationbyanewline.Notethat-d'$$'(oranotherdelimwhichspansmorethanoneline)isnecessary,becauseotherwiseagrepsearchesonlyonelineatatime.
• agrep'^agrep'outputsalltheexamplesoftheuseofagrepinthismanpages.
• GeneMyers:Afastbit-vectoralgorithmforapproximatestringmatchingbasedondynamicprogramming JournaloftheACM(JACM),Volume46,Issue3(May1999).http://doi.acm.org/10.1145/316542.316550.PDF
• Abstract• Theapproximatestringmatchingproblemistofindalllocationsatwhichaqueryoflengthmmatchesasubstringofatext
oflengthn withk-or-fewerdifferences.• Simpleandpracticalbit-vectoralgorithmshavebeendesignedforthisproblem,mostnotablytheoneusedinagrep.• Thesealgorithmscomputeabitrepresentationofthecurrentstate-setofthek-differenceautomatonforthequery,and
asymptoticallyrunineitherO(nm/w)orO(nmlogσ/w)timewherewisthewordsizeofthemachine(e.g.,32or64inpractice),andσisthesizeofthepatternalphabet.
• HerewepresentanalgorithmofcomparablesimplicitythatrequiresonlyO(nm/w)timebyvirtueofcomputingabitrepresentationoftherelocatabledynamicprogrammingmatrixfortheproblem.
• Thus,thealgorithm'sperformanceisindependentofk,anditisfoundtobemoreefficientthanthepreviousresultsformanychoicesofkandsmallm.
• Moreover,becausethealgorithmisnotdependentonk,itcanbeusedtorapidlycomputeblocksofthedynamicprogrammingmatrixasinthe4-RussiansalgorithmofWuetal.(1996).
• ThisgivesrisetoanO(kn/w)expected-timealgorithmforthecasewheremmaybearbitrarilylarge.• Inpracticethisnewalgorithm,thatcomputesaregionofthedynamicprogramming(d.p.)matrxwentriesatatimeusing
thebasicalgorithmasasubroutineissignificantlyfasterthanourprevious4-Russiansalgorithm,thatcomputesthesameregion4or5entriesatatimeusingtablelookup.
• Thisperformanceimprovementyieldsacodethatiseithersuperiororcompetitivewithallexistingalgorithmsexceptforsomefiltrationalgorithmsthataresuperiorwhenk/missufficientlysmall.
• Writingofanoverview,implementingthealgorithmandcreatingausefultoolcouldbeabigtopicforaBScorMScthesis.
Multipleapproximatestringmatching
• Howtofindsimultaneouslytheapproximatematchesforasetofwords,e.g.adictionary.
• Orasetofregularexpressions,generalizedpatterns,etc.
• Onecanbuildautomatonsforsetsofwords,andthenmatchtheautomatonsapproximately.
• Filteringapproaches– ifcloseenough,test• Notmany(good)methodshavebeenproposed