aho-corasick string matching
DESCRIPTION
Aho-Corasick String Matching. An Efficient String Matching. Introduction. Locate all occurrences of any of a finite number of keywords in a string of text. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/1.jpg)
Aho-Corasick String Matching
An Efficient String Matching
![Page 2: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/2.jpg)
Introduction
Locate all occurrences of any of a finite number of keywords in a string of text.
Consists of constructing a finite state pattern matching machine from the keywords and then using the pattern matching machine to process the text string in a single pass.
![Page 3: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/3.jpg)
Pattern Matching Machine(1)
Let be a finite set of strings which we shall call keywords and let x be an arbitrary string which we shall call the text string.
The behavior of the pattern matching machine is dictated by three functions: a goto function g , a failure function f , and an output function output.
yyyK k,,,
21
![Page 4: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/4.jpg)
![Page 5: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/5.jpg)
Pattern Matching Machine(2)
Goto function g : maps a pair consisting of a state and an input symbol into a state or the message fail.
Failure function f : maps a state into a state, and is consulted whenever the goto function reports fail.
Output function : associating a set of keyword (possibly empty) with every state.
![Page 6: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/6.jpg)
![Page 7: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/7.jpg)
Start state is state 0. Let s be the current state and a the
current symbol of the input string x. Operating cycle
If , makes a goto transition, and enters state s’ and the next symbol of x becomes the current input symbol.
If , make a failure transition f. If , the machine repeats the cycle with s’ as the current state and a as the current input symbol.
', sasg
failasg , 'ssf
![Page 8: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/8.jpg)
![Page 9: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/9.jpg)
Example
Text: u s h e r s State: 0 0 3 4 5 8 9 2 In state 4, since , and the
machine enters state 5, and finds keywords “she” and “he” at the end of position four in text string, emits
5,4 eg
5output
![Page 10: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/10.jpg)
Example Cont’d
In state 5 on input symbol r, the machine makes two state transitions in its operating cycle.
Since , M enters state . Then since , M enters state 8 and advances to the next input symbol.
No output is generated in this operating cycle.
failrg ,5 52 f 8,2 rg
![Page 11: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/11.jpg)
Construction the functions
Two part to the construction First : Determine the states and the
goto function. Second : Compute the failure
function. Output function start at first,
complete at second.
![Page 12: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/12.jpg)
Construction of Goto function
Construct a goto graph like next page.
New vertices and edges to the graph, starting at the start state.
Add new edges only when necessary. Add a loop from state 0 to state 0 on
all input symbols other than keywords.
![Page 13: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/13.jpg)
![Page 14: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/14.jpg)
![Page 15: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/15.jpg)
![Page 16: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/16.jpg)
Construction of Failure function
Depth : the length of the shortest path from the start state to state s.
The states of depth d can be determined from the states of depth
d-1. Make for all states s of depth
1.
0sf
![Page 17: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/17.jpg)
Construction of Failure function Cont’d
Compute failure function for the state of depth d ,each state r of depth d-1 : 1. If for all a, do nothing. 2. Otherwise, for each a such that ,
do the following : a. Set . b. Execute zero or more times,
until a value for state is obtained such that .
c. Set .
failarg ,
sarg ,
rfstate statefstate
failastateg , astatessf ,
![Page 18: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/18.jpg)
![Page 19: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/19.jpg)
About construction
When we determine , we merge the outputs of state s with the output of state s’.
In fact, if the keyword “his” were not present, then could go directly from state 4 to state 0, skipping an unnecessary intermediate transition to state 1.
To avoid above, we can use the deterministic finite automaton, which discuss later.
'ssf
![Page 20: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/20.jpg)
Time Complexity of Algorithms 1, 2, and 3
Algorithms 1 makes fewer than 2n state transitions in processing a text string of length n.
Algorithms 2 requires time linearly proportional to the sum of the lengths of the keywords.
Algorithms 3 can be implemented to run in time proportional to the sum of the lengths of the keywords.
![Page 21: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/21.jpg)
Eliminating Failure Transitions
Using in algorithm 1 , a next move function such
that for each state s and input symbol a.
By using the next move function , we can dispense with all failure transitions, and make exactly one state transition per input character.
as,
![Page 22: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/22.jpg)
![Page 23: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/23.jpg)
![Page 24: Aho-Corasick String Matching](https://reader035.vdocument.in/reader035/viewer/2022081502/56815723550346895dc4c1d7/html5/thumbnails/24.jpg)
Conclusion
Attractive in large numbers of keywords, since all keywords can be simultaneously matched in one pass.
Using Next move function can reduce state transitions by 50%,
but more memory. Spend most time in state 0 from which
there are no failure transitions.