Download - Fast Submatch Extraction using OBDDs
Fast Submatch Extraction using OBDDs
Liu Yang1, Pratyusa Manadhata2, William Horne2,
Prasad Rao2, Vinod Ganapathy1
Rutgers University1
HP Laboratories2
Applications of Regular Expressions
Signatures
Network traffic
Alerts
NIDS
Network intrusion detection systems (NIDS) employ regular expressions to represent attack signatures.
Applications of Regular Expressions (cont.)
Connectors (rule set) SIEM
Web security compliance
Email security compliance
Security information and event management (SIEM) systems employ regular expressions to normalize event logs generated by hardware connectors and software systems.
Submatch Extraction
…username=(.*), hostname=(.*) …
Rule set
username=Bob, hostname=Foo
Submatch extraction
$1 = Bob, $2 = Foo
Signature Matching
• Non-deterministic finite automaton (NFAs)– Space efficient, time inefficient
• Deterministic finite automaton (DFAs)– Time efficient, states blow-up
• Recursive backtracking– Fast in general– Vulnerable to algorithmic complexity attacks
Motivation: Time/Space Tradeoff
Space
Time
IdealDFA (deterministic finite automaton)
NFA (non-deterministic finite automaton)
Backtracking
Our approach
Our Contributions
• A novel way of annotating capturing groups, tagged-NFAs
• Design of a novel technique on submatch extraction (called Submatch-OBDD)– Extending Thompson’s algorithm– Using Boolean functions to represent tagged-NFAs– Using ordered binary decision diagrams (OBDDs)
to improve time efficiency
• Evaluation and comparison with RE2 and PCRENote: RE2 is a hybrid approach, using a mix of DFA/NFA, while PCRE uses recursive backtracking.
Solution Overview
RegExps with capturing groups
Tagged-NFAs
Boolean Representations
OBDD representations
NFA Representation of RegExps
E = a*aa
Current state (x) Input symbol (i) Next state (y)
1 a 1
1 a 2
2 a 3
NFA of regexp “a*aa”
Transition table T(x,i,y)
Submatch Tagging: tagged NFAsE = (a*)aa
Current state (x) Input symbol (i) Next state (y) Output tags (t)
1 a 1 {t1}
1 a 2 {}
2 a 3 {}
Tagged NFA of “(a*)aa” with submatch tagging t1
Extended transition table T(x,i,y,t) of the tagged NFA
/ t1
Tag(E) = (a*)t aa1
Match TestRegExp=(a*)aa; Input: aaaa
1
2
3
a a a a
{1} {1,2} {1,2,3} {1,2,3} {1,2,3}
{t1} {t1} {t1} {t1}
accept
Frontier
Submatch Extraction
1
2
3
a a a a
{t1} {t1} {t1} {t1}
accept
{1} {1,2} {1,2,3} {1,2,3} {1,2,3}Frontier
Any path from an accept state to a start state generates a valid assignment of submatches.
$1=aa
Complexity of Tagged NFAs
)( lnO )( lnO
Match test: Submatch extraction: n – size of tagged NFAl – length of input string
Can we make the operations faster?
Submatch-OBDD
• Representing tagged NFAs using Boolean functions– Updating frontiers in one-step using a single
Boolean formula
• Using OBDDs to manipulate Boolean functions
Transitions as Boolean Functions
Current state (x) Input symbol (i) Next state (y) Output tag (t)
1 a 1 {t1}
1 a 2 {}
2 a 3 {}
T(x,i,y,t) = (1 Λ a Λ 1 Λ t1)V (1 Λ a Λ 2 Λ{})V (2 Λ a Λ 3 Λ{})
RegExp: (a*)aa
Match Test using Boolean Functions
{1} Λ a Λ T(x,i,y,t) (1ΛaΛ 1 Λt1)V (1ΛaΛ 2 Λ{})
{1,2} Λ a Λ T(x,i,y,t) (1ΛaΛ 1 Λ t1)V (1ΛaΛ 2 Λ{})V (2ΛaΛ 3 Λ{})
{1,2,3} Λ a Λ T(x,i,y,t) (1ΛaΛ 1 Λt1)V (1ΛaΛ 2 Λ{})V (2ΛaΛ 3 Λ{})
Input symbol
Start states
Transition table
Intermediate transitions
Next states
Current states
Accept
aaaa
aaaa
…
aaaa
Submatch Extraction using Boolean Functions
(1ΛaΛ1Λt1)V (1ΛaΛ2Λ{})V (2ΛaΛ3Λ{})
aΛ3 Λ
Accept state
The last input symbol
Intermediate transitions [4]
2ΛaΛ3Λ{}
Previous state of 3
aΛ2Λ (1ΛaΛ1Λt1)V (1ΛaΛ2Λ{})V (2ΛaΛ3Λ{})
1ΛaΛ2Λ{}
Rename previous state as current state and continue
No output submatch tag
No output submatch tag
Intermediate transitions [3]
Previous state of 2
Start from the last symbol, going backwards
aaaa
aaaa
Submatch Extraction using Boolean Functions
aΛ1Λ (1ΛaΛ1Λt1)V (1ΛaΛ2Λ{})V (2ΛaΛ3Λ{})
1ΛaΛ1Λ t1
Output submatch tag
aΛ1Λ (1ΛaΛ1Λt1)V (1ΛaΛ2Λ{})
1ΛaΛ1Λ t1
Output submatch tag
aaaa
t1 t1
$1=aa
Intermediate transitions [2]
Intermediate transitions [1]
Previous state of 1
Previous state of 1
aaaa
aaaa
More Formal: Match Test
)),,,(
)(
)(( ,,
tyixionTransFunct
xFrontier
ilInputSymboMap tixxy
Finding new frontiers after processing an input symbol:
Next frontiers =
Checking acceptance:
))()(( xFrontierxesAcceptStatSAT
More Formal: Submatch Extraction
)(
))((
)),,,(
)(
)((
,,
,,
neTransitioOneRreversOutputTag
neTransitioOneRreversMapatepreviousSt
tyixsitionsIntermTran
ilInputSymbo
yteCurrentStaPickOne
neTransitioOneRrevers
yix
tyiyx
Submatch extraction: the last consecutive sequence of characters that are assigned with ti
A back traversal approach: starting from the last input symbol.
Submatch-OBDD
• Representation of tagged NFAs, match test, and submatch extraction using OBDDs
• OBDD representations for– Transitions with submatch tags– Intermediate transitions– Submatch tags– Set of start states– Set of accept states– Set of frontiers– Input symbols
Implementation
RE2TNFA TNFA2OBDD PATTERNMATCHRegExps
Tagged NFAs OBDDs
Input strings / network traffic
Matched at reg#Submatches $1= …, $2 = …
No match
Toolchain in C++, interfacing with the CUDD*
*CUDD is a package for manipulation of Binary Decision Diagrams
Feasibility Study
• Data sets– Snort-2009
• RegExps: 115 regexps with capturing groups from HTTP rules• Traces
– 1.2GB department network traffic (average packet size 126 bytes)– 1.3GB Twitter traffic (average packet size 1202 bytes)– 1MB synthetic trace (average string length 311 bytes)
– Snort-2012• RegExps: 403 regexps with capturing groups from HTTP rules• Traces
– 1.2GB department network traffic (average packet size 126 bytes)– 1.3GB Twitter traffic (average packet size 1202 bytes)– 1MB synthetic trace (average string length 689 bytes)
– Firewall-504• RegExps: 504 patterns from a commercial firewall F• Trace: 87MB of firewall logs (average line size 87 bytes)
Experimental Setup
• Platform: Intel Core2 Duo E7500, Linux-2.6.3, 2GB RAM
• Two configurations on pattern matching– Conf. S
• patterns compiled individually• Compiled pattern matched sequentially against
input traces
– Conf.C• patterns combined with UNION and compiled• combined pattern matched against input traces
Performance
Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Snort-2009 data set
Performance
Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Snort-2012 data set
Performance
Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Firewall-504 data set
Related Work
• NFA-OBDD [Yang et al., RAID’10, Chasaki and Wolf, ANCS’10]
• RE2 [Cox, code.google.com/p/re2]• PCRE [www.pcre.org]• TNFA [Laurikari et al., SPIRE’00]• MDFA [Yu et al., ANCS’06]• Hybrid FA [Becchi and Crowley, CoNEXT’07]• XFA [Smith et al., Oakland’08]• More – see paper for details
Conclusion
• A novel way of annotating capturing groups
• Submatch-OBDD: a novel technique on submatch extraction using OBDDs
• Feasibility study– Submatch-OBDD achieves ideal performance
when patterns are combined– Faster than RE2 and PCRE when patterns
are combined