kleene would be shocked redrawing the link between theory and modern regex engines

39
Kleene Would Be Kleene Would Be Shocked Shocked Redrawing the Link Between Theory Redrawing the Link Between Theory and Modern Regex Engines and Modern Regex Engines A Presentation by Ian A Presentation by Ian Graham Graham Carnegie Mellon University Carnegie Mellon University August 2, 2002 August 2, 2002

Upload: qabil

Post on 23-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines. A Presentation by Ian Graham Carnegie Mellon University August 2, 2002. The March of Progress. 1. Literal string search (exact substring) 2. Extended string search (character classes) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Kleene Would Be ShockedKleene Would Be ShockedRedrawing the Link Between Theory and Redrawing the Link Between Theory and

Modern Regex EnginesModern Regex Engines

A Presentation by Ian GrahamA Presentation by Ian Graham

Carnegie Mellon UniversityCarnegie Mellon UniversityAugust 2, 2002August 2, 2002

Page 2: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

The March of ProgressThe March of Progress

1. Literal string search (exact substring)1. Literal string search (exact substring) 2. Extended string search (character 2. Extended string search (character

classes)classes) 3. Regular expression matching3. Regular expression matching 4. Approximate matching4. Approximate matching 5. “Extended” regular expression matching5. “Extended” regular expression matching

Page 3: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Begin at the BeginningBegin at the Beginning

The simplest case of a regular expression The simplest case of a regular expression is a literal string searchis a literal string search

Literal—any symbol in the alphabetLiteral—any symbol in the alphabet Literal string—a concatenation of literalsLiteral string—a concatenation of literals Literal string search—the problem of Literal string search—the problem of

finding all occurrences of one literal string finding all occurrences of one literal string within another literal string (find “cad” in within another literal string (find “cad” in “abracadabra”)“abracadabra”)

Page 4: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Quick ReviewQuick Review

Knuth-Morris-Pratt (KMP) and Knuth-Morris-Pratt (KMP) and Boyer-Moore (BM)Boyer-Moore (BM) Two classical literal string search algorithmsTwo classical literal string search algorithms About 25 years oldAbout 25 years old Used to achieve O(Used to achieve O(mm++nn) search performance, ) search performance,

where where mm is the length of the search pattern is the length of the search pattern and and nn is the length of the text to be searched is the length of the text to be searched

Page 5: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Quick ReviewQuick Review

KMP scans from left to right, shifting by KMP scans from left to right, shifting by aligning the longest prefix of the search aligning the longest prefix of the search pattern which matches a suffix of the text pattern which matches a suffix of the text scannedscanned

BM scans from right to left along a window BM scans from right to left along a window that shifts from left to right by choosing the that shifts from left to right by choosing the largest shift amount from multiple shift largest shift amount from multiple shift rulesrules

Page 6: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Practical DevelopmentsPractical Developments

For any alphabet size, there is always an For any alphabet size, there is always an algorithm which achieves better algorithm which achieves better experimental results than KMP or BM.experimental results than KMP or BM.

The Horspool algorithm (1980) simplifies The Horspool algorithm (1980) simplifies BM, using only the bad character shift rule BM, using only the bad character shift rule instead of calculating multiple shift instead of calculating multiple shift amounts and using the bestamounts and using the best

Page 7: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Practical DevelopmentsPractical Developments

Horspool is O(Horspool is O(mm++nn) in the average case ) in the average case (assuming equal probability of all alphabet (assuming equal probability of all alphabet characters), O(characters), O(mnmn) in the worst case) in the worst case

BM is O(BM is O(mm++nn) average, O() average, O(mm++nn) worst) worst Evaluating multiple shift rules for BM Evaluating multiple shift rules for BM

greatly increases its runtime constantgreatly increases its runtime constant Horspool is much faster in practice, and is Horspool is much faster in practice, and is

extremely hard for any algorithm to beat extremely hard for any algorithm to beat over large alphabetsover large alphabets

Page 8: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Bit-ParallelismBit-Parallelism

Recent algorithms (1992~) create Recent algorithms (1992~) create nondeterministic automata to keep track of nondeterministic automata to keep track of each possible match along the length of each possible match along the length of the patternthe pattern

States of these NFAs are mapped to bits States of these NFAs are mapped to bits in a word, and transitions are simulated in a word, and transitions are simulated utilizing the parallelism of bitwise utilizing the parallelism of bitwise operationsoperations

Page 9: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Bit-ParallelismBit-Parallelism

Possible matches may be represented by Possible matches may be represented by “1”s, and proceed in parallel along the “1”s, and proceed in parallel along the pattern until they reach the end, indicating pattern until they reach the end, indicating a matcha match

Page 10: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Bit-ParallelismBit-Parallelism

Savings due to parallelism depends on the Savings due to parallelism depends on the word sizeword size

Bit-parallel algorithms often only perform Bit-parallel algorithms often only perform well for patterns of size near to or less well for patterns of size near to or less than the word sizethan the word size

Page 11: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Bit-ParallelismBit-Parallelism

Most analysis assumes constant word Most analysis assumes constant word size, either 32 or 64 bitssize, either 32 or 64 bits

Savings under this assumption are Savings under this assumption are constant, but result in extremely good constant, but result in extremely good performance for practical applicationsperformance for practical applications

Page 12: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

A WrenchA Wrench

Let a “character class” be an item which Let a “character class” be an item which matches a single character from a range matches a single character from a range or explicit list.or explicit list.

ExamplesExamples [0-9] matches any digit[0-9] matches any digit [Aa] matches A or a[Aa] matches A or a [A-Za-z] matches any English letter[A-Za-z] matches any English letter

Page 13: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

A WrenchA Wrench

Let an “extended string” be a literal string Let an “extended string” be a literal string with the additional property that it may with the additional property that it may contain character classes in place of contain character classes in place of literals.literals.

Examples:Examples: ““abc[de]f” matches “abcdf or “abcef”abc[de]f” matches “abcdf or “abcef” ““[Aa][Nn][Ee][Uu][Rr][Ii][Ss][Mm]” matches [Aa][Nn][Ee][Uu][Rr][Ii][Ss][Mm]” matches

“aneurism”, “ANEURISM”, “aNeUrIsM”…“aneurism”, “ANEURISM”, “aNeUrIsM”…

Page 14: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

A WrenchA Wrench

Moving from literal string searches to Moving from literal string searches to extended string searches confounds many extended string searches confounds many algorithmsalgorithms

Horspool may be extended, but its Horspool may be extended, but its performance suffers greatlyperformance suffers greatly

Boyer-Moore may also be extended, and Boyer-Moore may also be extended, and performs better than other well-known performs better than other well-known extensionsextensions

Page 15: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Bit-Parallelism on Top?Bit-Parallelism on Top?

A recent (Navarro and Raffinot, 1998) bit-A recent (Navarro and Raffinot, 1998) bit-parallel algorithm claims to be 10-40% parallel algorithm claims to be 10-40% faster than any known variant of BMfaster than any known variant of BM

Appears to be the fastest algorithm given:Appears to be the fastest algorithm given: moderate-sized alphabet (e.g. English)moderate-sized alphabet (e.g. English) moderate pattern sizes (5-110 characters)moderate pattern sizes (5-110 characters)

Page 16: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

What is a Regular Expression?What is a Regular Expression?

Says Stephen Kleene:Says Stephen Kleene: ““A notation to describe regular languages.”A notation to describe regular languages.” ““A description of the behavior of a finite state A description of the behavior of a finite state

machine.”machine.” ““Regular.”Regular.”

Page 17: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

A Familiar DefinitionA Familiar Definition

1. 1. aa for some for some aa in the alphabet in the alphabet ΣΣ 2. 2. εε 3. the null language3. the null language 4. R1 U R2 (R1, R2 regular languages)4. R1 U R2 (R1, R2 regular languages) 5. R1 ◦ R2 (R1, R2 regular languages)5. R1 ◦ R2 (R1, R2 regular languages) 6. R1* (R1 regular language)6. R1* (R1 regular language)

Page 18: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Efficiently Matching Regular Efficiently Matching Regular ExpressionsExpressions

Attempts to extend classical literal search Attempts to extend classical literal search algorithms to process regular expressions algorithms to process regular expressions have largely been fruitlesshave largely been fruitless

Efficient algorithms involve clever ways of Efficient algorithms involve clever ways of simulating an NFA equivalent of the simulating an NFA equivalent of the regular expressionregular expression

Page 19: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Efficiently Matching Regular Efficiently Matching Regular ExpressionsExpressions

For small to moderate pattern sizes, For small to moderate pattern sizes, optimizations using bit-parallelism appear optimizations using bit-parallelism appear to result in the fastest algorithms (Navarro, to result in the fastest algorithms (Navarro, Raffinot)Raffinot)

For large pattern sizes (greater than about For large pattern sizes (greater than about 4 times the word size), partial conversion 4 times the word size), partial conversion from NFA to DFA results in good from NFA to DFA results in good performanceperformance

Page 20: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Where can we go from here?Where can we go from here?

Approximate matching—match a literal Approximate matching—match a literal string to within some “difference”string to within some “difference” Edit distance is commonly usedEdit distance is commonly used Rules much more complex for computational Rules much more complex for computational

biology applicationsbiology applications Extensions to regular expressionsExtensions to regular expressions

Used by most languages and applicationsUsed by most languages and applications

Page 21: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Where can we go from here?Where can we go from here?

Efficiently handling regular expressions Efficiently handling regular expressions and approximate matching are problems in and approximate matching are problems in much of today’s researchmuch of today’s research

Flexible Pattern Matching in StringsFlexible Pattern Matching in Strings, by , by Navarro and Raffinot, referenced here, Navarro and Raffinot, referenced here, was published June 15, 2002was published June 15, 2002

Page 22: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

What is a Regular Expression?What is a Regular Expression?

Say modern developers:Say modern developers: A pattern that can be matched against a stringA pattern that can be matched against a string Not necessarily a model of any particular Not necessarily a model of any particular

machinemachine Not necessarily (and not usually) regularNot necessarily (and not usually) regular A very powerful tool for solving text-based A very powerful tool for solving text-based

problemsproblems

Page 23: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Who uses regular expressions?Who uses regular expressions?

Where to find built-in “regular expression” Where to find built-in “regular expression” support today?support today? awk, grep, sed, vi, emacs, find, more, less, awk, grep, sed, vi, emacs, find, more, less,

lex, Perl, Ruby, Tcl, MySQL, Javascript, PHP, lex, Perl, Ruby, Tcl, MySQL, Javascript, PHP, Python, Java, Microsoft .NET, and many, Python, Java, Microsoft .NET, and many, many moremany more

Built-in support has become more frequent Built-in support has become more frequent and more advanced in the past few yearsand more advanced in the past few years

Page 24: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Irregular Regular Expressions?Irregular Regular Expressions?

The languages described by most popular The languages described by most popular “regular expression” engines are NP-Hard“regular expression” engines are NP-Hard

Construction of a “regular expression” in Construction of a “regular expression” in Perl which matches representations of 3-Perl which matches representations of 3-colorable graphs is fairly straightforwardcolorable graphs is fairly straightforward

Page 25: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Irregular Regular Expressions?Irregular Regular Expressions?

Perl “regular expression” which matches any Perl “regular expression” which matches any 3-colorable graph, given a number of 3-colorable graph, given a number of vertices V and an edge-list E:vertices V and an edge-list E:

$string = (join "\n", (("rgb") x $V)) $string = (join "\n", (("rgb") x $V))

. "\n:\n" . "\n:\n"

. join "\n", (("rgbrbgr") x @E) ; . join "\n", (("rgbrbgr") x @E) ;

$regex = '^‘$regex = '^‘

. (join "\\n", (".*(.).*") x $V) . (join "\\n", (".*(.).*") x $V)

. "\\n:\\n" . "\\n:\\n"

. (join "\\n", map {".*\\$_->[0]\\$_->[1].*"} @E) . (join "\\n", map {".*\\$_->[0]\\$_->[1].*"} @E)

. '$' ; . '$' ;

3-colorable iff3-colorable iff $regex $regex matchesmatches $string $string

((http://perl.plover.com/NPC/NPC-3COL.htmlhttp://perl.plover.com/NPC/NPC-3COL.html))

Page 26: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Irregular Regular Expressions?Irregular Regular Expressions?

Usage of the term “regular expression” in Usage of the term “regular expression” in modern development conflicts with its modern development conflicts with its theoretical definitiontheoretical definition

Many are unaware of or ignore this Many are unaware of or ignore this conflict, while others choose different conflict, while others choose different terminology:terminology: ““Extended regular expression”Extended regular expression” ““Regex”Regex”

Page 27: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Clear DefinitionsClear Definitions

Regular expression—a description of a Regular expression—a description of a regular language, as defined by Kleeneregular language, as defined by Kleene

Regex—any pattern matched against a Regex—any pattern matched against a string, not necessarily regularstring, not necessarily regular

Page 28: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

The Main CulpritThe Main Culprit

BackreferencesBackreferences Ability to refer to text that has been matched in Ability to refer to text that has been matched in

a previous part of the regexa previous part of the regex Typically expressed as \Typically expressed as \nn, where , where nn is a number is a number

—refers to the text matched by the regex inside —refers to the text matched by the regex inside the the nnth set of parenthesisth set of parenthesis

““(.*)\1\1” matches “abcabcabc”, “abaabaaba”...(.*)\1\1” matches “abcabcabc”, “abaabaaba”... ““\b\b((\w+\w+))\b\s+\b\b\s+\b\1\1\b” matches “the the”, “a a”…\b” matches “the the”, “a a”…

(double words)(double words)

Page 29: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

BackreferencesBackreferences

Supported in limited number by vi, sed, Supported in limited number by vi, sed, grep, emacs, Ruby, Python, PHPgrep, emacs, Ruby, Python, PHP

POSIX standard for Basic Regular POSIX standard for Basic Regular Expressions includes capability to process Expressions includes capability to process nine backreferencesnine backreferences

Bounding the number available places a Bounding the number available places a bound on performancebound on performance

Page 30: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

BackreferencesBackreferences

Supported without quantity bounds by Perl Supported without quantity bounds by Perl 5 and later, Tcl, Java 1.4, .NET5 and later, Tcl, Java 1.4, .NET

Number of backreferences limited only by Number of backreferences limited only by physical memory restrictionsphysical memory restrictions

LanguageLanguage vivi sedsed grepgrep emacsemacs RubyRuby PHPPHP PerlPerl TclTcl JavaJava .NET.NET

Backreferences Backreferences supportedsupported 99 99 99 99 1010 99 ∞∞ ∞∞ ∞∞ ∞∞

Page 31: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

BackreferencesBackreferences

Slow—processing a regex becomes NP-Slow—processing a regex becomes NP-Hard (for unbound amounts of Hard (for unbound amounts of backreferences)backreferences)

Extremely useful—add a great deal of Extremely useful—add a great deal of expressive power to a regexexpressive power to a regex

Largely untouched by theoretical analysisLargely untouched by theoretical analysis No real bounds on efficiencyNo real bounds on efficiency

Page 32: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

LookaheadLookahead

Also known as “zero-width matching”Also known as “zero-width matching” Ability to check text ahead without consuming it Ability to check text ahead without consuming it

in a matchin a match Typically expressed as (?Typically expressed as (?texttext)) ExampleExample

““abc(?def)” will match “abc”, but only if followed by abc(?def)” will match “abc”, but only if followed by “def”“def”

LanguageLanguage sedsed grepgrep emacsemacs lexlex RubyRuby PHPPHP PerlPerl TclTcl JavaJava .NET.NET

LookaheadLookahead

support?support?NoNo NoNo NoNo YesYes YesYes YesYes YesYes YesYes YesYes YesYes

Page 33: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Thank Larry WallThank Larry Wall

Perl 5 regexes offer the ability to embed Perl 5 regexes offer the ability to embed code within a regexcode within a regex

Perl 6 will support recursive regexesPerl 6 will support recursive regexes

Page 34: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Why the divide?Why the divide?

Very little theory has touched on extended Very little theory has touched on extended regular expressions.regular expressions.

Backreferences are indispensable for Backreferences are indispensable for many programmers, and often even in many programmers, and often even in non-development use of *NIX systemsnon-development use of *NIX systems

Page 35: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Why the divide?Why the divide?

Developers implemented regular Developers implemented regular expression processors shortly after Kleene expression processors shortly after Kleene created regular expressions in the 50’screated regular expressions in the 50’s

Page 36: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Why the divide?Why the divide?

New and more powerful features were New and more powerful features were quickly added to practical “regular quickly added to practical “regular expressions” so that users and expressions” so that users and programmers could express more programmers could express more languageslanguages

Regexes soon left theory in the dustRegexes soon left theory in the dust

Page 37: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

Moral of the StoryMoral of the Story

It’s much easier to hack than to make a It’s much easier to hack than to make a good proofgood proof

Page 38: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

The FutureThe Future

Unbound backreferences are becoming a Unbound backreferences are becoming a standard feature in regex libraries and standard feature in regex libraries and languageslanguages

The idea of implementing regexes in a The idea of implementing regexes in a common module and sharing it among common module and sharing it among different languages and platforms is different languages and platforms is growing in popularitygrowing in popularity PCRE(Perl-Compatible Regex Engine) is PCRE(Perl-Compatible Regex Engine) is

used by Python, PHP, Apache, KDE…used by Python, PHP, Apache, KDE…

Page 39: Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines

The FutureThe Future

Regex implementations seem to be Regex implementations seem to be moving towards more standardizationmoving towards more standardization

Meanwhile, a solid theoretical foundation Meanwhile, a solid theoretical foundation has been laid down for regular has been laid down for regular expressions and modest extensionsexpressions and modest extensions

Practice may not come to theory, but Practice may not come to theory, but theory may soon come to practicetheory may soon come to practice