introduction to regular expressions
DESCRIPTION
Introduction to Regular Expressions for THATCamp Texas 2011TRANSCRIPT
![Page 1: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/1.jpg)
Introduction to Regular Expressions
Ben Brumfield
THATCamp Texas 2011
![Page 2: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/2.jpg)
What are Regular Expressions?
• Very small language for describing text.
• Not a programming language.
• Incredibly powerful tool for search/replace operations.
• Arcane art.
• Ubiquitous.
![Page 3: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/3.jpg)
Why Use Regular Expressions?
• Finding every instance of a string in a file – i.e. every mention of “chickens” in a farm diary
• How many times does “sing” appear in a text in all tenses and conjugations?
• Reformatting dirty data• Validating input.• Command line work – listing files,
grepping log files
![Page 4: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/4.jpg)
The Basics
• A regex is a pattern enclosed within delimiters.
• Most characters match themselves.
• /THATCamp/ is a regular expression that matches “THATCamp”.– Slash is the delimiter enclosing the
expression.– “THATCamp” is the pattern.
![Page 5: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/5.jpg)
/at/
• Matches strings with “a” followed by “t”.
at hat
that atlas
aft Athens
![Page 6: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/6.jpg)
/at/
• Matches strings with “a” followed by “t”.
at hat
that atlas
aft Athens
![Page 7: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/7.jpg)
Some Theory
• Finite State Machine for the regex /at/
![Page 8: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/8.jpg)
Characters
• Matching is case sensitive.
• Special characters: ( ) ^ $ { } [ ] \ | . + ? *
• To match a special character in your text, precede it with \ in your pattern:– /ironic [sic]/ does not match “ironic [sic]”– /ironic \[sic\]/ matches “ironic [sic]”
• Regular expressions can support Unicode.
![Page 9: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/9.jpg)
Character Classes
• Characters within [ ] are choices for a single-character match.
• Think of a set operation, or a type of or.
• Order within the set is unimportant.
• /x[01]/ matches “x0” and “x1”.
• /[10][23]/ matches “02”, “03”, “12” and “13”.
• Initial^ negates the class: – /[^45]/ matches all characters except 4 or 5.
![Page 10: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/10.jpg)
/[ch]at/
• Matches strings with “c” or “h”, followed by “a”, followed by “t”.
that at
chat cat
fat phat
![Page 11: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/11.jpg)
/[ch]at/
• Matches strings with “c” or “h”, followed by “a”, followed by “t”.
that at
chat cat
fat phat
![Page 12: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/12.jpg)
Ranges
• Ranges define sets of characters within a class.– /[1-9]/ matches any non-zero digit.– /[a-zA-Z]/ matches any letter.– /[12][0-9]/ matches numbers between 10 and
29.
![Page 13: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/13.jpg)
Shortcuts
Shortcut Name Equivalent Class
\d digit [0-9]
\D not digit [^0-9]
\w word [a-zA-Z0-9_]
\W not word [^a-zA-Z0-9_]
\s space [\t\n\r\f\v ]
\S not space [^\t\n\r\f\v ]
. everything [^\n] (depends on mode)
![Page 14: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/14.jpg)
/\d\d\d[- ]\d\d\d\d/
• Matches strings with:– Three digits– Space or dash– Four digits
501-1234 234 1252
652.2648 713-342-7452
PE6-5000 653-6464x256
![Page 15: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/15.jpg)
/\d\d\d[- ]\d\d\d\d/
• Matches strings with:– Three digits– Space or dash– Four digits
501-1234 234 1252
652.2648 713-342-7452
PE6-5000 653-6464x256
![Page 16: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/16.jpg)
Repeaters
• Symbols indicating that the preceding element of the pattern can repeat.
• /runs?/ matches runs or run
• /1\d*/ matches any number beginning with “1”.
Repeater Count
? zero or one
+ one or more
* zero or more
{n} exactly n
{n,m} between n and m times
{,m} no more than m times
{n,} at least n times
![Page 17: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/17.jpg)
Repeaters
Strings:
1: “at” 2: “art”
3: “arrrrt” 4: “aft”
Patterns:
A: /ar?t/B: /a[fr]?t/
C: /ar*t/ D: /ar+t/
E: /a.*t/ F: /a.+t/
Repeater Count
? zero or one
+ one or more
* zero or more
{n} exactly n
{n,m} between n and m times
{,m} no more than m times
{n,} at least n times
![Page 18: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/18.jpg)
Repeaters
• /ar?t/ matches “at” and “art” but not “arrrt”.
• /a[fr]?t/ matches “at”, “art”, and “aft”.
• /ar*t/ matches “at”, “art”, and “arrrrt”
• /ar+t/ matches “art” and “arrrt” but not “at”.
• /a.*t/ matches anything with an ‘a’ eventually followed by a ‘t’.
![Page 19: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/19.jpg)
Lab Session I
• http://gskinner.com/RegExr/
• https://gist.github.com/922838
• Match the titles “Mr.” and “Ms.”.
• Find all conjugations and tenses of “sing”.
• Find all places where more than one space follows punctuation.
![Page 20: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/20.jpg)
Lab Reference
Repeater Count
? zero or one
+ one or more
* zero or more
{n} exactly n
{n,m} between n and m times
{,m} no more than m times
{n,} at least n times
Shortcut Name
\d digit
\D not digit
\w word
\W not word
\s space
\S not space
. everything
![Page 21: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/21.jpg)
Anchors
• Anchors match between characters.
• Used to assert that the characters you’re matching must appear in a certain place.
• /\bat\b/ matches “at work” but not “batch”.
Anchor Matches
^ start of line
$ end of line
\b word boundary
\B not boundary
\A start of string
\Z end of string
\z raw end of string (rare)
![Page 22: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/22.jpg)
Alternation
• In Regex, | means “or”.
• You can put a full expression on the left and another full expression on the right.
• Either can match.
• /seeks?|sought/ matches “seek”, “seeks”, or “sought”.
![Page 23: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/23.jpg)
Grouping
• Everything within ( … ) is grouped into a single element for the purposes of repetition and alternation.
• The expression /(la)+/ matches “la”, “lala”, “lalalala” but not “all”.
• /schema(ta)?/ matches “schema” and “schemata” but not “schematic”.
![Page 24: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/24.jpg)
Grouping Example
• What regular expression matches “eat”, “eats”, “ate” and “eaten”?
![Page 25: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/25.jpg)
Grouping Example
• What regular expression matches “eat”, “eats”, “ate” and “eaten”?
• /eat(s|en)?|ate/
• Add word boundary anchors to exclude “sate” and “eating”: /\b(eat(s|en)?|ate)\b/
![Page 26: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/26.jpg)
Replacement
• Regex most often used for search/replace
• Syntax varies; most scripting languages and CLI tools use s/pattern/replacement/ .
• s/dog/hound/ converts “slobbery dogs” to “slobbery hounds”.
• s/\bsheeps\b/sheep/ converts – “sheepskin is made from sheeps” to– “sheepskin is made from sheep”
![Page 27: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/27.jpg)
Capture
• During searches, ( … ) groups capture patterns for use in replacement.
• Special variables $1, $2, $3 etc. contain the capture.
• /(\d\d\d)-(\d\d\d\d)/ “123-4567”– $1 contains “123”– $2 contains “4567”
![Page 28: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/28.jpg)
Capture
• How do you convert – “Smith, James” and “Jones, Sally” to – “James Smith” and “Sally Jones”?
![Page 29: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/29.jpg)
Capture
• How do you convert – “Smith, James” and “Jones, Sally” to – “James Smith” and “Sally Jones”?
• s/(\w+), (\w+)/$2 $1/
![Page 30: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/30.jpg)
Capture
• Given a file containing URLs, create a script that wgets each URL:– http://bit.ly/DHapiTRANSCRIBE
• becomes:
– wget “http://bit.ly/DHapiTRANSCRIBE”
![Page 31: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/31.jpg)
Capture
• Given a file containing URLs, create a script that wgets each URL:– http://bit.ly/DHapiTRANSCRIBE
• becomes
– wget “http://bit.ly/DHapiTRANSCRIBE”
• s/^(.*)$/wget “$1”/
![Page 32: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/32.jpg)
Lab Session II
• Convert all Miss and Mrs. to Ms.
• Convert infinitives to gerunds – “to sing” -> “singing”
• Extract last name, first name from (title first name last name)– Dr. Thelma Dunn– Mr. Clay Shirky– Dana Gray
![Page 33: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/33.jpg)
Caveats
• Do not use regular expressions to parse (complicated) XML!
• Check the language/application-specific documentation: some common shortcuts are not universal.
![Page 34: Introduction to regular expressions](https://reader034.vdocument.in/reader034/viewer/2022052505/555ee020d8b42a772f8b5576/html5/thumbnails/34.jpg)
Acknowledgments
• James Edward Gray II and Dana Gray– Much of the structure and some of the
wording of this presentation comes from– http://www.slideshare.net/JamesEdwardGrayII
/regular-expressions-7337223