regular expressions
DESCRIPTION
An old - but still very relevant - short course on regular expressions, plus examples on how to use them, and references where to find more.TRANSCRIPT
Regular Expressions
Powerful string validation and extraction
Ignaz Wanders – Architect @ Archimiddle@ignazw
Topics
• What are regular expressions?• Patterns• Character classes• Quantifiers• Capturing groups• Boundaries• Internationalization• Regular expressions in Java• Quiz• References
What are regular expressions?
• A regex is a string pattern used to search and manipulate text• A regex has special syntax• Very powerful for any type of String manipulation ranging from simple to very
complex structures:– Input validation– S(ubs)tring replacement– ...
• Example: • [A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z0-9._%-]{2,4}
History
• Originates from automata and formal-language theories of computer science• Stephen Kleene 50’s: Kleene algebra• Kenneth Thompson 1969: unix: qed, ed• 70’s - 90’s: unix: grep, awk, sed, emacs• Programming languages:
– C, Perl– JavaScript, Java
Patterns
• Regex is based on pattern matching: Strings are searched for certain patterns• Simplest regex is a string-literal pattern
• Metacharacters: ([{\^$|)?*+.
– Period means “any character”– To search for period as string literal, escape with “\”
REGEX: foxTEXT: The quick brown foxRESULT: fox
REGEX: fo.TEXT: The quick brown foxRESULT: fox
REGEX: .o.TEXT: The quick brown foxRESULT: row, fox
Character classes (1/3)
• Syntax: any characters between [ and ]
• Character classes denote one letter• Negation: ^
REGEX: [rcb]atTEXT: batRESULT: batREGEX: [rcb]atTEXT: ratRESULT: rat
REGEX: [rcb]atTEXT: catRESULT: catREGEX: [rcb]atTEXT: hatRESULT: -
REGEX: [^rcb]atTEXT: ratRESULT: -
REGEX: [^rcb]atTEXT: hatRESULT: hat
Character classes (2/3)
• Ranges: [a-z], [0-9], [i-n], [a-zA-Z]...
• Unions: [0-4[6-8]], [a-p[r-w]], ...
• Intersections: [a-f&&[efg]], [a-f&&[e-k]], ...
• Subtractions: [a-f&&[^efg]], ...
REGEX: [rcb]at[1-5]TEXT: bat4 RESULT: bat4
REGEX: [rcb]at[1-5[7-8]]TEXT: hat7 RESULT: -
REGEX: [rcb]at[1-7&&[78]]TEXT: rat7 RESULT: rat7
REGEX: [rcb]at[1-5&&[^34]]TEXT: bat4 RESULT: -
Character classes (3/3)
predefined character classes equivalence
. any character\d any digit [0-9]
\D any non-digit [^0-9], [^\d]
\s any white-space character [ \t\n\x0B\f\r]
\S any non-white-space character [^\s]
\w any word character [a-zA-Z_0-9]
\W any non-word character [^\w]
Quantifiers (1/5)
• Quantifiers allow character classes to match more than one character at a time.
Quantifiers for character classes X
X? zero or one timeX* zero or more timesX+ one or more timesX{n} exactly n timesX{n,} at least n timesX{n,m} at least n and at most m times
Quantifiers (2/5)
• Examples of X?, X*, X+
REGEX: “a?”TEXT: “”RESULT: “”
REGEX: “a*”TEXT: “”RESULT: “”
REGEX: “a+”TEXT: “”RESULT: -
REGEX: “a?”TEXT: “a”RESULT: “a”
REGEX: “a*”TEXT: “a”RESULT: “a”
REGEX: “a+”TEXT: “a”RESULT: “a”
REGEX: “a?”TEXT: “aaa”RESULT: “a”,”a”,”a”
REGEX: “a*”TEXT: “aaa”RESULT: “aaa”
REGEX: “a+”TEXT: “aaa”RESULT: “aaa”
Quantifiers (3/5)
REGEX: “[abc]{3}”TEXT: “abccabaaaccbbbc”RESULT: “abc”,”cab”,”aaa”,”ccb”,”bbc”
REGEX: “abc{3}”TEXT: “abccabaaaccbbbc”RESULT: -
REGEX: “(dog){3}”TEXT: “dogdogdogdogdogdog”RESULT: “dogdogdog”,”dogdogdog”
Quantifiers (4/5)
• Greedy quantifiers: – read complete string– work backwards until match found– syntax: X?, X*, X+, ...
• Reluctant quantifiers:– read one character at a time– work forward until match found– syntax: X??, X*?, X+?, ...
• Possessive quantifiers:– read complete string– try match only once– syntax: X?+, X*+, X++, ...
Quantifiers (5/5)
REGEX: “.*foo”TEXT: “xfooxxxxxxfoo”RESULT: “xfooxxxxxxfoo”
REGEX: .*?foo”TEXT: “xfooxxxxxxfoo”RESULT: “xfoo”, “xxxxxxfoo”
REGEX: “.*+foo”TEXT: “xfooxxxxxxfoo”RESULT: -
greedy
reluctant
possessive
Capturing groups (1/2)
• Capturing groups treat multiple characters as a single unit• Syntax: between braces ( and )• Example: (dog){3}• Numbering from left to right
– Example: ((A)(B(C)))• Group 1: ((A)(B(C)))• Group 2: (A)• Group 3: (B(C))• Group 4: (C)
Capturing groups (2/2)
• Backreferences to capturing groups are denoted by \i with i an integer number
REGEX: “(\d\d)\1”TEXT: “1212”RESULT: “1212”
REGEX: “(\d\d)\1”TEXT: “1234”RESULT: -
Boundaries (1/2)
Boundary characters
^ beginning of line$ end of line\b a word boundary\B a non-word boundary\A beginning of input\G end of previous match\z end of input\Z end of input, but before final terminator, if any
Boundaries (2/2)
• Be aware:• End-of-line marker is $
– Unix EOL is \n– Windows EOL is \r\n– JDK uses any of the following as EOL:
• '\n', '\r\n', '\u0085', '\u2028', '\u2029'
• Always test your regular expressions on the target OS
Internationalization (1/2)
• Regular expressions originally designed for the ascii Basic Latin set of characters.– Thus “België” is not matched by ^\w+$
• Extension to unicode character sets denoted by \p{...}• Character set: [\p{InCharacterSet}]
– Create character classes from symbols in character sets.– “België” is matched by ^[\w|[\p{InLatin-1Supplement}]]+$
Internationalization (2/2)
• Note that there are non-letters in character sets as well:– Latin-1 Supplement:
• Categories:– Letters: \p{L}– Uppercase letters: \p{Lu}– “België” is matched by ^\p{L}+$
• Other (POSIX) categories:– Unicode currency symbols: \p{Sc}– ASCII punctuation characters: \p{Punct}
¡¢£¤¥¦§¨©ª« ®¯°±²³´µ·¸¹º»¼½¾¿÷
Regular expressions in Java
• Since JDK 1.4• Package java.util.regex
– Pattern class– Matcher class
• Convenience methods in java.lang.String• Alternative for JDK 1.3
– Jakarta ORO project
java.util.regex.Pattern
• Wrapper class for regular expressions
• Useful methods:– compile(String regex): Pattern– matches(String regex, CharSequence text): boolean– split(String text): String[]
String regex = “(\\d\\d)\\1”;Pattern p = Pattern.compile(regex);
java.util.regex.Matcher
• Useful methods:– matches(): boolean– find(): boolean– find(int start): boolean– group(): String– replaceFirst(String replace): String– replaceAll(String replace): String
String regex = “(\\d\\d)\\1”;Pattern p = Pattern.compile(regex);String text = “1212”;Matcher m = p.matcher(text);boolean matches = m.matches();
java.lang.String
• Pattern and Matcher methods in String:– matches(String regex): boolean– split(String regex): String[]– replaceFirst(String regex, String replace): String– replaceAll(String regex, String replace): String
Examples
• Validation• Searching text• Filtering• Parsing• Removing duplicate lines• On-the-fly editing
Examples: validation
• Validate an e-mail address
• A URL
[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z0-9._%-]{2,4}
(http|https|ftp)://([a-zA-Z0-9](\\w+\\.)+\\w{2,7}|local\\w*)(:\\d+)?(/(\\w+[\\w/\\-\\.]*)?)?
Examples: searching text
• Write HttpUnit test to submit HTML form and check whether HTTP response is a confirmation screen containing a generated form number of the form 9xxxxxx-xxxxxx:
9[0-9]{6}-[0-9]{6}
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(text);
boolean ok = m.find();
String nr = m.group();
Examples: filtering
• Filter e-mail with subjects with capitals only, and including a leading “Re:”
(R[eE]:)*[^a-z]*$
Examples: parsing
• Matches any opening and closing XML tag:
– Note the use of the back reference
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>
Examples: duplicate lines
• Suppose you want to remove duplicate lines from a text.
– requirement here is that the lines are sorted alphabetically
^(.*)(\r?\n\1)+$
Examples: on-the-fly editing
• Suppose you want to edit a file in batch: all occurrances of a certain string pattern should be replaced with another string.
• In unix: use the sed command with a regex• In Java: use string.replaceAll(regex,”mystring”)• In Ant: use replaceregexp optional task to, e.g., edit deployment descriptors
depending on environment
Quiz
• What are the following regular expressions looking for?
\d+ at least one digit
[-+]?\d+ any integer
((\d*\.?)?\d+|\d+(\.?\d*)) any positive decimal
[\p{L}']['\-\.\p{L} ]+ a place name
Conclusion
• When doing one of the following:– validating strings– on-the-fly editing of strings– searching strings– filtering strings
• think regex!
References
• http://www.regular-expressions.info/• http://www.regexlib.com/• http://developer.java.sun.com/developer/technicalArticles/releases/1.4regex/• http://java.sun.com/docs/books/tutorial/extra/regex/• http://www.wellho.net/regex/javare.html• >JDK 1.4 API• Mastering Regular Expressions