regular expressions

33
Regular Expressions Powerful string validation and extraction Ignaz Wanders – Architect @ Archimiddle @ignazw

Upload: ignaz-wanders

Post on 20-Nov-2014

1.465 views

Category:

Technology


4 download

DESCRIPTION

An old - but still very relevant - short course on regular expressions, plus examples on how to use them, and references where to find more.

TRANSCRIPT

Page 1: Regular expressions

Regular Expressions

Powerful string validation and extraction

Ignaz Wanders – Architect @ Archimiddle@ignazw

Page 2: Regular expressions

Topics

• What are regular expressions?• Patterns• Character classes• Quantifiers• Capturing groups• Boundaries• Internationalization• Regular expressions in Java• Quiz• References

Page 3: Regular expressions

What are regular expressions?

• A regex is a string pattern used to search and manipulate text• A regex has special syntax• Very powerful for any type of String manipulation ranging from simple to very

complex structures:– Input validation– S(ubs)tring replacement– ...

• Example: • [A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z0-9._%-]{2,4}

Page 4: Regular expressions

History

• Originates from automata and formal-language theories of computer science• Stephen Kleene 50’s: Kleene algebra• Kenneth Thompson 1969: unix: qed, ed• 70’s - 90’s: unix: grep, awk, sed, emacs• Programming languages:

– C, Perl– JavaScript, Java

Page 5: Regular expressions

Patterns

• Regex is based on pattern matching: Strings are searched for certain patterns• Simplest regex is a string-literal pattern

• Metacharacters: ([{\^$|)?*+.

– Period means “any character”– To search for period as string literal, escape with “\”

REGEX: foxTEXT: The quick brown foxRESULT: fox

REGEX: fo.TEXT: The quick brown foxRESULT: fox

REGEX: .o.TEXT: The quick brown foxRESULT: row, fox

Page 6: Regular expressions

Character classes (1/3)

• Syntax: any characters between [ and ]

• Character classes denote one letter• Negation: ^

REGEX: [rcb]atTEXT: batRESULT: batREGEX: [rcb]atTEXT: ratRESULT: rat

REGEX: [rcb]atTEXT: catRESULT: catREGEX: [rcb]atTEXT: hatRESULT: -

REGEX: [^rcb]atTEXT: ratRESULT: -

REGEX: [^rcb]atTEXT: hatRESULT: hat

Page 7: Regular expressions

Character classes (2/3)

• Ranges: [a-z], [0-9], [i-n], [a-zA-Z]...

• Unions: [0-4[6-8]], [a-p[r-w]], ...

• Intersections: [a-f&&[efg]], [a-f&&[e-k]], ...

• Subtractions: [a-f&&[^efg]], ...

REGEX: [rcb]at[1-5]TEXT: bat4 RESULT: bat4

REGEX: [rcb]at[1-5[7-8]]TEXT: hat7 RESULT: -

REGEX: [rcb]at[1-7&&[78]]TEXT: rat7 RESULT: rat7

REGEX: [rcb]at[1-5&&[^34]]TEXT: bat4 RESULT: -

Page 8: Regular expressions

Character classes (3/3)

predefined character classes equivalence

. any character\d any digit [0-9]

\D any non-digit [^0-9], [^\d]

\s any white-space character [ \t\n\x0B\f\r]

\S any non-white-space character [^\s]

\w any word character [a-zA-Z_0-9]

\W any non-word character [^\w]

Page 9: Regular expressions

Quantifiers (1/5)

• Quantifiers allow character classes to match more than one character at a time.

Quantifiers for character classes X

X? zero or one timeX* zero or more timesX+ one or more timesX{n} exactly n timesX{n,} at least n timesX{n,m} at least n and at most m times

Page 10: Regular expressions

Quantifiers (2/5)

• Examples of X?, X*, X+

REGEX: “a?”TEXT: “”RESULT: “”

REGEX: “a*”TEXT: “”RESULT: “”

REGEX: “a+”TEXT: “”RESULT: -

REGEX: “a?”TEXT: “a”RESULT: “a”

REGEX: “a*”TEXT: “a”RESULT: “a”

REGEX: “a+”TEXT: “a”RESULT: “a”

REGEX: “a?”TEXT: “aaa”RESULT: “a”,”a”,”a”

REGEX: “a*”TEXT: “aaa”RESULT: “aaa”

REGEX: “a+”TEXT: “aaa”RESULT: “aaa”

Page 11: Regular expressions

Quantifiers (3/5)

REGEX: “[abc]{3}”TEXT: “abccabaaaccbbbc”RESULT: “abc”,”cab”,”aaa”,”ccb”,”bbc”

REGEX: “abc{3}”TEXT: “abccabaaaccbbbc”RESULT: -

REGEX: “(dog){3}”TEXT: “dogdogdogdogdogdog”RESULT: “dogdogdog”,”dogdogdog”

Page 12: Regular expressions

Quantifiers (4/5)

• Greedy quantifiers: – read complete string– work backwards until match found– syntax: X?, X*, X+, ...

• Reluctant quantifiers:– read one character at a time– work forward until match found– syntax: X??, X*?, X+?, ...

• Possessive quantifiers:– read complete string– try match only once– syntax: X?+, X*+, X++, ...

Page 13: Regular expressions

Quantifiers (5/5)

REGEX: “.*foo”TEXT: “xfooxxxxxxfoo”RESULT: “xfooxxxxxxfoo”

REGEX: .*?foo”TEXT: “xfooxxxxxxfoo”RESULT: “xfoo”, “xxxxxxfoo”

REGEX: “.*+foo”TEXT: “xfooxxxxxxfoo”RESULT: -

greedy

reluctant

possessive

Page 14: Regular expressions

Capturing groups (1/2)

• Capturing groups treat multiple characters as a single unit• Syntax: between braces ( and )• Example: (dog){3}• Numbering from left to right

– Example: ((A)(B(C)))• Group 1: ((A)(B(C)))• Group 2: (A)• Group 3: (B(C))• Group 4: (C)

Page 15: Regular expressions

Capturing groups (2/2)

• Backreferences to capturing groups are denoted by \i with i an integer number

REGEX: “(\d\d)\1”TEXT: “1212”RESULT: “1212”

REGEX: “(\d\d)\1”TEXT: “1234”RESULT: -

Page 16: Regular expressions

Boundaries (1/2)

Boundary characters

^ beginning of line$ end of line\b a word boundary\B a non-word boundary\A beginning of input\G end of previous match\z end of input\Z end of input, but before final terminator, if any

Page 17: Regular expressions

Boundaries (2/2)

• Be aware:• End-of-line marker is $

– Unix EOL is \n– Windows EOL is \r\n– JDK uses any of the following as EOL:

• '\n', '\r\n', '\u0085', '\u2028', '\u2029'

• Always test your regular expressions on the target OS

Page 18: Regular expressions

Internationalization (1/2)

• Regular expressions originally designed for the ascii Basic Latin set of characters.– Thus “België” is not matched by ^\w+$

• Extension to unicode character sets denoted by \p{...}• Character set: [\p{InCharacterSet}]

– Create character classes from symbols in character sets.– “België” is matched by ^[\w|[\p{InLatin-1Supplement}]]+$

Page 19: Regular expressions

Internationalization (2/2)

• Note that there are non-letters in character sets as well:– Latin-1 Supplement:

• Categories:– Letters: \p{L}– Uppercase letters: \p{Lu}– “België” is matched by ^\p{L}+$

• Other (POSIX) categories:– Unicode currency symbols: \p{Sc}– ASCII punctuation characters: \p{Punct}

¡¢£¤¥¦§¨©ª« ®¯°±²³´µ·¸¹º»¼½¾¿÷

Page 20: Regular expressions

Regular expressions in Java

• Since JDK 1.4• Package java.util.regex

– Pattern class– Matcher class

• Convenience methods in java.lang.String• Alternative for JDK 1.3

– Jakarta ORO project

Page 21: Regular expressions

java.util.regex.Pattern

• Wrapper class for regular expressions

• Useful methods:– compile(String regex): Pattern– matches(String regex, CharSequence text): boolean– split(String text): String[]

String regex = “(\\d\\d)\\1”;Pattern p = Pattern.compile(regex);

Page 22: Regular expressions

java.util.regex.Matcher

• Useful methods:– matches(): boolean– find(): boolean– find(int start): boolean– group(): String– replaceFirst(String replace): String– replaceAll(String replace): String

String regex = “(\\d\\d)\\1”;Pattern p = Pattern.compile(regex);String text = “1212”;Matcher m = p.matcher(text);boolean matches = m.matches();

Page 23: Regular expressions

java.lang.String

• Pattern and Matcher methods in String:– matches(String regex): boolean– split(String regex): String[]– replaceFirst(String regex, String replace): String– replaceAll(String regex, String replace): String

Page 24: Regular expressions

Examples

• Validation• Searching text• Filtering• Parsing• Removing duplicate lines• On-the-fly editing

Page 25: Regular expressions

Examples: validation

• Validate an e-mail address

• A URL

[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z0-9._%-]{2,4}

(http|https|ftp)://([a-zA-Z0-9](\\w+\\.)+\\w{2,7}|local\\w*)(:\\d+)?(/(\\w+[\\w/\\-\\.]*)?)?

Page 26: Regular expressions

Examples: searching text

• Write HttpUnit test to submit HTML form and check whether HTTP response is a confirmation screen containing a generated form number of the form 9xxxxxx-xxxxxx:

9[0-9]{6}-[0-9]{6}

Pattern p = Pattern.compile(regexp);

Matcher m = p.matcher(text);

boolean ok = m.find();

String nr = m.group();

Page 27: Regular expressions

Examples: filtering

• Filter e-mail with subjects with capitals only, and including a leading “Re:”

(R[eE]:)*[^a-z]*$

Page 28: Regular expressions

Examples: parsing

• Matches any opening and closing XML tag:

– Note the use of the back reference

<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>

Page 29: Regular expressions

Examples: duplicate lines

• Suppose you want to remove duplicate lines from a text.

– requirement here is that the lines are sorted alphabetically

^(.*)(\r?\n\1)+$

Page 30: Regular expressions

Examples: on-the-fly editing

• Suppose you want to edit a file in batch: all occurrances of a certain string pattern should be replaced with another string.

• In unix: use the sed command with a regex• In Java: use string.replaceAll(regex,”mystring”)• In Ant: use replaceregexp optional task to, e.g., edit deployment descriptors

depending on environment

Page 31: Regular expressions

Quiz

• What are the following regular expressions looking for?

\d+ at least one digit

[-+]?\d+ any integer

((\d*\.?)?\d+|\d+(\.?\d*)) any positive decimal

[\p{L}']['\-\.\p{L} ]+ a place name

Page 32: Regular expressions

Conclusion

• When doing one of the following:– validating strings– on-the-fly editing of strings– searching strings– filtering strings

• think regex!

Page 33: Regular expressions

References

• http://www.regular-expressions.info/• http://www.regexlib.com/• http://developer.java.sun.com/developer/technicalArticles/releases/1.4regex/• http://java.sun.com/docs/books/tutorial/extra/regex/• http://www.wellho.net/regex/javare.html• >JDK 1.4 API• Mastering Regular Expressions