regular expression in java 101 comp204 source: sun tutorial, …
TRANSCRIPT
What are they?
• a way to describe patterns in strings• similar to regex in Perl• cryptic syntax: “write once, ponders many
times”• used to search, parse, modify textual data• Java: java.util.regex with Pattern, Matcher,
and PatternSyntaxException class,• plus utility methods in String class
String constants match
• regex: foo
string: foo
=> 0:3 "foo" • regex: foo
string: foofoofoo
=> 0:3 "foo”
=> 3:6 "foo”
=> 6:9 "foo"
Meta characters
• Some characters are “special”, e.g. a single dot “.” matches any character:
• regex: cat. string: cats => 0:4 catsOthers are: ([{\^-$|]})?*+.Use meta char literally: “escape” with backslash (e.g. \.), or “quote”, e.g. \Q.\E
Character classes
[abc] a, b, or c (simple class)
[^abc] any character except a, b, or c (negation)
[a-zA-Z] a through z or A through Z, inclusive (range)
[a-d[m-p]] a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]] d, e, or f (intersection)
[a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]] a through z, and not m through p: [a-lq-z](subtraction)
Predefined classes (see Pattern)
. Any character (may or may not match line terminators)
\d digit: [0-9]
\D non-digit: [^0-9]
\s whitespace character: [ \t\n\x0B\f\r]
\S non-whitespace character: [^\s]
\w word character: [a-zA-Z_0-9]
\W non-word character: [^\w]
Greedy Quantifiers
X? X, once or not at all
X* X, zero or more times
X+ X, one or more times
X{n} X, exactly n times
X{n,} X, at least n times
X{n,m} X, at least n but not more than m times
Reluctant quantifiers
X?? X, once or not at all
X*? X, zero or more times
X+? X, one or more times
X{n}? X, exactly n times
X{n,}? X, at least n times
X{n,m}? X, at least n but not more than m times
Possessive Qantifiers
X?+ X, once or not at all
X*+ X, zero or more times
X++ X, one or more times
X{n}+ X, exactly n times
X{n,}+ X, at least n times
X{n,m}+ X, at least n but not more than m times
What’s the difference
// greedy quantifierregex: .*foostring: xfooxxxxxxfoo=> 0:13 "xfooxxxxxxfoo" // reluctant quantifierregex: .*?foostring: xfooxxxxxxfoo=> 0:4 "xfoo”=> 4:13 "xxxxxxfoo"// possessive quantifierregex: .*+foostring: xfooxxxxxxfooNo match found.
Capturing groups
• Quantifiers apply to single characters (e.g. a*, matches everything, why?), character classes (e.g. \s+) or groups (e.g. (dog){2} )
• Groups are numbered left-to-right: ((A)(B(C))) => 1 ((A)(B(C))) 2 (A) 3 (B(C)) 4 (C)
refer to groups with e.g. \2 for group two: regex: (\w)\1 string: hello => 2:4 “ll”
Boundaries
^ The beginning of a line$ The end of a line\b A word boundary\B A non-word boundary\A The beginning of the input\G The end of the previous match\Z The end of the input but for the final terminator, if any
\z The end of the input
Pattern class
boolean b = Pattern.matches("a*b", "aaaaab");
orPattern p = Pattern.compile("a*b");Matcher m = p.matcher("aaaaab"); boolean b = m.matches();
latter allows for efficient reuse
Splitting a string using a regex
Pattern p = Pattern.compile(“a*b”);
String[] items = p.split(“aabbab”);
for(String s : items) System.out.println(s);
similar to split(regex) method in class String
String[] items = “aabbab”.split(“a*b”);
Matcher class
• loads of methods, e.g. to access groups (see test harness) or replace expressions:
Pattern p = Pattern.compile(“dog”);
Matcher m = p.matcher(“the dog runs”);
String result = m.replaceAll(“cat”);
System.out.println(result);
=> “the cat runs”